pandas知识点(基本功能)

发布时间:2019-04-10 21:14:17编辑:auto阅读(1830)

    1.重新索引

    如果reindex会根据新索引重新排序,不存在的则引入缺省:
    In [3]: obj = Series([4.5,7.2,-5.3,3.6], index=["d","b","a","c"])
    In [4]: obj
    Out[4]:
    d    4.5
    b    7.2
    a   -5.3
    c    3.6
    dtype: float64
    In [6]: obj2 = obj.reindex(["a","b","c","d","e"])
    In [7]: obj2
    Out[7]:
    a   -5.3
    b    7.2
    c    3.6
    d    4.5
    e    NaN
    dtype: float64

     

    ffill可以实现前向值填充:
    In [8]: obj3 = Series(["blue","purple","yellow"], index=[0,2,4])
    In [9]: obj3.reindex(range(6), method="ffill")
    Out[9]:
    0      blue
    1      blue
    2    purple
    3    purple
    4    yellow
    5    yellow
    dtype: object

     

    2.丢弃指定轴上的项
    drop方法返回在指定轴上删除了指定值的新对象:
    In [12]: obj = Series(np.arange(5.), index=["a","b","c","d","e"])
    In [13]: new_obj = obj.drop("c")
    In [14]: new_obj
    Out[14]:
    a    0.0
    b    1.0
    d    3.0
    e    4.0
    dtype: float64

    DataFrame可以删除任意轴上的索引值

     
    3.索引,选取和过滤
    Series的索引可以不止是整数:
    In [4]: obj = Series(np.arange(4.), index=["a","b","c","d"])Out[6]:
    a    0.0
    b    1.0
    dtype: float64
    In [7]: obj[obj<2]
    Out[7]:
    a    0.0
    b    1.0
    dtype: float64

     

    Series切片与普通的python切片不一样,末端也是包含的:
    In [8]: obj["b":"c"]
    Out[8]:
    b    1.0
    c    2.0
    dtype: float64

     

    DataFrame进行索引:
    In [10]: data
    Out[10]:
              one  two  three  four
    Ohio        0    1      2     3
    Colorado    4    5      6     7
    Utah        8    9     10    11
    New York   12   13     14    15
    In [11]: data['two']
    Out[11]:
    Ohio         1
    Colorado     5
    Utah         9
    New York    13
    Name: two, dtype: int32
    In [12]: data[:2]
    Out[12]:
              one  two  three  four
    Ohio        0    1      2     3
    Colorado    4    5      6     7

     

    布尔型DataFrame进行索引:
    In [13]: data > 5
    Out[13]:
                one    two  three   four
    Ohio      False  False  False  False
    Colorado  False  False   True   True
    Utah       True   True   True   True
    New York   True   True   True   True

     

    利用ix可以选取行和列的子集:
    In [18]: data.ix['Colorado',['two','three']]
    Out[18]:
    two      5
    three    6
    Name: Colorado, dtype: int32
    In [19]: data.ix[['Colorado','Utah'],[3,0,1]]
    Out[19]:
              four  one  two
    Colorado     7    4    5
    Utah        11    8    9

     

    4.算数运算和数据对齐
    对不同索引的对象进行算数运算,如果存在不同的索引,则结果的索引取其并集:
    In [20]: s1 = Series([7.3,-2.5,3.4,1.5],index=['a','c','d','e'])
    In [21]: s2 = Series([-2.1, 3.6, -1.5, 4, 3.1],index=['a','c','e','f','g'])
    In [22]: s1+s2
    Out[22]:
    a    5.2
    c    1.1
    d    NaN
    e    0.0
    f    NaN
    g    NaN
    dtype: float64

     

    对于DataFrame,对齐操作会同时发生在行和列上:
    In [26]: df1
    Out[26]:
              b     d     e
    Utah    0.0   1.0   2.0
    Ohio    3.0   4.0   5.0
    Texas   6.0   7.0   8.0
    Oregon  9.0  10.0  11.0
    In [27]: df2
    Out[27]:
                b    c    d
    Ohio      0.0  1.0  2.0
    Texas     3.0  4.0  5.0
    Colorado  6.0  7.0  8.0
    In [28]: df1+df2
    Out[28]:
                b   c     d   e
    Colorado  NaN NaN   NaN NaN
    Ohio      3.0 NaN   6.0 NaN
    Oregon    NaN NaN   NaN NaN
    Texas     9.0 NaN  12.0 NaN
    Utah      NaN NaN   NaN NaN

     

    使用add方法相加:
    In [30]: df2.add(df1,fill_value=0)
    Out[30]:
                b    c     d     e
    Colorado  6.0  7.0   8.0   NaN
    Ohio      3.0  1.0   6.0   5.0
    Oregon    9.0  NaN  10.0  11.0
    Texas     9.0  4.0  12.0   8.0
    Utah      0.0  NaN   1.0   2.0

     

    5.DataFrame和Series之间的运算:
    计算二维数组和某一行的差:
    In [31]: arr = np.arange(12.).reshape((3,4))
    In [32]: arr
    Out[32]:
    array([[ 0.,  1.,  2.,  3.],
           [ 4.,  5.,  6.,  7.],
           [ 8.,  9., 10., 11.]])
    In [33]: arr - arr[1]
    Out[33]:
    array([[-4., -4., -4., -4.],
           [ 0.,  0.,  0.,  0.],
           [ 4.,  4.,  4.,  4.]])

     

    DataFrame和Series之间的运算:
    In [35]: frame = DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
    In [39]: series = frame.iloc[0]
    In [40]: frame
    Out[40]:
              b     d     e
    Utah    0.0   1.0   2.0
    Ohio    3.0   4.0   5.0
    Texas   6.0   7.0   8.0
    Oregon  9.0  10.0  11.0
    In [41]: series
    Out[41]:
    b    0.0
    d    1.0
    e    2.0
    Name: Utah, dtype: float64
    In [43]: frame - series
    Out[43]:
              b    d    e
    Utah    0.0  0.0  0.0
    Ohio    3.0  3.0  3.0
    Texas   6.0  6.0  6.0
    Oregon  9.0  9.0  9.0

     

    如果某个索引值找不到,则与运算的两个对象会被重新索引以形成并集:
    In [45]: frame + series2
    Out[45]:
              b   d     e   f
    Utah    0.0 NaN   3.0 NaN
    Ohio    3.0 NaN   6.0 NaN
    Texas   6.0 NaN   9.0 NaN
    Oregon  9.0 NaN  12.0 NaN

     

    匹配列并在列上广播:
    In [46]: series3 = frame['d']
    In [47]: frame.sub(series3, axis=0)
    Out[47]:
              b    d    e
    Utah   -1.0  0.0  1.0
    Ohio   -1.0  0.0  1.0
    Texas  -1.0  0.0  1.0
    Oregon -1.0  0.0  1.0

     

    6.函数应用和映射
    Numpy的ufuncs也可用于操作pandas对象:
    In [49]: frame = DataFrame(np.random.randn(4,3), columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
    In [50]: frame
    Out[50]:
                   b         d         e
    Utah    0.913051 -1.289725 -0.590573
    Ohio    1.417612 -1.835357 -0.010755
    Texas   0.328839 -0.121878 -1.209583
    Oregon  1.315330 -1.026557 -1.777427
     
    In [51]: np.abs(frame)
    Out[51]:
                   b         d         e
    Utah    0.913051  1.289725  0.590573
    Ohio    1.417612  1.835357  0.010755
    Texas   0.328839  0.121878  1.209583
    Oregon  1.315330  1.026557  1.777427
    DataFrame的apply方法可以实现将函数应用到由各行或列形成的一维数组上:
    In [52]: f = lambda x:x.max() - x.min()
    In [53]: frame.apply(f)
    Out[53]:
    b    1.088773
    d    1.713479
    e    1.766671
    dtype: float64
    In [54]: frame.apply(f, axis=1)
    Out[54]:
    Utah      2.202776
    Ohio      3.252969
    Texas     1.538421
    Oregon    3.092757
    dtype: float64

     

    7.排序和排名
    sort_index方法可以返回一个已排序的对象
    In [57]: obj = Series(range(4), index=['d','a','b','c'])
    In [58]: obj
    Out[58]:
    d    0
    a    1
    b    2
    c    3
    dtype: int64
    In [59]: obj.sort_index
    Out[59]:
    <bound method Series.sort_index of d    0
    a    1
    b    2
    c    3
    dtype: int64>
    In [62]: frame.sort_index()
    Out[62]:
                   b         d         e
    Ohio    1.417612 -1.835357 -0.010755
    Oregon  1.315330 -1.026557 -1.777427
    Texas   0.328839 -0.121878 -1.209583
    Utah    0.913051 -1.289725 -0.590573
    In [63]: frame.sort_index(axis=1)
    Out[63]:
                   b         d         e
    Utah    0.913051 -1.289725 -0.590573
    Ohio    1.417612 -1.835357 -0.010755
    Texas   0.328839 -0.121878 -1.209583
    Oregon  1.315330 -1.026557 -1.777427

     

    倒序查看:
    In [65]: frame.sort_index(axis=1,ascending=False)
    Out[65]:
                   e         d         b
    Utah   -0.590573 -1.289725  0.913051
    Ohio   -0.010755 -1.835357  1.417612
    Texas  -1.209583 -0.121878  0.328839
    Oregon -1.777427 -1.026557  1.315330

     

    按某一列的值进行排序:
    In [67]: frame.sort_values(by='b')
    Out[67]:
                   b         d         e
    Texas   0.328839 -0.121878 -1.209583
    Utah    0.913051 -1.289725 -0.590573
    Oregon  1.315330 -1.026557 -1.777427
    Ohio    1.417612 -1.835357 -0.010755

     

    排名(rank)与排序类似,它会设置一个排名值,并且可以根据某种规则破坏平级关系
    In [70]: obj
    Out[70]:
    0    7
    1   -5
    2    7
    3    4
    4    2
    5    0
    6    4
    dtype: int64
    In [71]: obj.rank()
    Out[71]:
    0    6.5
    1    1.0
    2    6.5
    3    4.5
    4    3.0
    5    2.0
    6    4.5
    dtype: float64

     

    根据值在原数据中出现的顺序给出排名
    In [72]: obj.rank(method='first')
    Out[72]:
    0    6.0
    1    1.0
    2    7.0
    3    4.0
    4    3.0
    5    2.0
    6    5.0
    dtype: float64

     

    8.带有重复值的轴索引
    使用is_unique查看值是否唯一
    In [73]: obj = Series(range(5),index=['a','a','b','b','c'])
    In [74]: obj
    Out[74]:
    a    0
    a    1
    b    2
    b    3
    c    4
    dtype: int64
    In [75]: obj.index.is_unique
    Out[75]: False

     

    对重复索引选取数据:
    In [76]: obj['a']
    Out[76]:
    a    0
    a    1
    dtype: int64

    DataFrame也是同样的道理

关键字