dataframes常见方法(一) | 机器学习 |《python学习之路》| python 技术论坛-江南app体育官方入口
一些常见的方法
- head,默认获取前五行数据,可以传入想获取的行数
>>> df = pd.dataframe({'animal':['alligator', 'bee', 'falcon', 'lion', ... 'monkey', 'parrot', 'shark', 'whale', 'zebra']}) >>> df animal 0 alligator 1 bee 2 falcon 3 lion 4 monkey 5 parrot 6 shark 7 whale 8 zebra >>> df.head() animal 0 alligator 1 bee 2 falcon 3 lion 4 monkey >>> df.head() animal 0 alligator 1 bee 2 falcon 3 lion 4 monkey
- tail,默认获取后五行数据,可以传入想获取的行数
>>> df.tail() animal 4 monkey 5 parrot 6 shark 7 whale 8 zebra >>> df.tail(3) animal 6 shark 7 whale 8 zebra
- shape,查看dataframe的行列个数
>>> df.shape (9, 1)
- info,查看索引、数据类型和内存信息
>>> df.info() <class 'pandas.core.frame.dataframe'> rangeindex: 9 entries, 0 to 8 data columns (total 1 columns): animal 9 non-null object dtypes: object(1) memory usage: 152.0 bytes
- mean,所有列的平均数
>>> df = pd.dataframe(np.random.rand(5,5)) >>> df 0 1 2 3 4 0 0.987926 0.556055 0.774863 0.926501 0.029973 1 0.635812 0.698311 0.402425 0.727675 0.048129 2 0.001094 0.329329 0.364231 0.754038 0.405464 3 0.975270 0.388988 0.598047 0.355597 0.189753 4 0.171976 0.334893 0.931219 0.967504 0.323952 >>> df.mean() 0 0.554416 1 0.461516 2 0.614157 3 0.746263 4 0.199454 dtype: float64
- count,每一列中非空值的个数
>>> df.count() 0 5 1 5 2 5 3 5 4 5 dtype: int64
- max,每一列的最大值
>>> df.max() 0 0.987926 1 0.698311 2 0.931219 3 0.967504 4 0.405464 dtype: float64
- min,每一列的最小值
>>> df.min() 0 0.001094 1 0.329329 2 0.364231 3 0.355597 4 0.029973 dtype: float64
- median,每一列的中位数
>>> df.median() 0 0.635812 1 0.388988 2 0.598047 3 0.754038 4 0.189753 dtype: float64
- std,每一列的标准差
>>> df.std() 0 0.453900 1 0.161072 2 0.241820 3 0.242105 4 0.165573 dtype: float64
- corr,列与列之间的相关系数
>>> df.corr() 0 1 2 3 4 0 1.000000 0.517374 0.142778 -0.401999 -0.836540 1 0.517374 1.000000 -0.262423 0.076483 -0.882558 2 0.142778 -0.262423 1.000000 0.458608 -0.044045 3 -0.401999 0.076483 0.458608 1.000000 0.032442 4 -0.836540 -0.882558 -0.044045 0.032442 1.000000
- head,默认获取前五行数据,可以传入想获取的行数
获取数据
切片操作可以运用在一下所有的方法里
df[columns],获取列,返回列,数据类型为series
>>> df = pd.dataframe([[1, 2], [4, 5], [7, 8]],index=['cobra', 'viper', 'sidewinder'],columns=['max_speed', 'shield']) >>> df max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8 >>> df['max_speed'] cobra 1 viper 4 sidewinder 7 name: max_speed, dtype: int64
df[columns1,columns2],返回多列,数据类型为dataframe
>>> df[['max_speed','shield']] max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8
df.loc[0,0],通过对应行列的索引名称来获取,当填入行列时返回单个元素,当只填行或者列的时候返回一个行或者列的series
>>> df.loc["cobra","max_speed"] 1 >>> df.loc["cobra":,"max_speed":] max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8 >>> df.loc["cobra":,] max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8
df.iloc[0,0],通过位置来获取,当填入行列时返回单个元素,当只填行或者列的时候返回一个行或者列的series
>>> df.iloc[0,0] 1 >>> df.iloc[0:,0:] max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8 >>> df.iloc[0:,] max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8 >>> df.loc["cobra":,]
df.ix[0,0],结合了loc与iloc,既可以通过位置,又可以通过索引名。当只填行或者列的时候返回一个行或者列的series
>>> df.ix['viper',1:2] shield 5 name: viper, dtype: int64 >>> df.ix['viper',0:1] max_speed 4 name: viper, dtype: int64
df.values[:,:],通过位置返回所有的数据,当只填行或者列的时候返回一个行或者列的array
>>> df.values[0:,:] array([[1, 2], [4, 5], [7, 8]], dtype=int64) >>> df.values[:1,:] array([[1, 2]], dtype=int64) >>>
df[df[columns]>10],根据条件选出符合条件的列
>>> df[df['max_speed']>0] max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8
df.sort_values([columns1,columns2],ascending=[false,true]),按照某列的升降序排列,当填入两个以上的列时,按照先后顺序升降序排列
>>> df.sort_values('max_speed') max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8 >>> df.sort_values('max_speed',ascending=false) max_speed shield sidewinder 7 8 viper 4 5 cobra 1 2 >>> df.sort_values(['max_speed','shield'],ascending=false) max_speed shield sidewinder 7 8 viper 4 5 cobra 1 2
df.groupby([columns1,columns2]),按一列或者多列进行分组,返回分组对象
>>> df.groupby('max_speed') <pandas.core.groupby.generic.dataframegroupby object at 0x000001df64584588> >>> df.groupby('max_speed').mean() shield max_speed 1 2 4 5 7 8
数据清洗
- df.columns = [‘a’,’b’,’c’,’d’],重新给列命名
>>> df.columns=['a','b'] >>> df a b cobra 1 2 viper 4 5 sidewinder 7 8
- df.rename(data,axis),改变行索引或者列索引,axis里选择行列
>>> df.rename(index=str,columns={'a':'a','b':'b'}) a b cobra 1 2 viper 4 5 sidewinder 7 8 >>> df.rename(str.lower,axis='columns') a b cobra 1 2 viper 4 5 sidewinder 7 8 >>> df.rename({'cobra':'a','viper':'b','sidewinder':'c'},axis='index') a b a 1 2 b 4 5 c 7 8
- df.set_index(‘column_one’):设置索引列
>>> df = pd.dataframe({'month': [1, 4, 7, 10], ... 'year': [2012, 2014, 2013, 2014], ... 'sale': [55, 40, 84, 31]}) >>> df month year sale 0 1 2012 55 1 4 2014 40 2 7 2013 84 3 10 2014 31 >>> df.set_index('month') year sale month 1 2012 55 4 2014 40 7 2013 84 10 2014 31
- df.reset_index,重新设置行索引
>>> df = pd.dataframe([('bird', 389.0), ... ('bird', 24.0), ... ('mammal', 80.5), ... ('mammal', np.nan)], ... index=['falcon', 'parrot', 'lion', 'monkey'], ... columns=('class', 'max_speed')) >>> df class max_speed falcon bird 389.0 parrot bird 24.0 lion mammal 80.5 monkey mammal nan >>> df.reset_index() index class max_speed 0 falcon bird 389.0 1 parrot bird 24.0 2 lion mammal 80.5 3 monkey mammal nan
- df.isnull,判断dataframe中有没有空值,有空值返回true,没有返回false
>>> df = pd.dataframe({'age': [5, 6, np.nan], ... 'born': [pd.nat, pd.timestamp('1939-05-27'), ... pd.timestamp('1940-04-25')], ... 'name': ['alfred', 'batman', ''], ... 'toy': [none, 'batmobile', 'joker']}) >>> df age born name toy 0 5.0 nat alfred none 1 6.0 1939-05-27 batman batmobile 2 nan 1940-04-25 joker >>> df.isnull() age born name toy 0 false true false true 1 false false false false 2 true false false false >>> df.isna() age born name toy 0 false true false true 1 false false false false 2 true false false false
- df.notnull,判断dataframe中有没有非空值,有非空值返回true,没有返回false
>>> df.notna() age born name toy 0 true false true false 1 true true true true 2 false true true true
- df.dropna(axis),删除所有包含空值的行或者列,axis=0为行,axis=1为列
>>> df.dropna() age born name toy 1 6.0 1939-05-27 batman batmobile >>> df age born name toy 0 5.0 nat alfred none 1 6.0 1939-05-27 batman batmobile 2 nan 1940-04-25 joker
- df.fillna(n),用n来替换dataframe中的所有空值
>>> df.fillna('hahaha') age born name toy 0 5 hahaha alfred hahaha 1 6 1939-05-27 00:00:00 batman batmobile 2 hahaha 1940-04-25 00:00:00 joker
- df.replace(‘a’,’b’),用’b’来替换datafrme中所有的’a’
>>> df = pd.dataframe({'a': [0, 1, 2, 3, 4], ... 'b': [5, 6, 7, 8, 9], ... 'c': ['a', 'b', 'c', 'd', 'e']}) >>> df.replace(0, 5) a b c 0 5 5 a 1 1 6 b 2 2 7 c 3 3 8 d 4 4 9 e >>> df.replace(0,5) a b c 0 5 5 a 1 1 6 b 2 2 7 c 3 3 8 d 4 4 9 e
- df[[‘a’,’b’]].astype(type),改变datafrme中某几列的数据类型,即改变series的数据类型
>>> df = pd.dataframe({"a": [1, 2, 3], "b": [4, 5, 6]}) >>> df.rename(index=str, columns={"a": "a", "b": "c"}) a c 0 1 4 1 2 5 2 3 6 df = pd.dataframe([('bird', 389.0),('bird', 24.0),('mammal', 80.5),('mammal', np.nan)],index=['falcon', 'parrot', 'lion', 'monkey'],columns=('class', 'max_speed'))
- df.columns = [‘a’,’b’,’c’,’d’],重新给列命名
数据合并
- df.append(df2),将df2中的数据根据列追加到df中的末尾,注意如果两个df的列名不相同,会显示所有列,在没有的列添加nan
>>> df = pd.dataframe([[1, 2], [3, 4]], columns=list('ab')) >>> df a b 0 1 2 1 3 4 >>> df2 = pd.dataframe([[5, 6], [7, 8]], columns=list('ab')) >>> df2 a b 0 5 6 1 7 8 >>> df.append(df2) a b 0 1 2 1 3 4 0 5 6 1 7 8 >>> df3 = pd.dataframe([[5, 6], [7, 8]], columns=list('bc')) >>> df3 b c 0 5 6 1 7 8 >>> df.append(df3,sort=true) # sort设置排序规则 a b c 0 1.0 2 nan 1 3.0 4 nan 0 nan 5 6.0 1 nan 7 8.0 >>> df.append(df3,sort=true,ignore_index=true) # ignore_index重新设置索引 a b c 0 1.0 2 nan 1 3.0 4 nan 2 nan 5 6.0 3 nan 7 8.0
- pd.concat([df1,df2],axis=1),将df2中的数据根据axis选择行列追加到df1的尾部
>>> df1 = pd.dataframe([['a', 1], ['b', 2]],columns=['letter', 'number']) >>> df1 letter number 0 a 1 1 b 2 >>> df2 = pd.dataframe([['c', 3], ['d', 4]],columns=['letter', 'number']) >>> df2 letter number 0 c 3 1 d 4 >>> pd.concat([df1,df2]) letter number 0 a 1 1 b 2 0 c 3 1 d 4 >>> pd.concat([df1,df2],axis=1) letter number letter number 0 a 1 c 3 1 b 2 d 4
- df1.join(df2,on=columns,how=’out’)
>>> df = pd.dataframe({'key': ['k0', 'k1', 'k2', 'k3', 'k4', 'k5'],'a': ['a0', 'a1', 'a2', 'a3', 'a4', 'a5']}) >>> df key a 0 k0 a0 1 k1 a1 2 k2 a2 3 k3 a3 4 k4 a4 5 k5 a5 >>> df2 = pd.dataframe({'key': ['k0', 'k1', 'k2'],'b': ['b0', 'b1', 'b2']}) >>> df2 key b 0 k0 b0 1 k1 b1 2 k2 b2 >>> df.join(df2,lsuffix='_caller',rsuffix='_other') # lsuffix设置df左侧的重叠列中使用的列名,同理rsuffix为右侧 key_caller a key_other b 0 k0 a0 k0 b0 1 k1 a1 k1 b1 2 k2 a2 k2 b2 3 k3 a3 nan nan 4 k4 a4 nan nan 5 k5 a5 nan nan
- df1.merge(df2,left_on=’column1’,right_on=’column2’)
>>> df1 = pd.dataframe({'lkey': ['foo', 'bar', 'baz', 'foo'], 'value': [1, 2, 3, 5]}) >>> df1 lkey value 0 foo 1 1 bar 2 2 baz 3 3 foo 5 >>> df2 = pd.dataframe({'rkey': ['foo', 'bar', 'baz', 'foo'], 'value': [5, 6, 7, 8]}) >>> df2 rkey value 0 foo 5 1 bar 6 2 baz 7 3 foo 8 >>> df1.merge(df2,left_on='lkey',right_on='rkey') lkey value_x rkey value_y 0 foo 1 foo 5 1 foo 1 foo 8 2 foo 5 foo 5 3 foo 5 foo 8 4 bar 2 bar 6 5 baz 3 baz 7
- df.append(df2),将df2中的数据根据列追加到df中的末尾,注意如果两个df的列名不相同,会显示所有列,在没有的列添加nan