series 是一种类似于 一维数组的的对象,他由一组数据以及与之相关的数据标签组成。
In [42]: import pandas as pd In [43]: pd.Series([2,3,7,1]) Out[43]: 0 2 1 3 2 7 3 1 dtype: int64 In [44]: In [44]: f=pd.Series([2,3,7,1]) In [45]: f.index Out[45]: RangeIndex(start=0, stop=4, step=1) In [46]: f.values Out[46]: array([2, 3, 7, 1], dtype=int64)
可以将series看成是一个有序的字典,因为它是索引值到数据值的一个映射,而且它具有很多字典的特性。
可以通过索引的方式选取其中的一个或一组值:
In [56]: f Out[56]: fang 3 liu 9 wang 2 su 3 dtype: int64 In [57]: f[‘fang‘] Out[57]: 3 In [58]: f[[‘fang‘,‘su‘,‘liu‘]] Out[58]: fang 3 su 3 liu 9 dtype: int64
同时支持python当中的in
In [56]: f Out[56]: fang 3 liu 9 wang 2 su 3 dtype: int64 In [59]: ‘fang‘ in f Out[59]: True In [60]: ‘liu‘ in f Out[60]: True In [61]: ‘laohu‘ in f Out[61]: False In [62]: 9 in f Out[62]: False
而且两个Series 直接通过相同的index进行合并,请看下面的例子
In [49]: f=pd.Series([3,9,2,3],index=[‘fang‘,‘liu‘,‘wang‘,‘su‘]) In [50]: f Out[50]: fang 3 liu 9 wang 2 su 3 dtype: int64 In [51]: d=pd.Series([2,22,13,14],index=[‘fang‘,‘liu‘,‘wang‘,‘su‘]) In [52]: d+f Out[52]: fang 5 liu 31 wang 15 su 17 dtype: int64
同时假如你有一个python的字典类型可以将其直接转换为series
In [54]: o={‘fang‘:567,‘su‘:456,‘liu‘:110} In [55]: pd.Series(o) Out[55]: fang 567 su 456 liu 110 dtype: int64
Series对象本身以及它的索引有一个name属性
In [65]: f.name=‘count‘ In [66]: f.index.name=‘people name‘ In [67]: f Out[67]: people name fang 3 liu 9 wang 2 su 3 Name: count, dtype: int64 In [68]:
Series的index可以通过赋值进行修改
In [69]: f.index=[‘zhou‘,‘wu‘,‘zheng‘,‘wang‘] In [70]: f Out[70]: zhou 3 wu 9 zheng 2 wang 3 Name: count, dtype: int64
dataframe 是一个表格行的数据结构,它含有一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值等),相比于numpy的数组一般是统一的数据类型,dataframe相比于series会有两个索引,行索引和列索引。
创建dataframe的方式有很多种,但是最常见的也是最直观的是通过等长的列表或者数组组成的字典去创建。
In [76]: data Out[76]: {‘state‘: [‘Ohio‘, ‘Ohio‘, ‘Ohio‘, ‘Nevada‘, ‘Nevada‘, ‘Nevada‘], ‘year‘: [2000, 2001, 2002, 2001, 2002, 2003], ‘pop‘: [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} In [77]: pd.DataFrame(data) Out[77]: state year pop 0 Ohio 2000 1.5 1 Ohio 2001 1.7 2 Ohio 2002 3.6 3 Nevada 2001 2.4 4 Nevada 2002 2.9 5 Nevada 2003 3.2
另一种是创建嵌套字典,如果嵌套字典传给Dataframe,pandas就会被解释为外层字典的键作为列,内层的键作为作为行索引。
In [130]: pop = {‘Nevada‘: {2001: 2.4, 2002: 2.9},‘Ohio‘: {2000: 1.5, 2001: 1.7, 2002: 3.6}} In [131]: pd.DataFrame(pop) Out[131]: Nevada Ohio 2001 2.4 1.7 2002 2.9 3.6 2000 NaN 1.5
可以在创建时指定列的顺序
In [78]: pd.DataFrame(data,columns=[‘year‘,‘state‘,‘pop‘]) Out[78]: year state pop 0 2000 Ohio 1.5 1 2001 Ohio 1.7 2 2002 Ohio 3.6 3 2001 Nevada 2.4 4 2002 Nevada 2.9 5 2003 Nevada 3.2
dataframe 除了有columns以外同样也有index序号,可以在创建时赋值
In [86]: data Out[86]: {‘state‘: [‘Ohio‘, ‘Ohio‘, ‘Ohio‘, ‘Nevada‘, ‘Nevada‘, ‘Nevada‘], ‘year‘: [2000, 2001, 2002, 2001, 2002, 2003], ‘pop‘: [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} In [87]: frame=pd.DataFrame(data,columns=[‘year‘,‘state‘,‘pop‘],index=[‘one‘,‘two‘,‘three‘,‘four‘,‘five‘,‘six‘]) In [88]: frame Out[88]: year state pop one 2000 Ohio 1.5 two 2001 Ohio 1.7 three 2002 Ohio 3.6 four 2001 Nevada 2.4 five 2002 Nevada 2.9 six 2003 Nevada 3.2
获取dataframe的index和columns列表
In [89]: frame.index.to_list() Out[89]: [‘one‘, ‘two‘, ‘three‘, ‘four‘, ‘five‘, ‘six‘] In [90]: frame.columns.to_list() Out[90]: [‘year‘, ‘state‘, ‘pop‘]
像series一样,dataframe也都有name属性
In [93]: frame.index.name=‘num‘ In [94]: frame.index.name Out[94]: ‘num‘ In [95]: frame Out[95]: year state pop num one 2000 Ohio 1.5 two 2001 Ohio 1.7 three 2002 Ohio 3.6 four 2001 Nevada 2.4 five 2002 Nevada 2.9 six 2003 Nevada 3.2 In [96]: frame.columns.name=‘column_name‘ In [97]: frame.columns.name Out[97]: ‘column_name‘ In [98]: frame Out[98]: column_name year state pop num one 2000 Ohio 1.5 two 2001 Ohio 1.7 three 2002 Ohio 3.6 four 2001 Nevada 2.4 five 2002 Nevada 2.9
访问columns时,可以使用df[‘year] 也可以使用df.year,当然在使用df.column时,只能使用df当中存在的列名
In [99]: frame[‘year‘] Out[99]: num one 2000 two 2001 three 2002 four 2001 five 2002 six 2003 Name: year, dtype: int64 In [100]: frame.year Out[100]: num one 2000 two 2001 three 2002 four 2001 five 2002 six 2003 Name: year, dtype: int64
dataframe当中选取行需要采用loc,同时当我们同时选取行与列的某个数据时,也需要采用loc,用法有两种frame.loc[‘index‘,‘column‘]或者frame.loc[‘index‘][‘column‘]
In [103]: frame Out[103]: column_name year state pop num one 2000 Ohio 1.5 two 2001 Ohio 1.7 three 2002 Ohio 3.6 four 2001 Nevada 2.4 five 2002 Nevada 2.9 six 2003 Nevada 3.2 In [104]: frame.loc[‘one‘] Out[104]: column_name year 2000 state Ohio pop 1.5 Name: one, dtype: object In [105]: frame.loc[‘one‘,‘pop‘] Out[105]: 1.5 In [106]: frame.loc[‘one‘][‘pop‘] Out[106]: 1.5
当然也可以使用loc选取列,思路是选取所有的行,从中再选取列
In [107]: frame.loc[:,‘pop‘] Out[107]: num one 1.5 two 1.7 three 3.6 four 2.4 five 2.9 six 3.2 Name: pop, dtype: float64
dataframe当中的行与列进行赋值,需要注意的是当你想为一整列或者一整行赋给一个同样的值那么直接赋值就可以
In [108]: frame Out[108]: column_name year state pop num one 2000 Ohio 1.5 two 2001 Ohio 1.7 three 2002 Ohio 3.6 four 2001 Nevada 2.4 five 2002 Nevada 2.9 six 2003 Nevada 3.2 In [109]: frame[‘pop‘] =22 In [110]: frame Out[110]: column_name year state pop num one 2000 Ohio 22 two 2001 Ohio 22 three 2002 Ohio 22 four 2001 Nevada 22 five 2002 Nevada 22 six 2003 Nevada 22 In [118]: frame.loc[‘one‘]=23 In [119]: frame Out[119]: column_name year state pop num one 23 23 23 two 2001 Ohio 22 three 2002 Ohio 22 four 2001 Nevada 22 five 2002 Nevada 22 six 2003 Nevada 22
但是当你想为dataframe当中的行或列赋予不同的值时,需要注意的是 你需要传入与当前行列等长的列表
In [113]: frame.loc[‘one‘]=22,‘fang‘,‘wang‘ In [114]: frame Out[114]: column_name year state pop num one 22 fang wang two 2001 Ohio 22 three 2002 Ohio 22 four 2001 Nevada 22 five 2002 Nevada 22 six 2003 Nevada 22
In [122]: frame[‘pop‘]=1,2,3,4,5,6 In [123]: frame Out[123]: column_name year state pop num one 23 23 1 two 2001 Ohio 2 three 2002 Ohio 3 four 2001 Nevada 4 five 2002 Nevada 5 six 2003 Nevada 6
为不存在的行或者列赋值时会直接创建一个新的行或者列,需要注意的是同样需要传入与当前行,列等长的
In [123]: frame Out[123]: column_name year state pop num one 23 23 1 two 2001 Ohio 2 three 2002 Ohio 3 four 2001 Nevada 4 five 2002 Nevada 5 six 2003 Nevada 6 In [128]: frame.loc[‘one‘]=range(4) In [129]: frame Out[129]: column_name year state pop ID num one 0 1 2 3 two 2001 Ohio 2 1 three 2002 Ohio 3 2 four 2001 Nevada 4 3 five 2002 Nevada 5 4 six 2003 Nevada 6 5 In [125]: frame[‘ID‘]=range(6) In [126]: frame Out[126]: column_name year state pop ID num one 23 23 1 0 two 2001 Ohio 2 1 three 2002 Ohio 3 2 four 2001 Nevada 4 3 five 2002 Nevada 5 4 six 2003 Nevada 6 5
与python的集合不同,pandas的index可以包含重复项
In [136]: f=pd.DataFrame(data,index=[1,1,3,4,5,6]) In [137]: f Out[137]: state year pop 1 Ohio 2000 1.5 1 Ohio 2001 1.7 3 Ohio 2002 3.6 4 Nevada 2001 2.4 5 Nevada 2002 2.9 6 Nevada 2003 3.2 In [138]: f.loc[1] Out[138]: state year pop 1 Ohio 2000 1.5 1 Ohio 2001 1.7
以上就是pandas当中的series和dataframe这两种数据结构的基本方法和属性,后面我们会继续介绍操作他们的其他手段以及pandas在数据处理和数据分析上的其他功能。
python数据分析(五) python pandas--数据结构series和dataframe
原文:https://www.cnblogs.com/xiaosanye/p/12012731.html