要利用python进行数据分析,pandas必不可少。
作为python数据分析利器,pandas以快速,高效著称。
为了更加方便地处理数据,pandas创造了自己的数据类型:Series,DataFrame。
一般使用pandas要进行如下import:
import pandas as pd
Series
可以认为series类型的包含着一列数据。
In [4]: s = pd.Series(np.random.randn(4), name=‘daily returns‘)
In [5]: s
Out[5]:
0 0.430271
1 0.617328
2 -0.265421
3 -0.836113
Name: daily returns
索引从零开始,与列表一样。
Series数据是基于numpy的array结构的,所以Series支持相似的运算。
In [6]: s * 100
Out[6]:
0 43.027108
1 61.732829
2 -26.542104
3 -83.611339
Name: daily returns
In [7]: np.abs(s)
Out[7]:
0 0.430271
1 0.617328
2 0.265421
3 0.836113
Name: daily returns
但是Series有着更加高级的特性。例如:
In [8]: s.describe()
Out[8]:
count 4.000000
mean -0.013484
std 0.667092
min -0.836113
25% -0.408094
50% 0.082425
75% 0.477035
max 0.617328
这是一种统计描述性数据。
还有,更加丰富的索引形式。
In [9]: s.index = [‘AMZN‘, ‘AAPL‘, ‘MSFT‘, ‘GOOG‘]
In [10]: s
Out[10]:
AMZN 0.430271
AAPL 0.617328
MSFT -0.265421
GOOG -0.836113
Name: daily returns
这么一看,Series也像字典类型,但是要求字典的值必须是相同类型。
一些类似字典的操作,Series也支持:
In [11]: s[‘AMZN‘]
Out[11]: 0.43027108469945924
In [12]: s[‘AMZN‘] = 0
In [13]: s
Out[13]:
AMZN 0.000000
AAPL 0.617328
MSFT -0.265421
GOOG -0.836113
Name: daily returns
In [14]: ‘AAPL‘ in s
Out[14]: True
DataFrame
如果说Series是一列数据,那么DataFrame就是多列数据。
DataFrame读入csv文件十分方便,假如有以下csv文件:test_pwt.csv
"country","country isocode","year","POP","XRAT","tcgdp","cc","cg"
"Argentina","ARG","2000","37335.653","0.9995","295072.21869","75.716805379","5.5788042896"
"Australia","AUS","2000","19053.186","1.72483","541804.6521","67.759025993","6.7200975332"
"India","IND","2000","1006300.297","44.9416","1728144.3748","64.575551328","14.072205773"
"Israel","ISR","2000","6114.57","4.07733","129253.89423","64.436450847","10.266688415"
"Malawi","MWI","2000","11801.505","59.543808333","5026.2217836","74.707624181","11.658954494"
"South Africa","ZAF","2000","45064.098","6.93983","227242.36949","72.718710427","5.7265463933"
"United States","USA","2000","282171.957","1","9898700","72.347054303","6.0324539789"
"Uruguay","URY","2000","3219.793","12.099591667","25255.961693","78.978740282","5.108067988"
利用read_csv()函数,轻松读入csv文件,csv文件中的数据就组成了一个DataFrame。
In [28]: df = pd.read_csv(‘data/test_pwt.csv‘)
In [29]: type(df)
Out[29]: pandas.core.frame.DataFrame
In [30]: df
Out[30]:
country country isocode year POP XRAT tcgdp cc cg
0 Argentina ARG 2000 37335.653 0.999500 295072.218690 0 75.716805 5.578804
1 Australia AUS 2000 19053.186 1.724830 541804.652100 1 67.759026 6.720098
2 India IND 2000 1006300.297 44.941600 1728144.374800 2 64.575551 14.072206
3 Israel ISR 2000 6114.570 4.077330 129253.894230 3 64.436451 10.266688
4 Malawi MWI 2000 11801.505 59.543808 5026.221784 4 74.707624 11.658954
5 South Africa ZAF 2000 45064.098 6.939830 227242.369490 5 72.718710 5.726546
6 United States USA 2000 282171.957 1.000000 9898700.000000 6 72.347054 6.032454
7 Uruguay URY 2000 3219.793 12.099592 25255.961693 7 78.978740 5.108068
对DataFrame可以采用行数切片索引,得到的仍然是DataFrame类型的数据
In [13]: df[2:5]
Out[13]:
country country isocode year POP XRAT tcgdp cc cg
2 India IND 2000 1006300.297 44.941600 1728144.374800 64.575551 14.072206
3 Israel ISR 2000 6114.570 4.077330 129253.894230 64.436451 10.266688
4 Malawi MWI 2000 11801.505 59.543808 5026.221784 74.707624 11.658954
选取DataFrame的类,往往采用列名索引的形式:
In [14]: df[[‘country‘, ‘tcgdp‘]]
Out[14]:
country tcgdp
0 Argentina 295072.218690
1 Australia 541804.652100
2 India 1728144.374800
3 Israel 129253.894230
4 Malawi 5026.221784
5 South Africa 227242.369490
6 United States 9898700.000000
7 Uruguay 25255.961693
既要选择特定的行,又要选择特定的列时:
In [21]: df.ix[2:5, [‘country‘, ‘tcgdp‘]]
Out[21]:
country tcgdp
2 India 1728144.374800
3 Israel 129253.894230
4 Malawi 5026.221784
5 South Africa 227242.369490
pop()方法可以从DataFrame中分离出一列数据:
In [34]: countries = df.pop(‘country‘)
In [35]: type(countries)
Out[35]: pandas.core.series.Series
In [36]: countries
Out[36]:
0 Argentina
1 Australia
2 India
3 Israel
4 Malawi
5 South Africa
6 United States
7 Uruguay
Name: country
In [37]: df
Out[37]:
POP tcgdp
0 37335.653 295072.218690
1 19053.186 541804.652100
2 1006300.297 1728144.374800
3 6114.570 129253.894230
4 11801.505 5026.221784
5 45064.098 227242.369490
6 282171.957 9898700.000000
7 3219.793 25255.961693
In [38]: df.index = countries
In [39]: df
Out[39]:
POP tcgdp
country
Argentina 37335.653 295072.218690
Australia 19053.186 541804.652100
India 1006300.297 1728144.374800
Israel 6114.570 129253.894230
Malawi 11801.505 5026.221784
South Africa 45064.098 227242.369490
United States 282171.957 9898700.000000
Uruguay 3219.793 25255.961693
修改DataFrame的列名:
In [40]: df.columns = ‘population‘, ‘total GDP‘
In [41]: df
Out[41]:
population total GDP
country
Argentina 37335.653 295072.218690
Australia 19053.186 541804.652100
India 1006300.297 1728144.374800
Israel 6114.570 129253.894230
Malawi 11801.505 5026.221784
South Africa 45064.098 227242.369490
United States 282171.957 9898700.000000
Uruguay 3219.793 25255.961693
对一列数据进行运算:
In [66]: df[‘population‘] = df[‘population‘] * 1e3
In [67]: df
Out[67]:
population total GDP
country
Argentina 37335653 295072.218690
Australia 19053186 541804.652100
India 1006300297 1728144.374800
Israel 6114570 129253.894230
Malawi 11801505 5026.221784
South Africa 45064098 227242.369490
United States 282171957 9898700.000000
Uruguay 3219793 25255.961693
根据已有数据创建新的列:
In [74]: df[‘GDP percap‘] = df[‘total GDP‘] * 1e6 / df[‘population‘]
In [75]: df
Out[75]:
population total GDP GDP percap
country
Argentina 37335653 295072.218690 7903.229085
Australia 19053186 541804.652100 28436.433261
India 1006300297 1728144.374800 1717.324719
Israel 6114570 129253.894230 21138.672749
Malawi 11801505 5026.221784 425.896679
South Africa 45064098 227242.369490 5042.647686
United States 282171957 9898700.000000 35080.381854
Uruguay 3219793 25255.961693 7843.970620
DataFrame内置了基于matplotlib的绘图功能;
In [76]: df[‘GDP percap‘].plot(kind=‘bar‘)
Out[76]: <matplotlib.axes.AxesSubplot at 0x2f22ed0>
In [77]: import matplotlib.pyplot as plt
In [78]: plt.show()
排序操作:
In [83]: df = df.sort_index(by=‘GDP percap‘, ascending=False) #根据GDP percap数据,降序排列
In [84]: df
Out[84]:
population total GDP GDP percap
country
United States 282171957 9898700.000000 35080.381854
Australia 19053186 541804.652100 28436.433261
Israel 6114570 129253.894230 21138.672749
Argentina 37335653 295072.218690 7903.229085
Uruguay 3219793 25255.961693 7843.970620
South Africa 45064098 227242.369490 5042.647686
India 1006300297 1728144.374800 1717.324719
Malawi 11801505 5026.221784 425.896679
使用在线数据
pandas可以通过urllib2库函数自动获得在线数据,不需用户自己下载。
url = ‘http://research.stlouisfed.org/fred2/series/UNRATE/downloaddata/UNRATE.csv‘
source = urllib2.urlopen(url)
data = pd.read_csv(source, index_col=0, parse_dates=True, header=None)
In [71]: type(data)
Out[71]: pandas.core.frame.DataFrame
In [72]: data.head() # A useful method to get a quick look at a data frame
Out[72]:
1
0
DATE VALUE
1948-01-01 3.4
1948-02-01 3.8
1948-03-01 4.0
1948-04-01 3.9
In [73]: data.describe() # Your output might differ slightly
Out[73]:
1
count 786
unique 81
top 5.4
freq 31
但是pandas自己也可直接在线获得一些数据,同样是上面的数据,依靠pandas自身库函数也可以做到:
In [77]: import pandas.io.data as web
In [78]: import datetime as dt # Standard Python date / time library
In [79]: start, end = dt.datetime(2006, 1, 1), dt.datetime(2012, 12, 31)
In [80]: data = web.DataReader(‘UNRATE‘, ‘fred‘, start, end)
In [81]: type(data)
Out[81]: pandas.core.frame.DataFrame
In [82]: data.plot()
Out[82]: <matplotlib.axes.AxesSubplot at 0xcf79390>
In [83]: import matplotlib.pyplot as plt
In [84]: plt.show()
pandas 基础
原文:http://blog.csdn.net/myjiayan/article/details/42805957