利用python进行数据分析-01

时间：2015-10-19 17:15:29 阅读：896 评论：0 收藏：0 [点我收藏+]

1.将gov数据导入并读出数据: （time_zones类型为list，为tz的值）-第一种方法

import json
path = ‘B:/test/ch02/usagov_bitly_data2012-03-16-1331923249.txt‘
records = [json.loads(line) for line in open(path)]
#print(records[0])
time_zones = [rec[‘tz‘] for rec in records if ‘tz‘ in rec]

自定义计数函数计算time_zones里面各个值的个数

def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
    return counts
    
from collections import defaultdict
def get_counts2(sequence):
    counts =defaultdict(int)
    for x in sequence:
        counts[x] += 1
    return counts

如计算time_zones中America/New_York的个数（coun为字典）：

coun = get_counts2(time_zones)
print (coun[‘America/New_York‘])

输出前20个：count_dict是字典，items属性

def count_value(count_dict,n=10):
    value_count_dict = [(count,tz) for tz,count in count_dict.items()]
    value_count_dict.sort()
    return value_count_dict[-n:]

print(count_value(coun,10))

　　将gov数据导入并读出数据: （time_zones类型为list，为tz的值）-第二种方法

采用collections中的Counter方法：

import json
path = ‘B:/test/ch02/usagov_bitly_data2012-03-16-1331923249.txt‘
records = [json.loads(line) for line in open(path)]
time_zones = [rec[‘tz‘] for rec in records if ‘tz‘ in rec]
from collections import Counter
coun = Counter(time_zones)
print(coun.most_common(10))

2、用pandas中的dataframe来进行视图展示

from pandas import DataFrame,Series
import pandas as pd;import numpy as np
frame = DataFrame(records)

print (frame[‘tz‘][:10])

tz的摘要视图.同时frame[‘tz‘]series的对象使用value_counts(）方法计数

print (frame[‘tz‘][:10])

tz_counts = frame[‘tz‘].value_counts()
print(tz_counts[:10])

3、matplotlib生成图片

fillna函数代替缺失值，空值用unknown表示

clean_tz = frame[‘tz‘].fillna(‘Missing‘)
clean_tz[clean_tz == ‘‘] = ‘Unknow‘
tz_counts = clean_tz.value_counts()

plot画图 kind = ’bar‘ 图标类型为条形图，rot 为转向率，倾斜角度

tz_counts[:5].plot(kind = ‘bar‘,rot = 0)

4、用数据中的 a 数据进行切片提取第一个数据，frame.a.dropna 和 frame[‘a‘].dropna 是一样的

results = Series([x.split()[0] for x in frame.a.dropna()])

dropna 对于一个 Series，dropna 返回一个仅含非空数据和索引值的 Series。

按照a中 windows 和非windows进行分类统计

cframe = frame[frame.a.notnull()]
operating_system = np.where(cframe.a.str.contains(‘Windows‘),
‘Windows‘,‘Not Windows‘)
print(operating_system[:5])
by_tz_os = cframe.groupby([‘tz‘,operating_system])
agg_counts = by_tz_os.size().unstack().fillna(0)
print(agg_counts[:10])

技术分享

利用python进行数据分析-01

原文：http://www.cnblogs.com/groupe/p/4887746.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)