pandas组队学习：task9

时间：2021-01-08 09:36:34 阅读：30 评论：0 收藏：0 [点我收藏+]

一、Cat对象

cat对象的属性

使用astype将普通序列转换为分类变量，例如：

s = pd.Series([‘man‘,‘woman‘,‘child‘,‘man‘,‘child‘])
s = s.astype(‘category‘)
s
Out[49]: 
0      man
1    woman
2    child
3      man
4    child
dtype: category
Categories (3, object): [‘child‘, ‘man‘, ‘woman‘]

使用cat.categories查看分类的类型：

s.cat.categories
Out[50]: Index([‘child‘, ‘man‘, ‘woman‘], dtype=‘object‘)

cat.ordered查看是否有序：

s.cat.ordered
Out[59]: False

还可以对类别进行编码，编码顺序取决于categories的顺序：

s.cat.codes
Out[60]: 
0    1
1    2
2    0
3    1
4    0
dtype: int8

类别的增删改

增加：

使用 add_categories增加类别：

s.cat.add_categories(‘oldman‘) 
Out[61]: 
0      man
1    woman
2    child
3      man
4    child
dtype: category
Categories (4, object): [‘child‘, ‘man‘, ‘woman‘, ‘oldman‘]

删除：

使用 remove_categories ，删除某一个类别，原来序列中的该类会被设置为缺失：

s.cat.remove_categories(‘child‘)
Out[64]: 
0      man
1    woman
2      NaN
3      man
4      NaN
dtype: category
Categories (2, object): [‘man‘, ‘woman‘]

使用 remove_unused_categories ，删除未出现在序列中的类别：

s = s.cat.add_categories(‘oldman‘) 
s = s.cat.remove_unused_categories()

使用 set_categories 直接设置序列的新类别，原类别若不能存在，则会被设置为缺失：

s.cat.set_categories([‘Sophomore‘,‘PhD‘]) 
Out[65]: 
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
dtype: category
Categories (2, object): [‘Sophomore‘, ‘PhD‘]

修改

使用rename_categories 方法完成：

s.cat.rename_categories({‘child‘:‘old‘})
Out[66]: 
0      man
1    woman
2      old
3      man
4      old
dtype: category
Categories (3, object): [‘old‘, ‘man‘, ‘woman‘]

二、有序分类

序的建立

使用reorder_categories将无序转换为有序，传入时不能够增加新的类别，也不能缺少原来的类别，并且必须指定参数 ordered=True ：

s = pd.Series([‘man‘,‘woman‘,‘child‘,‘man‘,‘child‘])
s = s.astype(‘category‘)
s = s.cat.reorder_categories([‘child‘, ‘woman‘,‘man‘],ordered=True)
s
Out[68]: 
0      man
1    woman
2    child
3      man
4    child
dtype: category
Categories (3, object): [‘child‘ < ‘woman‘ < ‘man‘]

使用 as_unordered将有序转换为无序：

s.cat.as_unordered()
Out[69]: 
0      man
1    woman
2    child
3      man
4    child
dtype: category
Categories (3, object): [‘child‘, ‘woman‘, ‘man‘]

排序和比较

排序：

使用reorder_categories将无序转换为有序后，即可使用 sort_values 进行值排序：

s.sort_values()
Out[71]: 
2    child
4    child
1    woman
0      man
3      man
dtype: category
Categories (3, object): [‘child‘ < ‘woman‘ < ‘man‘]

使用sort_index进行索引排序：

s.sort_index()
Out[72]: 
0      man
1    woman
2    child
3      man
4    child
dtype: category
Categories (3, object): [‘child‘ < ‘woman‘ < ‘man‘]

比较

对于排序后的序列，可以使用==，!=或者>,<等进行比较，例如：

s == ‘child‘			#进行==比较
Out[73]: 
0    False
1    False
2     True
3    False
4     True
dtype: bool
	
s >‘child‘				#进行>比较
Out[74]: 
0     True
1     True
2    False
3     True
4    False
dtype: bool

三、区间类别

利用cut和qcut进行区间构造

可以将数值类别分类到不同的区间中，主要使用cut和qcut函数。

1).cut

第一个参数为要划分区间的序列
bins：表示划分的区间，可以为整数或者区间
right：默认为True，表示左开右闭
labels ：代表区间的名字

retbins：是否返回分割点（默认不返回）

bins为整数时：

s = pd.Series([1,2])
pd.cut(s, bins=2)
Out[75]: 
0    (0.999, 1.5]
1      (1.5, 2.0]
dtype: category
Categories (2, interval[float64]): [(0.999, 1.5] < (1.5, 2.0]]

bin还可以指定区间：

pd.cut(s, bins=[-1,1.5,2,3])
Out[77]: 
0    (-1.0, 1.5]
1     (1.5, 2.0]
dtype: category
Categories (3, interval[float64]): [(-1.0, 1.5] < (1.5, 2.0] < (2.0, 3.0]]

返回区间名字和分割点：

res = pd.cut(s, bins=2, labels=[‘small‘, ‘big‘], retbins=True)

res[0]
Out[79]: 
0    small
1      big
dtype: category
Categories (2, object): [‘small‘ < ‘big‘]

res[1] 
Out[80]: array([0.999, 1.5  , 2.   ])

2). qcut

qcut只是把 bins 参数变成的 q 参数， q为整数 n 时，指按照 n 等分位数把数据分箱，还可以传入浮点列表指代相应的分位数分割点。

q为整数：

s = pd.Series([1,2,3,4,5,6])
pd.qcut(s,q=2)
Out[84]: 
0    (0.999, 3.5]
1    (0.999, 3.5]
2    (0.999, 3.5]
3      (3.5, 6.0]
4      (3.5, 6.0]
5      (3.5, 6.0]
dtype: category
Categories (2, interval[float64]): [(0.999, 3.5] < (3.5, 6.0]]

q为列表，此时传入的要从0到1，否则不在区间范围的会设为缺失值：

pd.qcut(s,q=[0.1,0.5])
Out[86]: 
0             NaN
1    (1.499, 3.5]
2    (1.499, 3.5]
3             NaN
4             NaN
5             NaN
dtype: category
Categories (1, interval[float64]): [(1.499, 3.5]]

pd.qcut(s,q=[0,0.1,0.5,1])
Out[87]: 
0    (0.999, 1.5]
1      (1.5, 3.5]
2      (1.5, 3.5]
3      (3.5, 6.0]
4      (3.5, 6.0]
5      (3.5, 6.0]
dtype: category
Categories (3, interval[float64]): [(0.999, 1.5] < (1.5, 3.5] < (3.5, 6.0]]

一般区间的构造

区间的构造可以使用Interval，其中具备三个要素，即左端点、右端点和端点的开闭状态，其中开闭状态可以指定 right, left, both, neither 中的一类：

my_interval = pd.Interval(0, 1, ‘right‘)

In [50]: my_interval
Out[50]: Interval(0, 1, closed=‘right‘)

pd.IntervalIndex 对象有四类方法生成，分别是 from_breaks, from_arrays, from_tuples, interval_range ，它们分别应用于不同的情况：

from_breaks：直接传入分割点

pd.IntervalIndex.from_breaks([1,3,6,10], closed=‘both‘)
Out[54]: 
IntervalIndex([[1, 3], [3, 6], [6, 10]],
              closed=‘both‘,
              dtype=‘interval[int64]‘)

from_arrays 分别传入左端点和右端点的列表：

pd.IntervalIndex.from_arrays(left = [1,3,6,10], right = [5,4,9,11], closed =‘neither‘)
Out[55]: 
IntervalIndex([(1, 5), (3, 4), (6, 9), (10, 11)],
              closed=‘neither‘,
              dtype=‘interval[int64]‘)

from_tuples 传入的是起点和终点元组构成的列表：

pd.IntervalIndex.from_tuples([(1,5),(3,4),(6,9),(10,11)],closed=‘neither‘)
Out[56]: 
IntervalIndex([(1, 5), (3, 4), (6, 9), (10, 11)],
              closed=‘neither‘,
              dtype=‘interval[int64]‘)

interval_range 传入start, end, periods, freq 起点，终点，区间个数，区间长度：

传入个数：

pd.interval_range(start=1,end=5,periods=8)
Out[57]: 
IntervalIndex([(1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5], (4.5, 5.0]],
              closed=‘right‘,
              dtype=‘interval[float64]‘)

传入长度：

pd.interval_range(end=5,periods=8,freq=0.5)
Out[58]: 
IntervalIndex([(1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5], (4.5, 5.0]],
              closed=‘right‘,
              dtype=‘interval[float64]‘)

区间的属性与方法

IntervalIndex 有若干常用属性： left, right, mid, length ，分别表示左右端点、两端点均值和区间长度。
IntervalIndex 还有两个常用方法，包括 contains 和 overlaps ，分别指逐个判断每个区间是否包含某元素，以及是否和一个 pd.Interval 对象有交集。

四、练习

Ex1：统计未出现的类别

我的答案：

dropna参数默认为True，此时对于未出现的类别将不显示，设为False时，未出现的类别也会显示。

思路是先统计行索引和列索引的类别数目，对同时属于这两个类别的元素求和，最后对于dropna参数为True时，将舍弃掉全为0的行或者列。

def my_crosstab(A,B,dropna = True):
    A = A.astype(‘category‘)
    B = B.astype(‘category‘)
    index1 = A.cat.categories
    index2 = B.cat.categories
    n = len(index1)
    m = len(index2)							#统计类别数目
    data = np.zeros([n,m])
    for i in range(n):
        for j in range(m):
            data[i][j] = sum( (A ==A.cat.categories[i]) &  (B == B.cat.categories[j]))	#统计同时属于两个类别的元素数目
    if dropna == False:    
        df = pd.DataFrame(data,
                     index=index1,
                     columns=index2)
    else:								#对全0行或者列进行舍弃
        df = pd.DataFrame(data,
                     index=index1,
                     columns=index2)
        df = df.drop(df.index[(df==0).all(axis=1)])
        df = df.drop(df.columns[(df==0).all(axis=0)],axis=1)
    return df

测试dropna==True：

import pandas as pd
import numpy as np
df = pd.DataFrame({‘A‘:[‘a‘,‘b‘,‘c‘,‘a‘],‘B‘:[‘cat‘,‘cat‘,‘dog‘,‘cat‘]})
res = my_crosstab(df.A,df.B)
res
Out[188]: 
   cat  dog
a  2.0  0.0
b  1.0  0.0
c  0.0  1.0

测试dropna==False：

df = pd.DataFrame({‘A‘:[‘a‘,‘b‘,‘c‘,‘a‘],‘B‘:[‘cat‘,‘cat‘,‘dog‘,‘cat‘]})
df.B = df.B.astype(‘category‘).cat.add_categories(‘sheep‘)
res = my_crosstab(df.A,df.B,drop = False)
res
Out[191]: 
   cat  dog  sheep
a  2.0  0.0    0.0
b  1.0  0.0    0.0
c  0.0  1.0    0.0

EX.2

pandas组队学习：task9

原文：https://www.cnblogs.com/zwrAI/p/14249344.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)