样本类别分布不均衡处理(处理过拟合和欠拟合问题)
什么是样本类别分布不均衡
样本类别分布不均衡导致的危害
# 数据源生成 import pandas as pd import numpy as np x = np.random.randint(0,100,size=(100,3)) y = pd.Series(data=np.random.randint(0,1,size=(95,))) y = y.append(pd.Series(data=[1,1,1,1,1]),ignore_index=False).values y = y.reshape((-1,1)) all_data_np = np.concatenate((x,y),axis=1) np.random.shuffle(all_data_np) df = pd.DataFrame(all_data_np) df.head() df.shape # (100, 4) df[3].value_counts() # 样本分类不均衡 0 95 1 5 Name: 3, dtype: int64 X = df[[0,1,2]] y = df[3] from imblearn.over_sampling import SMOTE s = SMOTE(k_neighbors=3) feature,target = s.fit_sample(X,y) # 返回一个元组 feature.shape # 原来100行增加到190 # (190, 3) target.shape # (190,) target.value_counts() 1 95 0 95 Name: 3, dtype: int64
通过减少分类中多数类样本的数量来实现样本均衡(可能造成样本数据大量丢失)
from imblearn.under_sampling import RandomUnderSampler r = RandomUnderSampler() # 没有那个n_neighbors参数 a,b = r.fit_sample(X,y) a.shape # (10, 3) b.shape # (10,) b.value_counts() 1 5 0 5 Name: 3, dtype: int64
原文:https://www.cnblogs.com/wgwg/p/13386899.html