1 测试集和训练集连接,使用append,前提:训练集drop掉target列,在列数上和测试集保持一致
all_data_predictors = home_data_predictors.append(test_data_predictors)
2 在数据集中,选择object列或数字列,组成单一成分的数据集,用df.select_dtypes()
object_all_data_predictors = all_data_predictors.select_dtypes(include=[‘object‘])
numeric_all_data_predictors = all_data_predictors.select_dtypes(include=[‘number‘])
3 查看df中的有空值的列
missing_val_count_by_column = (df.isnull().sum())
print("df有空值的列:")
print(missing_val_count_by_column[missing_val_count_by_column > 0])
4 返回df中的有空值的列的generator
cols_with_missing_in_df = [col for col in df.columns if df[col].isnull().any()] #这里使用了python中常用的列表推导式
print("cols_with_missing_in_df长度:" + str(len(cols_with_missing_in_df)))
5 SimpleImputer对df插值后,返回值是numpy数组,需要再转换成df
my_imputer = SimpleImputer(strategy=‘most_frequent‘)
inputed_df = pd.DataFrame(my_imputer.fit_transform(df))
6 对df中的列进行筛选,返回df中种类是object,且每列分类数量小于10的列的generator
low_cardinality_cols = [cname for cname in df.columns if
df[cname].nunique() < 10 and
df[cname].dtype == "object"]
7 先把训练集和测试集append后,再进行one-hot编码前,否则可能会出现训练集和测试集列数不一致
8 将两个df横着连接(concat)前,需要让两个df的index保持一致,需要reset_index()操作
df1.reset_index(inplace=True, drop=True)
df2.reset_index(inplace=True, drop=True)
df3= pd.concat([df1, df2], axis=1)
9 将df拆分
df2 = df1.iloc[0:1460,:]
原文:https://www.cnblogs.com/wangzhonghan/p/10488725.html