对于刚刚入门机器学习的的同学来说,kaggle竞赛通常是他们学习和跟其他的全世界范围内的参赛选手切磋的一个大的平台,这个平台上提供了一些入门的竞赛,可以供刚入门的同学一展拳脚
本文针对房价预测的这个竞赛展开,从EDA,特征工程,到模型调参开始讲述一些竞赛中的小的trick,希望对大家有些帮助,本人基础一般,如果有贻笑大方的地方,可以随意拍砖
from IPython.display import HTML
from IPython.display import Image
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
import pandas as pd
import numpy as np
!ls
data_description.txt
data_description.zip
kaggle house price.ipynb
sample_submission.csv
stacking-house-prices-walkthrough-to-top-5.ipynb
test.csv
train.csv
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id 1460 non-null int64
MSSubClass 1460 non-null int64
MSZoning 1460 non-null object
LotFrontage 1201 non-null float64
LotArea 1460 non-null int64
Street 1460 non-null object
Alley 91 non-null object
LotShape 1460 non-null object
LandContour 1460 non-null object
Utilities 1460 non-null object
LotConfig 1460 non-null object
LandSlope 1460 non-null object
Neighborhood 1460 non-null object
Condition1 1460 non-null object
Condition2 1460 non-null object
BldgType 1460 non-null object
HouseStyle 1460 non-null object
OverallQual 1460 non-null int64
OverallCond 1460 non-null int64
YearBuilt 1460 non-null int64
YearRemodAdd 1460 non-null int64
RoofStyle 1460 non-null object
RoofMatl 1460 non-null object
Exterior1st 1460 non-null object
Exterior2nd 1460 non-null object
MasVnrType 1452 non-null object
MasVnrArea 1452 non-null float64
ExterQual 1460 non-null object
ExterCond 1460 non-null object
Foundation 1460 non-null object
BsmtQual 1423 non-null object
BsmtCond 1423 non-null object
BsmtExposure 1422 non-null object
BsmtFinType1 1423 non-null object
BsmtFinSF1 1460 non-null int64
BsmtFinType2 1422 non-null object
BsmtFinSF2 1460 non-null int64
BsmtUnfSF 1460 non-null int64
TotalBsmtSF 1460 non-null int64
Heating 1460 non-null object
HeatingQC 1460 non-null object
CentralAir 1460 non-null object
Electrical 1459 non-null object
1stFlrSF 1460 non-null int64
2ndFlrSF 1460 non-null int64
LowQualFinSF 1460 non-null int64
GrLivArea 1460 non-null int64
BsmtFullBath 1460 non-null int64
BsmtHalfBath 1460 non-null int64
FullBath 1460 non-null int64
HalfBath 1460 non-null int64
BedroomAbvGr 1460 non-null int64
KitchenAbvGr 1460 non-null int64
KitchenQual 1460 non-null object
TotRmsAbvGrd 1460 non-null int64
Functional 1460 non-null object
Fireplaces 1460 non-null int64
FireplaceQu 770 non-null object
GarageType 1379 non-null object
GarageYrBlt 1379 non-null float64
GarageFinish 1379 non-null object
GarageCars 1460 non-null int64
GarageArea 1460 non-null int64
GarageQual 1379 non-null object
GarageCond 1379 non-null object
PavedDrive 1460 non-null object
WoodDeckSF 1460 non-null int64
OpenPorchSF 1460 non-null int64
EnclosedPorch 1460 non-null int64
3SsnPorch 1460 non-null int64
ScreenPorch 1460 non-null int64
PoolArea 1460 non-null int64
PoolQC 7 non-null object
Fence 281 non-null object
MiscFeature 54 non-null object
MiscVal 1460 non-null int64
MoSold 1460 non-null int64
YrSold 1460 non-null int64
SaleType 1460 non-null object
SaleCondition 1460 non-null object
SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
print(train.shape)
print(test.shape)
(1460, 81)
(1459, 80)
with open ('data_description.txt','r') as f:
for i in f.readlines():
print(i)
break
MSSubClass: Identifies the type of dwelling involved in the sale.
Here‘s a brief version of what you‘ll find in the data description file.
SaleCondition: Condition of sale
首先看这个特征 GrLivArea: Above grade (ground) living area square feet,是指居住面积平方英尺
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style='white', context='notebook', palette='deep')
plt.subplots(figsize=(15,8))
plt.subplot(1,2,1)
g= sns.regplot(x=train['GrLivArea'],y= train['SalePrice'],fit_reg=False).set_title('Before')
plt.subplot(1,2,2)
train= train.drop(train[train['GrLivArea']>4000].index)
g=sns.regplot(x=train['GrLivArea'],y=train['SalePrice'],fit_reg=False).set_title('After')
# 首先先把训练数据与测试数据的长度保持,以备后用
ntrain = train.shape[0]
ntest = test.shape[0]
# 保持训练集的目标值数据即 SalePrice
y_train = train.SalePrice.values
all_data = pd.concat((train,test)).reset_index(drop=True)
all_data.drop(['SalePrice'],axis=1,inplace=True)
all_data.drop(['Id'],axis=1,inplace=True)
print('all data shape:{}'.format(all_data.shape))
all data shape:(2915, 79)
/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:7: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
import sys
all_data_na = all_data.isnull().sum()
all_data_na.sort_values(ascending=False)
PoolQC 2907
MiscFeature 2810
Alley 2717
Fence 2345
FireplaceQu 1420
LotFrontage 486
GarageFinish 159
GarageQual 159
GarageYrBlt 159
GarageCond 159
GarageType 157
BsmtCond 82
BsmtExposure 82
BsmtQual 81
BsmtFinType2 80
BsmtFinType1 79
MasVnrType 24
MasVnrArea 23
MSZoning 4
BsmtHalfBath 2
Utilities 2
Functional 2
BsmtFullBath 2
Electrical 1
Exterior2nd 1
KitchenQual 1
GarageCars 1
Exterior1st 1
GarageArea 1
TotalBsmtSF 1
...
GrLivArea 0
YearRemodAdd 0
YearBuilt 0
WoodDeckSF 0
TotRmsAbvGrd 0
Street 0
ScreenPorch 0
SaleCondition 0
RoofStyle 0
RoofMatl 0
PoolArea 0
PavedDrive 0
OverallQual 0
OverallCond 0
OpenPorchSF 0
Neighborhood 0
MoSold 0
MiscVal 0
MSSubClass 0
LowQualFinSF 0
LotShape 0
LotConfig 0
LotArea 0
LandSlope 0
LandContour 0
KitchenAbvGr 0
HouseStyle 0
HeatingQC 0
Heating 0
1stFlrSF 0
Length: 79, dtype: int64
all_data_na = all_data_na.drop(all_data_na[all_data_na==0].index).sort_values(ascending=False)
plt.subplots(figsize=(12,6))
all_data_na.plot(kind='Bar')
<matplotlib.axes._subplots.AxesSubplot at 0x128568710>
!pip install xgboost
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: xgboost in /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages (0.90)
Requirement already satisfied: numpy in /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages (from xgboost) (1.16.2)
Requirement already satisfied: scipy in /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages (from xgboost) (1.2.1)
train[all_data_na.index[:25]].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1456 entries, 0 to 1459
Data columns (total 25 columns):
PoolQC 5 non-null object
MiscFeature 54 non-null object
Alley 91 non-null object
Fence 280 non-null object
FireplaceQu 766 non-null object
LotFrontage 1197 non-null float64
GarageQual 1375 non-null object
GarageCond 1375 non-null object
GarageFinish 1375 non-null object
GarageYrBlt 1375 non-null float64
GarageType 1375 non-null object
BsmtExposure 1418 non-null object
BsmtCond 1419 non-null object
BsmtQual 1419 non-null object
BsmtFinType2 1418 non-null object
BsmtFinType1 1419 non-null object
MasVnrType 1448 non-null object
MasVnrArea 1448 non-null float64
MSZoning 1456 non-null object
BsmtFullBath 1456 non-null int64
BsmtHalfBath 1456 non-null int64
Utilities 1456 non-null object
Functional 1456 non-null object
Electrical 1455 non-null object
BsmtUnfSF 1456 non-null int64
dtypes: float64(3), int64(3), object(19)
memory usage: 295.8+ KB
for col in ("PoolQC", 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'GarageQual', 'GarageCond',
'GarageFinish', 'GarageType','BsmtExposure','BsmtCond','BsmtQual','BsmtFinType2','BsmtFinType1',
'MasVnrType'):
all_data[col] = all_data[col].fillna('None')
print('处理object类型缺失值,根据特征的描述,特征缺失值补充为"None",已完成')
for col in ("GarageYrBlt", "GarageArea", "GarageCars", "BsmtFinSF1",
"BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "MasVnrArea",
"BsmtFullBath", "BsmtHalfBath"):
all_data[col] = all_data[col].fillna(0)
print('处理数值类型的缺失值,根据特征的描述,选择特征缺失值补充为0,已完成')
all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])
all_data["Functional"] = all_data["Functional"].fillna(all_data['Functional'].mode()[0])
print('处理缺失值较少的缺失值,数据类型为数值,填充缺失值为该特征的众数,已完成')
all_data_na = all_data.isnull().sum()
print("Features with missing values: ", all_data_na.drop(all_data_na[all_data_na == 0].index))
处理object类型缺失值,根据特征的描述,特征缺失值补充为"None",已完成
处理数值类型的缺失值,根据特征的描述,选择特征缺失值补充为0,已完成
处理缺失值较少的缺失值,数据类型为数值,填充缺失值为该特征的众数,已完成
Features with missing values: LotFrontage 486
Utilities 2
dtype: int64
all_data.groupby(["Neighborhood"])['LotFrontage'].sum()
Neighborhood
Blmngtn 938.0
Blueste 273.0
BrDale 645.0
BrkSide 5300.0
ClearCr 1763.0
CollgCr 15694.0
Crawfor 5806.0
Edwards 11467.0
Gilbert 8237.0
IDOTRR 5415.0
MeadowV 845.0
Mitchel 6763.0
NAmes 28204.0
NPkVill 591.0
NWAmes 6929.0
NoRidge 4684.0
NridgHt 13722.0
OldTown 14147.0
SWISU 2599.0
Sawyer 7306.0
SawyerW 7491.0
Somerst 10457.0
StoneBr 2860.0
Timber 4626.0
Veenker 1152.0
Name: LotFrontage, dtype: float64
all_data['LotFrontage']=all_data.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))
plt.subplots(figsize=(12,5))
plt.subplot(1,2,1)
g=sns.countplot(x='Utilities',data=train).set_title('Utilities_train')
plt.subplot(1,2,2)
g=sns.countplot(x='Utilities',data=test).set_title('Utilities_test')
train['Utilities'].value_counts()
AllPub 1455
NoSeWa 1
Name: Utilities, dtype: int64
test['Utilities'].value_counts()
AllPub 1457
Name: Utilities, dtype: int64
all_data = all_data.drop(['Utilities'], axis=1)
all_data_na = all_data.isnull().sum()
print("Features with missing values: ", len(all_data_na.drop(all_data_na[all_data_na == 0].index)))
Features with missing values: 0
corr = train.corr()
plt.subplots(figsize=(30,30))
cmap = sns.diverging_palette(150, 250, as_cmap=True)
sns.heatmap(corr, cmap="RdYlBu", vmax=1, vmin=-0.6, center=0.2, square=True, linewidths=0, cbar_kws={"shrink": .5}, annot = True)
<matplotlib.axes._subplots.AxesSubplot at 0x12901bc18>
for raw highly influencing factors on SalePrice, we could do feature engineering
从相关系数矩阵中,我们挑选了一些跟最终售价相关性较高的做进一步的分析
# Quadratic
all_data["OverallQual-2"] = all_data["OverallQual"] ** 2
all_data["GrLivArea-2"] = all_data["GrLivArea"] ** 2
all_data["GarageCars-2"] = all_data["GarageCars"] ** 2
all_data["GarageArea-2"] = all_data["GarageArea"] ** 2
all_data["TotalBsmtSF-2"] = all_data["TotalBsmtSF"] ** 2
all_data["1stFlrSF-2"] = all_data["1stFlrSF"] ** 2
all_data["FullBath-2"] = all_data["FullBath"] ** 2
all_data["TotRmsAbvGrd-2"] = all_data["TotRmsAbvGrd"] ** 2
all_data["Fireplaces-2"] = all_data["Fireplaces"] ** 2
all_data["MasVnrArea-2"] = all_data["MasVnrArea"] ** 2
all_data["BsmtFinSF1-2"] = all_data["BsmtFinSF1"] ** 2
all_data["LotFrontage-2"] = all_data["LotFrontage"] ** 2
all_data["WoodDeckSF-2"] = all_data["WoodDeckSF"] ** 2
all_data["OpenPorchSF-2"] = all_data["OpenPorchSF"] ** 2
all_data["2ndFlrSF-2"] = all_data["2ndFlrSF"] ** 2
print("Quadratics done!...")
# Cubic
all_data["OverallQual-23"] = all_data["OverallQual"] ** 3
all_data["GrLivArea-3"] = all_data["GrLivArea"] ** 3
all_data["GarageCars-3"] = all_data["GarageCars"] **3
all_data["GarageArea-3"] = all_data["GarageArea"] ** 3
all_data["TotalBsmtSF-3"] = all_data["TotalBsmtSF"] ** 3
all_data["1stFlrSF-3"] = all_data["1stFlrSF"] ** 3
all_data["FullBath-3"] = all_data["FullBath"] ** 3
all_data["TotRmsAbvGrd-3"] = all_data["TotRmsAbvGrd"] ** 3
all_data["Fireplaces-3"] = all_data["Fireplaces"] ** 3
all_data["MasVnrArea-3"] = all_data["MasVnrArea"] ** 3
all_data["BsmtFinSF1-3"] = all_data["BsmtFinSF1"] ** 3
all_data["LotFrontage-3"] = all_data["LotFrontage"] ** 3
all_data["WoodDeckSF-3"] = all_data["WoodDeckSF"] ** 3
all_data["OpenPorchSF-3"]=all_data["OpenPorchSF"] ** 3
all_data["2ndFlrSF-3"]= all_data["2ndFlrSF"] ** 3
print("Quadratics done!...")
# Square Root
all_data["OverallQual-Sq"] = np.sqrt(all_data["OverallQual"])
all_data["GrLivArea-Sq"] = np.sqrt(all_data["GrLivArea"])
all_data["GarageCars-Sq"] = np.sqrt(all_data["GarageCars"])
all_data["GarageArea-Sq"] = np.sqrt(all_data["GarageArea"])
all_data["TotalBsmtSF-Sq"] = np.sqrt(all_data["TotalBsmtSF"])
all_data["1stFlrSF-Sq"] = np.sqrt(all_data["1stFlrSF"])
all_data["FullBath-Sq"] = np.sqrt(all_data["FullBath"])
all_data["TotRmsAbvGrd-Sq"] = np.sqrt(all_data["TotRmsAbvGrd"])
all_data["Fireplaces-Sq"] = np.sqrt(all_data["Fireplaces"])
all_data["MasVnrArea-Sq"] = np.sqrt(all_data["MasVnrArea"])
all_data["BsmtFinSF1-Sq"] = np.sqrt(all_data["BsmtFinSF1"])
all_data["LotFrontage-Sq"] = np.sqrt(all_data["LotFrontage"])
all_data["WoodDeckSF-Sq"] = np.sqrt(all_data["WoodDeckSF"])
all_data["OpenPorchSF-Sq"] = np.sqrt(all_data["OpenPorchSF"])
all_data["2ndFlrSF-Sq"] = np.sqrt(all_data["2ndFlrSF"])
print("Roots done!...")
Quadratics done!...
Quadratics done!...
Roots done!...
train['BsmtQual'].value_counts()
TA 649
Gd 618
Ex 117
Fa 35
Name: BsmtQual, dtype: int64
train.groupby(['BsmtQual'])['SalePrice'].mean()
"""
BsmtQual: Evaluates the height of the basement
Ex Excellent (100+ inches)
Gd Good (90-99 inches)
TA Typical (80-89 inches)
Fa Fair (70-79 inches)
Po Poor (<70 inches
NA No Basement
"""
'\nBsmtQual: Evaluates the height of the basement\n\n Ex\tExcellent (100+ inches)\t\n Gd\tGood (90-99 inches)\n TA\tTypical (80-89 inches)\n Fa\tFair (70-79 inches)\n Po\tPoor (<70 inches\n NA\tNo Basement\n'
plt.subplots(figsize=(20,6))
plt.subplot(1,3,1)# 箱形图
sns.boxplot(x='BsmtQual',y='SalePrice',data=train,order= ['Fa', 'TA', 'Gd', 'Ex'])
plt.subplot(1,3,2) # x轴里的类别进行分类
sns.stripplot(x='BsmtQual',y='SalePrice',data=train,size=5,jitter=True,order= ['Fa', 'TA', 'Gd', 'Ex'])
plt.subplot(1,3,3) # 柱状图
sns.barplot(x='BsmtQual',y='SalePrice',data=train,order= ['Fa', 'TA', 'Gd', 'Ex'],estimator=np.mean)
<matplotlib.axes._subplots.AxesSubplot at 0x1263d5e10>
all_data['BsmtQual'] = all_data['BsmtQual'].map({"None":0, "Fa":1, "TA":2, "Gd":3, "Ex":4})
all_data['BsmtQual'].unique()
array([3, 2, 4, 0, 1])
all_data['BsmtQual'].value_counts()
2 1283
3 1209
4 254
1 88
0 81
Name: BsmtQual, dtype: int64
"""
BsmtCond: Evaluates the general condition of the basement
Ex Excellent
Gd Good
TA Typical - slight dampness allowed
Fa Fair - dampness or some cracking or settling
Po Poor - Severe cracking, settling, or wetness
NA No Basement
"""
'\nBsmtCond: Evaluates the general condition of the basement\n\n Ex\tExcellent\n Gd\tGood\n TA\tTypical - slight dampness allowed\n Fa\tFair - dampness or some cracking or settling\n Po\tPoor - Severe cracking, settling, or wetness\n NA\tNo Basement\n'
plt.subplots(figsize=(20,5))
plt.subplot(1,3,1)
sns.boxplot(x='BsmtCond',y='SalePrice',data=train,order=['Po','Fa','TA','Gd'])
plt.subplot(1,3,2)
sns.stripplot(x='BsmtCond',y='SalePrice',data=train,size=5,jitter=True,order= ['Po','Fa','TA','Gd'])
plt.subplot(1,3,3)
sns.barplot(x='BsmtCond',y='SalePrice',data=train,order=['Po','Fa','TA','Gd'])
<matplotlib.axes._subplots.AxesSubplot at 0x12ab8d6d8>
train['BsmtCond'].value_counts()
TA 1307
Gd 65
Fa 45
Po 2
Name: BsmtCond, dtype: int64
all_data['BsmtCond'] = all_data['BsmtCond'].map({"None":0, "Po":1, "Fa":2, "TA":3,"Gd":4, "Ex":5})
all_data['BsmtCond'].unique()
array([3, 4, 0, 2, 1])
"""
BsmtExposure: Refers to walkout or garden level walls
Gd Good Exposure
Av Average Exposure (split levels or foyers typically score average or above)
Mn Mimimum Exposure
No No Exposure
NA No Basement
"""
'\nBsmtExposure: Refers to walkout or garden level walls\n\n Gd\tGood Exposure\n Av\tAverage Exposure (split levels or foyers typically score average or above)\t\n Mn\tMimimum Exposure\n No\tNo Exposure\n NA\tNo Basement\n\n'
plt.subplots(figsize=(20,5))
plt.subplot(1,3,1)
sns.boxplot(x='BsmtExposure',y='SalePrice',data=train,order=['No','Mn','Av','Gd'])
plt.subplot(1,3,2)
sns.stripplot(x='BsmtExposure',y='SalePrice',data=train,size=5,jitter=True,order= ['No','Mn','Av','Gd'])
plt.subplot(1,3,3)
sns.barplot(x='BsmtExposure',y='SalePrice',data=train,order=['No','Mn','Av','Gd'])
<matplotlib.axes._subplots.AxesSubplot at 0x12b8e4470>
all_data['BsmtExposure'] = all_data['BsmtExposure'].map({"None":0, "No":1, "Mn":2, "Av":3,"Gd":4})
all_data['BsmtExposure'].unique()
array([1, 4, 2, 3, 0])
"""
BsmtFinType1: Rating of basement finished area
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
"""
'\nBsmtFinType1: Rating of basement finished area\n\n GLQ\tGood Living Quarters\n ALQ\tAverage Living Quarters\n BLQ\tBelow Average Living Quarters\t\n Rec\tAverage Rec Room\n LwQ\tLow Quality\n Unf\tUnfinshed\n NA\tNo Basement\n'
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.boxplot(x="BsmtFinType1", y="SalePrice", data=train, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]);
plt.subplot(1, 3, 2)
sns.stripplot(x="BsmtFinType1", y="SalePrice", data=train, size = 5, jitter = True, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]);
plt.subplot(1, 3, 3)
sns.barplot(x="BsmtFinType1", y="SalePrice", data=train, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]);
all_data = pd.get_dummies(all_data, columns = ["BsmtFinType1"], prefix="BsmtFinType1")
all_data.head(3)
1stFlrSF | 2ndFlrSF | 3SsnPorch | Alley | BedroomAbvGr | BldgType | BsmtCond | BsmtExposure | BsmtFinSF1 | BsmtFinSF2 | ... | WoodDeckSF-Sq | OpenPorchSF-Sq | 2ndFlrSF-Sq | BsmtFinType1_ALQ | BsmtFinType1_BLQ | BsmtFinType1_GLQ | BsmtFinType1_LwQ | BsmtFinType1_None | BsmtFinType1_Rec | BsmtFinType1_Unf | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 856 | 854 | 0 | None | 3 | 1Fam | 3 | 1 | 706.0 | 0.0 | ... | 0.000000 | 7.810250 | 29.223278 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 | 1262 | 0 | 0 | None | 3 | 1Fam | 3 | 4 | 978.0 | 0.0 | ... | 17.262677 | 0.000000 | 0.000000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 920 | 866 | 0 | None | 3 | 1Fam | 3 | 2 | 486.0 | 0.0 | ... | 0.000000 | 6.480741 | 29.427878 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 rows × 129 columns
from scipy.stats.stats import pearsonr
grid = plt.GridSpec(2,3,wspace=0.15,hspace=0.25)
# 创建画布指定子图将放置的网格的几何位置。 需要设置网格的行数和列数。 子图布局参数(例如,左,右等)可以选择性调整。
plt.subplots(figsize=(30,15))
plt.subplot(grid[0,0])
g = sns.regplot(x=train['BsmtFinSF1'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['BsmtFinSF1'], train['SalePrice'])[0]))
# g= sns.regplot(x=train['BsmtFinSF1'],y=train["SalePrice"],fit_reg==False,label= "Corr:%2f" %(pearsonr(train['BsmtFinType1'],train['SalePrice'])[0]))
g.legend(loc='best')
plt.subplot(grid[0,1:])
sns.boxplot(x='Neighborhood',y='BsmtFinSF1',data=train)
plt.subplot(grid[1,0])
sns.barplot(x='BldgType',y= 'BsmtFinSF1',data=train)
plt.subplot(grid[1,1])
sns.barplot(x='HouseStyle',y ='BsmtFinSF1',data=train)
plt.subplot(grid[1,2])
sns.barplot(x='LotShape',y='BsmtFinSF1',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x129034e10>
bins = [-5,1000,2000,3000,float('inf')]
all_data['BsmtFinSF1_Band'] = pd.cut(all_data['BsmtFinSF1'], bins,labels=['1','2','3','4'])
all_data['BsmtFinSF1_Band'].unique()
all_data.drop('BsmtFinSF1',axis=1,inplace=True)
all_data = pd.get_dummies(all_data, columns = ["BsmtFinSF1_Band"], prefix="BsmtFinSF1")
all_data.head()
1stFlrSF | 2ndFlrSF | 3SsnPorch | Alley | BedroomAbvGr | BldgType | BsmtCond | BsmtExposure | BsmtFinSF2 | BsmtFinType2 | ... | BsmtFinType1_BLQ | BsmtFinType1_GLQ | BsmtFinType1_LwQ | BsmtFinType1_None | BsmtFinType1_Rec | BsmtFinType1_Unf | BsmtFinSF1_1 | BsmtFinSF1_2 | BsmtFinSF1_3 | BsmtFinSF1_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 856 | 854 | 0 | None | 3 | 1Fam | 3 | 1 | 0.0 | Unf | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 1262 | 0 | 0 | None | 3 | 1Fam | 3 | 4 | 0.0 | Unf | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 920 | 866 | 0 | None | 3 | 1Fam | 3 | 2 | 0.0 | Unf | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 | 961 | 756 | 0 | None | 3 | 1Fam | 4 | 1 | 0.0 | Unf | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
4 | 1145 | 1053 | 0 | None | 4 | 1Fam | 3 | 3 | 0.0 | Unf | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
5 rows × 132 columns
"""
BsmtFinType2: Rating of basement finished area (if multiple types)
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
"""
'\nBsmtFinType2: Rating of basement finished area (if multiple types)\n\n GLQ\tGood Living Quarters\n ALQ\tAverage Living Quarters\n BLQ\tBelow Average Living Quarters\t\n Rec\tAverage Rec Room\n LwQ\tLow Quality\n Unf\tUnfinshed\n NA\tNo Basement\n\n'
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.boxplot(x="BsmtFinType2", y="SalePrice", data=train, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]);
plt.subplot(1, 3, 2)
sns.stripplot(x="BsmtFinType2", y="SalePrice", data=train, size = 5, jitter = True, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]);
plt.subplot(1, 3, 3)
sns.barplot(x="BsmtFinType2", y="SalePrice", data=train, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]);
all_data = pd.get_dummies(all_data, columns = ["BsmtFinType2"], prefix="BsmtFinType2") # columns 参数要传入列表
all_data.head(3)
"""
columns : list-like, default None
Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted.
"""
'\ncolumns : list-like, default None\nColumn names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted.\n\n'
"""
BsmtFinSF2: Type 2 finished square feet
"""
grid = plt.GridSpec(2,3,wspace=0.15,hspace=0.25)
# 创建画布指定子图将放置的网格的几何位置。 需要设置网格的行数和列数。 子图布局参数(例如,左,右等)可以选择性调整。
plt.subplots(figsize=(30,15))
plt.subplot(grid[0,0])
g = sns.regplot(x=train['BsmtFinSF2'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['BsmtFinSF2'], train['SalePrice'])[0]))
# g= sns.regplot(x=train['BsmtFinSF1'],y=train["SalePrice"],fit_reg==False,label= "Corr:%2f" %(pearsonr(train['BsmtFinType1'],train['SalePrice'])[0]))
g.legend(loc='best')
plt.subplot(grid[0,1:])
sns.boxplot(x='Neighborhood',y='BsmtFinSF2',data=train)
plt.subplot(grid[1,0])
sns.barplot(x='BldgType',y= 'BsmtFinSF2',data=train)
plt.subplot(grid[1,1])
sns.barplot(x='HouseStyle',y ='BsmtFinSF2',data=train)
plt.subplot(grid[1,2])
sns.barplot(x='LotShape',y='BsmtFinSF2',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x12c7a68d0>
all_data['BsmtFinType2_None'].value_counts()
0 2835
1 80
Name: BsmtFinType2_None, dtype: int64
all_data['BsmtFinSf2_Flag'] = all_data['BsmtFinSF2'].map(lambda x:0 if x==0 else 1)
all_data.drop('BsmtFinSF2', axis=1, inplace=True)
all_data['BsmtFinSf2_Flag'].value_counts()
0 2568
1 347
Name: BsmtFinSf2_Flag, dtype: int64
"""
Unfinished square feet of basement area
"""
grid = plt.GridSpec(2,3,wspace=0.15,hspace=0.25)
# 创建画布指定子图将放置的网格的几何位置。 需要设置网格的行数和列数。 子图布局参数(例如,左,右等)可以选择性调整。
plt.subplots(figsize=(30,15))
plt.subplot(grid[0,0])
g = sns.regplot(x=train['BsmtUnfSF'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['BsmtUnfSF'], train['SalePrice'])[0]))
# g= sns.regplot(x=train['BsmtFinSF1'],y=train["SalePrice"],fit_reg==False,label= "Corr:%2f" %(pearsonr(train['BsmtFinType1'],train['SalePrice'])[0]))
g.legend(loc='best')
plt.subplot(grid[0,1:])
sns.boxplot(x='Neighborhood',y='BsmtUnfSF',data=train)
plt.subplot(grid[1,0])
sns.barplot(x='BldgType',y= 'BsmtUnfSF',data=train)
plt.subplot(grid[1,1])
sns.barplot(x='HouseStyle',y ='BsmtUnfSF',data=train)
plt.subplot(grid[1,2])
sns.barplot(x='LotShape',y='BsmtUnfSF',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x118d8b940>
"""
This feature has a significant positive correlation with SalePrice, with a small proportion of data points having a value of 0.
This tells me that most houses will have some amount of square feet unfinished within the basement, and this actually positively contributes towards SalePrice.
The amount of unfinished square feet also varies widely based on location and style.
Whereas the average unfinished square feet within the basement is fairly consistent across the different lot shapes.
Since this is a continuous numeric feature with a significant correlation, I will bin this and create dummy variables.
与售价正相关,
Unfinished square feet of basement area 与lot shape 没啥关系
连续值变量,需要进行封箱操作,然后将封箱之后的特征进行one-hot转化
"""
all_data['BsmtUnfSF_Band'] = pd.cut(all_data['BsmtUnfSF'], 3,labels=['1','2','3'])
all_data.drop('BsmtUnfSF',axis=1,inplace=True)
all_data['BsmtUnfSF_Band'].unique()
all_data['BsmtUnfSF_Band'] = all_data['BsmtUnfSF_Band'].astype(int)
all_data = pd.get_dummies(all_data, columns = ["BsmtUnfSF_Band"], prefix="BsmtUnfSF")
all_data.head()
1stFlrSF | 2ndFlrSF | 3SsnPorch | Alley | BedroomAbvGr | BldgType | BsmtCond | BsmtExposure | BsmtFullBath | BsmtHalfBath | ... | BsmtFinType2_BLQ | BsmtFinType2_GLQ | BsmtFinType2_LwQ | BsmtFinType2_None | BsmtFinType2_Rec | BsmtFinType2_Unf | BsmtFinSf2_Flag | BsmtUnfSF_1 | BsmtUnfSF_2 | BsmtUnfSF_3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 856 | 854 | 0 | None | 3 | 1Fam | 3 | 1 | 1.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
1 | 1262 | 0 | 0 | None | 3 | 1Fam | 3 | 4 | 0.0 | 1.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
2 | 920 | 866 | 0 | None | 3 | 1Fam | 3 | 2 | 1.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
3 | 961 | 756 | 0 | None | 3 | 1Fam | 4 | 1 | 1.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
4 | 1145 | 1053 | 0 | None | 4 | 1Fam | 3 | 3 | 1.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
5 rows × 140 columns
"""
Total square feet of basement area.
"""
grid = plt.GridSpec(2,3,wspace=0.15,hspace=0.25)
# 创建画布指定子图将放置的网格的几何位置。 需要设置网格的行数和列数。 子图布局参数(例如,左,右等)可以选择性调整。
plt.subplots(figsize=(30,15))
plt.subplot(grid[0,0])
g = sns.regplot(x=train['TotalBsmtSF'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['TotalBsmtSF'], train['SalePrice'])[0]))
# g= sns.regplot(x=train['BsmtFinSF1'],y=train["SalePrice"],fit_reg==False,label= "Corr:%2f" %(pearsonr(train['BsmtFinType1'],train['SalePrice'])[0]))
g.legend(loc='best')
plt.subplot(grid[0,1:])
sns.boxplot(x='Neighborhood',y='TotalBsmtSF',data=train)
plt.subplot(grid[1,0])
sns.barplot(x='BldgType',y= 'TotalBsmtSF',data=train)
plt.subplot(grid[1,1])
sns.barplot(x='HouseStyle',y ='TotalBsmtSF',data=train)
plt.subplot(grid[1,2])
sns.barplot(x='LotShape',y='TotalBsmtSF',data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x12d9b3d30>
def get_feature_corr(feature_name):
grid = plt.GridSpec(2,3,wspace=0.15,hspace=0.25)
# 创建画布指定子图将放置的网格的几何位置。 需要设置网格的行数和列数。 子图布局参数(例如,左,右等)可以选择性调整。
plt.subplots(figsize=(30,15))
plt.subplot(grid[0,0])
g = sns.regplot(x=train[feature_name], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train[feature_name], train['SalePrice'])[0]))
# g= sns.regplot(x=train['BsmtFinSF1'],y=train["SalePrice"],fit_reg==False,label= "Corr:%2f" %(pearsonr(train['BsmtFinType1'],train['SalePrice'])[0]))
g.legend(loc='best')
plt.subplot(grid[0,1:])
sns.boxplot(x='Neighborhood',y=feature_name,data=train)
plt.subplot(grid[1,0])
sns.barplot(x='BldgType',y= feature_name,data=train)
plt.subplot(grid[1,1])
sns.barplot(x='HouseStyle',y =feature_name,data=train)
plt.subplot(grid[1,2])
sns.barplot(x='LotShape',y=feature_name,data=train)
plt.show()
get_feature_corr('1stFlrSF')
"""
First floor square feet.
"""
'\nFirst floor square feet.\n'
all_data['1stFlrSF_Band'] = pd.cut(all_data['1stFlrSF'], 6,labels=['1','2','3','4','5','6'])
all_data['1stFlrSF_Band'].unique()
all_data['1stFlrSF_Band'] = all_data['1stFlrSF_Band'].astype(int)
all_data.drop('1stFlrSF', axis=1, inplace=True)
all_data = pd.get_dummies(all_data, columns = ["1stFlrSF_Band"], prefix="1stFlrSF")
all_data.head(3)
2ndFlrSF | 3SsnPorch | Alley | BedroomAbvGr | BldgType | BsmtCond | BsmtExposure | BsmtFullBath | BsmtHalfBath | BsmtQual | ... | BsmtFinSf2_Flag | BsmtUnfSF_1 | BsmtUnfSF_2 | BsmtUnfSF_3 | 1stFlrSF_1 | 1stFlrSF_2 | 1stFlrSF_3 | 1stFlrSF_4 | 1stFlrSF_5 | 1stFlrSF_6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 854 | 0 | None | 3 | 1Fam | 3 | 1 | 1.0 | 0.0 | 3 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | None | 3 | 1Fam | 3 | 4 | 0.0 | 1.0 | 3 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 866 | 0 | None | 3 | 1Fam | 3 | 2 | 1.0 | 0.0 | 3 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
3 rows × 145 columns
get_feature_corr('2ndFlrSF')
"""
Second floor square feet.
"""
'\nSecond floor square feet.\n'
all_data['2ndFlrSF_Band'] = pd.cut(all_data['2ndFlrSF'], 6,labels=list('123456'))
all_data['2ndFlrSF_Band'].unique()
all_data=pd.get_dummies(all_data,columns=['2ndFlrSF_Band'],prefix="2ndFlrSF")
all_data.drop('2ndFlrSF', axis=1, inplace=True)
all_data.head()
3SsnPorch | Alley | BedroomAbvGr | BldgType | BsmtCond | BsmtExposure | BsmtFullBath | BsmtHalfBath | BsmtQual | CentralAir | ... | 1stFlrSF_3 | 1stFlrSF_4 | 1stFlrSF_5 | 1stFlrSF_6 | 2ndFlrSF_1 | 2ndFlrSF_2 | 2ndFlrSF_3 | 2ndFlrSF_4 | 2ndFlrSF_5 | 2ndFlrSF_6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 1Fam | 3 | 1 | 1.0 | 0.0 | 3 | Y | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | None | 3 | 1Fam | 3 | 4 | 0.0 | 1.0 | 3 | Y | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | None | 3 | 1Fam | 3 | 2 | 1.0 | 0.0 | 3 | Y | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 | 0 | None | 3 | 1Fam | 4 | 1 | 1.0 | 0.0 | 2 | Y | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
4 | 0 | None | 4 | 1Fam | 3 | 3 | 1.0 | 0.0 | 3 | Y | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
5 rows × 150 columns
get_feature_corr('LowQualFinSF')
'''
Low quality finished square feet (all floors)
'''
'\nLow quality finished square feet (all floors)\n'
all_data['LowQualFinSF_Flag'] = all_data['LowQualFinSF'].map(lambda x:0 if x==0 else 1)
all_data.drop('LowQualFinSF', axis=1, inplace=True)
all_data['TotalBathrooms'] = all_data['BsmtHalfBath'] + all_data['BsmtFullBath'] + all_data['HalfBath'] + all_data['FullBath']
columns = ['BsmtHalfBath', 'BsmtFullBath', 'HalfBath', 'FullBath']
all_data.drop(columns, axis=1, inplace=True)
def get_feature_corr1(feature_name,order=None):
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.boxplot(x=feature_name, y="SalePrice", data=train,order=order)
plt.subplot(1, 3, 2)
sns.stripplot(x=feature_name, y="SalePrice", data=train, size = 5, jitter = True ,order=order);
plt.subplot(1, 3, 3)
sns.barplot(x=feature_name, y="SalePrice", data=train,order=order)
plt.show()
get_feature_corr1('BedroomAbvGr',order=None)
"""
Bedrooms above grade (does not include basement bedrooms)
"""
'\nBedrooms above grade (does not include basement bedrooms)\n'
get_feature_corr1('KitchenAbvGr',order=None)
get_feature_corr1('KitchenQual',order=['Fa','TA','Gd','Ex'])
print("""
该特征需要转化category with order
""")
?
该特征需要转化category with order
all_data['KitchenQual'] = all_data['KitchenQual'].map({"Fa":1, "TA":2, "Gd":3, "Ex":4})
all_data['KitchenQual'].unique()
array([3, 2, 4, 1])
get_feature_corr1('TotRmsAbvGrd')
get_feature_corr1('Fireplaces')
get_feature_corr1('FireplaceQu',order=['Po','Fa','TA','Gd','Ex'])
all_data['FireplaceQu'] = all_data['FireplaceQu'].map({"None":0, "Po":1, "Fa":2, "TA":3, "Gd":4, "Ex":5})
all_data['FireplaceQu'].unique()
array([0, 3, 4, 2, 5, 1])
get_feature_corr('GrLivArea')
all_data['GrLivArea_Band'] = pd.cut(all_data['GrLivArea'], 6,labels=list('123456'))
all_data['GrLivArea_Band'].unique()
all_data['GrLivArea_Band'] = all_data['GrLivArea_Band'].astype(int)
all_data.drop('GrLivArea',axis=1,inplace=True)
all_data = pd.get_dummies(all_data, columns = ["GrLivArea_Band"], prefix="GrLivArea")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BldgType | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | ... | 2ndFlrSF_5 | 2ndFlrSF_6 | LowQualFinSF_Flag | TotalBathrooms | GrLivArea_1 | GrLivArea_2 | GrLivArea_3 | GrLivArea_4 | GrLivArea_5 | GrLivArea_6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 1Fam | 3 | 1 | 3 | Y | Norm | Norm | ... | 0 | 0 | 0 | 4.0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 | 0 | None | 3 | 1Fam | 3 | 4 | 3 | Y | Feedr | Norm | ... | 0 | 0 | 0 | 3.0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | None | 3 | 1Fam | 3 | 2 | 3 | Y | Norm | Norm | ... | 0 | 0 | 0 | 4.0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 rows × 152 columns
get_feature_corr1('MSSubClass')
all_data['MSSubClass'] = all_data['MSSubClass'].astype(str)
all_data = pd.get_dummies(all_data, columns = ["MSSubClass"], prefix="MSSubClass")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BldgType | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | ... | MSSubClass_30 | MSSubClass_40 | MSSubClass_45 | MSSubClass_50 | MSSubClass_60 | MSSubClass_70 | MSSubClass_75 | MSSubClass_80 | MSSubClass_85 | MSSubClass_90 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 1Fam | 3 | 1 | 3 | Y | Norm | Norm | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | None | 3 | 1Fam | 3 | 4 | 3 | Y | Feedr | Norm | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | None | 3 | 1Fam | 3 | 2 | 3 | Y | Norm | Norm | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
3 rows × 167 columns
get_feature_corr1('BldgType')
all_data['BldgType'] = all_data['BldgType'].astype(str)
all_data = pd.get_dummies(all_data, columns = ["BldgType"], prefix="BldgType")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | MSSubClass_70 | MSSubClass_75 | MSSubClass_80 | MSSubClass_85 | MSSubClass_90 | BldgType_1Fam | BldgType_2fmCon | BldgType_Duplex | BldgType_Twnhs | BldgType_TwnhsE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 rows × 171 columns
get_feature_corr1('HouseStyle')
all_data['HouseStyle'] = all_data['HouseStyle'].map({"2Story":"2Story", "1Story":"1Story", "1.5Fin":"1.5Story", "1.5Unf":"1.5Story",
"SFoyer":"SFoyer", "SLvl":"SLvl", "2.5Unf":"2.5Story", "2.5Fin":"2.5Story"})
all_data = pd.get_dummies(all_data, columns = ["HouseStyle"], prefix="HouseStyle")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | BldgType_2fmCon | BldgType_Duplex | BldgType_Twnhs | BldgType_TwnhsE | HouseStyle_1.5Story | HouseStyle_1Story | HouseStyle_2.5Story | HouseStyle_2Story | HouseStyle_SFoyer | HouseStyle_SLvl | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3 rows × 176 columns
get_feature_corr1('OverallQual')
get_feature_corr1('OverallCond')
get_feature_corr1('YearRemodAdd')
train['Remod_Diff'] = train['YearRemodAdd'] - train['YearBuilt']
plt.subplots(figsize =(40, 10))
sns.barplot(x="Remod_Diff", y="SalePrice", data=train);
all_data['Remod_Diff'] = all_data['YearRemodAdd'] - all_data['YearBuilt']
all_data.drop('YearRemodAdd', axis=1, inplace=True)
get_feature_corr1('YearBuilt')
all_data['YearBuilt_Band'] = pd.cut(all_data['YearBuilt'], 7,labels=list('1234567'))
all_data['YearBuilt_Band'].unique()
all_data['YearBuilt_Band'] = all_data['YearBuilt_Band'].astype(int)
all_data.drop('YearBuilt',axis=1,inplace=True)
all_data = pd.get_dummies(all_data, columns = ["YearBuilt_Band"], prefix="YearBuilt")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | HouseStyle_SFoyer | HouseStyle_SLvl | Remod_Diff | YearBuilt_1 | YearBuilt_2 | YearBuilt_3 | YearBuilt_4 | YearBuilt_5 | YearBuilt_6 | YearBuilt_7 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 rows × 182 columns
get_feature_corr1('Foundation')
all_data = pd.get_dummies(all_data, columns = ["Foundation"], prefix="Foundation")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | YearBuilt_4 | YearBuilt_5 | YearBuilt_6 | YearBuilt_7 | Foundation_BrkTil | Foundation_CBlock | Foundation_PConc | Foundation_Slab | Foundation_Stone | Foundation_Wood | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
3 rows × 187 columns
get_feature_corr1('Functional')
all_data['Functional'] = all_data['Functional'].map({"Sev":1, "Maj2":2, "Maj1":3, "Mod":4, "Min2":5, "Min1":6, "Typ":7})
all_data['Functional'].unique()
array([7, 6, 3, 5, 4, 2, 1])
get_feature_corr1('RoofStyle')
all_data = pd.get_dummies(all_data, columns = ["RoofStyle"], prefix="RoofStyle")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | Foundation_PConc | Foundation_Slab | Foundation_Stone | Foundation_Wood | RoofStyle_Flat | RoofStyle_Gable | RoofStyle_Gambrel | RoofStyle_Hip | RoofStyle_Mansard | RoofStyle_Shed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 rows × 192 columns
"""
Roof material.
"""
get_feature_corr1('RoofMatl')
all_data = pd.get_dummies(all_data, columns = ["RoofMatl"], prefix="RoofMatl")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | RoofStyle_Hip | RoofStyle_Mansard | RoofStyle_Shed | RoofMatl_CompShg | RoofMatl_Membran | RoofMatl_Metal | RoofMatl_Roll | RoofMatl_Tar&Grv | RoofMatl_WdShake | RoofMatl_WdShngl | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 rows × 198 columns
get_feature_corr1('Exterior1st')
get_feature_corr1('Exterior2nd')
def Exter2(col):
if col['Exterior2nd'] == col['Exterior1st']:
return 1
else:
return 0
all_data['ExteriorMatch_Flag'] = all_data.apply(Exter2, axis=1)
all_data.drop('Exterior2nd', axis=1, inplace=True)
all_data = pd.get_dummies(all_data, columns = ["Exterior1st"], prefix="Exterior1st")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | Exterior1st_CemntBd | Exterior1st_HdBoard | Exterior1st_ImStucc | Exterior1st_MetalSd | Exterior1st_Plywood | Exterior1st_Stone | Exterior1st_Stucco | Exterior1st_VinylSd | Exterior1st_Wd Sdng | Exterior1st_WdShing | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3 rows × 212 columns
get_feature_corr1('MasVnrType')
all_data = pd.get_dummies(all_data, columns = ["MasVnrType"], prefix="MasVnrType")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | Exterior1st_Plywood | Exterior1st_Stone | Exterior1st_Stucco | Exterior1st_VinylSd | Exterior1st_Wd Sdng | Exterior1st_WdShing | MasVnrType_BrkCmn | MasVnrType_BrkFace | MasVnrType_None | MasVnrType_Stone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
3 rows × 215 columns
get_feature_corr('MasVnrArea')
all_data.drop('MasVnrArea', axis=1, inplace=True)
get_feature_corr1('ExterQual',order=['Fa','TA','Gd', 'Ex'])
all_data['ExterQual'] = all_data['ExterQual'].map({"Fa":1, "TA":2, "Gd":3, "Ex":4})
all_data['ExterQual'].unique()
array([3, 2, 4, 1])
"""
Evaluates the present condition of the material on the exterior.
"""
'\nEvaluates the present condition of the material on the exterior.\n'
get_feature_corr1('ExterCond',order=['Po','Fa',"TA",'Gd','Ex'])
all_data = pd.get_dummies(all_data, columns = ["ExterCond"], prefix="ExterCond")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | Exterior1st_WdShing | MasVnrType_BrkCmn | MasVnrType_BrkFace | MasVnrType_None | MasVnrType_Stone | ExterCond_Ex | ExterCond_Fa | ExterCond_Gd | ExterCond_Po | ExterCond_TA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 rows × 218 columns
"""
location of the Garage
"""
get_feature_corr1('GarageType')
all_data = pd.get_dummies(all_data, columns = ["GarageType"], prefix="GarageType")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | ExterCond_Gd | ExterCond_Po | ExterCond_TA | GarageType_2Types | GarageType_Attchd | GarageType_Basment | GarageType_BuiltIn | GarageType_CarPort | GarageType_Detchd | GarageType_None | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
3 rows × 224 columns
"""
Year Garage was built
"""
get_feature_corr1('GarageYrBlt')
plt.subplots(figsize =(50, 10))
sns.boxplot(x="GarageYrBlt", y="SalePrice", data=train);
plt.subplots(figsize =(50, 10))
sns.violinplot(x = 'GarageYrBlt', y = 'SalePrice', data = train,
linewidth = 2, #线宽
width = 0.8, #箱之间的间隔比例
palette = 'hls', #设置调色板
# order = {'Thur', 'Fri', 'Sat','Sun'}, #筛选类别
# scale = 'count', #测度小提琴图的宽度: area-面积相同,count-按照样本数量决定宽度,width-宽度一样
gridsize = 50, #设置小提琴图的平滑度,越高越平滑
inner = 'box', #设置内部显示类型 --> 'box','quartile','point','stick',None
#bw = 0.8 #控制拟合程度,一般可以不设置
)
### 新学到的seaborn中的一些新图
<matplotlib.axes._subplots.AxesSubplot at 0x12e2cec50>
train['GarageYrBlt'].value_counts()
sns.distplot(train['GarageYrBlt'].dropna(), kde=True, bins=5, rug=True)
<matplotlib.axes._subplots.AxesSubplot at 0x12945c940>
all_data['GarageYrBlt_Band'] = pd.qcut(all_data['GarageYrBlt'],3,labels=list('123'))
# qcut是根据这些值的频率来选择箱子的均匀间隔,即每个箱子中含有的数的数量是相同的
# cut将根据值本身来选择箱子均匀间隔,即每个箱子的间距都是相同的
all_data['GarageYrBlt_Band'] = all_data['GarageYrBlt_Band'].astype(int)
all_data.drop(['GarageYrBlt'],axis=1,inplace=True)
all_data = pd.get_dummies(all_data, columns = ["GarageYrBlt_Band"], prefix="GarageYrBlt") # 默认删除掉原来的特征,因此不必删除旧值
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | GarageType_2Types | GarageType_Attchd | GarageType_Basment | GarageType_BuiltIn | GarageType_CarPort | GarageType_Detchd | GarageType_None | GarageYrBlt_1 | GarageYrBlt_2 | GarageYrBlt_3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 rows × 226 columns
get_feature_corr1('GarageFinish')
all_data = pd.get_dummies(all_data, columns = ["GarageFinish"], prefix="GarageFinish")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | GarageType_CarPort | GarageType_Detchd | GarageType_None | GarageYrBlt_1 | GarageYrBlt_2 | GarageYrBlt_3 | GarageFinish_Fin | GarageFinish_None | GarageFinish_RFn | GarageFinish_Unf | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
3 rows × 229 columns
"""
size of the Garage in car capacity
默认是的数字不用其他操作,3辆车容量的车库售价最高,四辆车的转手频率较低(5个样本)
"""
get_feature_corr1('GarageCars')
get_feature_corr('GarageArea')
all_data['GarageArea_Band'] = pd.cut(all_data['GarageArea'],3,labels=list('123'))
all_data['GarageArea_Band'] =all_data['GarageArea_Band'].astype('int')
all_data.drop(['GarageArea'],axis=1,inplace=True)
all_data = pd.get_dummies(all_data, columns = ["GarageArea_Band"], prefix="GarageArea")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | GarageYrBlt_1 | GarageYrBlt_2 | GarageYrBlt_3 | GarageFinish_Fin | GarageFinish_None | GarageFinish_RFn | GarageFinish_Unf | GarageArea_1 | GarageArea_2 | GarageArea_3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
3 rows × 231 columns
"""
Garage quality
"""
get_feature_corr1('GarageQual',order=['Po','Fa','TA','Gd','Ex'])
all_data['GarageQual'] = all_data['GarageQual'].map({"None":"None", "Po":"Low", "Fa":"Low", "TA":"TA", "Gd":"High", "Ex":"High"})
all_data['GarageQual'].unique()
array(['TA', 'Low', 'High', 'None'], dtype=object)
all_data = pd.get_dummies(all_data, columns = ["GarageQual"], prefix="GarageQual")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | GarageFinish_None | GarageFinish_RFn | GarageFinish_Unf | GarageArea_1 | GarageArea_2 | GarageArea_3 | GarageQual_High | GarageQual_Low | GarageQual_None | GarageQual_TA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
3 rows × 234 columns
"""
Garage condition.
"""
get_feature_corr1('GarageCond',order=['Po','Fa','TA','Gd','Ex'])
all_data['GarageCond']= all_data['GarageCond'].map({"None":'None',"Po":'Low','Fa':'Low','TA':'TA','Gd':'High','Ex':'High'})
all_data['GarageCond'].unique()
array(['TA', 'Low', 'None', 'High'], dtype=object)
all_data = pd.get_dummies(all_data, columns = ["GarageCond"], prefix="GarageCond")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | GarageArea_2 | GarageArea_3 | GarageQual_High | GarageQual_Low | GarageQual_None | GarageQual_TA | GarageCond_High | GarageCond_Low | GarageCond_None | GarageCond_TA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
3 rows × 237 columns
"""
Wood deck area in SF.
"""
get_feature_corr('WoodDeckSF')
def WoodDeckFlag(col):
if col['WoodDeckSF'] == 0:
return 1
else:
return 0
all_data['NoWoodDeck_Flag'] = all_data.apply(WoodDeckFlag, axis=1) # new feature
all_data['WoodDeckSF_Band'] = pd.cut(all_data['WoodDeckSF'], 4,labels=list('1234')) ## bin
all_data['WoodDeckSF_Band'] = all_data['WoodDeckSF_Band'].astype(int)
all_data.drop('WoodDeckSF', axis=1, inplace=True)
all_data = pd.get_dummies(all_data, columns = ["WoodDeckSF_Band"], prefix="WoodDeckSF")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | GarageQual_TA | GarageCond_High | GarageCond_Low | GarageCond_None | GarageCond_TA | NoWoodDeck_Flag | WoodDeckSF_1 | WoodDeckSF_2 | WoodDeckSF_3 | WoodDeckSF_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 |
3 rows × 241 columns
"""
OpenPorchSF, EnclosedPorch, 3SsnPorch & ScreenPorch
I will sum these features together to create a total porch in square feet feature.
"""
all_data['TotalPorchSF'] = all_data['OpenPorchSF'] + all_data['OpenPorchSF'] + all_data['EnclosedPorch'] + all_data['3SsnPorch'] + all_data['ScreenPorch']
train['TotalPorchSF'] = train['OpenPorchSF'] + train['OpenPorchSF'] + train['EnclosedPorch'] + train['3SsnPorch'] + train['ScreenPorch']
get_feature_corr('TotalPorchSF')
def PorchFlag(col):
if col['TotalPorchSF'] == 0:
return 1
else:
return 0
all_data['NoPorch_Flag'] = all_data.apply(PorchFlag, axis=1)
all_data['TotalPorchSF_Band'] = pd.cut(all_data['TotalPorchSF'], 4,labels=list('1234'))
all_data['TotalPorchSF_Band'].unique()
all_data['TotalPorchSF_Band'] = all_data['TotalPorchSF_Band'].astype(int)
all_data.drop('TotalPorchSF', axis=1, inplace=True)
all_data = pd.get_dummies(all_data, columns = ["TotalPorchSF_Band"], prefix="TotalPorchSF")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | NoWoodDeck_Flag | WoodDeckSF_1 | WoodDeckSF_2 | WoodDeckSF_3 | WoodDeckSF_4 | NoPorch_Flag | TotalPorchSF_1 | TotalPorchSF_2 | TotalPorchSF_3 | TotalPorchSF_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 rows × 246 columns
"""
PoolArea Pool area in square feet.
"""
get_feature_corr('PoolArea')
def PoolFlag(col):
if col['PoolArea'] == 0:
return 0
else:
return 1
all_data['HasPool_Flag'] = all_data.apply(PoolFlag, axis=1)
all_data.drop('PoolArea', axis=1, inplace=True)
"""
Pool quality.
"""
get_feature_corr1('PoolQC',order=['Fa','Gd','Ex'])
all_data['PoolQC'].value_counts() # 总共8个数据带pool,其他的都是不带的,所以拿到的这个quality数据意义不大
None 2907
Gd 3
Ex 3
Fa 2
Name: PoolQC, dtype: int64
all_data.drop('PoolQC', axis=1, inplace=True)
'''
Fence: Fence quality
GdPrv Good Privacy
MnPrv Minimum Privacy
GdWo Good Wood
MnWw Minimum Wood/Wire
NA No Fence
'''
get_feature_corr1('Fence',order=['MnWw','GdWo','MnPrv','GdPrv'])
all_data = pd.get_dummies(all_data, columns = ["Fence"], prefix="Fence")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | TotalPorchSF_1 | TotalPorchSF_2 | TotalPorchSF_3 | TotalPorchSF_4 | HasPool_Flag | Fence_GdPrv | Fence_GdWo | Fence_MnPrv | Fence_MnWw | Fence_None | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 rows × 249 columns
"""
MSZoning: Identifies the general zoning classification of the sale.
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density
"""
get_feature_corr1('MSZoning')
all_data['MSZoning'].value_counts()
RL 2265
RM 460
FV 139
RH 26
C (all) 25
Name: MSZoning, dtype: int64
all_data = pd.get_dummies(all_data, columns = ["MSZoning"], prefix="MSZoning")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | Fence_GdPrv | Fence_GdWo | Fence_MnPrv | Fence_MnWw | Fence_None | MSZoning_C (all) | MSZoning_FV | MSZoning_RH | MSZoning_RL | MSZoning_RM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
3 rows × 253 columns
"""
this feature has lots of values,and SalePrice varies a lot in the values of the feature,
we just use one-hot to transform this feature
"""
get_feature_corr1('Neighborhood')
all_data = pd.get_dummies(all_data, columns = ["Neighborhood"], prefix="Neighborhood")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Condition1 | Condition2 | Electrical | ... | Neighborhood_NoRidge | Neighborhood_NridgHt | Neighborhood_OldTown | Neighborhood_SWISU | Neighborhood_Sawyer | Neighborhood_SawyerW | Neighborhood_Somerst | Neighborhood_StoneBr | Neighborhood_Timber | Neighborhood_Veenker | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | Feedr | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | Norm | Norm | SBrkr | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 rows × 277 columns
print('condition1')
get_feature_corr1('Condition1')
print('condition2')
get_feature_corr1('Condition2')
condition1
condition2
'''
Condition1: Proximity to various conditions
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
'''
all_data['Condition1'] = all_data['Condition1'].map({"Norm":"Norm", "Feedr":"Street", "PosN":"Pos", "Artery":"Street", "RRAe":"Train",
"RRNn":"Train", "RRAn":"Train", "PosA":"Pos", "RRNe":"Train"})
all_data['Condition2'] = all_data['Condition2'].map({"Norm":"Norm", "Feedr":"Street", "PosN":"Pos", "Artery":"Street", "RRAe":"Train",
"RRNn":"Train", "RRAn":"Train", "PosA":"Pos", "RRNe":"Train"})
def ConditionMatch(col):
if col['Condition1'] == col['Condition2']:
return 0
else:
return 1
all_data['Diff2ndCondition_Flag'] = all_data.apply(ConditionMatch, axis=1)
all_data.drop('Condition2', axis=1, inplace=True)
all_data = pd.get_dummies(all_data, columns = ["Condition1"], prefix="Condition1")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Electrical | EnclosedPorch | ExterQual | ... | Neighborhood_SawyerW | Neighborhood_Somerst | Neighborhood_StoneBr | Neighborhood_Timber | Neighborhood_Veenker | Diff2ndCondition_Flag | Condition1_Norm | Condition1_Pos | Condition1_Street | Condition1_Train | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | SBrkr | 0 | 3 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | SBrkr | 0 | 2 | ... | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | SBrkr | 0 | 3 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 rows × 280 columns
"""
Linear feet of street connected to property.
"""
get_feature_corr('LotFrontage')
'''
Lot size in square feet.
'''
get_feature_corr('LotArea')
all_data['LotArea_Band'] = pd.qcut(all_data['LotArea'], 8,labels=list('12345678')) # 针对分布不均匀的特征使用qcut进行封箱
all_data['LotArea_Band'].unique()
all_data['LotArea_Band'] = all_data['LotArea_Band'].astype(int)
all_data.drop('LotArea', axis=1, inplace=True)
all_data = pd.get_dummies(all_data, columns = ["LotArea_Band"], prefix="LotArea")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Electrical | EnclosedPorch | ExterQual | ... | Condition1_Street | Condition1_Train | LotArea_1 | LotArea_2 | LotArea_3 | LotArea_4 | LotArea_5 | LotArea_6 | LotArea_7 | LotArea_8 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | SBrkr | 0 | 3 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | SBrkr | 0 | 2 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | SBrkr | 0 | 3 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3 rows × 287 columns
"""
LotShape: General shape of property
Reg Regular
IR1 Slightly irregular
IR2 Moderately Irregular
IR3 Irregula
该特征能够明显的影响售价,在国外,不仅仅要有大的面积数,而且尺寸也要合理,否则也很能卖出高价
"""
get_feature_corr1('LotShape')
all_data = pd.get_dummies(all_data, columns = ["LotShape"], prefix="LotShape")
all_data.head(3)
print("地皮的形状主要集中在Reg,Reg1两个值里面,而且salerice在不同的属性里面变化很大")
地皮的形状主要集中在Reg,Reg1两个值里面,而且salerice在不同的属性里面变化很大
"""
LandContour: Flatness of the property
Lvl Near Flat/Level
Bnk Banked - Quick and significant rise from street grade to building
HLS Hillside - Significant slope from side to side
Low Depression
"""
get_feature_corr1('LandContour')
all_data = pd.get_dummies(all_data, columns = ["LandContour"], prefix="LandContour")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Electrical | EnclosedPorch | ExterQual | ... | LotArea_7 | LotArea_8 | LotShape_IR1 | LotShape_IR2 | LotShape_IR3 | LotShape_Reg | LandContour_Bnk | LandContour_HLS | LandContour_Low | LandContour_Lvl | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | SBrkr | 0 | 3 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | SBrkr | 0 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | SBrkr | 0 | 3 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 rows × 293 columns
"""
LotConfig: Lot configuration
Inside Inside lot 内部
Corner Corner lot 角落
CulDSac Cul-de-sac 死胡同
FR2 Frontage on 2 sides of property 前排
FR3 Frontage on 3 sides of property 前排
房子周围的环境
"""
get_feature_corr1('LotConfig')
all_data['LotConfig'] = all_data['LotConfig'].map({"Inside":"Inside", "FR2":"FR", "Corner":"Corner", "CulDSac":"CulDSac", "FR3":"FR"})
all_data = pd.get_dummies(all_data, columns = ["LotConfig"], prefix="LotConfig")
all_data.head(3)
3SsnPorch | Alley | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Electrical | EnclosedPorch | ExterQual | ... | LotShape_IR3 | LotShape_Reg | LandContour_Bnk | LandContour_HLS | LandContour_Low | LandContour_Lvl | LotConfig_Corner | LotConfig_CulDSac | LotConfig_FR | LotConfig_Inside | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | None | 3 | 3 | 1 | 3 | Y | SBrkr | 0 | 3 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
1 | 0 | None | 3 | 3 | 4 | 3 | Y | SBrkr | 0 | 2 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
2 | 0 | None | 3 | 3 | 2 | 3 | Y | SBrkr | 0 | 3 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
3 rows × 296 columns
"""
LandSlope: Slope of property
Gtl Gentle slope
Mod Moderate Slope
Sev Severe Slope
"""
get_feature_corr1('LandSlope')
all_data['LandSlope'] = all_data['LandSlope'].map({"Gtl":1, "Mod":0, "Sev":0})
'''
Mod and Sev saleprice 处于同一区间,可以将两者合并
'''
'\nMod and Sev saleprice 处于同一区间,可以将两者合并\n'
all_data['LandSlope'].value_counts()
1 2774
0 141
Name: LandSlope, dtype: int64
get_feature_corr1('Street')
all_data.drop('Street', axis=1, inplace=True)
get_feature_corr1('Alley')
all_data['Alley'].value_counts()
None 2717
Grvl 120
Pave 78
Name: Alley, dtype: int64
all_data = pd.get_dummies(all_data, columns = ["Alley"], prefix="Alley")
all_data.head(3)
3SsnPorch | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Electrical | EnclosedPorch | ExterQual | FireplaceQu | ... | LandContour_HLS | LandContour_Low | LandContour_Lvl | LotConfig_Corner | LotConfig_CulDSac | LotConfig_FR | LotConfig_Inside | Alley_Grvl | Alley_None | Alley_Pave | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 3 | 1 | 3 | Y | SBrkr | 0 | 3 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
1 | 0 | 3 | 3 | 4 | 3 | Y | SBrkr | 0 | 2 | 3 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
2 | 0 | 3 | 3 | 2 | 3 | Y | SBrkr | 0 | 3 | 3 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
3 rows × 297 columns
"""
PavedDrive: Paved driveway
Y Paved 价格差异较大,且没有明显的顺序关系,需要转化为one-hot特征
P Partial Pavement
N Dirt/Gravel
"""
get_feature_corr1('PavedDrive')
all_data=pd.get_dummies(all_data,columns=['PavedDrive'],prefix='PavedDrive')
all_data.head()
3SsnPorch | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Electrical | EnclosedPorch | ExterQual | FireplaceQu | ... | LotConfig_Corner | LotConfig_CulDSac | LotConfig_FR | LotConfig_Inside | Alley_Grvl | Alley_None | Alley_Pave | PavedDrive_N | PavedDrive_P | PavedDrive_Y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 3 | 1 | 3 | Y | SBrkr | 0 | 3 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
1 | 0 | 3 | 3 | 4 | 3 | Y | SBrkr | 0 | 2 | 3 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
2 | 0 | 3 | 3 | 2 | 3 | Y | SBrkr | 0 | 3 | 3 | ... | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
3 | 0 | 3 | 4 | 1 | 2 | Y | SBrkr | 272 | 2 | 4 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
4 | 0 | 4 | 3 | 3 | 3 | Y | SBrkr | 0 | 3 | 3 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
5 rows × 299 columns
get_feature_corr1('Heating')
"""
大量集中在GasA,其余的数据量非常小,可以转化为天然气供暖,和其他方式供暖
"""
all_data['Heating'] = all_data['Heating'].map({'GasA':1,'GasW':0,'Grav':0,'Wall':0,'OthW':0,'Floor':0})
all_data.drop('Heating', axis=1, inplace=True)
all_data.head(3)
3SsnPorch | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | Electrical | EnclosedPorch | ExterQual | FireplaceQu | ... | LotConfig_Corner | LotConfig_CulDSac | LotConfig_FR | LotConfig_Inside | Alley_Grvl | Alley_None | Alley_Pave | PavedDrive_N | PavedDrive_P | PavedDrive_Y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 3 | 1 | 3 | Y | SBrkr | 0 | 3 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
1 | 0 | 3 | 3 | 4 | 3 | Y | SBrkr | 0 | 2 | 3 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
2 | 0 | 3 | 3 | 2 | 3 | Y | SBrkr | 0 | 3 | 3 | ... | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
3 rows × 298 columns
"""
Heating quality and condition.
"""
get_feature_corr1('HeatingQC',order=['Po','Fa','TA','Gd','Ex'])
all_data['HeatingQC'] = all_data['HeatingQC'].map({"Po":1, "Fa":2, "TA":3, "Gd":4, "Ex":5})
all_data['HeatingQC'].unique()
array([5, 4, 3, 2, 1])
"""
Central air conditioning.
"""
get_feature_corr1('CentralAir')
all_data['CentralAir'] = all_data['CentralAir'].map({"Y":1,"N":0})
"""
Electrical system.
"""
get_feature_corr1('Electrical')
all_data['Electrical'] = all_data['Electrical'].map({'SBrkr':'SBrkr','FuseF':'Fuse','FuseA':'Fuse','FuseP':'Fuse','Mix':'Mix'})
all_data = pd.get_dummies(all_data, columns = ["Electrical"], prefix="Electrical")
all_data.head(3)
3SsnPorch | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | EnclosedPorch | ExterQual | FireplaceQu | Fireplaces | ... | LotConfig_Inside | Alley_Grvl | Alley_None | Alley_Pave | PavedDrive_N | PavedDrive_P | PavedDrive_Y | Electrical_Fuse | Electrical_Mix | Electrical_SBrkr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 3 | 1 | 3 | 1 | 0 | 3 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
1 | 0 | 3 | 3 | 4 | 3 | 1 | 0 | 2 | 3 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
2 | 0 | 3 | 3 | 2 | 3 | 1 | 0 | 3 | 3 | 1 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
3 rows × 300 columns
all_data['MiscFeature'].value_counts() #
None 2810
Shed 95
Gar2 5
Othr 4
TenC 1
Name: MiscFeature, dtype: int64
get_feature_corr1('MiscFeature')
'''
有效数据太少,剔除该特征
'''
'\n有效数据太少,剔除该特征\n'
get_feature_corr1('MiscVal')
all_data['MiscVal'].value_counts()
"""
有效数据过少,剔除该特征
"""
'\n有效数据过少,剔除该特征\n'
all_data.drop(['MiscVal','MiscFeature'],axis=1,inplace=True)
"""
month sold,Year Sold
"""
get_feature_corr1('MoSold')
get_feature_corr1('YrSold')
all_data = pd.get_dummies(all_data, columns = ["MoSold"], prefix="MoSold")
all_data = pd.get_dummies(all_data,columns=['YrSold'],prefix='YrSold')
all_data.head(3)
3SsnPorch | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | EnclosedPorch | ExterQual | FireplaceQu | Fireplaces | ... | MoSold_8 | MoSold_9 | MoSold_10 | MoSold_11 | MoSold_12 | YrSold_2006 | YrSold_2007 | YrSold_2008 | YrSold_2009 | YrSold_2010 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 3 | 1 | 3 | 1 | 0 | 3 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 0 | 3 | 3 | 4 | 3 | 1 | 0 | 2 | 3 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 0 | 3 | 3 | 2 | 3 | 1 | 0 | 3 | 3 | 1 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3 rows × 313 columns
"""
SaleType: Type of sale
WD Warranty Deed - Conventional
CWD Warranty Deed - Cash
VWD Warranty Deed - VA Loan
New Home just constructed and sold
COD Court Officer Deed/Estate
Con Contract 15% Down payment regular terms
ConLw Contract Low Down payment and low interest
ConLI Contract Low Interest
ConLD Contract Low Down
Oth Other
"""
get_feature_corr1('SaleType')
all_data['SaleType'] = all_data['SaleType'].map({'WD':"WD",'New':"New",'COD':"COD",'CWD':'Oth','ConLD':'Oth','ConLI':'Oth',
"ConLW":'Oth','Con':'Oth','Oth':'Oth'})
all_data= pd.get_dummies(all_data,columns=['SaleType'],prefix='SaleType')
all_data.head()
3SsnPorch | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | EnclosedPorch | ExterQual | FireplaceQu | Fireplaces | ... | MoSold_12 | YrSold_2006 | YrSold_2007 | YrSold_2008 | YrSold_2009 | YrSold_2010 | SaleType_COD | SaleType_New | SaleType_Oth | SaleType_WD | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 3 | 1 | 3 | 1 | 0 | 3 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 3 | 3 | 4 | 3 | 1 | 0 | 2 | 3 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | 3 | 3 | 2 | 3 | 1 | 0 | 3 | 3 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 0 | 3 | 4 | 1 | 2 | 1 | 272 | 2 | 4 | 1 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 0 | 4 | 3 | 3 | 3 | 1 | 0 | 3 | 3 | 1 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 316 columns
"""
Condition of sale.
"""
get_feature_corr1('SaleCondition')
all_data = pd.get_dummies(all_data, columns = ["SaleCondition"], prefix="SaleCondition")
all_data.head(3)
3SsnPorch | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtQual | CentralAir | EnclosedPorch | ExterQual | FireplaceQu | Fireplaces | ... | SaleType_COD | SaleType_New | SaleType_Oth | SaleType_WD | SaleCondition_Abnorml | SaleCondition_AdjLand | SaleCondition_Alloca | SaleCondition_Family | SaleCondition_Normal | SaleCondition_Partial | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 3 | 1 | 3 | 1 | 0 | 3 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 0 | 3 | 3 | 4 | 3 | 1 | 0 | 2 | 3 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 0 | 3 | 3 | 2 | 3 | 1 | 0 | 3 | 3 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
3 rows × 321 columns
from scipy.stats import skew, norm
plt.subplots(figsize=(15,12))
g = sns.distplot(train['SalePrice'],fit=norm,label="Skewness:%.2f" % (train['SalePrice'].skew()))
g.legend(loc='best')
<matplotlib.legend.Legend at 0x12f5f5cc0>
train["SalePrice"] = np.log1p(train["SalePrice"])
y_train = train["SalePrice"]
#Check the new distribution
plt.subplots(figsize=(15,10))
g = sns.distplot(train['SalePrice'], fit=norm, label = "Skewness : %.2f"%(train['SalePrice'].skew()));
g = g.legend(loc="best")
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
# Check how skewed they are
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
plt.subplots(figsize =(65, 20))
skewed_feats.plot(kind='bar');
from scipy.special import boxcox1p
skewness = skewed_feats[abs(skewed_feats) > 0.5]
skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
all_data[feat] = boxcox1p(all_data[feat], lam)
print(skewness.shape[0], "skewed numerical features have been Box-Cox transformed")
294 skewed numerical features have been Box-Cox transformed
train = all_data[:ntrain]
test = all_data[ntrain:]
print(train.shape)
print(test.shape)
(1456, 321)
(1459, 321)
y_train.shape
(1456,)
import xgboost as xgb
model = xgb.XGBRegressor()
model.fit(train, y_train)
# Sort feature importances from GBC model trained earlier
indices = np.argsort(model.feature_importances_)[::-1]
indices = indices[:75]
# Visualise these with a barplot
plt.subplots(figsize=(20, 15))
g = sns.barplot(y=train.columns[indices], x = model.feature_importances_[indices], orient='h')
g.set_xlabel("Relative importance",fontsize=12)
g.set_ylabel("Features",fontsize=12)
g.tick_params(labelsize=9)
g.set_title("XGB feature importance");
/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
if getattr(data, 'base', None) is not None and /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:588: FutureWarning: Series.base is deprecated and will be removed in a future version
data.base is not None and isinstance(data, np.ndarray)
[11:04:46] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
xgb_train = train.copy()
xgb_test = test.copy()
from sklearn.feature_selection import SelectFromModel
xgb_feat_red = SelectFromModel(model,prefit=True)
# reduce estimation validation and test datasets
xgb_train = xgb_feat_red.transform(xgb_train)
xgb_test = xgb_feat_red.transform(xgb_test)
print('X_train: ', xgb_train.shape, '\nX_test: ', xgb_test.shape)
X_train: (1456, 47)
X_test: (1459, 47)
from sklearn import model_selection
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(xgb_train, y_train, test_size=0.3, random_state=42)
# X_train = predictor features for estimation dataset
# X_test = predictor variables for validation dataset
# Y_train = target variable for the estimation dataset
# Y_test = target variable for the estimation dataset
print('X_train: ', X_train.shape, '\nX_test: ', X_test.shape, '\nY_train: ', Y_train.shape, '\nY_test: ', Y_test.shape)
X_train: (1019, 47)
X_test: (437, 47)
Y_train: (1019,)
Y_test: (437,)
X_train
array([[0.73046315, 3. , 0.73046315, ..., 0. , 0. ,
0. ],
[0.73046315, 3. , 0.73046315, ..., 0. , 0. ,
0. ],
[1.19431764, 2. , 0.73046315, ..., 0. , 0. ,
0. ],
...,
[1.8203341 , 3. , 0.73046315, ..., 0.73046315, 0. ,
0. ],
[0.73046315, 3. , 0.73046315, ..., 0. , 0. ,
0. ],
[1.54096276, 3. , 0.73046315, ..., 0. , 0. ,
0. ]])
# 从sklearn 导入不同的回归模型
from sklearn.linear_model import ElasticNet, Lasso, BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.kernel_ridge import KernelRidge
import xgboost as xgb
print('Algorithm packages imported!')
Algorithm packages imported!
# Model selection packages used for sampling dataset and optimising parameters
from sklearn import model_selection
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit
print('Model selection packages imported!')
Model selection packages imported!
models = [KernelRidge(),ElasticNet(),Lasso(),GradientBoostingRegressor(),BayesianRidge(),LassoLarsIC(),RandomForestRegressor(),xgb.XGBRegressor()]
# 随机取样,其实可以使用正常的split,然后选择里面的shuffle = True
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
shuff =ShuffleSplit(n_splits=5,test_size=0.2,random_state=42)
# 创建一个数据框,用于保存模型的指标
columns = ['Name','Parameters','Train mean_squared_error','Test mean_squared_error']
before_model_compare = pd.DataFrame(columns=columns)
# 将模型的参数以及结果添加到DataFrame中
row_index=0
for alg in models:
model_name = alg.__class__.__name__
before_model_compare.loc[row_index,'Name'] = model_name
before_model_compare.loc[row_index,'Parameters'] = str(alg.get_params())
alg.fit(X_train,Y_train)
# for cross_validation but the results are negative,we need to convert it to postive,均方误差
training_results = np.sqrt((-cross_val_score(alg,X_train,Y_train,cv=shuff,scoring='neg_mean_squared_error')).mean())
test_results = np.sqrt(((Y_test-alg.predict(X_test))**2).mean())
before_model_compare.loc[row_index,"Train mean_squared_error"] = training_results*100
before_model_compare.loc[row_index,'Test mean_squared_error'] = test_results*100
row_index+=1
print(row_index,model_name,"trained>>>>")
decimals = 3
before_model_compare['Train mean_squared_error'] = before_model_compare['Train mean_squared_error'].apply(lambda x:round(x,decimals))
before_model_compare['Test mean_squared_error'] = before_model_compare['Train mean_squared_error'].apply(lambda x:round(x,decimals))
before_model_compare
1 KernelRidge trained>>>>
2 ElasticNet trained>>>>
3 Lasso trained>>>>
4 GradientBoostingRegressor trained>>>>
5 BayesianRidge trained>>>>
6 LassoLarsIC trained>>>>
7 RandomForestRegressor trained>>>>
[12:04:14] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[12:04:14] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[12:04:14] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
if getattr(data, 'base', None) is not None and /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
if getattr(data, 'base', None) is not None and /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
if getattr(data, 'base', None) is not None and /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
if getattr(data, 'base', None) is not None and /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
if getattr(data, 'base', None) is not None and
[12:04:14] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[12:04:14] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[12:04:14] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
8 XGBRegressor trained>>>>
Name | Parameters | Train mean_squared_error | Test mean_squared_error | |
---|---|---|---|---|
0 | KernelRidge | {‘alpha‘: 1, ‘coef0‘: 1, ‘degree‘: 3, ‘gamma‘:... | 31.424 | 31.424 |
1 | ElasticNet | {‘alpha‘: 1.0, ‘copy_X‘: True, ‘fit_intercept‘... | 23.245 | 23.245 |
2 | Lasso | {‘alpha‘: 1.0, ‘copy_X‘: True, ‘fit_intercept‘... | 28.008 | 28.008 |
3 | GradientBoostingRegressor | {‘alpha‘: 0.9, ‘criterion‘: ‘friedman_mse‘, ‘i... | 12.381 | 12.381 |
4 | BayesianRidge | {‘alpha_1‘: 1e-06, ‘alpha_2‘: 1e-06, ‘compute_... | 11.118 | 11.118 |
5 | LassoLarsIC | {‘copy_X‘: True, ‘criterion‘: ‘aic‘, ‘eps‘: 2.... | 11.818 | 11.818 |
6 | RandomForestRegressor | {‘bootstrap‘: True, ‘criterion‘: ‘mse‘, ‘max_d... | 14.299 | 14.299 |
7 | XGBRegressor | {‘base_score‘: 0.5, ‘booster‘: ‘gbtree‘, ‘cols... | 12.466 | 12.466 |
models = [KernelRidge(),ElasticNet(),Lasso(),GradientBoostingRegressor(),BayesianRidge(),LassoLarsIC(),RandomForestRegressor(),
xgb.XGBRegressor()]
KR_param_grid = {'alpha': [0.1], 'coef0': [100], 'degree': [1], 'gamma': [None], 'kernel': ['polynomial']}
EN_param_grid = {'alpha': [0.001], 'copy_X': [True], 'l1_ratio': [0.6], 'fit_intercept': [True], 'normalize': [False],
'precompute': [False], 'max_iter': [300], 'tol': [0.001], 'selection': ['random'], 'random_state': [None]}
LASS_param_grid = {'alpha': [0.0005], 'copy_X': [True], 'fit_intercept': [True], 'normalize': [False], 'precompute': [False],
'max_iter': [300], 'tol': [0.01], 'selection': ['random'], 'random_state': [None]}
GB_param_grid = {'loss': ['huber'], 'learning_rate': [0.1], 'n_estimators': [300], 'max_depth': [3],
'min_samples_split': [0.0025], 'min_samples_leaf': [5]}
BR_param_grid = {'n_iter': [200], 'tol': [0.00001], 'alpha_1': [0.00000001], 'alpha_2': [0.000005], 'lambda_1': [0.000005],
'lambda_2': [0.00000001], 'copy_X': [True]}
LL_param_grid = {'criterion': ['aic'], 'normalize': [True], 'max_iter': [100], 'copy_X': [True], 'precompute': ['auto'], 'eps': [0.000001]}
RFR_param_grid = {'n_estimators': [50], 'max_features': ['auto'], 'max_depth': [None], 'min_samples_split': [5], 'min_samples_leaf': [2]}
XGB_param_grid = {'max_depth': [3], 'learning_rate': [0.1], 'n_estimators': [300], 'booster': ['gbtree'], 'gamma': [0], 'reg_alpha': [0.1],
'reg_lambda': [0.7], 'max_delta_step': [0], 'min_child_weight': [1], 'colsample_bytree': [0.5], 'colsample_bylevel': [0.2],
'scale_pos_weight': [1]}
params_grid = [KR_param_grid, EN_param_grid, LASS_param_grid, GB_param_grid, BR_param_grid, LL_param_grid, RFR_param_grid, XGB_param_grid]
after_model_compare = pd.DataFrame(columns=columns)
row_index= 0
for alg in models:
gs_alg = GridSearchCV(alg,param_grid=params_grid[0],cv=shuff,scoring='neg_mean_squared_error',n_jobs=-1)
params_grid.pop(0)
model_name = alg.__class__.__name__
after_model_compare.loc[row_index,'Name'] = model_name
gs_alg.fit(X_train,Y_train)
gs_best=gs_alg.best_estimator_
after_model_compare.loc[row_index,"Parameters"] = str(gs_alg.best_params_)
after_training_results = np.sqrt(-gs_alg.best_score_)
after_test_results = np.sqrt((Y_test-gs_alg.predict(X_test)**2).mean())
after_model_compare.loc[row_index,"Train mean_squared_error"] = after_training_results*100
after_model_compare.loc[row_index,'Test mean_squared_error']= after_test_results*100
row_index+=1
print(row_index,model_name,"trained>>>>>")
decimals = 3
after_model_compare['Train mean_squared_error'] = after_model_compare['Train mean_squared_error'].apply(lambda x:round(x,decimals))
after_model_compare['Test mean_squared_error'] = after_model_compare['Train mean_squared_error'].apply(lambda x:round(x,decimals))
after_model_compare
/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt
/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt
/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt
1 KernelRidge trained>>>>>
2 ElasticNet trained>>>>>
3 Lasso trained>>>>>
/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt
/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt
/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt
4 GradientBoostingRegressor trained>>>>>
5 BayesianRidge trained>>>>>
6 LassoLarsIC trained>>>>>
/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt
/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
if getattr(data, 'base', None) is not None and
7 RandomForestRegressor trained>>>>>
[19:23:22] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
8 XGBRegressor trained>>>>>
Name | Parameters | Train mean_squared_error | Test mean_squared_error | |
---|---|---|---|---|
0 | KernelRidge | {‘alpha‘: 0.1, ‘coef0‘: 100, ‘degree‘: 1, ‘gam... | 11.140 | 11.140 |
1 | ElasticNet | {‘alpha‘: 0.001, ‘copy_X‘: True, ‘fit_intercep... | 11.234 | 11.234 |
2 | Lasso | {‘alpha‘: 0.0005, ‘copy_X‘: True, ‘fit_interce... | 11.203 | 11.203 |
3 | GradientBoostingRegressor | {‘learning_rate‘: 0.1, ‘loss‘: ‘huber‘, ‘max_d... | 11.966 | 11.966 |
4 | BayesianRidge | {‘alpha_1‘: 1e-08, ‘alpha_2‘: 5e-06, ‘copy_X‘:... | 11.118 | 11.118 |
5 | LassoLarsIC | {‘copy_X‘: True, ‘criterion‘: ‘aic‘, ‘eps‘: 1e... | 11.818 | 11.818 |
6 | RandomForestRegressor | {‘max_depth‘: None, ‘max_features‘: ‘auto‘, ‘m... | 13.735 | 13.735 |
7 | XGBRegressor | {‘booster‘: ‘gbtree‘, ‘colsample_bylevel‘: 0.2... | 11.964 | 11.964 |
models = [KernelRidge(),ElasticNet(),Lasso(),GradientBoostingRegressor(),BayesianRidge(),LassoLarsIC(),RandomForestRegressor(),xgb.XGBRegressor()]
names = ['KernelRidge','ElasticNet','Lasso','GradientBoostingRegressor','BayesianRidge','LassoLarsIC','RandomForest','XGBoost']
params_grid = [KR_param_grid, EN_param_grid, LASS_param_grid, GB_param_grid, BR_param_grid, LL_param_grid, RFR_param_grid, XGB_param_grid]
stacked_validation_train = pd.DataFrame()
stacked_test_train = pd.DataFrame()
row_index= 0
for alg in models:
gs_alg = GridSearchCV(alg,param_grid=params_grid[0],cv=shuff,scoring='neg_mean_squared_error',n_jobs=-1)
params_grid.pop(0)
gs_alg.fit(X_train,Y_train)
gs_best = gs_alg.best_estimator_
stacked_validation_train.insert(loc= row_index,column=names[0],value=gs_best.predict(X_test))
""" dataFrme insert (loc 表示的是列的序号,column 列名,value 插入的内容)"""
print(row_index+1,alg.__class__.__name__,"将验证集的预测的结果堆砌,组成新的训练集")
stacked_test_train.insert(loc=row_index,column=names[0],value=gs_best.predict(xgb_test))
print(row_index+1,alg.__class__.__name__,"将测试集的预测的结果堆砌,组成新的测试集")
print("---"*50)
names.pop(0)
row_index+=1
print("第一层数据处理完成,新的训练集与测试集完成")
1 KernelRidge 将验证集的预测的结果堆砌,组成新的训练集
1 KernelRidge 将测试集的预测的结果堆砌,组成新的测试集
------------------------------------------------------------------------------------------------------------------------------------------------------
2 ElasticNet 将验证集的预测的结果堆砌,组成新的训练集
2 ElasticNet 将测试集的预测的结果堆砌,组成新的测试集
------------------------------------------------------------------------------------------------------------------------------------------------------
3 Lasso 将验证集的预测的结果堆砌,组成新的训练集
3 Lasso 将测试集的预测的结果堆砌,组成新的测试集
------------------------------------------------------------------------------------------------------------------------------------------------------
4 GradientBoostingRegressor 将验证集的预测的结果堆砌,组成新的训练集
4 GradientBoostingRegressor 将测试集的预测的结果堆砌,组成新的测试集
------------------------------------------------------------------------------------------------------------------------------------------------------
5 BayesianRidge 将验证集的预测的结果堆砌,组成新的训练集
5 BayesianRidge 将测试集的预测的结果堆砌,组成新的测试集
------------------------------------------------------------------------------------------------------------------------------------------------------
6 LassoLarsIC 将验证集的预测的结果堆砌,组成新的训练集
6 LassoLarsIC 将测试集的预测的结果堆砌,组成新的测试集
------------------------------------------------------------------------------------------------------------------------------------------------------
7 RandomForestRegressor 将验证集的预测的结果堆砌,组成新的训练集
7 RandomForestRegressor 将测试集的预测的结果堆砌,组成新的测试集
------------------------------------------------------------------------------------------------------------------------------------------------------
[15:23:01] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
8 XGBRegressor 将验证集的预测的结果堆砌,组成新的训练集
8 XGBRegressor 将测试集的预测的结果堆砌,组成新的测试集
------------------------------------------------------------------------------------------------------------------------------------------------------
第一层数据处理完成,新的训练集与测试集完成
/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
if getattr(data, 'base', None) is not None and \
print(stacked_validation_train.shape)
stacked_validation_train.head()
# Y_test的数据结果
(437, 8)
KernelRidge | ElasticNet | Lasso | GradientBoostingRegressor | BayesianRidge | LassoLarsIC | RandomForest | XGBoost | |
---|---|---|---|---|---|---|---|---|
0 | 12.096814 | 12.095574 | 12.095347 | 12.103610 | 12.095675 | 12.104932 | 12.170897 | 12.084927 |
1 | 11.952395 | 11.966939 | 11.964576 | 12.027570 | 11.957859 | 11.999328 | 12.066678 | 12.071651 |
2 | 11.798390 | 11.800390 | 11.807569 | 11.842686 | 11.807968 | 11.787126 | 11.880778 | 11.789903 |
3 | 11.834224 | 11.814334 | 11.820662 | 11.806835 | 11.840026 | 11.837654 | 11.755137 | 11.753889 |
4 | 11.287412 | 11.267859 | 11.271162 | 11.150576 | 11.289689 | 11.290524 | 11.328786 | 11.278980 |
print(stacked_test_train.shape)
stacked_test_train.head()
(1459, 8)
KernelRidge | ElasticNet | Lasso | GradientBoostingRegressor | BayesianRidge | LassoLarsIC | RandomForest | XGBoost | |
---|---|---|---|---|---|---|---|---|
0 | 11.655653 | 11.666206 | 11.661235 | 11.717153 | 11.664298 | 11.639410 | 11.735618 | 11.754628 |
1 | 12.033653 | 12.042914 | 12.039875 | 11.950150 | 12.032724 | 12.007921 | 11.956780 | 11.985191 |
2 | 12.121196 | 12.121925 | 12.124266 | 12.138572 | 12.125334 | 12.072644 | 12.097413 | 12.115376 |
3 | 12.194246 | 12.200128 | 12.201113 | 12.166538 | 12.196015 | 12.143436 | 12.095009 | 12.139894 |
4 | 12.171520 | 12.180859 | 12.179168 | 12.145913 | 12.167523 | 12.168576 | 12.178091 | 12.176064 |
stacked_validation_train.drop('Lasso',axis=1,inplace=True)
stacked_test_train.drop('Lasso',axis=1,inplace=True)
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
meta_model = make_pipeline(RobustScaler(),Lasso(alpha=0.00001,copy_X=True,fit_intercept=True,normalize=False,precompute=False,
max_iter=10000,tol=0.0001,selection='random',random_state=42))
meta_model.fit(stacked_validation_train,Y_test)
meta_model_pred= np.expm1(meta_model.predict(stacked_test_train))
print("meta_model 完成训练,并预测测试集的数据")
meta_model 完成训练,并预测测试集的数据
/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:475: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 1.7538551527086552, tolerance: 0.006483051719467419
positive)
models = [KernelRidge(), ElasticNet(), Lasso(), GradientBoostingRegressor(), BayesianRidge(), LassoLarsIC(), RandomForestRegressor(), xgb.XGBRegressor()]
names = ['KernelRidge', 'ElasticNet', 'Lasso', 'Gradient Boosting', 'Bayesian Ridge', 'Lasso Lars IC', 'Random Forest', 'XGBoost']
params_grid = [KR_param_grid, EN_param_grid, LASS_param_grid, GB_param_grid, BR_param_grid, LL_param_grid, RFR_param_grid, XGB_param_grid]
final_predictions = pd.DataFrame()
row_index=0
for alg in models:
gs_alg = GridSearchCV(alg, param_grid = params_grid[0], cv = shuff, scoring = 'neg_mean_squared_error', n_jobs=-1)
params_grid.pop(0)
gs_alg.fit(stacked_validation_train, Y_test)
gs_best = gs_alg.best_estimator_
final_predictions.insert(loc = row_index, column = names[0], value = np.expm1(gs_best.predict(stacked_test_train)))
print(row_index+1, alg.__class__.__name__, 'final results predicted added to table...')
names.pop(0)
row_index+=1
print("-"*50)
print("已经完成")
final_predictions.head()
1 KernelRidge final results predicted added to table...
2 ElasticNet final results predicted added to table...
3 Lasso final results predicted added to table...
4 GradientBoostingRegressor final results predicted added to table...
5 BayesianRidge final results predicted added to table...
6 LassoLarsIC final results predicted added to table...
7 RandomForestRegressor final results predicted added to table...
[18:03:42] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
8 XGBRegressor final results predicted added to table...
--------------------------------------------------
已经完成
/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
if getattr(data, 'base', None) is not None and \
KernelRidge | ElasticNet | Lasso | Gradient Boosting | Bayesian Ridge | Lasso Lars IC | Random Forest | XGBoost | |
---|---|---|---|---|---|---|---|---|
0 | 120698.786728 | 121126.968875 | 120569.541877 | 119545.552352 | 121817.672344 | 121618.593011 | 120774.731602 | 117987.320312 |
1 | 162778.261755 | 162293.616103 | 163198.661456 | 154034.245333 | 162888.953970 | 162663.194168 | 154944.085742 | 154422.265625 |
2 | 184187.690046 | 183822.395933 | 184145.902661 | 181996.954345 | 185167.984485 | 184643.383928 | 181824.224304 | 174336.687500 |
3 | 193128.541814 | 192388.040730 | 193035.580999 | 195110.109361 | 193760.580424 | 193069.794744 | 188563.541259 | 181933.593750 |
4 | 192957.823204 | 192839.290437 | 193289.070140 | 192292.299199 | 192910.466862 | 192890.725826 | 190770.891456 | 192144.093750 |
ensemble = meta_model_pred*(1/10) + final_predictions['XGBoost']*(1.5/10) + final_predictions['Gradient Boosting']*(2/10) + final_predictions['Bayesian Ridge']*(1/10) + final_predictions['Lasso']*(1/10) + final_predictions['KernelRidge']*(1/10) + final_predictions['Lasso Lars IC']*(1/10) + final_predictions['Random Forest']*(1.5/10)
submission = pd.DataFrame()
test1 = pd.read_csv('test.csv',index_col=False)
test_ID = test1['Id']
submission['Id'] = test_ID
submission['SalePrice'] = ensemble
submission.to_csv('final_submission.csv',index=False)
print("Submission file, created!")
Submission file, created!
原文:https://www.cnblogs.com/onemorepoint/p/11236051.html