离散特征编码分两种,特征具有大小意义,特征不具有大小意义。
1、特征不具备大小意义的直接独热编码
2、特征有大小意义的采用映射编码
- import pandas as pd
- df = pd.DataFrame([
- [‘green‘, ‘M‘, 10.1, ‘label1‘],
- [‘red‘, ‘L‘, 13.5, ‘label2‘],
- [‘blue‘, ‘XL‘, 15.3, ‘label2‘]])
- df.columns = [‘color‘, ‘size‘, ‘length‘, ‘label‘]
- df

- size_mapping = {
- ‘XL‘: 3,
- ‘L‘: 2,
- ‘M‘: 1}
- df[‘size‘] = df[‘size‘].map(size_mapping)
-
- label_mapping = {lab:idx for idx,lab in enumerate(set(df[‘label‘]))}
- df[‘label‘] = df[‘label‘].map(label_mapping)
- df

直接使用函数进行独热编码
并不会区分是否具有大小含义
- import pandas as pd
- df = pd.DataFrame([
- [‘green‘, ‘M‘, 10.1, ‘label1‘],
- [‘red‘, ‘L‘, 13.5, ‘label2‘],
- [‘blue‘, ‘XL‘, 15.3, ‘label2‘]])
- df.columns = [‘color‘, ‘size‘, ‘length‘, ‘label‘]
- pd.get_dummies(df)

get_dummies用法:
- import pandas as pd
- s = pd.Series(list(‘abca‘))
- pd.get_dummies(s)
- df = pd.DataFrame({‘A‘: [‘a‘, ‘b‘, ‘a‘], ‘B‘: [‘b‘, ‘a‘, ‘c‘],
- ‘C‘: [1, 2, 3]})
- pd.get_dummies(df, prefix=[‘col1‘, ‘col2‘])
python离散特征编码
原文:https://www.cnblogs.com/fujian-code/p/9011589.html