机器学习之处理文本和分类属性

时间：2019-03-28 18:53:15 阅读：381 评论：0 收藏：0 [点我收藏+]

　　数据集中经常会出现文本属性特征，由于机器学习算法更易于跟数字打交道，我们必须想办法将这些文本标签转化为数字，Sklearn中提供了几个转换器来处理文本属性，下面将总结LabelEncode、OneHotEncoder和LabelBinarizer转换器用法

1. 类图结构

　　技术分享图片

2. LabelEncode用法

　　官方文档：https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

示例：

labels = [1,2,2,6]
le = LabelEncoder()
le.fit(labels)
print("classes_:",le.classes_)
print("transform:",le.transform([1, 2, 2, 6]))
print(‘fit_transform:‘,le.fit_transform(labels))
print(‘inverser_transform:‘,le.inverse_transform([0, 0, 1, 2]))

输出：

技术分享图片

　　当labels换成文本标签时

labels = [‘aa‘,‘cc‘,‘dd‘,‘bb‘]
le = LabelEncoder()
le.fit(labels)
print("classes_:",le.classes_)
print("transform:",le.transform([‘aa‘,‘cc‘,‘dd‘,‘bb‘]))
print(‘fit_transform:‘,le.fit_transform(labels))
print(‘inverser_transform:‘,le.inverse_transform([0, 2, 3, 1]))

输出：

技术分享图片

3. OneHotEncoder用法

OneHotEncoder(n_values=’auto’,  categorical_features=’all’,  dtype=<class ‘numpy.float64’>,  sparse=True,  handle_unknown=’error’)

（1）参数

　　n_values = ‘auto‘ --->表示每个特征使用几维的数值由数据集自动推断，即几种类别就使用几位来表示，还可以自己设置

- int : number of categorical values per feature.
            Each feature value should be in ``range(n_values)``
- array : ``n_values[i]`` is the number of categorical values in
          ``X[:, i]``. Each feature value should be
          in ``range(n_values[i])``

　　categorical_features = ‘all‘ --->这个参数指定了对哪些特征进行编码，默认对所有类别都进行编码。也可以自己指定选择哪些特征，通过索引或者 bool 值来指定

　　dtype --->表示编码数值格式，默认为浮点型

　　spares--->表示编码的格式，默认为True,即为稀疏的格式，指定False则就不用toarray了

　　handle_unknow --->指定为‘error’或者‘ignore‘，即如果碰到未知的类型，是返回一个错误还是忽略它

（2）示例

enc = OneHotEncoder()
enc.fit([[0, 0, 3],
         [1, 1, 0],
         [0, 2, 1],
         [1, 0, 2]])
print(enc.transform([[0,1,3]]).toarray())

　　输出：

技术分享图片

　　我们先来看第一个特征，即第一列 $[0, 1, 0, 1]$

　　第二个特征，第二列 $[0, 1, 2, 0]$

　　第二个特征，第三列 $[3, 0, 1, 2]$

示例：

array = np.array([1,2,3])
ohe = OneHotEncoder()
labels = ohe.fit_transform(array.reshape(-1,1))
print(labels.toarray())

[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]

4. LabelBinarizer用法

（1）参数

neg_label : int (default: 0)
    Value with which negative labels must be encoded.

pos_label : int (default: 1)
    Value with which positive labels must be encoded.

sparse_output : boolean (default: False)
    True if the returned array from transform is desired to be in sparse
    CSR format.

（2）示例

　　使用LabelBinarizer可以一次性完成LabelEncoder和OneHotEncoder(从文本类别转化为整数类别，再从整数类别转换为独热向量)

lb = preprocessing.LabelBinarizer()
lb.fit_transform([‘yes‘, ‘no‘, ‘no‘, ‘yes‘])
array([[1],[0],[0],[1]])

机器学习之处理文本和分类属性

原文：https://www.cnblogs.com/xiaobingqianrui/p/10593182.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)