机器学习之分类算法-决策树、随机森林（2.3）

时间：2021-05-26 14:57:09 阅读：16 评论：0 收藏：0 [点我收藏+]

决策树的三种算法实现

当然决策树的原理不止信息增益这一种，还有其他方法。但是原理都类似，我们就不去举例计算。

ID3
- 信息增益最大的准则
C4.5
- 信息增益比最大的准则
CART
- 分类树: 基尼系数最小的准则在sklearn中可以选择划分的默认原则
- 优势：划分更加细致（从后面例子的树显示来理解）

决策树API

class sklearn.tree.DecisionTreeClassifier(criterion=’gini’, max_depth=None,random_state=None)
- 决策树分类器
- criterion:默认是’gini’系数，也可以选择信息增益的熵’entropy’
- max_depth:树的深度大小
- random_state:随机数种子
其中会有些超参数：max_depth:树的深度大小
- 其它超参数我们会结合随机森林分析

流程分析：
1）获取数据
2）数据处理
缺失值处理
特征值 -> 字典类型
3）准备好特征值目标值
4）划分数据集
5）特征工程：字典特征抽取
6）决策树预估器流程
7）模型评估

优点：
- 简单的理解和解释，树木可视化。
缺点：
- 决策树学习者可以创建不能很好地推广数据的过于复杂的树，这被称为过拟合。
改进：
- 减枝cart算法(决策树API当中已经实现，随机森林参数调优有相关介绍)
- 随机森林

随机森林
森林：包含多个决策树的分类器
原理过程
训练集：
N个样本
特征值目标值
M个特征
两个随机：
训练集随机 - N个样本中随机有放回的抽样N个
bootstrap 随机有放回抽样
[1, 2, 3, 4, 5]
新的树的训练集
[2, 2, 3, 1, 5]
特征随机 - 从M个特征中随机抽取m个特征
M >> m
降维
总结
能够有效地运行在大数据集上，
处理具有高维特征的输入样本，而且不需要降维

案例：

from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV


def randomforest():
        estimator = RandomForestClassifier()
        # 1）获取数据集
        iris = load_iris()

        # 2）划分数据集
        x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=22)
        # 加入网格搜索与交叉验证
        # 参数准备
        param_dict = {"n_estimators": [120,200,300,500,800,1200], "max_depth": [5,8,15,25,30]}
        estimator = GridSearchCV(estimator, param_grid=param_dict, cv=3)
        estimator.fit(x_train, y_train)

        # 5）模型评估
        # 方法1：直接比对真实值和预测值
        y_predict = estimator.predict(x_test)
        print("y_predict:\n", y_predict)
        print("直接比对真实值和预测值:\n", y_test == y_predict)

        # 方法2：计算准确率
        score = estimator.score(x_test, y_test)
        print("准确率为：\n", score)

        # 最佳参数：best_params_
        print("最佳参数：\n", estimator.best_params_)
        # 最佳结果：best_score_
        print("最佳结果：\n", estimator.best_score_)
        # 最佳估计器：best_estimator_
        print("最佳估计器:\n", estimator.best_estimator_)
        # 交叉验证结果：cv_results_
        print("交叉验证结果:\n", estimator.cv_results_)

机器学习之分类算法-决策树、随机森林（2.3）

原文：https://www.cnblogs.com/sima-3/p/14813044.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)