x1 | x2 | x3 | x4 | x5 | x6 | x7 | class |
---|---|---|---|---|---|---|---|
0 | 0 | 1 | 27 | 0 | 1 | 0 | 2 |
0 | 1 | 0 | 27 | 0 | 1 | 0 | 1 |
0 | 1 | 0 | 27 | 1 | 0 | 0 | 2 |
0 | 0 | 1 | 24 | 1 | 0 | 0 | 1 |
1 | 0 | 0 | 30 | 1 | 0 | 0 | 1 |
0 | 0 | 1 | 31 | 0 | 1 | 0 | 1 |
1 | 0 | 1 | 37 | 1 | 0 | 0 | 2 |
0 | 0 | 1 | 22 | 0 | 0 | 1 | 2 |
0 | 0 | 1 | 25 | 0 | 1 | 1 | 1 |
1 | 0 | 1 | 34 | 1 | 1 | 0 | 2 |
0 | 1 | 0 | 33 | 1 | 1 | 0 | 2 |
0 | 0 | 1 | 37 | 1 | 1 | 0 | 1 |
0 | 1 | 0 | 30 | 0 | 1 | 1 | 2 |
1 | 1 | 0 | 42 | 1 | 0 | 0 | 1 |
0 | 0 | 0 | 43 | 1 | 0 | 0 | 1 |
0 | 1 | 1 | 28 | 1 | 0 | 0 | 2 |
0 | 1 | 1 | 29 | 0 | 1 | 1 | 1 |
1 | 1 | 0 | 51 | 1 | 0 | 1 | 2 |
0 | 1 | 1 | 22 | 0 | 1 | 0 | 2 |
1 | 1 | 0 | 40 | 0 | 0 | 1 | 1 |
0 | 0 | 1 | 28 | 1 | 0 | 1 | 1 |
1 | 0 | 0 | 52 | 0 | 1 | 1 | 2 |
0 | 1 | 0 | 38 | 1 | 0 | 1 | 1 |
1 | 0 | 0 | 26 | 1 | 0 | 0 | 2 |
0 | 0 | 1 | 33 | 1 | 0 | 1 | 1 |
0 | 0 | 1 | 26 | 1 | 0 | 0 | 2 |
0 | 1 | 0 | 23 | 1 | 0 | 1 | 1 |
0 | 0 | 0 | 37 | 1 | 0 | 1 | 2 |
0 | 0 | 1 | 50 | 0 | 1 | 0 | 1 |
0 | 0 | 1 | 24 | 0 | 1 | 0 | 2 |
x1 | x2 | x3 | x4 | x5 | x6 | x7 | class |
---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 12 | 0 | 1 | |
0 | 0 | 0 | 1 | 10 | 0 | 1 | |
1 | 0 | 0 | 0 | 24 | 0 | 0 | |
1 | 0 | 0 | 0 | 27 | 1 | 0 | |
0 | 0 | 1 | 0 | 36 | 0 | 1 | |
0 | 1 | 0 | 1 | 12 | 0 | 1 | |
1 | 0 | 1 | 0 | 36 | 0 | 1 | |
1 | 0 | 0 | 1 | 24 | 0 | 0 | |
0 | 1 | 0 | 0 | 48 | 0 | 0 | |
1 | 0 | 0 | 1 | 6 | 1 | 0 | |
0 | 1 | 0 | 0 | 30 | 1 | 0 | |
0 | 0 | 1 | 0 | 18 | 0 | 1 | |
0 | 1 | 0 | 1 | 18 | 1 | 0 | |
0 | 0 | 1 | 0 | 24 | 1 | 0 | |
0 | 0 | 1 | 1 | 9 | 0 | 1 | |
0 | 1 | 0 | 1 | 18 | 0 | 1 | |
1 | 0 | 0 | 1 | 15 | 1 | 0 | |
0 | 1 | 0 | 1 | 24 | 1 | 0 | |
0 | 1 | 0 | 0 | 24 | 0 | 0 | |
0 | 1 | 0 | 1 | 36 | 0 | 0 |
本文拟采用C4.5
算法对上述的数据进行分类,开发环境:visual studio 2019,使用C/C++语言编写相关代码。
一个属性的增益比例1公式:
上述式子的内容部分继承自ID3算法,接下来简单地介绍一下相关的知识。
其中:
式中的 \(p_js=\frac{n_{js}}{n_s}\),\(p_{js}\) 表示在子集\(X_s\)中类别为\(c_j\)的数据样本所占的比例。熵值越小,表示属性对数据集划分的纯度越高。计算Af划分数据集时的信息增益:
对于连续属性值,C4.5其处理过程如下:
C4.5
处理的样本中可以含有未知属性值,其处理方法是常用的值替代或者是将最常用的值分在同一个类中。具体采用概率的方法,依据属性已知的值,对属性和每一个值赋予一个概率,取得这些概率依赖于该属性已知的值。
一旦树被建立,就可以把树转换成if-then
规则。规则存储在一个二维数组中,每一行代表树的一个规则,即从根到叶之间的一个路径。表中的每列存放树中的结点。
GainRatio
,根据Outlook
的取值,将三分枝。原文:https://www.cnblogs.com/jianle23/p/14116017.html