首页 > 其他 > 详细

pyspark 随机森林特征重要性

时间:2019-02-27 19:46:43      阅读:779      评论:0      收藏:0      [点我收藏+]
# IMPORT
>>> import numpy
>>> from numpy import allclose
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.feature import StringIndexer
>>> from pyspark.ml.classification import RandomForestClassifier

# PREPARE DATA
>>> df = spark.createDataFrame([
...     (1.0, Vectors.dense(1.0)),
...     (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> si_model = stringIndexer.fit(df)
>>> td = si_model.transform(df)

# BUILD THE MODEL
>>> rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42)
>>> model = rf.fit(td)

# FEATURE IMPORTANCES
>>> model.featureImportances
SparseVector(1, {0: 1.0}) 

  

重要性:

model.featureImportances

 

概率:

predictions.select("probability", "label").show(1000)

probability--->即为输出概率

 

pandas 打乱样本:

import pandas as pd
df = pd.read_excel("window regulator01 _0914新增样本.xlsx")
df = df.sample(frac = 1) #打乱样本

 

pyspark train、test 随机划分

 

 train, test = labeled_v.randomSplit([0.75, 0.25])


pyspark 随机森林特征重要性

原文:https://www.cnblogs.com/Allen-rg/p/10445893.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!