最近的关键字:分类算法,outlier detection, machine learning
此文将 k-means,decision tree,random forest,SVM(support vector mechine),人工神经网络(Artificial Neural Network,简称ANN )这几种常见的算法 apply 在同一个数据集
先分享几个tips、R blog:
a) 如何快速找到自己想要的R package。到 Available CRAN Packages By Date of Publication control+F 找,快得很。
b) 如何找到一个小领域 一些列相关的R包。以 randomForest 为例,到这个包的页面,你会看到一个 In views: Environmetrics, MachineLearning,接下来就可以把机器学习算法的R包一网打尽。帅呆了,好么,再也不用一个个查了。
c) 挺好的一个 R blog http://mkseo.pe.kr/stats/ 里面有韩文和英文写的文章,不知道博主是个是个韩国妹子呢,,,想多了,,,妹子怎么去coding,,,
d)再推荐一个blog 。拷过来,变成了这样,,戳吧,少年。
基于数据集 spam来看看k-means
library(kernlab) library(magrittr) data(spam) set.seed(124) res <- kmeans(spam[,-58]%>%sapply(scale),2) table(spam$type,res$cluster) 1 2 nonspam 2754 34 spam 1813 0 # 错误率是 1-2754/nrow(spam) = 0.4014345 # 假设spam 和 nonspam 五五开,如果完全是蒙,出错率也就50%吧,,,
(2)decision tree
library(tree) library(kernlab) library(dplyr) library(magrittr) data(spam) #create train and test dataset set.seed(1859) train <- sample(nrow(spam),nrow(spam)*0.7,replace = FALSE) df.train <- spam[train,] df.test <- spam[-train,]
#modeling tree.fit <- tree(type~.,data=df.train) summary(tree.fit)
# plot decision tree plot(tree.fit, type ="uniform") text(tree.fit, pretty =1, all=TRUE,cex=0.7) # predication pred <- predict(tree.fit,df.test,type = c("class")) #查看预测结果 confusionMatrix(pred,df.test[,58]) Confusion Matrix and Statistics Reference Prediction nonspam spam nonspam 819 84 spam 34 444 Accuracy : 0.9146
# decision tree 选择变量模型会自动帮你选择选择,预测错误率是 0.08
引用tree这个包帮助文档的一句话:The left-hand-side (response) should be either a numerical vector when a regression tree will be fitted or a factor,when a classification tree is produced. 既可以做classification 也可以 regression!
如果是做classification,Factor predictor variables can have up to 32 levels.
decision tree 会不会 overfitting?我看下来,是不会,用tree.fit对测试集预测时,准确率依然是很高。
帮助文档是这样说的:The split which maximizes the reduction in impurity is chosen, the data set split and the process repeated. Splitting continues until the terminal nodes are too small or too few to be split.
> summary(tree.fit) Classification tree: tree(formula = type ~ ., data = df.train) Variables actually used in tree construction: [1] "charDollar" "remove" "charExclamation" [4] "george" "hp" "capitalLong" [7] "edu" "num650" "capitalTotal" [10] "free" "capitalAve" Number of terminal nodes: 16 Residual mean deviance: 0.432 = 1384 / 3204 Misclassification error rate: 0.08416 = 271 / 3220
(3)random forest
library(randomForest) library(magrittr) library(dplyr) data(spam) train <- sample(nrow(spam),nrow(spam)*0.7,replace = FALSE) df.train <- spam[train,] df.test <- spam[-train,] # random forest 要预先设定随机种子,结果才能一样。 set.seed(189) spam.rf <- randomForest(type~.,data=df.train,mtry=3,do.trace=100,ntree=500,importance=TRUE,proximity=TRUE)
#下面是运行时,种到第几百棵树的时候,error rate是多少 ntree OOB 1 2 100: 5.56% 2.95% 9.49% 200: 5.59% 2.69% 9.96% 300: 5.53% 2.79% 9.65% 400: 5.40% 2.84% 9.26% 500: 5.28% 2.69% 9.18%
spam.rf pred <- predict(spam.rf,df.test[,-58],type="class") confusionMatrix(pred,df.test[,58]) Confusion Matrix and Statistics Reference Prediction nonspam spam nonspam 833 41 spam 20 487 Accuracy : 0.9558 # 错误率很低 # random forest 已经自己选择了关键变量,不存在过度拟合的问题
# 下面的命令是看那些变量重要 varImpPlot(spam.rf)
varImpPlot(spam.rf)的运行结果是一张图,charExclamation 这个变量对于预测是不是垃圾邮件,至关重要。
random forest会不会 overfitting?Random Forest - How to handle overfitting
Breiman claims that RF does not overfit. stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
random forest 也是classification 和regression 都能做
(4)SVM (supprot vector mechine)
library(e1071) library(rpart) set.seed(1871) train <- sample(nrow(spam),nrow(spam)*0.7,replace = FALSE) df.train <- spam[train,] df.test <- spam[-train,] model <- svm(df.train[,-58], df.train[,58]) print(model) summary(model) pred <- predict(model, df.test[,-58]) confusionMatrix(pred,df.test[,58]) Confusion Matrix and Statistics Reference Prediction nonspam spam nonspam 789 59 spam 37 496 Accuracy : 0.9305
再看另一个,来自于caret包,参考 Computational Prediction这篇文章
library(caret) library(doMC) data(spam) set.seed(89) train <- sample(nrow(spam),nrow(spam)*0.7,replace = FALSE) df.train <- spam[train,] df.test <- spam[-train,] # 多线程,是doMC包的函数,看了监视器,确实cpu使用率瞬间飙到90%+。 # 回头仔细研究下和parallel包有什么区别 registerDoMC(cores=4) model <- train(df.train[, -58], df.train[, 58], method="svmRadial") predict(model,df.test[,-58])%>%confusionMatrix(df.test[,58]) # 准确率也还可以 Confusion Matrix and Statistics Reference Prediction nonspam spam nonspam 816 70 spam 42 453 Accuracy : 0.9189
fillInNa <- function(d) { naCount <- NROW(d) * 0.1 for (i in sample(NROW(d), naCount)) { d[i, sample(4, 1)] <- NA } return(d) }
(5)人工神经网络(Artificial Neural Network)
参考 R语言中最强的神经网络包RSNNS 和RSNNS 帮助文档,帮助文档在 confusionMatrix(iris$targetsTrain,fitted.values(model)),这一句有错误,必须encodeClassLabels才可以,正确的写法是:confusionMatrix(encodeClassLabels(iris$targetsTrain),encodeClassLabels(fitted.values(model)))
library(RSNNS) library(doMC) data(spam) set.seed(199) spam <- spam[sample(1:nrow(spam) ,nrow(spam)), 1:ncol(spam)] spamValues <- spam[,-58] spamTargets <- spam[,58] spamDecTargets <- decodeClassLabels(spamTargets) spam <- splitForTrainingAndTest(spamValues, spamDecTargets, ratio = 0.3) spam <- normTrainingAndTestSet(spam) #The model is then built with: # registerDoMC(cores=4) 看了下活动监视器,CPU使用率在30%左右,这东西在这里没卵用。 # 下面这个函数很耗时 model <- mlp(spam$inputsTrain, spam$targetsTrain, size = 5 ,learnFuncParams = c(0.1), maxit = 60 ,inputsTest = spam$inputsTest ,targetsTest = spam$targetsTest) predictions <- predict(model, spam$inputsTest) confusionMatrix(encodeClassLabels(spam$targetsTest),encodeClassLabels(predictions)) Confusion Matrix and Statistics Reference Prediction 1 2 1 805 42 2 34 500 Accuracy : 0.945 95% CI : (0.9316, 0.9564) No Information Rate : 0.6075 P-Value [Acc > NIR] : <2e-16 Kappa : 0.8843 Mcnemar‘s Test P-Value : 0.422 Sensitivity : 0.9595 Specificity : 0.9225 Pos Pred Value : 0.9504 Neg Pred Value : 0.9363 Prevalence : 0.6075 Detection Rate : 0.5829 Detection Prevalence : 0.6133 Balanced Accuracy : 0.9410 ‘Positive‘ Class : 1
用户 系统 流逝
248.671 26.796 283.194