在机器学习中评判一个模型好坏的标准有很多,常用的有准确率、召回率、AUC等。本文介绍下AUC及其计算方式。
AUC常用来评估一个二元分类模型,二元分类模型通常有4中预测结局,以是否患高血压为例:
我们可以得到一个TPR = TP / (TP + FN) FPR = FP / (FP + TN)
如果我们去很多不同的阈值就可以得到一系列的(FPR, TPR)点,这些点可以拟合成一条曲线,我们称之为ROC(Receiver Operating Characteristic);该曲线下方与横轴之间的面积大小即为AUC。因此,我们计算AUC的方式如下:
#!/usr/bin/python import sys def get_auc(arr_score, arr_label, pos_label): score_label_list = [] for index in xrange(len(arr_score)): score_label_list.append((float(arr_score[index]), int(arr_label[index]))) score_label_list_sorted = sorted(score_label_list, key = lambda line:line[0], reverse = True) fp, tp = 0, 0 lastfp, lasttp = 0, 0 A = 0 lastscore = None for score_label in score_label_list_sorted: score, label = score_label[:2] if score != lastscore: A += trapezoid_area(fp, lastfp, tp, lasttp) lastscore = score lastfp, lasttp = fp, tp if label == pos_label: tp += 1 else: fp += 1 A += trapezoid_area(fp, lastfp, tp, lasttp) A /= (fp * tp) return A def trapezoid_area(x1, x2, y1, y2): delta = abs(x2 - x1) return delta * 0.5 * (y1 + y2) if __name__ == "__main__": if len(sys.argv) != 2: print "Error!\n%s pred_model_file" sys.exit(-1) arr_score, arr_label = [], [] for line in file(sys.argv[1]): line = line.strip().split('\t') if len(line) < 2 : continue arr_score.append(line[0]) arr_label.append(line[1]) print arr_score;print arr_label; print "AUC = %s" % get_auc(arr_score, arr_label, 2)
F:\python_workspace\offline_evaluation>python model_evaluation.py pred_model_file.txt ['0.1', '0.4', '0.35', '0.8'] ['1', '1', '2', '2'] AUC = 0.75
采用sklearn里的代码也可以得到AUC值,http://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html#sklearn.metrics.auc
>>> import numpy as np >>> from sklearn import metrics >>> y = np.array([1, 1, 2, 2]) >>> pred = np.array([0.1, 0.4, 0.35, 0.8]) >>> fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2) >>> metrics.auc(fpr, tpr) 0.75
换个例子:
F:\python_workspace\offline_evaluation>python model_evaluation.py tmp.txt ['0.1', '0.2', '0.4', '0.5', '0.35', '0.8', '0.9', '0.95'] ['1', '2', '1', '1', '2', '2', '2', '1'] AUC = 0.5
>>> import numpy as np >>> from sklearn import metrics >>> y = np.array([1,2,1,1,2,2,2,1]) >>> pred = np.array([0.1,0.2,0.4,0.5,0.35,0.8,0.9,0.95]) >>> fpr, tpr, ths = metrics.roc_curve(y, pred, pos_label=2) >>> metrics.auc(fpr,tpr) 0.5
从以上2个例子中可以看到与之前自己写的代码得到的AUC值一样!
原文:http://blog.csdn.net/lming_08/article/details/44284155