这篇文章将利用朴素贝叶斯分类对文档进行分类。
从文本中获取特征,需要先拆分文本,下面的代码直接创建词条向量形式的文本作为训练数据,函数有两个返回值,分别是训练数据和每条数据对应的类别组成的列表:
def loadDataSet():
# postingList 为进行词条切分后的文档集合
postingList = [
[‘my‘,‘dog‘,‘has‘,‘flea‘,‘problems‘,‘help‘,‘please‘],
[‘maybe‘,‘not‘,‘take‘,‘him‘,‘to‘,‘dog‘,‘park‘,‘stupid‘],
[‘my‘,‘dalmation‘,‘is‘,‘so‘,‘cute‘,‘I‘,‘love‘,‘him‘],
[‘stop‘,‘posting‘,‘stupid‘,‘worthless‘,‘garbage‘],
[‘mr‘,‘licks‘,‘ate‘,‘my‘,‘steak‘,‘how‘,‘to‘,‘stop‘,‘him‘],
[‘quit‘,‘buying‘,‘worthless‘,‘dog‘,‘food‘,‘stupid‘]
]
classVec = [0,1,0,1,0,1] # 类别标签集合
return postingList,classVec
接着创建一个包含在所有文档中出现的不重复词的词汇表:
# 返回包含在所有文档中出现的不重复词的列表
def createVocabList(dataSet):
vocabSet = set([])
for document in dataSet:
vocabSet = vocabSet | set(document)
return list(vocabSet) #转化为列表形式返回
下面是对训练数据进行处理的函数,输入为词汇表和某个文档,输出为文档向量,向量的元素为0或者1,代表该词在该输入文档中是否出现:1代表出现 0代表未出现
def setOfWords2Vec(vocabList,inputSet): #vocabList 用于对照的词汇表 inputSet 用于检查的文档
returnVec = [0] * len(vocabList) # 默认为0 (即所有单词都不出现在inputSet文档中)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] += 1 #设置该单词出现
return returnVec
接着是朴素贝叶斯训练函数:
def trainNB0(trainMatrix,trainCategory):
numTrainDocs = len(trainMatrix) #训练样本的总数量
numWords = len(trainMatrix[0])
pAbusive = sum(trainCategory) / numTrainDocs #类别为1的数据出现的概率
p0Num = np.ones(numWords)
p1Num = np.ones(numWords)
p0DeNom = 2.0
p1DeNom = 2.0
for i in range(numTrainDocs):
if trainCategory[i] == 1:
p1Num += trainMatrix[i] #类别为1:各个词分别出现的总次数
p1DeNom += sum(trainMatrix[i]) # 类别为1: 单词总数
else:
p0Num += trainMatrix[i] #类别为0:各个词分别出现的总次数
p0DeNom += sum(trainMatrix[i]) # 类别为0: 单词总数
p1Vect = p1Num/p1DeNom
p0Vect = p0Num/p0DeNom
return p0Vect, p1Vect, pAbusive
# vec2Classify->要分类的变量 p0Vec->P(word|0) p1Vec->P(word|1) pClass1->1类出现的概率
def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
p1 = sum(vec2Classify * p1Vec) * pClass1
p0 = sum(vec2Classify * p0Vec) * (1 - pClass1)
if p1 > p0:
return 1
else:
return 0
下面函数用于测试分类器的效果:
def testingNB():
listOPosts,listClasses = loadDataSet()
myVocabList = createVocabList(listOPosts) # 词汇表
trainMat = []
for postInDoc in listOPosts:
trainMat.append(setOfWords2Vec(myVocabList, postInDoc))
p0V, p1V, pAb = trainNB0(trainMat, listClasses)
testEntry = [‘love‘,‘my‘,‘dalmation‘]
thisDoc = np.array(setOfWords2Vec(myVocabList,testEntry))
print(testEntry,‘classified as: ‘,classifyNB(thisDoc,p0V,p1V,pAb))
testEntry2 = [‘stupid‘,‘garbage‘]
thisDoc2 = np.array(setOfWords2Vec(myVocabList,testEntry2))
print(testEntry2,‘classified as: ‘, classifyNB(thisDoc2,p0V,p1V,pAb))
调用函数,输出结果为:
[‘love‘, ‘my‘, ‘dalmation‘] classified as: 0
[‘stupid‘, ‘garbage‘] classified as: 1
原文:http://www.cnblogs.com/weimusan/p/7499319.html