获取全部 txt 文本中出现次数最多的前N个词汇

时间：2020-04-11 17:02:34 阅读：104 评论：0 收藏：0 [点我收藏+]

1.使用 chain 对 allwords 二维列表进行解包
    from itertools import chain
    allwords = []
    allwords.append(列表)
    解包： chain(*allwords)
        将 allwords 里面的子列表解出来
2.使用 next 对 chain 对象进行输出
    c = chain([1,2,3],"hello",(1,2,3),map(str,range(3)))
    next(c) 输出 c 的下一个元素 1 
    next(c) 输出 c 的下一个元素 2 
3.获取有效词汇的数目
    freq = Counter(chain(*allwords))
4.Counter 返回的是可迭代对象出现的次数
    使用 most_common 方法返回出现次数最多的前三个 
        .most_common(3)
    Counter ("dadasfafasfa")
        Counter({‘a‘: 5, ‘f‘: 3, ‘d‘: 2, ‘s‘: 2})
    Counter ("dadasfafasfa").most_common(2)
        [(‘a‘, 5), (‘f‘, 3)]

程序：
allwords = [ ]
def getTopWords(topN):
    # 按文体编号顺序处理当前文件夹中所有的记事本文件
    # 5.txt 9.txt 121.txt
    # 训练集中共有141封邮件，0.txt~99.txt 为垃圾邮件
    # 100~140 为有效邮件
    txtFiles = [str(i) + ‘.txt‘ for i in range(141)]
    # 获取训练集中所有邮件中的全部文件
    for txtFile in txtFiles:
        allwords.append(getWordsFromFile(txtFile))
        # 获取并返回出现次数最多的前 topN 个单词
    freq = Counter(chain(*allwords))
    return [w[0] for w in freq.most_common(topN)]
    # 返回有效字符出现次数最多的前 topN 个字符
    # w[0] 表示获取字符，w[1] 为出现的次数

2020-04-11

原文：https://www.cnblogs.com/hany-postq473111315/p/12680560.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)