中文词频统计

时间：2018-03-28 23:13:22 阅读：272 评论：0 收藏：0 [点我收藏+]

import jieba

f=open(‘novel.txt‘,‘r‘,encoding=‘utf-8‘)
content=f.read()
f.close()

symbol=‘‘‘。，“”！？\n（）；‘‘‘
for i in symbol:
    content=content.replace(i,‘ ‘)

# 使用jieba进行中文分词
contentList=list(jieba.cut(content))

# 生成词频统计
contentDict={}
for i in contentList:
    contentDict[i]=contentList.count(i)

# 排除语法型词汇，代词、冠词、连词
exclude={‘ ‘,‘的‘,‘她‘,‘是‘,‘了‘,‘—‘,‘他‘,‘在‘,‘说‘,‘我‘,‘你‘,‘不‘,‘都‘,‘也‘,
         ‘和‘,‘有‘,‘着‘,‘就‘}
for i in exclude:
    del contentDict[i]

# 排序
contentDict=sorted(contentDict.items(),key=lambda e:e[1],reverse=True)

# 输出词频最大TOP20
for i in range(20):
    print(contentDict[i])

运行结果：

技术分享图片

中文词频统计

原文：https://www.cnblogs.com/ffde/p/8666602.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)