首页 > 其他 > 详细

中文词频统计

时间:2017-09-29 22:02:22      阅读:359      评论:0      收藏:0      [点我收藏+]

中文分词

  1. 下载一中文长篇小说,并转换成UTF-8编码。
  2. 使用jieba库,进行中文词频统计,输出TOP20的词及出现次数。
  3. 排除一些无意义词、合并同一词。
  4. 对词频统计结果做简单的解读。
  5. import jieba
    book=open(D:\\xiaoshuo.txt,r,encoding=utf-8)
    
    #读入待分析的字符串
    str=book.read()
    book.close()
    
    for i in ,。!、   \n “ ” ;:
        str=str.replace(i,‘‘)
    
    words=jieba.cut(str)
    word=set(words)
    
    #计数字典 
    dic={}
    for i in word:
        if len(i)>1:
            dic[i]=str.count(i)
    str=list(dic.items())
    
    #排序
    str.sort(key=lambda x:x[1],reverse=True)
    for i in range(20):
        print(str[i])

    Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
    Type "copyright", "credits" or "license()" for more information.
    >>>
    ============================= RESTART: D:/daa.py =============================
    Building prefix dict from the default dictionary ...
    Dumping model to file cache C:\Users\asus\AppData\Local\Temp\jieba.cache
    Loading model cost 1.306 seconds.
    Prefix dict has been built succesfully.
    (‘父亲‘, 10)
    (‘背影‘, 4)
    (‘丧事‘, 3)
    (‘北京‘, 3)
    (‘散文‘, 3)
    (‘茶房‘, 3)
    (‘那年‘, 2)
    (‘父母‘, 2)
    (‘踌躇‘, 2)
    (‘朱自清‘, 2)
    (‘要紧‘, 2)
    (‘终于‘, 2)
    (‘日子‘, 2)
    (‘一会‘, 2)
    (‘一半‘, 2)
    (‘子女‘, 2)
    (‘描写‘, 2)
    (‘回家‘, 2)
    (‘不必‘, 2)
    (‘为了‘, 2)
    >>>

    Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
    Type "copyright", "credits" or "license()" for more information.
    >>>
    ============================= RESTART: D:/daa.py =============================
    Building prefix dict from the default dictionary ...
    Dumping model to file cache C:\Users\asus\AppData\Local\Temp\jieba.cache
    Loading model cost 1.306 seconds.
    Prefix dict has been built succesfully.
    (‘父亲‘, 10)
    (‘背影‘, 4)
    (‘丧事‘, 3)
    (‘北京‘, 3)
    (‘散文‘, 3)
    (‘茶房‘, 3)
    (‘那年‘, 2)
    (‘父母‘, 2)
    (‘踌躇‘, 2)
    (‘朱自清‘, 2)
    (‘要紧‘, 2)
    (‘终于‘, 2)
    (‘日子‘, 2)
    (‘一会‘, 2)
    (‘一半‘, 2)
    (‘子女‘, 2)
    (‘描写‘, 2)
    (‘回家‘, 2)
    (‘不必‘, 2)
    (‘为了‘, 2)
    >>>

    Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
    Type "copyright", "credits" or "license()" for more information.
    >>>
    ============================= RESTART: D:/daa.py =============================
    Building prefix dict from the default dictionary ...
    Dumping model to file cache C:\Users\asus\AppData\Local\Temp\jieba.cache
    Loading model cost 1.306 seconds.
    Prefix dict has been built succesfully.
    (‘父亲‘, 10)
    (‘背影‘, 4)
    (‘丧事‘, 3)
    (‘北京‘, 3)
    (‘散文‘, 3)
    (‘茶房‘, 3)
    (‘那年‘, 2)
    (‘父母‘, 2)
    (‘踌躇‘, 2)
    (‘朱自清‘, 2)
    (‘要紧‘, 2)
    (‘终于‘, 2)
    (‘日子‘, 2)
    (‘一会‘, 2)
    (‘一半‘, 2)
    (‘子女‘, 2)
    (‘描写‘, 2)
    (‘回家‘, 2)
    (‘不必‘, 2)
    (‘为了‘, 2)
    >>>

中文词频统计

原文:http://www.cnblogs.com/xiepingjian/p/7612830.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!