python3.6 基于Pycharm实现中文分词、去停用词、词云可视化

时间：2019-02-15 10:04:56 阅读：1508 评论：0 收藏：0 [点我收藏+]

可视化词云的时候遇到了中文不显示的问题，解决方法代码中有标注。

import glob
import random
import jieba
import matplotlib.pyplot as plt
from wordcloud import WordCloud


# 数据读取
def get_content(path):
    with open(path, ‘r‘, encoding=‘utf8‘, errors=‘ignore‘) as f:
        content = ‘‘
        for line in f:
            # 去掉每句话开头和结尾的空格
            line = line.strip()
            content += line
        return content


# 定义一个高频词函数
def get_if(words, top=10):
    tf_dic = {}
    for w in words:
        # 遍历words中的每一个词切片，以词为键，出现的次数为值存储在字典中
        tf_dic[w] = tf_dic.get(w, 0) + 1
    return sorted(tf_dic.items(), key=lambda x: x[1], reverse=True)[:top]


def stop_words(path):
    with open(path, ‘r‘, encoding=‘utf-8‘, errors=‘ignore‘) as f:
        print(line.strip() for line in f)
        return [line.strip() for line in f]


if __name__ == ‘__main__‘:

    # 获取txt文件
    files = glob.glob(‘./chinese_english‘)

    # 读取所有文件的内容存在corpus的列表中
    corpus = [get_content(x) for x in files]

    # 获取一个0到corpus长度的整数随机数
    sample_inx = random.randint(0, len(corpus))

    # 使用jieba精确模式分词，
    split_words = list(jieba.cut(corpus[sample_inx]))
    # stop_words(‘./stop_words.txt‘)
    split_words = [x for x in jieba.cut(corpus[sample_inx]) if x not in stop_words(‘./stop_words.txt‘)]

    # 打印随机选取的样本
    print(‘样本之一： ‘ + corpus[sample_inx])

    # 打印随机选取的样本的分词情况
    print("\n----------------------->开始分词")
    # print(‘样本分词结果： ‘ + ‘  ‘.join(split_words))
    for word in split_words:
        print(‘样本分词结果： ‘ + word)

    # 统计显示高频词
    print("\n---------------------------------->统计分词结果")
    # print(‘样本的top(10)词为： ‘ + str(get_if(split_words)))
    for i in get_if(split_words):
        print(‘样本的top(10)词为： ‘ + str(i))

    word_cloud = " ".join(split_words)
    my_wordcloud = WordCloud(font_path=‘simfang.ttf‘, collocations=False).generate(word_cloud)

    plt.imshow(my_wordcloud)
    plt.axis("off")
    plt.show()

部分输出结果如下

样本分词结果： 新鲜
样本分词结果： 烤面包
样本分词结果： 味道
样本分词结果： 某
样本分词结果： 一座
样本分词结果： 房里
样本分词结果： 飘
样本分词结果： 出来
样本分词结果： 也许
样本分词结果： 是
样本分词结果： 微风
样本分词结果： 轻拂
样本分词结果： 树叶
样本分词结果： 声音
样本分词结果： 或者
样本分词结果： 是
样本分词结果： 晨光
样本分词结果： 照射
样本分词结果： 轻轻
样本分词结果： 飘落
样本分词结果： 秋叶
样本分词结果： 上
样本分词结果： 方式
样本分词结果： 请
样本分词结果： 你们
样本分词结果： 寻找
样本分词结果： 东西
样本分词结果： 并且
样本分词结果： 记住
样本分词结果： 它们
样本分词结果： 吧


------------------------------>统计分词结果

样本的top(20)词为： (‘class‘, 3)
样本的top(20)词为： (‘一个‘, 3)
样本的top(20)词为： (‘一些‘, 3)
样本的top(20)词为： (‘放学‘, 3)
样本的top(20)词为： (‘东西‘, 3)
样本的top(20)词为： (‘I‘, 3)
样本的top(20)词为： (‘you‘, 3)
样本的top(20)词为： (‘你们‘, 3)
样本的top(20)词为： (‘人‘, 3)
样本的top(20)词为： (‘它‘, 3)
样本的top(20)词为： (‘也许‘, 3)
样本的top(20)词为： (‘way‘, 3)
样本的top(20)词为： (‘or‘, 3)
样本的top(20)词为： (‘it‘, 3)
样本的top(20)词为： (‘very‘, 2)
样本的top(20)词为： (‘school‘, 2)
样本的top(20)词为： (‘with‘, 2)
样本的top(20)词为： (‘when‘, 2)
样本的top(20)词为： (‘over‘, 2)
样本的top(20)词为： (‘things‘, 2)

词云

技术分享图片

python3.6 基于Pycharm实现中文分词、去停用词、词云可视化

原文：https://www.cnblogs.com/RHadoop-Hive/p/10381887.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)