同义词查找，关键词扩展，使用腾讯Tencent AILAB的800万词向量，gensim，annoy

时间：2020-08-24 14:33:04 阅读：154 评论：0 收藏：0 [点我收藏+]

最近在做一个关键词匹配系统，为了更好的效果，

添加一个关键词扩展的功能。使用Tencent AIlab的800万词向量文件。

腾讯AILAB的800万词向量下载地址：https://ai.tencent.com/ailab/nlp/zh/embedding.html 这个是最新的有效地址

是用gensim模块读取词向量，并找到相似词，占用内存比较大，速度也慢，最好是16g以上的内存和高主频的cpu

import gensim


wv_from_text = gensim.models.KeyedVectors.load_word2vec_format(‘./Tencent_AILab_ChineseEmbedding.txt‘,binary=False)

wv_from_text.init_sims(replace=True)  # 神奇，很省内存，可以运算most_similar


while True:
    keyword = input("输入关键词：")
    w1 = [keyword]
    print(wv_from_text.most_similar(positive=w1,topn=5))

　　会返回5个最相似的词语

下面这个代码会使用annoy模块，这个模块好像不支持windows，要使用linux系统，32g以上内存加上高主频cpu。

from gensim.models import KeyedVectors
import json
from collections import OrderedDict
from annoy import AnnoyIndex

# 此处加载时间略长，加载完毕后大概使用了12G内存，后续使用过程中内存还在增长，如果测试，请用大一些内存的机器
tc_wv_model = KeyedVectors.load_word2vec_format(‘./Tencent_AILab_ChineseEmbedding.txt‘, binary=False)

# 构建一份词汇ID映射表，并以json格式离线保存一份（这个方便以后离线直接加载annoy索引时使用）
word_index = OrderedDict()
for counter, key in enumerate(tc_wv_model.vocab.keys()):
    word_index[key] = counter

with open(‘tc_word_index.json‘, ‘w‘) as fp:
    json.dump(word_index, fp)

# 开始基于腾讯词向量构建Annoy索引，腾讯词向量大概是882万条
# 腾讯词向量的维度是200
tc_index = AnnoyIndex(200)
i = 0
for key in tc_wv_model.vocab.keys():  #遍历Tencent词向量的所有词
    v = tc_wv_model[key]
    tc_index.add_item(i, v)
    i += 1


# 这个构建时间也比较长，另外n_trees这个参数很关键，官方文档是这样说的：
# n_trees is provided during build time and affects the build time and the index size.
# A larger value will give more accurate results, but larger indexes.
# 这里首次使用没啥经验，按文档里的是10设置，到此整个流程的内存占用大概是30G左右
tc_index.build(10)

# 可以将这份index存储到硬盘上，再次单独加载时，带词表内存占用大概在2G左右
tc_index.save(‘tc_index_build10.index‘)

# 准备一个反向id==>word映射词表
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

# 然后测试一下Annoy，自然语言处理和AINLP公众号后台的结果基本一致
# 感兴趣的同学可以关注AINLP公众号，查询：相似词 自然语言处理
for item in tc_index.get_nns_by_item(word_index[u‘自然语言处理‘], 11):
    print(reverse_word_index[item])

# 不过英文词的结果好像有点不同
for item in tc_index.get_nns_by_item(word_index[u‘nlp‘], 11):
    print(reverse_word_index[item])

同义词查找，关键词扩展，使用腾讯Tencent AILAB的800万词向量，gensim，annoy

原文：https://www.cnblogs.com/LiuXinyu12378/p/13553368.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)