最近一直把以前放下的NLP收拾起来,刚准备做关系抽取,然后把词变成向量的时候看到了Word2Vec这个神器,然后就开始了折腾之路
1.java版的
目前Word2Vec有很多版本,这次主要实验的是python版本,但开始为了省心(就在当前项目内)就先用java版的试试,java版的是ansj的作者孙健搞的,如果我没记错的话,ansj现在已经停止维护了。但搞出来这个新玩意儿,还是试试,倒是很简单,导入项目,学习,然后用,but没有语料,很多效果都没有。
地址:https://github.com/NLPchina/Word2VEC_java,不知道什么原因,在语料规模上来后(1G的中文语料,也不大啊),java版本的内存会在4.17G的时候挂掉,我怕不够直接给了10G。所以java版本的学习部分在大规模语料上没跑通,回头再试试。
2.Python版
苦于没有大规模语料,所以就又开始了寻觅之路,国家语委,各种分词工具内部的语料库,搜狗语料库,北大中文语料库等等,不是下载不来,就是语料太旧,峰回路转,逛52nlp的时候,找到了52NLP的一个说明,看到了竟然有中文wiki这么高质量的语料,赶紧下手搞到。<实验过程参考:http://www.52nlp.cn/中英文维基百科语料上的word2vec实验>
实验环境:macbook pro i5 16g 256ssd ,python2.7,jdk1.8
实验步骤:
1. 下载语料,直接中文,目前需要
https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
2. 解析wiki
process_wiki.py
1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 4 import logging 5 import os.path 6 import sys 7 8 from gensim.corpora import WikiCorpus 9 10 if __name__ == ‘__main__‘: 11 program = os.path.basename(sys.argv[0]) 12 logger = logging.getLogger(program) 13 14 logging.basicConfig(format=‘%(asctime)s: %(levelname)s: %(message)s‘) 15 logging.root.setLevel(level=logging.INFO) 16 logger.info("running %s" % ‘ ‘.join(sys.argv)) 17 18 # check and process input arguments 19 if len(sys.argv) < 3: 20 print globals()[‘__doc__‘] % locals() 21 sys.exit(1) 22 inp, outp = sys.argv[1:3] 23 space = " " 24 i = 0 25 26 output = open(outp, ‘w‘) 27 wiki = WikiCorpus(inp, lemmatize=False, dictionary={}) 28 for text in wiki.get_texts(): 29 output.write(space.join(text) + "\n") 30 i = i + 1 31 if (i % 10000 == 0): 32 logger.info("Saved " + str(i) + " articles") 33 34 output.close() 35 logger.info("Finished Saved " + str(i) + " articles")
将这两个文件放在同一个目录下,执行:python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text:执行结果类似(当时没有截图,借用下):
2015-03-11 17:39:22,739: INFO: running process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text 2015-03-11 17:40:08,329: INFO: Saved 10000 articles 2015-03-11 17:40:45,501: INFO: Saved 20000 articles 2015-03-11 17:41:23,659: INFO: Saved 30000 articles 2015-03-11 17:42:01,748: INFO: Saved 40000 articles 2015-03-11 17:42:33,779: INFO: Saved 50000 articles ...... 2015-03-11 17:55:23,094: INFO: Saved 200000 articles 2015-03-11 17:56:14,692: INFO: Saved 210000 articles 2015-03-11 17:57:04,614: INFO: Saved 220000 articles 2015-03-11 17:57:57,979: INFO: Saved 230000 articles 2015-03-11 17:58:16,621: INFO: finished iterating over Wikipedia corpus of 232894 documents with 51603419 positions (total 2581444 articles, 62177405 positions before pruning articles shorter than 50 words) 2015-03-11 17:58:16,622: INFO: Finished Saved 232894 articles
解析完毕后,需要(1)繁简转化(2)统一为UTF-8编码(3)分词
由于这几项手上直接有东西搞定,所以就没有采用52nlp的产品,反正只要能达到这个目的就可以了
然后需要:train_word2vec_model.py
1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 4 import logging 5 import os.path 6 import sys 7 import multiprocessing 8 9 from gensim.corpora import WikiCorpus 10 from gensim.models import Word2Vec 11 from gensim.models.word2vec import LineSentence 12 13 if __name__ == ‘__main__‘: 14 program = os.path.basename(sys.argv[0]) 15 logger = logging.getLogger(program) 16 17 logging.basicConfig(format=‘%(asctime)s: %(levelname)s: %(message)s‘) 18 logging.root.setLevel(level=logging.INFO) 19 logger.info("running %s" % ‘ ‘.join(sys.argv)) 20 21 # check and process input arguments 22 if len(sys.argv) < 4: 23 print globals()[‘__doc__‘] % locals() 24 sys.exit(1) 25 inp, outp1, outp2 = sys.argv[1:4] 26 27 model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, 28 workers=multiprocessing.cpu_count()) 29 30 # trim unneeded model memory = use(much) less RAM 31 #model.init_sims(replace=True) 32 model.save(outp1) 33 model.save_word2vec_format(outp2, binary=False)
执行:python train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector
同上,执行结果
2015-03-11 18:50:02,586: INFO: running train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector 2015-03-11 18:50:02,592: INFO: collecting all words and their counts 2015-03-11 18:50:02,592: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types 2015-03-11 18:50:12,476: INFO: PROGRESS: at sentence #10000, processed 12914562 words and 254662 word types 2015-03-11 18:50:20,215: INFO: PROGRESS: at sentence #20000, processed 22308801 words and 373573 word types 2015-03-11 18:50:28,448: INFO: PROGRESS: at sentence #30000, processed 30724902 words and 460837 word types ... 2015-03-11 18:52:03,498: INFO: PROGRESS: at sentence #210000, processed 143804601 words and 1483608 word types 2015-03-11 18:52:07,772: INFO: PROGRESS: at sentence #220000, processed 149352283 words and 1521199 word types 2015-03-11 18:52:11,639: INFO: PROGRESS: at sentence #230000, processed 154741839 words and 1563584 word types 2015-03-11 18:52:12,746: INFO: collected 1575172 word types from a corpus of 156430908 words and 232894 sentences 2015-03-11 18:52:13,672: INFO: total 278291 word types after removing those with count<5 2015-03-11 18:52:13,673: INFO: constructing a huffman tree from 278291 words 2015-03-11 18:52:29,323: INFO: built huffman tree with maximum node depth 25 2015-03-11 18:52:29,683: INFO: resetting layer weights 2015-03-11 18:52:38,805: INFO: training model with 4 workers on 278291 vocabulary and 400 features, using ‘skipgram‘=1 ‘hierarchical softmax‘=1 ‘subsample‘=0 and ‘negative sampling‘=0 2015-03-11 18:52:49,504: INFO: PROGRESS: at 0.10% words, alpha 0.02500, 15008 words/s 2015-03-11 18:52:51,935: INFO: PROGRESS: at 0.38% words, alpha 0.02500, 44434 words/s 2015-03-11 18:52:54,779: INFO: PROGRESS: at 0.56% words, alpha 0.02500, 53965 words/s 2015-03-11 18:52:57,240: INFO: PROGRESS: at 0.62% words, alpha 0.02491, 52116 words/s 2015-03-11 18:52:58,823: INFO: PROGRESS: at 0.72% words, alpha 0.02494, 55804 words/s 2015-03-11 18:53:03,649: INFO: PROGRESS: at 0.94% words, alpha 0.02486, 58277 words/s 2015-03-11 18:53:07,357: INFO: PROGRESS: at 1.03% words, alpha 0.02479, 56036 words/s ...... 2015-03-11 19:22:09,002: INFO: PROGRESS: at 98.38% words, alpha 0.00044, 85936 words/s 2015-03-11 19:22:10,321: INFO: PROGRESS: at 98.50% words, alpha 0.00044, 85971 words/s 2015-03-11 19:22:11,934: INFO: PROGRESS: at 98.55% words, alpha 0.00039, 85940 words/s 2015-03-11 19:22:13,384: INFO: PROGRESS: at 98.65% words, alpha 0.00036, 85960 words/s 2015-03-11 19:22:13,883: INFO: training on 152625573 words took 1775.1s, 85982 words/s 2015-03-11 19:22:13,883: INFO: saving Word2Vec object under wiki.zh.text.model, separately None 2015-03-11 19:22:13,884: INFO: not storing attribute syn0norm 2015-03-11 19:22:13,884: INFO: storing numpy array ‘syn0‘ to wiki.zh.text.model.syn0.npy 2015-03-11 19:22:20,797: INFO: storing numpy array ‘syn1‘ to wiki.zh.text.model.syn1.npy 2015-03-11 19:22:40,667: INFO: storing 278291x400 projection weights into wiki.zh.text.vector
跑完之后就可以在python里使用model了
基本用法:
import gensim model = gensim.models.Word2Vec.load("wiki.zh.text.model") >>> result =model.most_similar(u"美女") >>> for e in result: ... print e[0],e[1] ... 帅哥 0.629959464073 正妹 0.607636809349 校花 0.566570997238 美腿 0.560691952705 女明星 0.556897878647 性感 0.548311054707 谐星 0.537560880184 大变身 0.52529746294 女丑 0.517377853394 辣妹 0.506102442741 >>> result = model.most_similar(positive=[u‘中国‘,u‘日本‘], negative=[u‘东京‘]) >>> for e in result: ... print e[0],e[1] ... 我国 0.525859713554 中国政府 0.455589711666 朝鲜民主主义人民共和国 0.433199852705 中华民国 0.430634796619 全中国 0.429285645485 美国 0.425486922264 境外 0.422223210335 台商 0.420866370201 英国 0.420089453459 中华人民共和国政府 0.41133800149 >>> result = model.most_similar(positive=[u‘女人‘,u‘国王‘], negative=[u‘男人‘]) >>> for e in result: ... print e[0],e[1] ... 王储 0.538514256477 王室 0.533518970013 四世 0.531962811947 一世 0.531662106514 王后 0.528761506081 王位 0.517430365086 君主 0.513949334621 摄政王 0.50737452507 二世 0.503388166428 六世 0.503049015999
更详细用法参考:
https://radimrehurek.com/gensim/models/word2vec.html
http://rare-technologies.com/word2vec-tutorial/
感谢52nlp
中文维基百科上的word2vec实验,python及java版本
原文:http://www.cnblogs.com/helloever/p/5280891.html