大数据最烦的就是数据质量差,为了把数据导入到sequoiadb中,需要要求文本是UTF-8模式的,使用enca查看文件编码是gb2312,然后是enca转utf-8报错。google了整个地球都不知道原因,尝试使用python进行转码
# -*- coding: utf-8 -*- import codecs import sys print "文件名:", sys.argv[1] filename = sys.argv[1] if( filename == None ): exit(1) file = open(filename) writefile = open(filename+"utf8","w+") bom = file.read(3) if( bom == codecs.BOM_UTF8 ): file.seek(3) for a in file: writefile.write(a.decode("gb2312").encode("utf-8")) file.close() writefile.close()
原文:http://www.cnblogs.com/gaoxing/p/4918134.html