使用ICTCLAS2015进行分词

时间：2015-03-24 09:14:46 阅读：286 评论：0 收藏：0 [点我收藏+]

使用ICTCLAS2015进行分词

在今年的Imagine Cup中使用到了语义分析的部分，其中需要分词作为基础，我是用的是中科院的ICTCLA2015，本篇博客我来讲讲如何使用ICTCLAS2015进行分词

ICTCLAS2015

简介

中文词法分析是中文信息处理的基础与关键。中国科学院计算技术研究所在多年研究工作积累的基础上，研制出了汉语词法分析系统ICTCLAS(Institute of Computing Technology, Chinese Lexical Analysis System)，主要功能包括中文分词；词性标注；命名实体识别；新词识别；同时支持用户词典。先后精心打造五年，内核升级6次，目前已经升级到了ICTCLAS3.0。ICTCLAS3.0分词速度单机996KB/s，分词精度98.45%，API不超过200KB，各种词典数据压缩后不到3M，是当前世界上最好的汉语词法分析器。

下载地址

http://ictclas.nlpir.org/downloads

使用ICTCLAS2015进行开发

本文所采用开发平台

操作系统：Windows 8.1 x64
开发语言：Java
开发工具：Eclipse

开发实例

准备

复制Data文件夹和NLPIR.dll至开发目录

下载JNA类库， jna-platform-4.1.0.jar

使用JNA调用C++接口

    //定义JNA接口
    public interface CLibrary extends Library{
        //建立实例
        CLibrary Instance = (CLibrary)Native.loadLibrary("./libs/NLPIR", CLibrary.class);
        //系统初始化
        public int NLPIR_Init(byte[] sDataPath, int encoding,byte[] sLicenceCode);
        //段落处理
        public String NLPIR_ParagraphProcess(String sSrc, int bPOSTagged);
        //获取关键词
        public String NLPIR_GetKeyWords(String sLine, int nMaxKeyLimit,boolean bWeightOut);
        //退出函数
        public void NLPIR_Exit();
        //文档处理
        public double NLPIR_FileProcess(String sSourceFilename,String sResultFilename,int bPOStagged);
        //引入用户自定义词典
        public int NLPIR_ImportUserDict(String sFilename,Boolean bOverwrite);
        //添加用户新词并标注词性
        public int NLPIR_AddUserWord(String sWords);
    }

对一段文字进行分词，返回标注词性的分词结果

    /**
     * 对一段文字进行分词，返回标注词性的分词结果
     * 
     * @param fileName
     * @return words
     * @throws Exception
     */
    public static String[] Segment(String fileName) throws Exception{
        //保存分词结果
        String result[]={"",""};
        String sourceString = "";
        //从文件中读入文本
        try {
            String encoding="UTF-8";

            File file=new File(fileName);

            if(file.isFile() && file.exists()){
                //判断文件是否存在
                String temp = null;
                InputStreamReader read = new InputStreamReader(new FileInputStream(file),encoding);
                BufferedReader bufferedReader = new BufferedReader(read);

                while((temp = bufferedReader.readLine()) != null){
                    sourceString += temp;
                }

                read.close();
            }else{
                System.out.println("找不到指定的文件");
            }
        } catch (Exception e) {
            System.out.println("读取文件内容出错");
            e.printStackTrace();
        }
        //进行分词，对NLPIR初始化
        String argu = "";
        String system_charset = "UTF-8";
        int charset_type = 1;
        int init_flag = CLibrary.Instance.NLPIR_Init(argu.getBytes(system_charset), charset_type, "1".getBytes(system_charset));

        AddUserWords("dic/dic.txt");

        if(0 == init_flag){
            System.out.println("init fail!");
            return null;
        }
        //保存分词结果        
        String nativeBytes = null;
        //保存关键词
        String nativeByte = null;
        try{
            //分词
            nativeBytes = CLibrary.Instance.NLPIR_ParagraphProcess(sourceString, 1);
            //获取关键词
            nativeByte = CLibrary.Instance.NLPIR_GetKeyWords(sourceString, 5, true);
        }catch(Exception e){
            e.printStackTrace();
        }
        result[0] = nativeBytes;
        result[1] = nativeByte;
        //返回分词结果
        return result;
    }

添加用户词典并进行词性标注

    /**
     * 添加用户词典并进行词性标注
     * @param filePath
     */
    public static void AddUserWords(String filePath){
        try{
            String encoding = "UTF-8";
            File file = new File(filePath);
            if(file.isFile()&&file.exists()){
                InputStreamReader read = new InputStreamReader(new FileInputStream(file), encoding);
                BufferedReader bufferReader = new BufferedReader(read);
                String lineText = "";
                while((lineText = bufferReader.readLine()) != null){
                    CLibrary.Instance.NLPIR_AddUserWord(lineText);
                }
            }
            else{
                System.out.println("未找到文件！");
            }
        }catch(Exception e){
            e.printStackTrace();
        }

    }

使用ICTCLAS2015进行分词

原文：http://blog.csdn.net/luoyhang003/article/details/44586731

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)