Oracle 支持流版式文件的全文检索,而原生的PostgreSQL是不支持流版式文件全文检索的。KingbaseES 通过ftutilx 插件将流版式文件转换成文本文件,从而支持流版式文件全文检索。
分别创建文本文件和word文件,内容如下:
oid | cfgname | cfgnamespace | cfgowner | cfgparser -------+------------+--------------+----------+----------- 3748 | simple | 11 | 10 | 3722 13288 | arabic | 11 | 10 | 3722 13290 | danish | 11 | 10 | 3722 13292 | dutch | 11 | 10 | 3722 13294 | english | 11 | 10 | 3722 13296 | finnish | 11 | 10 | 3722 13298 | french | 11 | 10 | 3722 13300 | german | 11 | 10 | 3722
create table ts_test(txt_clob clob,txt_blob blob,doc_clob clob,doc_blob blob); insert into ts_test values(clob_import(‘/home/kb21/temp.txt‘),blob_import(‘/home/kb21/temp.txt‘),clob_import(‘/home/kb21/temp.docx‘),blob_import(‘/home/kb21/temp.docx‘));
ERROR: invalid byte sequence for encoding "UTF8": 0x00
注意:这里把docx 用clob 导入时报错。
create table ts_test(txt_clob clob,txt_blob blob,doc_blob blob); insert into ts_test values(clob_import(‘/home/kb21/temp.txt‘,‘UTF8‘),blob_import(‘/home/kb21/temp.txt‘),blob_import(‘/home/kb21/temp.docx‘));
test=# select count(*) from ts_test where to_tsvector(txt_clob) @@ to_tsquery(‘simple‘); count ------- 1 (1 row) test=# select count(*) from ts_test where to_tsvector(txt_blob) @@ to_tsquery(‘simple‘); count ------- 1 (1 row) test=# select count(*) from ts_test where to_tsvector(doc_blob) @@ to_tsquery(‘simple‘); count ------- 0 (1 row)
结论:对于文本类型,不管存储数据类型的是clob,还是blob,全文检索都可以搜索到;对于docx类型,由于是流版式格式,全文检索无法使用。
shared_preload_libraries = ‘ftutilx, ......‘
ftutilx 依赖于 jre-1.8.0 运行时环境,部署后需要设置LD_LIBRARY_PATH 系统环境变量包含jre-1.8.0的libjvm.so 路径。
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/etc/alternatives/jre_1.8.0/lib/amd64/server
test=# create extension ftutilx; CREATE EXTENSION test=# \dx+ ftutilx Objects in extension "ftutilx" Object description ---------------------------- function extracttext(blob) (1 row) test=# select extracttext(doc_blob) from ts_test; extracttext ------------------------------------------------------------ oid | cfgname | cfgnamespace | cfgowner | cfgparser + -------+------------+--------------+----------+-----------+ 3748 | simple | 11 | 10 | 3722 + 13288 | arabic | 11 | 10 | 3722 + 13290 | danish | 11 | 10 | 3722 + 13292 | dutch | 11 | 10 | 3722 + 13294 | english | 11 | 10 | 3722 + 13296 | finnish | 11 | 10 | 3722 + 13298 | french | 11 | 10 | 3722 + test=# select to_tsvector(extracttext(doc_blob)) from ts_test; to_tsvector --------------------------------------------------------------------------------------------------------------------------------------------------------------- ‘10‘:9,14,19,24,29,34,39 ‘11‘:8,13,18,23,28,33,38 ‘13288‘:11 ‘13290‘:16 ‘13292‘:21 ‘13294‘:26 ‘13296‘:31 ‘13298‘:36 ‘3722‘:10,15,20,25,30,35,40 ‘3748‘:6 ‘arab ic‘:12 ‘cfgname‘:2 ‘cfgnamespace‘:3 ‘cfgowner‘:4 ‘cfgparser‘:5 ‘danish‘:17 ‘dutch‘:22 ‘english‘:27 ‘finnish‘:32 ‘french‘:37 ‘oid‘:1 ‘simple‘:7 (1 row)
可以看到, extracttext 的作用是将流版式的数据抽取成文本格式,然后再通过to_tsvector 进行分词。
原文:https://www.cnblogs.com/kingbase/p/15101739.html