一、问题的引入
在sphinx应用中,需要对数据进行复杂的条件过滤,刷选出我们需要的数据。这个过程,等同于mysql查询中的where条件。
但sphinx本身的filter并不能支持复杂的逻辑组合运算,各个过滤条件之间只能进行AND操作,连OR操作都不支持。
为了解决这个问题,之前我们是在建立索引的时候,从源数据库取出数据建立索引的时候,利用mysql的sql语句,在数据导入的时候就要把逻辑摊开。
这样会带来很多问题:
1)sphinx配置文件复杂,变动不灵活,当业务发生变化的时候,需要修改后台的sphinx配置文件,重建索引;
2)没法做到业务表现和数据分离,前后台耦合严重;
3)从源db获取数据的sql语句复杂,不方便维护调整,同时增加建立索引的时间
4)sphinx的开发维护人员,必须非常清楚业务的特点才能实现,不利于各人的分工协作
5)当逻辑实在太复杂,数据导入的时候没法将逻辑摊开的时候,采用这种方式,业务就没法实现了
二、新的解决方案
新的解决方案,利用SetSelect()和SetFilter()这两个api,对sphinx的参数进行逻辑操作来实现。
下面根据例子来说明。
1.
准备测试环境和数据
创建表格和插入数据
CREATE TABLE
`t_test_filter` (
`Fcommodity_id` bigint(20) unsigned NOT NULL DEFAULT
‘0‘,
` FUserId`
int(10) unsigned NOT NULL DEFAULT ‘0‘,
` FUserName`
varchar(32) NOT NULL DEFAULT ‘‘,
`Fshopid` bigint(20) unsigned NOT
NULL DEFAULT ‘0‘,
`Ftitle` varchar(60) NOT NULL DEFAULT ‘‘,
`FProperty` int(10) unsigned NOT NULL DEFAULT ‘0‘,
`FTime` int(10)
unsigned NOT NULL DEFAULT ‘0‘,
PRIMARY KEY (`Fcommodity_id`)
)
ENGINE=InnoDB DEFAULT CHARSET=latin1
创建sphinxSE的表格
CREATE TABLE
`t_test_filter_sphinxSE` (
`id` bigint(20) unsigned NOT
NULL,
`weight` int(11) DEFAULT ‘1‘,
`query` varchar(3072)
NOT NULL,
`total_found` int(11) DEFAULT ‘0‘,
`total_match`
int(11) DEFAULT ‘0‘,
`query_time` int(11) DEFAULT ‘0‘,
` FUserId` int(10)
unsigned DEFAULT ‘0‘,
`Fshopid` bigint(20) unsigned NOT NULL DEFAULT
‘0‘,
`FProperty` int(10) unsigned NOT NULL DEFAULT ‘0‘,
`FTime` int(10) unsigned NOT NULL DEFAULT ‘0‘,
KEY `query`
(`query`(1536))
) ENGINE=SPHINX DEFAULT CHARSET=gbk
CONNECTION=‘sphinx://localhost:9412‘
插入测试数据
insert into t_test_filter values(4567,123,"test qq
nick", 1234567,"title name",2324354,9876);
insert into t_test_filter
values(2,34,"nick name", 4545,"title name",67,40);
insert into t_test_filter
values(66,55,"test3 qq3 nick3", 22,"title3 name3 ",88,77);
insert into
t_test_filter values(767,234,"test3 qq nick", 3433,"title3 name3
",342342,89340);
insert into t_test_filter values(4545,123,"test qq nick",
1234567,"title name",2324354,9876);
insert into t_test_filter
values(4546,778,"test qq nick", 1234567,"title
name",2324354,9876);
2.
使用举例
这里用sphinxSE的接口,举2个例子来说明这种方法:
(1)查询词是nick,满足条件 FUserId=34 || Fshopid = 1234567;
通过sphinxSE测试查询
select
* from t_test_filter_sphinxSE where
query="nick;select=IF( FUserId=34,1,0)
+ IF(Fshopid=1234567,10,0) as match_qq; filter= match_qq,
1,10,11";
(2)查询词是nick,满足条件 FUserId=34
|| (Fshopid = 1234567 && FUserId=778);
select *
from t_test_filter_sphinxSE where
query="nick;select=IF( FUserId=34,1,0) +
IF(Fshopid=1234567,10,0) + IF( FUserId=778,100,0) as match_qq;
filter= match_qq, 1,110,111";
为了减少组合导致的过滤条件太多,可以使用诸如下面这种逻辑运算来简化
select * from t_test_filter_sphinxSE where
query="nick;select=IF( FUserId=34,1,0) OR (
IF(Fshopid=1234567,1,0) AND IF( FUserId=778,1,0)) as match_qq;
filter= match_qq, 1";
这两条语句的作用相同。
其他复杂的查询,可以按照这个原理组合实现。
原理:
1.
函数IF()的说明:它接受3个参数,检查第一个参数是否为0.0,若非零则返回第二个参数,为零时则返回第三个参数;
2.
上面的语法主要是query里面的语句。
(1)
第一个词是nick,表示需要查询到关键词。
(2)select语句,这个等同于api接口中的SetFilter()。这个是从配置文件的sql_attr_xxxx中,过滤返回的数据。select计算的结果,保存在某个变量名(match_qq)中。这里的计算支持+,
-, *, /, <, > <=, >=, =,
<>等操作符。
(3)filter语句,这个等同于api接口中的SetFilter()。根据select的结果,用filter过滤。不同的查询条件可以得到不同的组合。这样就可以支持非常复杂的查询过滤条件了。
其他接口可以类似,比如c-api和php的接口,可以使用SetSelect()和SetFilter()这两个api实现。
3.
优点和局限
1.在SetSelect()中支持比较负责数学运算和一些逻辑组合,可以灵活实现各种逻辑;
2.业务开发人员和sphinx的运维配置人员,可以分开独立
3.版本限制
(1)
sphinxSE的接口,需要在coreseek4.0.1或者sphinx0.9.9rc1以上的版本才支持。
(2)
c-api和php的接口,在coreseek3.2.14应该已经支持,他已经提供了SetSelect()和SetFilter()这两个api,但我没有实试过。
三、附录
这里附上sphinx配置文件,以便大家更好的分析和测试
# file:
t_test_filter.conf
source
t_test_filter
{
type
= mysql
sql_host
= localhost
sql_user
= xxxx
sql_pass
= xxx
sql_db
= test
sql_port
= 3306
sql_query = SELECT
Fcommodity_id, FUserId, FUserName,Fshopid,
Ftitle, FProperty,FTime from t_test_filter
sql_attr_uint
= FUserId
sql_attr_bigint
= Fshopid
sql_attr_uint
= FProperty
sql_attr_uint
= FTime
}
index t_test_filter
{
source
= t_test_filter
path
=
/usr/local/coreseek/var/data/t_test_filter
docinfo
= extern
mlock
= 0
morphology
= none
exceptions
=
/usr/local/coreseek/var/exceptions.txt
min_word_len =
1
charset_type =
zh_cn.gbk
charset_dictpath =
/usr/local/coreseek/dict
ngram_len
= 1
ngram_chars
= U+3000..U+2FA1F
html_strip
= 0
blend_mode
= trim_none
}
#############################################################################
##
indexer
settings
#############################################################################
indexer
{
# memory limit,
in bytes, kiloytes (16384K) or megabytes
(256M)
# optional, default is 32M,
max is 2047M, recommended is 256M to
1024M
mem_limit
= 40M
# maximum IO calls per second
(for I/O throttling)
# optional,
default is 0 (unlimited)
#
#
max_iops
= 40
# maximum IO call size,
bytes (for I/O throttling)
#
optional, default is 0 (unlimited)
#
#
max_iosize =
1048576
# maximum xmlpipe2 field
length, bytes
# optional, default
is 2M
#
#
max_xmlpipe2_field = 4M
# write buffer size,
bytes
# several (currently up to
4) buffers will be allocated
#
write buffers are allocated in addition to
mem_limit
# optional, default is
1M
#
write_buffer =
1M
}
#############################################################################
##
searchd
settings
#############################################################################
searchd
{
# hostname,
port, or hostname:port, or /unix/socket/path to listen
on
# multi-value, multiple listen
points are allowed
# optional,
default is 0.0.0.0:9312 (listen on all interfaces, port
9312)
#
#
listen
= 127.0.0.1
#
listen
= 192.168.0.1:9312
listen
= 9412
#
listen
= /var/run/searchd.sock
listen =
localhost:9406:mysql41
binlog_path =
/usr/local/coreseek/var/data
# log file, searchd run info is
logged here
# optional, default is
‘searchd.log‘
log
= /usr/local/coreseek/var/log/searchd_9412.log
# query log file, all search
queries are logged here
#
optional, default is empty (do not log
queries)
query_log
= /usr/local/coreseek/var/log/query_9412.log
# client read timeout,
seconds
# optional, default is
5
read_timeout =
5
# request timeout,
seconds
# optional, default is 5
minutes
client_timeout = 300
# maximum amount of children to
fork (concurrent searches to run)
# optional, default is 0
(unlimited)
max_children =
30
# PID file, searchd process ID
file name
#
mandatory
pid_file
= /usr/local/coreseek/var/log/searchd_9412.pid
# max amount of matches the
daemon ever keeps in RAM,
per-index
# WARNING, THERE‘S ALSO
PER-QUERY LIMIT, SEE SetLimits() API
CALL
# default is 1000 (just like
Google)
max_matches
= 1000000
# seamless rotate, prevents
rotate stalls if precaching huge
datasets
# optional, default is
1
seamless_rotate = 1
# whether to forcibly preopen
all indexes on startup
# optional,
default is 0 (do not preopen)
preopen_indexes = 0
# whether to unlink .old index
copies on succesful rotation.
#
optional, default is 1 (do unlink)
unlink_old
= 1
# attribute updates periodic
flush timeout, seconds
# updates
will be automatically dumped to disk this
frequently
# optional, default is
0 (disable periodic flush)
#
#
attr_flush_period = 900
# instance-wide ondisk_dict
defaults (per-index value take
precedence)
# optional, default is
0 (precache all dictionaries in
RAM)
#
#
ondisk_dict_default = 1
# MVA updates pool
size
# shared between all
instances of searchd, disables attr
flushes!
# optional, default size
is 1M
mva_updates_pool = 1M
# max allowed network packet
size
# limits both query packets
from clients, and responses from
agents
# optional, default size is
8M
max_packet_size = 8M
# crash log
path
# searchd will (try to) log
crashed query to ‘crash_log_path.PID‘
file
# optional, default is empty
(do not create crash logs)
#
#
crash_log_path
= /usr/local/coreseek/var/log/crash
# max allowed per-query
filter count
# optional, default
is 256
max_filters
= 256
# max allowed per-filter values
count
# optional, default is
4096
max_filter_values = 4096
# socket listen queue
length
# optional, default is
5
#
#
listen_backlog
= 5
# per-keyword read buffer
size
# optional, default is
256K
#
#
read_buffer
= 256K
# unhinted read size
(currently used when reading hits)
# optional, default is 32K
#
#
read_unhinted = 32K
}
# --eof--
在sphinx中应用复杂过滤条件,布布扣,bubuko.com
在sphinx中应用复杂过滤条件
原文:http://www.cnblogs.com/littlehb/p/3631000.html