如果把解决方案的范围限定在开源关系数据库,答案可能只有一个,就是PostgreSQL的gin索引。
PostgreSQL的gin索引就是倒排索引,它不仅被用于全文检索还可以用在常规的数据类型上,比如int,varchar。
对于多维查询我们可以这样建索引:
1. 对所有等值条件涉及的
低基数字段,建立唯一一个多字段gin索引
2. 对选择性比较好的等值查询或范围查询
涉及的字段,另外建btree索引
可能有同学会有疑问,同样是多字段索引,为什么gin的多字段索引只要建一个就可以了,而btree的多字段索引却要考虑各种查询组合建若干个。这是由于gin多字段索引中的每个字段是等价的,不存在前导字段的说法,所以只要建一个唯一的gin多字段索引就可以覆盖所有的查询组合;而btree多字段索引则不同,如果查询条件中不包含suoyi前导字段,是无法利用索引的。
多字段gin索引的内部存储的每个键是(column number,key datum)这样的形式,所以可以区分不同的字段而不致混淆。存储的值是匹配key的所有记录的ctid集合。这个集合在记录数比较多的情况下采用btree的形式存储,并且经过了压缩,所以gin索引占用的存储空间很小,大约只有等价的btree索引的二十分之一,这也从另一方面提升了性能。
对于多维查询涉及的多个字段,包含在多字段gin索引中的字段,由gin索引做ctid的集合归并(取并集或交集),然后得到的ctid集合和其它索引得到的ctid集合再做BitmapAnd或BitmapOr归并。gin索引内部的ctid集合归并效率远高于索引间的ctid集合归并,而且gin索引对低基数字段的优化更好,所以充分利用gin索引的特性比为每个字段单独建一个btree索引再通过
BitmapAnd或BitmapOr归并结果集效率高的多。
4. 一个真实的案例
4.1 原始查询
下面这个SQL是某系统中一个真实SQL的简化版。
-
SELECT CASE WHEN gpppur.GB_BEGINDATE <= ‘2016-02-29 14:36:00‘ AND gpppur.GB_ENDDATE > ‘2016-02-29 14:36:00‘ THEN 1
-
WHEN gpppur.PREVIEW_BEGINDT <= ‘2016-02-29 14:36:00‘ AND gpppur.PREVIEW_ENDDT > ‘2016-02-29 14:36:00‘ THEN 2
-
ELSE 3 END AS flag,
-
gpppur.*
-
FROM T_MPS_INFO gpppur
-
WHERE gpppur.ATTRACT_TP = 0
-
AND gpppur.COLUMN_ID = 1
-
AND gpppur.FIELD2 = 1
-
AND gpppur.STATUS = 1
-
ORDER BY flag ASC,gpppur.PC_SORT_NUM ASC,gpppur.GB_BEGINDATE DESC
-
LIMIT 0,45
用的是MySQL数据库,总数据量是60w,其中建有
FIELD2+STATUS的多字段索引。
查询条件涉及的4个字段的值分布情况如下:
由于这几个字段的值分布极其不均的,我们构造下面这个lua脚本产生不同的select语句来模拟负载。
qx.lua:
-
pathtest = string.match(test, "(.*/)") or ""
-
-
dofile(pathtest .. "common.lua")
-
-
function thread_init(thread_id)
-
set_vars()
-
end
-
-
function event(thread_id)
-
local ATTRACT_TP,COLUMN_ID,FIELD2,STATUS
-
ATTRACT_TP = sb_rand_uniform(0, 10)
-
COLUMN_ID = sb_rand_uniform(1, 100)
-
FIELD2 = sb_rand_uniform(0, 10)
-
STATUS = sb_rand_uniform(0, 4)
-
-
rs = db_query("SELECT CASE WHEN gpppur.GB_BEGINDATE <= ‘2016-02-29 14:36:00‘ AND gpppur.GB_ENDDATE > ‘2016-02-29 14:36:00‘ THEN 1
-
WHEN gpppur.PREVIEW_BEGINDT <= ‘2016-02-29 14:36:00‘ AND gpppur.PREVIEW_ENDDT > ‘2016-02-29 14:36:00‘ THEN 2
-
ELSE 3 END AS flag,
-
gpppur.*
-
FROM T_MPS_INFO gpppur
-
WHERE gpppur.ATTRACT_TP = "..ATTRACT_TP.."
-
AND gpppur.COLUMN_ID = "..COLUMN_ID.."
-
AND gpppur.FIELD2 = "..FIELD2.."
-
AND gpppur.STATUS = "..STATUS.."
-
ORDER BY flag ASC,gpppur.PC_SORT_NUM ASC,gpppur.GB_BEGINDATE DESC
-
LIMIT 45")
-
end
然后用sysbench进行压测,结果在32并发时测得的qps是64。
-
[root@rh6375Gt20150507 ~]# sysbench --db-driver=mysql --test=/opt/sysbench-0.5/sysbench/tests/db/qx.lua --mysql-db=test --mysql-user=mysql --mysql-password=mysql --mysql-host=srdsdevapp69 --num-threads=32 --max-time=5 run
-
sysbench 0.5: multi-threaded system evaluation benchmark
-
-
Running the test with following options:
-
Number of threads: 32
-
Random number generator seed is 0 and will be ignored
-
-
-
Threads started!
-
-
OLTP test statistics:
-
queries performed:
-
read: 825
-
write: 0
-
other: 0
-
total: 825
-
transactions: 0 (0.00 per sec.)
-
read/write requests: 825 (64.20 per sec.)
-
other operations: 0 (0.00 per sec.)
-
ignored errors: 0 (0.00 per sec.)
-
reconnects: 0 (0.00 per sec.)
-
-
General statistics:
-
total time: 12.8496s
-
total number of events: 825
-
total time taken by event execution: 399.6003s
-
response time:
-
min: 1.01ms
-
avg: 484.36ms
-
max: 12602.74ms
-
approx. 95 percentile: 222.79ms
-
-
Threads fairness:
-
events (avg/stddev): 25.7812/24.12
-
execution time (avg/stddev): 12.4875/0.23
4.2 优化后的查询
对于上面那个特定的SQL虽然我们可以通过建一个包含所有等值查询条件中4个字段(ATTRACT_TP,COLUMN_ID,FIELD2,STATUS)的组合索引进行优化,但是需要说明的是,这条SQL只是各种查询组合产生的1000多种不同SQL中的一个,每个SQL涉及的查询字段的组合是不一样的,我们不可能为每种组合都单独建一个多字段索引。
所以我们想到了
PostgreSQL的gin索引。为了使用PostgreSQL的gin索引,先把MySQL的表定义,索引和数据原封不动的迁移到PostgreSQL。
在添加gin索引前,先做了一个测试。另人惊讶的是,还没有开始进行优化,PostgreSQL测出的性能已经是MySQL的5倍(335/64=5)了。
-
[root@rh6375Gt20150507 ~]# sysbench --db-driver=pgsql --test=/opt/sysbench-0.5/sysbench/tests/db/qx.lua --pgsql-db=postgres --pgsql-user=postgres --pgsql-password=postgres --pgsql-host=srdsdevapp69 --num-threads=32 --max-time=5 run
-
sysbench 0.5: multi-threaded system evaluation benchmark
-
-
Running the test with following options:
-
Number of threads: 32
-
Random number generator seed is 0 and will be ignored
-
-
-
Threads
-
-
OLTP test statistics:
-
queries performed:
-
read: 1948
-
write: 0
-
other: 0
-
total: 1948
-
transactions: 0 (0.00 per sec.)
-
read/write requests: 1948 (335.52 per sec.)
-
other operations: 0 (0.00 per sec.)
-
ignored errors: 0 (0.00 per sec.)
-
reconnects: 0 (0.00 per sec.)
-
-
General statistics:
-
total time: 5.8059s
-
total number of events: 1948
-
total time taken by event execution: 172.0538s
-
response time:
-
min: 0.90ms
-
avg: 88.32ms
-
max: 2885.69ms
-
approx. 95 percentile: 80.01ms
-
-
Threads fairness:
-
events (avg/stddev): 60.8750/27.85
-
execution time (avg/stddev): 5.3767/0.29
下一步,添加gin索引。
-
postgres=# create extension btree_gin;
-
CREATE EXTENSION
-
postgres=# create index idx3 on t_mps_info using gin(attract_tp, column_id, field2, status);
-
CREATE INDEX
再进行压测,测出的qps是5412,是MySQL的85倍(5412/64=85)。
-
[root@rh6375Gt20150507 ~]# sysbench --db-driver=pgsql --test=/opt/sysbench-0.5/sysbench/tests/db/qx.lua --pgsql-db=postgres --pgsql-user=postgres --pgsql-password=postgres --pgsql-host=srdsdevapp69 --num-threads=32 --max-time=5 run
-
sysbench 0.5: multi-threaded system evaluation benchmark
-
-
Running the test with following options:
-
Number of threads: 32
-
Random number generator seed is 0 and will be ignored
-
-
-
Threads
-
-
OLTP test statistics:
-
queries performed:
-
read: 10000
-
write: 0
-
other: 0
-
total: 10000
-
transactions: 0 (0.00 per sec.)
-
read/write requests: 10000 (5412.80 per sec.)
-
other operations: 0 (0.00 per sec.)
-
ignored errors: 0 (0.00 per sec.)
-
reconnects: 0 (0.00 per sec.)
-
-
General statistics:
-
total time: 1.8475s
-
total number of events: 10000
-
total time taken by event execution: 58.2706s
-
response time:
-
min: 0.95ms
-
avg: 5.83ms
-
max: 68.36ms
-
approx. 95 percentile: 9.42ms
-
-
Threads fairness:
-
events (avg/stddev): 312.5000/47.80
-
execution time (avg/stddev): 1.8210/0.02
4.3 补充
作为对比,我们又在MySQL上添加了包含attract_tp, column_id, field2和status这4个字段的多字段索引,测出的qps是4000多,仍然不如PostgreSQL。可见业界广为流传的MySQL的简单查询性能优于PostgreSQL的说法不可信!(对于复杂查询PostgreSQL
的性能大大优于MySQL应该是大家的共识。我例子中的SQL不能算是复杂查询吧?)
5. 总结
gin索引(还包括类似的gist,spgist索引)是PostgreSQL的一大特色,
基于它可以挖掘出很多好玩的用法。对于本文提到的场景,有兴趣的同学可以把它和Oracle的bitmap索引以及基于搜索引擎(Elasticsearch,Solr等)的方案做个对比。另外,本人所知有限,如果有
其它更好的方案,希望能让我知道。