首页 > 其他 > 详细

ClickHouse 支持的join类型说明

时间:2020-04-16 10:28:34      阅读:857      评论:0      收藏:0      [点我收藏+]

ClickHouse 支持的join类型说明

按照代码Join.h的说明,ClickHouse支持14种Join,如下所示:

  * JOIN-s could be of these types:
  * - ALL × LEFT/INNER/RIGHT/FULL
  * - ANY × LEFT/INNER/RIGHT
  * - SEMI/ANTI x LEFT/RIGHT
  * - ASOF x LEFT/INNER
  * - CROSS

All和Any的区别如官网文档所示:

ANYALL

在使用ALL修饰符对JOIN进行修饰时,如果右表中存在多个与左表关联的数据,那么系统则将右表中所有可以与左表关联的数据全部返回在结果中。这与SQL标准的JOIN行为相同。
在使用ANY修饰符对JOIN进行修饰时,如果右表中存在多个与左表关联的数据,那么系统仅返回第一个与左表匹配的结果。如果左表与右表一一对应,不存在多余的行时,ANYALL的结果相同。

以INNER JOIN为例说明ANY和ALL的区别,先准备数据:

1、创建join_test库

create database join_test engine=Ordinary;

2、创建left_t1和right_t1表

create table left_t1(a UInt16,b UInt16,create_date date)Engine=MergeTree(create_date,a,8192);

create table right_t1(a UInt16,b UInt16,create_date date)Engine=MergeTree(create_date,a,8192);

3、插入数据

insert into left_t1 values(1,11,2020-3-20);

insert into left_t1 values(2,22,2020-3-20);

insert into left_t1 values(3,22,2020-3-20);

insert into right_t1 values(1,111,2020-3-20);

insert into right_t1 values(2,222,2020-3-20);

insert into right_t1 values(2,2222,2020-3-20);

4、查看分别增加ANY和ALL对INNER JOIN输出结果的影响

ALL INNER JOIN

select * from left_t1 all inner join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
ALL INNER JOIN right_t1 ON left_t1.a = right_t1.a

┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │  1975-06-21 │          2 │       2222 │           1975-06-21 │
│ 2 │ 22 │  1975-06-21 │          2 │        222 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │  1975-06-21 │          1 │        111 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘

3 rows in set. Elapsed: 0.019 sec.

右表right_t1存在两条与左表left_t1匹配的结果,两条全部返回。

ANY INNER JOIN

select * from left_t1 any inner join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
ANY INNER JOIN right_t1 ON left_t1.a = right_t1.a

┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │  1975-06-21 │          2 │       2222 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │  1975-06-21 │          1 │        111 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘

2 rows in set. Elapsed: 0.023 sec.

右表right_t1存在两条与左表left_t1匹配的结果,但是只返回一条。

INNER JOIN

内连接,将left_t1表和right_t1表所有满足left_t1.a=right_t1.a条件的记录进行连接,如下图所示:

select * from left_t1 all inner join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
ALL INNER JOIN right_t1 ON left_t1.a = right_t1.a

┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │  1975-06-21 │          1 │        111 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │  1975-06-21 │          2 │       2222 │           1975-06-21 │
│ 2 │ 22 │  1975-06-21 │          2 │        222 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘

3 rows in set. Elapsed: 0.134 sec.

LEFT JOIN

左连接,在内连接的基础上,对于那些在right_t1表中找不到匹配记录的left_t1表中的记录,用空值或0进行连接,如下图所示:

select * from left_t1 all left join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
ALL LEFT JOIN right_t1 ON left_t1.a = right_t1.a

┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 3 │ 22 │  1975-06-21 │          0 │          0 │           0000-00-00 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │  1975-06-21 │          1 │        111 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │  1975-06-21 │          2 │       2222 │           1975-06-21 │
│ 2 │ 22 │  1975-06-21 │          2 │        222 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘

4 rows in set. Elapsed: 0.013 sec.

RIGHT JOIN

右连接,在内连接的基础上,对于那些在left_t1表中找不到匹配记录的right_t1表中的记录,用空值或0进行连接,如下图所示:

select * from left_t1 all right join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
ALL RIGHT JOIN right_t1 ON left_t1.a = right_t1.a

┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │  1975-06-21 │          2 │       2222 │           1975-06-21 │
│ 2 │ 22 │  1975-06-21 │          2 │        222 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │  1975-06-21 │          1 │        111 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘

3 rows in set. Elapsed: 0.021 sec.

FULL JOIN

全连接,在内连接的基础上,对于那些在left_t1表中找不到匹配记录的right_t1表中的记录和在right_t1表中找不到匹配记录的left_t1表中的记录,都用空值或0进行连接,如下图所示:

select * from left_t1 all full join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
ALL FULL OUTER JOIN right_t1 ON left_t1.a = right_t1.a

┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │  1975-06-21 │          2 │       2222 │           1975-06-21 │
│ 2 │ 22 │  1975-06-21 │          2 │        222 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │  1975-06-21 │          1 │        111 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 3 │ 22 │  1975-06-21 │          0 │          0 │           0000-00-00 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘

4 rows in set. Elapsed: 0.046 sec.

SEMI LEFT JOIN  和  SEMI RIGHT JOIN    ANTI LEFT JOIN  和  ANTI RIGHT JOIN         Join.h中的解释如下:

  * SEMI JOIN filter left table by keys that are present in right table for LEFT JOIN, and filter right table by keys from left table
  * for RIGHT JOIN. In other words SEMI JOIN returns only rows which joining keys present in another table.
  * ANTI JOIN is the same as SEMI JOIN but returns rows with joining keys that are NOT present in another table.
  * SEMI/ANTI JOINs allow to get values from both tables. For filter table it gets any row with joining same key. For ANTI JOIN it returns
  * defaults other table columns.

意思是:使用SEMI LEFT JOIN时,使用右表中存在的key去过滤左表中的key,如果左表存在与右表相同的key,则输出。

            使用SEMI RIGHT JOIN时,使用左表中存在的key去过滤右表中的key,如果右表中存在与左表相同的key,则输出。

            换句话说,SEMI JOIN返回key在另外一个表中存在的记录行。

           ANTI JOIN和SEMI JOIN相反,他返回的是key在另外一个表中不存在的记录行。

           SEMI JOIN和ANTI JOIN都允许从两个表中获取数据。对于被过滤的表,返回的是与key相同的记录行。对于ANTI JOIN,另外一个表返回的是默认值,比如空值或0。

 select * from left_t1 semi left join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
SEMI LEFT JOIN right_t1 ON left_t1.a = right_t1.a

┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │  1975-06-21 │          2 │       2222 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │  1975-06-21 │          1 │        111 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘

2 rows in set. Elapsed: 0.052 sec.

 

select * from left_t1 semi right join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
SEMI RIGHT JOIN right_t1 ON left_t1.a = right_t1.a

┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │  1975-06-21 │          1 │        111 │           1975-06-21 │
│ 2 │ 22 │  1975-06-21 │          2 │        222 │           1975-06-21 │
│ 2 │ 22 │  1975-06-21 │          2 │       2222 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘

3 rows in set. Elapsed: 1.327 sec.

 

select * from left_t1 anti left join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
ANTI LEFT JOIN right_t1 ON left_t1.a = right_t1.a

┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 3 │ 22 │  1975-06-21 │          3 │          0 │           0000-00-00 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘

1 rows in set. Elapsed: 0.061 sec.

 

select * from left_t1 anti right join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
ANTI RIGHT JOIN right_t1 ON left_t1.a = right_t1.a

Ok.

0 rows in set. Elapsed: 0.024 sec.

ASOF LEFT  和  ASOF INNER 没有具体的语法,本来想通过查看执行计划来看看,但是采用下述方式后,没看到选择什么方式,暂时不知道怎么能走到这两个类型的处理方式上来。

clickhouse-client --send_logs_level=trace <<< ‘select * from join_test.left_t1,join_test.right_t1 where join_test.left_t1.a<>1 and join_test.right_t1.a<>1‘ > /dev/null


下面为Join.h中的说明:

  * ASOF JOIN is not-equi join. For one key column it finds nearest value to join according to join inequality.
  * It‘s expected that ANY|SEMI LEFT JOIN is more efficient that ALL one.
  *
  * If INNER is specified - leave only rows that have matching rows from "right" table.
  * If LEFT is specified - in case when there is no matching row in "right" table, fill it with default values instead.
  * If RIGHT is specified - first process as INNER, but track what rows from the right table was joined,
  *  and at the end, add rows from right table that was not joined and substitute default values for columns of left table.
  * If FULL is specified - first process as LEFT, but track what rows from the right table was joined,
  *  and at the end, add rows from right table that was not joined and substitute default values for columns of left table.
  *
  * Thus, LEFT and RIGHT JOINs are not symmetric in terms of implementation.
  *
  * All JOINs (except CROSS) are done by equality condition on keys (equijoin).
  * Non-equality and other conditions are not supported.

仅支持等值条件的Join,不支持非等值和其他条件的Join。
  *
  * Implementation:实现机制如下:
  *
  * 1. Build hash table in memory from "right" table.
  * This hash table is in form of keys -> row in case of ANY or keys -> [rows...] in case of ALL.
  * This is done in insertFromBlock method.
  *一般将小表作为右表,根据右表在内存中构建hash表。这部分实现在insertFromBlock中完成。
  * 2. Process "left" table and join corresponding rows from "right" table by lookups in the map.
  * This is done in joinBlock methods.
  *遍历左表,根据右表在内存中的map来连接对应行,这部分实现在joinBlock中完成。
  * In case of ANY LEFT JOIN - form new columns with found values or default values.
  * This is the most simple. Number of rows in left table does not change.
  *ANY LEFT JOIN左表行数量不变,使用匹配的值或默认值填充新列。
  * In case of ANY INNER JOIN - form new columns with found values,
  *  and also build a filter - in what rows nothing was found.
  * Then filter columns of "left" table.
  *ANY INNER JOIN 用满足条件的值构建新列,用不满足条件的行构建filter,然后用filter过滤左表。
  * In case of ALL ... JOIN - form new columns with all found rows,
  *  and also fill ‘offsets‘ array, describing how many times we need to replicate values of "left" table.
  * Then replicate columns of "left" table.
  *ALL...JOIN 将所有找到的行合并为新列,并填充offsets数组,并描述需要把左表的值复制多少次,然后复制左表的列。
  * How Nullable keys are processed:
  *如何处理NULL值:
  * NULLs never join to anything, even to each other.

NULL永远不会和任何值做JOIN,即使是NULL之间。
  * During building of map, we just skip keys with NULL value of any component.

构建Hash表的过程中,跳过任何NULL值。
  * During joining, we simply treat rows with any NULLs in key as non joined.
  *Join期间,将NULL值行视为未JOIN
  * Default values for outer joins (LEFT, RIGHT, FULL):
  *外部连接的默认值
  * Behaviour is controlled by ‘join_use_nulls‘ settings.

行为由join_use_nulls参数控制。
  * If it is false, we substitute (global) default value for the data type, for non-joined rows
  *  (zero, empty string, etc. and NULL for Nullable data types).
  * If it is true, we always generate Nullable column and substitute NULLs for non-joined rows,
  *  as in standard SQL.

分两种情况:当join_use_nulls参数为false时,用默认值替代未连接的行;当join_use_nulls为true时,用NULL替代未连接的行。

ANTI RIGHT JOIN

ClickHouse 支持的join类型说明

原文:https://www.cnblogs.com/snake-fly/p/12599924.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!