HIVE outer join TIPS

时间：2014-02-08 16:50:23 阅读：514 评论：0 收藏：0 [点我收藏+]

OUTER JOIN Gotcha
Before we discuss the other outer joins, let’s discuss a gotcha you should understand.
JOIN Statements | 101Recall what we said previously about speeding up queries by adding partition filters in
the WHERE clause. To speed up our previous query, we might choose to add predicates
that select on the exchange in both tables:
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = ‘AAPL‘
> AND s.exchange = ‘NASDAQ‘ AND d.exchange = ‘NASDAQ‘;
1987-05-11
AAPL
77.0
0.015
1987-08-10
AAPL
48.25
0.015
1987-11-17
AAPL
35.0
0.02
1988-02-12
AAPL
41.0
0.02
1988-05-16
AAPL
41.25
0.02
...
However, the output has changed, even though we thought we were just adding an
optimization! We’re back to having approximately four stock records per year and we
have non- NULL entries for all the dividend values. In other words, we are back to the
original inner join!
This is actually common behavior for all outer joins in most SQL implementations. It
occurs because the JOIN clause is evaluated first, then the results are passed through
the WHERE clause. By the time the WHERE clause is reached, d.exchange is NULL most of the
time, so the “optimization” actually filters out all records except those on the day of
dividend payments.
One solution is straightforward; remove the clauses in the WHERE clause that reference
the dividends table:
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = ‘AAPL‘ AND s.exchange = ‘NASDAQ‘;
...
1987-05-07
AAPL
80.25
NULL
1987-05-08
AAPL
79.0
NULL
1987-05-11
AAPL
77.0
0.015
1987-05-12
AAPL
75.5
NULL
1987-05-13
AAPL
78.5
NULL
...
This isn’t very satisfactory. You might wonder if you can move the predicates from the
WHERE clause into the ON clause, at least the partition filters. This does not work for outer
joins, despite documentation on the Hive Wiki that claims it should work (https://cwiki
.apache.org/confluence/display/Hive/LanguageManual+Joins).
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT OUTER JOIN dividends d
> ON s.ymd = d.ymd AND s.symbol = d.symbol
> AND s.symbol = ‘AAPL‘ AND s.exchange = ‘NASDAQ‘ AND d.exchange = ‘NASDAQ‘;
...
1962-01-02
GE
74.75
NULL
1962-01-02
IBM
572.0
NULL
1962-01-03
GE
74.0
NULL
1962-01-03
IBM
577.0
NULL
102 | Chapter 6: HiveQL: Queries1962-01-04
1962-01-04
1962-01-05
1962-01-05
...
GE
IBM
GE
IBM
73.12
NULL
571.25 NULL
71.25
NULL
560.0
NULL
The partition filters are ignored for OUTER JOINTS . However, using such filter predicates
in ON clauses for inner joins does work!
Fortunately, there is solution that works for all joins; use nested SELECT statements:
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend FROM
> (SELECT * FROM stocks WHERE symbol = ‘AAPL‘ AND exchange = ‘NASDAQ‘) s
> LEFT OUTER JOIN
> (SELECT * FROM dividends WHERE symbol = ‘AAPL‘ AND exchange = ‘NASDAQ‘) d
> ON s.ymd = d.ymd;
...
1988-02-10
AAPL
41.0
NULL
1988-02-11
AAPL
40.63
NULL
1988-02-12
AAPL
41.0
0.02
1988-02-16
AAPL
41.25
NULL
1988-02-17
AAPL
41.88
NULL
...
The nested SELECT statement performs the required “push down” to apply the partition
filters before data is joined.
WHERE clauses are evaluated after joins are performed, so WHERE clauses
should use predicates that only filter on column values that won’t be
NULL . Also, contrary to Hive documentation, partition filters don’t work
in ON clauses for OUTER JOINS , although they do work for INNER JOINS !

---------------------------------------------------------

LEFT SEMI-JOIN
A left semi-join returns records from the lefthand table if records are found in the right-
hand table that satisfy the ON predicates. It’s a special, optimized case of the more general
inner join. Most SQL dialects support an IN ... EXISTS construct to do the same thing.
For instance, the following query in Example 6-2 attempts to return stock records only
on the days of dividend payments, but it doesn’t work in Hive.
Example 6-2. Query that will not work in Hive
SELECT s.ymd, s.symbol, s.price_close FROM stocks s
WHERE s.ymd, s.symbol IN
(SELECT d.ymd, d.symbol FROM dividends d);
Instead, you use the following LEFT SEMI JOIN syntax:
hive> SELECT s.ymd, s.symbol, s.price_close
> FROM stocks s LEFT SEMI JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol;
...
1962-11-05
IBM
361.5
1962-08-07
IBM
373.25
1962-05-08
IBM
459.5
1962-02-06
IBM
551.5
Note that the SELECT and WHERE clauses can’t reference columns from the righthand
table

----------------------

The hint still works, but it’s now deprecated as of Hive v0.7. However, you still have
to set a property, hive.auto.convert.join , to true before Hive will attempt the opti-
mization. It’s false by default:
hive> set hive.auto.convert.join=true;
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = ‘AAPL‘;
Note that you can also configure the threshold size for table files considered small
enough to use this optimization. Here is the default definition of the property (in bytes):
hive.mapjoin.smalltable.filesize=25000000

HIVE outer join TIPS

原文：http://www.cnblogs.com/briller/p/3540297.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)