OUTER JOIN Gotcha
Before we discuss the other outer joins, let’s discuss a
gotcha you should understand.
JOIN Statements | 101Recall what we said
previously about speeding up queries by adding partition filters in
the WHERE
clause. To speed up our previous query, we might choose to add
predicates
that select on the exchange in both tables:
hive> SELECT
s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT OUTER JOIN
dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol =
‘AAPL‘
> AND s.exchange = ‘NASDAQ‘ AND d.exchange =
‘NASDAQ‘;
1987-05-11
AAPL
77.0
0.015
1987-08-10
AAPL
48.25
0.015
1987-11-17
AAPL
35.0
0.02
1988-02-12
AAPL
41.0
0.02
1988-05-16
AAPL
41.25
0.02
...
However,
the output has changed, even though we thought we were just adding
an
optimization! We’re back to having approximately four stock records per
year and we
have non- NULL entries for all the dividend values. In other
words, we are back to the
original inner join!
This is actually common
behavior for all outer joins in most SQL implementations. It
occurs because
the JOIN clause is evaluated first, then the results are passed through
the
WHERE clause. By the time the WHERE clause is reached, d.exchange is NULL most
of the
time, so the “optimization” actually filters out all records except
those on the day of
dividend payments.
One solution is straightforward;
remove the clauses in the WHERE clause that reference
the dividends
table:
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
>
FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol =
d.symbol
> WHERE s.symbol = ‘AAPL‘ AND s.exchange =
‘NASDAQ‘;
...
1987-05-07
AAPL
80.25
NULL
1987-05-08
AAPL
79.0
NULL
1987-05-11
AAPL
77.0
0.015
1987-05-12
AAPL
75.5
NULL
1987-05-13
AAPL
78.5
NULL
...
This
isn’t very satisfactory. You might wonder if you can move the predicates from
the
WHERE clause into the ON clause, at least the partition filters. This
does not work for outer
joins, despite documentation on the Hive Wiki that
claims it should work
(https://cwiki
.apache.org/confluence/display/Hive/LanguageManual+Joins).
hive>
SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT
OUTER JOIN dividends d
> ON s.ymd = d.ymd AND s.symbol = d.symbol
>
AND s.symbol = ‘AAPL‘ AND s.exchange = ‘NASDAQ‘ AND d.exchange =
‘NASDAQ‘;
...
1962-01-02
GE
74.75
NULL
1962-01-02
IBM
572.0
NULL
1962-01-03
GE
74.0
NULL
1962-01-03
IBM
577.0
NULL
102
| Chapter 6: HiveQL:
Queries1962-01-04
1962-01-04
1962-01-05
1962-01-05
...
GE
IBM
GE
IBM
73.12
NULL
571.25
NULL
71.25
NULL
560.0
NULL
The partition filters are ignored for
OUTER JOINTS . However, using such filter predicates
in ON clauses for inner
joins does work!
Fortunately, there is solution that works for all joins; use
nested SELECT statements:
hive> SELECT s.ymd, s.symbol, s.price_close,
d.dividend FROM
> (SELECT * FROM stocks WHERE symbol = ‘AAPL‘ AND exchange
= ‘NASDAQ‘) s
> LEFT OUTER JOIN
> (SELECT * FROM dividends WHERE
symbol = ‘AAPL‘ AND exchange = ‘NASDAQ‘) d
> ON s.ymd =
d.ymd;
...
1988-02-10
AAPL
41.0
NULL
1988-02-11
AAPL
40.63
NULL
1988-02-12
AAPL
41.0
0.02
1988-02-16
AAPL
41.25
NULL
1988-02-17
AAPL
41.88
NULL
...
The
nested SELECT statement performs the required “push down” to apply the
partition
filters before data is joined.
WHERE clauses are evaluated after
joins are performed, so WHERE clauses
should use predicates that only filter
on column values that won’t be
NULL . Also, contrary to Hive documentation,
partition filters don’t work
in ON clauses for OUTER JOINS , although they do
work for INNER JOINS !
---------------------------------------------------------
LEFT SEMI-JOIN
A left semi-join returns records from the lefthand table if
records are found in the right-
hand table that satisfy the ON predicates.
It’s a special, optimized case of the more general
inner join. Most SQL
dialects support an IN ... EXISTS construct to do the same thing.
For
instance, the following query in Example 6-2 attempts to return stock records
only
on the days of dividend payments, but it doesn’t work in
Hive.
Example 6-2. Query that will not work in Hive
SELECT s.ymd,
s.symbol, s.price_close FROM stocks s
WHERE s.ymd, s.symbol IN
(SELECT
d.ymd, d.symbol FROM dividends d);
Instead, you use the following LEFT SEMI
JOIN syntax:
hive> SELECT s.ymd, s.symbol, s.price_close
> FROM
stocks s LEFT SEMI JOIN dividends d ON s.ymd = d.ymd AND s.symbol =
d.symbol;
...
1962-11-05
IBM
361.5
1962-08-07
IBM
373.25
1962-05-08
IBM
459.5
1962-02-06
IBM
551.5
Note
that the SELECT and WHERE clauses can’t reference columns from the
righthand
table
----------------------
The hint still works, but it’s now
deprecated as of Hive v0.7. However, you still have
to
set a property, hive.auto.convert.join , to true before Hive will attempt the
opti-
mization. It’s false by
default:
hive> set
hive.auto.convert.join=true;
hive> SELECT s.ymd, s.symbol, s.price_close,
d.dividend
> FROM stocks s
JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
>
WHERE s.symbol = ‘AAPL‘;
Note
that you can also configure the threshold size for table files considered
small
enough to use this
optimization. Here is the default definition of the property (in
bytes):
hive.mapjoin.smalltable.filesize=25000000
原文:http://www.cnblogs.com/briller/p/3540297.html