一、什么时候可以避免执行MapReduce?select *where语句中只有分区字段
二、Jion优化驱动表最右边查询表表的大小从左边到右边依次增大标志机制显示的告知查询优化器哪张表示大表/*+streamtable(table_name)*/
三、Map-side聚合sethive.map.aggr=true;这个设置可以将顶层的聚合操作放在Map阶段执行,从而减轻清洗 阶段数据传输和Reduce阶段的执行时间,提升总体性能。缺点:该设置会消耗更多的内存。执行select count(1) from wlan;
四、Localhadoop本地模式SETmapred.job.tracker=local; 测试 select 1 from wlan limit 5;
下面两个参数是local mr中常用的控制参数:
1,hive.exec.mode.local.auto.inputbytes.max设置local mr的最大输入数据量,当输入数据量小于这个值的时候会 采用local mr的方式
2,hive.exec.mode.local.auto.tasks.max设置local mr的最大输入文件个数,当输入文件个数小于这个值的时 候会采用local mr的方式默认执行模式:
hive (default)> select count(1) t1;
Query ID = root_20150611185656_333185b7-e8b3-40b5-bc4c-2f11978f9822
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
sethive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
sethive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1433931422330_0001,Tracking URL = http://crxy176:8088/proxy/application_1433931422330_0001/
Kill Command = /usr/local/hadoop-2.6.0/bin/hadoop job -kill job_1433931422330_0001
Hadoop job information for Stage-1: number of mappers: 1; number ofreducers: 1
2015-06-11 18:56:44,749 Stage-1 map = 0%, reduce = 0%
2015-06-11 18:56:57,029 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.8 sec
2015-06-11 18:57:11,050 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.2 sec
MapReduce Total cumulative CPU time: 4 seconds 200 msec
Ended Job = job_1433931422330_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce:1 Cumulative CPU: 4.2 sec HDFS Read: 312 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 200 msec
OK
t1
100
Time taken: 46.573 seconds, Fetched: 1 row(s)
对比启动Localhadoop模式:
hive (default)> select count(1) t1;
Automatically selecting local only mode for query
Query ID = root_20150611185555_97e1a1d0-1958-4f35-8ea7-8face4cda85f
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
sethive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
sethive.exec.reducers.max=<number>
In order to set a constant number of reducers:
setmapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
Hadoop job information for Stage-1: number of mappers: 0; number ofreducers: 0
2015-06-11 18:55:25,123 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1510342541_0004
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 12HDFS Write: 22 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
t1
100
Time taken: 1.721 seconds, Fetched: 1 row(s)
五、索引Hive中的索引架构开放了一个接口,允许你根据这个接口去实现自己的索引。目前Hive自己有一个参考的索引实现(CompactIndex),后来在0.8版本中又加入位图索引。这里就讲讲CompactIndex。
/*在index_test_table表的id字段上创建索引*/
create index idx on table index_test_table(id)
as ‘org.apache.Hadoop.Hive.ql.index.compact.CompactIndexHandler‘with deferred rebuild;
alter index idx on index_test_table rebuild;
/*索引的剪裁。找到上面建的索引表,根据你最终要用的查询条件剪裁一下。*/
/*如果你想跟RDBMS一样建完索引就用,那是不行的,会直接报错,这也是其麻烦的地方*/
create table my_index
as select _bucketname, `_offsets`
from default__index_test_table_idx__ where id = 10;
/*现在可以用索引了,注意最终查询条件跟上面的剪裁条件一致*/
set Hive.index.compact.file = /user/Hive/warehouse/my_index;
set Hive.input.format =org.apache.Hadoop.Hive.ql.index.compact.HiveCompactIndexInputFormat;
select count(*) from index_test_table where id = 10;
六、数据倾斜所谓数据倾斜,说的是由于数据分布不均匀,个别值集中占据大部分数据量,加上Hadoop的计算模式,导致计算资源不均匀引起性能下降。
倾斜分成group by造成的倾斜和join造成的倾斜:
一个是Hive.Map.aggr,默认值已经为true,意思是会做Map端的combiner。所以如果你的group by查询只是做count(*)的话,其实是看不出倾斜效果的,但是如果你做的是count(distinct),那么还是会看出一点倾斜效果。
另一个参数是Hive.groupby. skewindata。这个参数的意思是做Reduce操作的时候,拿到的key并不是所有相同值给同一个Reduce,而是随机分发,然后Reduce做聚合,做完之后再做一轮MR,拿前面聚合过的数据再算结果。所以这个参数其实跟Hive.Map.aggr做的是类似的事情,只是拿到Reduce端来做,而且要额外启动一轮Job,所以其实不怎么推荐用,效果不明显。
改写SQL来优化
/*改写前*/
select a, count(distinctb) as c from tbl group by a;
/*改写后*/
select a, count(*) as c
from (selectdistinct a, b from tbl) group by a;
七、Job间并行
首先,在Hive生成的多个Job中,在有些情况下Job之间是可以并行的,典型的就是子查询。当需要执行多个子查询unionall或者join操作的时候,Job间并行就可以使用了。比如下面的代码就是一个可以并行的场景示意:
hive>FROM t4
INSERT OVERWRITE TABLE t3PARTITION (...) SELECT ...WHERE...
INSERT OVERWRITE TABLE t3PARTITION (...) SELECT ...WHERE...
INSERT OVERWRITE TABLE t3PARTITION (...) SELECT ...WHERE...
更多精彩内容请关注:http://bbs.superwu.cn
关注超人学院微信二维码:
Hive优化
原文:http://www.cnblogs.com/CRXY/p/4572150.html