hive> select * from new_test; OK 1 20 qiu 20140101 2 43 liu 20140101 3 23 zheng 20140101 4 32 yang 20140101 5 24 qian 20140101 Time taken: 0.106 seconds
hive> create table student( > id int, > age int, > name string > )partitioned by (dt string) > clustered by(id) sorted by(age) into 3 buckets > row format delimited fields terminated by ‘,‘ lines terminated by ‘\n‘; OK Time taken: 0.348 seconds
from new_test insert overwrite table student partition(dt=‘20140102‘) select id,age,name where datekey=‘20140101‘ sort by age;
查看buckets。
[root@hadoop00 /]# hadoop fs -ls /user/hive/warehous/hbmsdb.db/student/dt=20140102 Found 3 items -rw-r--r-- 3 root supergroup 11 2014-03-04 15:59 /user/hive/warehous/hbmsdb.db/student/dt=20140102/000000_0 -rw-r--r-- 3 root supergroup 19 2014-03-04 15:59 /user/hive/warehous/hbmsdb.db/student/dt=20140102/000001_0 -rw-r--r-- 3 root supergroup 19 2014-03-04 16:00 /user/hive/warehous/hbmsdb.db/student/dt=20140102/000002_0
[root@hadoop00 /]# hadoop fs -cat /user/hive/warehous/hbmsdb.db/student/dt=20140102/000000_0 3,23,zheng [root@hadoop00 /]# hadoop fs -cat /user/hive/warehous/hbmsdb.db/student/dt=20140102/000001_0 1,20,qiu 4,32,yang [root@hadoop00 /]# hadoop fs -cat /user/hive/warehous/hbmsdb.db/student/dt=20140102/000002_0 5,24,qian 2,43,liu
读取相应的sampling数据
hive> select * from student tablesample(bucket 2 out of 3 on id); Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there‘s no reduce operator Starting Job = job_201403041248_0004, Tracking URL = http://hadoop00:50030/jobdetails.jsp?jobid=job_201403041248_0004 Kill Command = /warehouse/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=http://hadoop00:9001 -kill job_201403041248_0004 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2014-03-04 16:16:45,583 Stage-1 map = 0%, reduce = 0% 2014-03-04 16:16:51,616 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.23 sec 2014-03-04 16:16:52,623 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.23 sec 2014-03-04 16:16:53,630 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.23 sec 2014-03-04 16:16:54,636 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.23 sec 2014-03-04 16:16:55,643 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.23 sec 2014-03-04 16:16:56,650 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.23 sec 2014-03-04 16:16:57,657 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.23 sec MapReduce Total cumulative CPU time: 1 seconds 230 msec Ended Job = job_201403041248_0004 MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 1.23 sec HDFS Read: 251 HDFS Write: 37 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 230 msec OK 1 20 qiu 20140102 4 32 yang 20140102 Time taken: 19.554 seconds
tablesample是抽样语句,语法:TABLESAMPLE(BUCKET x OUT OF y)y必须是table总bucket数的倍数或者因子。hive根据y的大小,决定抽样的比例。例如,table总共分了64份,当y=32时,抽取(64/32=)2个bucket的数据,当y=128时,抽取(64/128=)1/2个bucket的数据。x表示从哪个bucket开始抽取。例如,table总bucket数为32,tablesample(bucket 3 out of 16),表示总共抽取(32/16=)2个bucket的数据,分别为第3个bucket和第(3+16=)19个bucket的数据。
原文:http://blog.csdn.net/magicharvey/article/details/20466541