hive分区表

时间：2015-08-17 13:44:10 阅读：434 评论：0 收藏：0 [点我收藏+]

1.分区表

　　假如有一日志文件，其中每条记录都包含一个时间戳。我们根据日期来对他进行分区，那么同一天的记录就会被存放在同一个分区中。

为避免产生过多小文件，建议只对离散字段分区。
使用分区并不会影响大范围查询的执行，仍然可以跨分区查询。
分区实质：在数据表文件夹下再次创建分区文件夹

2.子分区

　　在日期分区的基础上，还可以根据国家对每个分区进行子分区，以加快根据地理位置的查询。

　　子分区的实质：在分区目录下创建子分区目录。

　　分区在创建表时用partitioned by定义，创建表后可以使用alter table语句来增加或移除分区。

create table logs (ts bigint,line string)
 partitioned by (dt string,country string) 
 Row Format Delimited Fields Terminated By ‘\t’ ;

　　 Load数据时，显示指定分区值：

load data local inpath ‘/root/hive/file2‘
 into table logs
 partition (dt=‘2001-01-01‘,country=‘GB‘);

　　更多数据文件加载到logs表之后，目录结构：

　　日志表中：两个日期分区 + 两个国家分区。数据文件则存放在底层目录中。

/user/hive/warehouse/logs /dt=2010-01-01/country=GB/file1
/file2

/country=US/file3

/dt=2010-01-02/country=GB/file4

/country=US/file5
  /file6

　　使用show partitions logs命令获得logs表中有那些分区：

dt=2001-01-01/country=GB

dt=2001-01-01/country=US

dt=2001-01-02/country=GB

dt=2001-01-02/country=US

　　显示表的结构信息：Describe logs;

ts                      bigint                                 
line                    string                                      
dt                      string                                      
country                 string                                

# Partition Information          

# col_name              data_type               comment                         

dt                          string                                      
country                     string

需要注意，Partitioned by子句中的列定义是表中正式的列，称为“分区列”partition column。
但是，数据文件并不包含这些列的值，因为他们源于目录名。
可以在select语句中以普通方式使用分区列。Hive会对输入进行修剪，从而只扫描相关的分区。

select ts,dt,line
 from logs
 where country=‘GB‘;

　　将只扫描file1,file2,file4。

还要注意，这个查询返回dt分区列的值。这个值是Hive从目录名中读取。

hive分区表

原文：http://www.cnblogs.com/skyl/p/4736283.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)