有时想要同时显示聚集前后的数据,这时引入了窗口函数,在SQL处理中,窗口函数都是最后一步执行
,而且仅位于 Order by 字句之前。
name,orderdate,cost jack,2017-01-01,10 tony,2017-01-02,15 jack,2017-02-03,23 tony,2017-01-04,29 jack,2017-01-05,46 jack,2017-04-06,42 tony,2017-01-07,50 jack,2017-01-08,55 mart,2017-04-08,62 mart,2017-04-09,68 neil,2017-05-10,12 mart,2017-04-11,75 neil,2017-06-12,80 mart,2017-04-13,94
建表 导数据
create table business( name string, orderdate string, cost int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,‘; load data local inpath "/opt/module/datas/business.txt" into table business;
相关函数说明
OVER():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化
CURRENT ROW:当前行
n PRECEDING:往前 n 行数据
n FOLLOWING:往后 n 行数据
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING 表示到后面的终点
上面写 over 里面,下面写 over 前面。
LAG(col,n):往前第
n 行数据
LEAD(col,n):往后第
n 行数据
NTILE(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE 返回此行所属的组的编号。注意:n必须为int类型。
1 查询在2017年4月份购买过的顾客及总人数
select name,count(*) over() from business where substring(orderdate,1,7) = ‘2017-04‘ group by name; 结果: mart 2 jack 2
2 查询顾客的购买明细及月购买总额
select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from business; 解释:按照月划分数据 然后统计这个月的cost总和 jack 2017-01-01 10 205 jack 2017-01-08 55 205 tony 2017-01-07 50 205 jack 2017-01-05 46 205 tony 2017-01-04 29 205 tony 2017-01-02 15 205 jack 2017-02-03 23 23 mart 2017-04-13 94 341 jack 2017-04-06 42 341 mart 2017-04-11 75 341 mart 2017-04-09 68 341 mart 2017-04-08 62 341 neil 2017-05-10 12 12 neil 2017-06-12 80 80
3 查看顾客上次的购买时间
select name,orderdate,cost, lag(orderdate,1,‘defaulttime‘) over(partition by name order by orderdate ) as time1, lag(orderdate,2,‘defaulttime‘) over (partition by name order by orderdate) as time2 from business; 结果 : 姓名 日期 价格 前一天日期 前两天日期 jack 2017-01-01 10 defaulttime defaulttime jack 2017-01-05 46 2017-01-01 defaulttime jack 2017-01-08 55 2017-01-05 2017-01-01 jack 2017-02-03 23 2017-01-08 2017-01-05 jack 2017-04-06 42 2017-02-03 2017-01-08 mart 2017-04-08 62 defaulttime defaulttime mart 2017-04-09 68 2017-04-08 defaulttime mart 2017-04-11 75 2017-04-09 2017-04-08 mart 2017-04-13 94 2017-04-11 2017-04-09 neil 2017-05-10 12 defaulttime defaulttime neil 2017-06-12 80 2017-05-10 defaulttime tony 2017-01-02 15 defaulttime defaulttime tony 2017-01-04 29 2017-01-02 defaulttime tony 2017-01-07 50 2017-01-04 2017-01-02
4 查询前20%时间的订单信息
select * from ( select name,orderdate,cost, ntile(5) over(order by orderdate) sorted from business ) t where sorted = 1; 结果 : jack 2017-01-01 10 1 tony 2017-01-02 15 1 tony 2017-01-04 29 1 jack 2017-01-05 46 2 tony 2017-01-07 50 2 jack 2017-01-08 55 2 jack 2017-02-03 23 3 jack 2017-04-06 42 3 mart 2017-04-08 62 3 mart 2017-04-09 68 4 mart 2017-04-11 75 4 mart 2017-04-13 94 4 neil 2017-05-10 12 5 neil 2017-06-12 80 5
以下实验均关注最后一列
1. 所有行相加
select name,orderdate,cost,sum(cost) over() as sample1 from business; 结果 : mart 2017-04-13 94 661 neil 2017-06-12 80 661 mart 2017-04-11 75 661 neil 2017-05-10 12 661 mart 2017-04-09 68 661
2. 按name分组,组内数据相加
select name,orderdate,cost,sum(cost) over(partition by name) as sample2 from business; 结果 : jack 2017-01-05 46 176 jack 2017-01-08 55 176 jack 2017-01-01 10 176 jack 2017-04-06 42 176 jack 2017-02-03 23 176 ... tony 2017-01-04 29 94 tony 2017-01-02 15 94 tony 2017-01-07 50 94
3. 按name分组,组内数据累加
select name,orderdate,cost, sum(cost) over(partition by name order by orderdate) as sample3 from business; 跟下面类似 select name,orderdate,cost, sum(cost) over(distribute by name sort by orderdate) as sample3 from business; jack 2017-01-01 10 10 jack 2017-01-05 46 56 jack 2017-01-08 55 111 jack 2017-02-03 23 134 jack 2017-04-06 42 176 ...
4. 和sample3一样,由起点到当前行的聚合
select name,orderdate,cost, sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 from business; 结果 : jack 2017-01-01 10 10 jack 2017-01-05 46 56 jack 2017-01-08 55 111 jack 2017-02-03 23 134 jack 2017-04-06 42 176 ...
5. 当前行和前面一行做聚合
select name,orderdate,cost, sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5 from business; 结果 : jack 2017-01-01 10 10 jack 2017-01-05 46 56 = 46 + 10 jack 2017-01-08 55 101 = 44 + 46 jack 2017-02-03 23 78 = 23 + 55 jack 2017-04-06 42 65 = 42 + 23 ... tony 2017-01-02 15 15 tony 2017-01-04 29 44 tony 2017-01-07 50 79
6. 当前行和前边一行及后面一行
select name,orderdate,cost, sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6 from business; 结果 : jack 2017-01-01 10 56 = 10 + 46 jack 2017-01-05 46 111 = 46 + 10 + 55 jack 2017-01-08 55 124 = 55 + 46 + 23 jack 2017-02-03 23 120 = 23 + 55 + 42 jack 2017-04-06 42 65 = 42 + 23 ... tony 2017-01-02 15 44 tony 2017-01-04 29 94 tony 2017-01-07 50 79
7. 当前行及后面所有行
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 from business; 结果 : jack 2017-01-01 10 176 = 10 + 46 + 55 + 23 + 42 jack 2017-01-05 46 166 = 46 + 55 + 23 + 42 jack 2017-01-08 55 120 = 55 + 23 + 42 jack 2017-02-03 23 65 = 23 + 42 jack 2017-04-06 42 42 = 42 mart 2017-04-08 62 299 mart 2017-04-09 68 237 mart 2017-04-11 75 169 mart 2017-04-13 94 94 neil 2017-05-10 12 92 neil 2017-06-12 80 80 tony 2017-01-02 15 94 tony 2017-01-04 29 79 tony 2017-01-07 50 50
函数说明
rank():排序相同时会重复,总数不会变
dense_rank():排序相同时会重复,总数会减少
row_number():会根据顺序计算
select name,subject,score, rank() over(partition by subject order by score desc) rp, dense_rank() over(partition by subject order by score desc) drp, row_number() over(partition by subject order by score desc) rmp from score; 结果 : name subject score rp drp rmp 孙悟空 数学 95 1 1 1 宋宋 数学 86 2 2 2 婷婷 数学 85 3 3 3 大海 数学 56 4 4 4 宋宋 英语 84 1 1 1 大海 英语 84 1 1 2 婷婷 英语 78 3(跳过2)2 3 孙悟空 英语 68 4 3(总数少) 4 大海 语文 94 1 1 1 孙悟空 语文 87 2 2 2 婷婷 语文 65 3 3 3 宋宋 语文 64 4 4 4
CONCAT(string A, string B):
返回输入字符串
连接后
的结果,支持任意个输入字符串;
CONCAT_WS(separator, str1, str2,…):
特殊形式的 CONCAT()。第一个参数剩余参数间的分隔符。分隔符可以是与剩余参数一样的字符串。如果分隔符是 NULL,返回值也将为 NULL。COLLECT_SET(col):
只接受基本数据类型,主要作用是将某字段的值进行去重汇总,产生array类型字段。多行汇总成一个array类型。
EXPLODE(col):
将hive一列中复杂的array或者map结构拆分成多行。
LATERAL VIEW
用法:LATERAL VIEW udtf(expression) table Alias AS columnAlias
解释:用于和split, explode等UDTF一起使用,它能够将一列数据拆成多行数据,在此基础上可以对拆分后的数据进行聚合。
原文:https://www.cnblogs.com/qiu-hua/p/14731299.html