首页 > 解决方案 > Hive 窗口化附加输出

问题描述

给定以下数据:

CREATE TABLE dat (dt STRING, uxt BIGINT, temp FLOAT, city STRING);
INSERT INTO dat VALUES
('1/1/2000 0:53', 946687980, 100, 'A'),
('1/1/2000 0:59', 946688340, 28.9, 'A'),
('1/1/2000 13:54', 946734840, -1, 'A'),
('1/1/2000 13:55', 946734900, 30.9, 'A'),
('1/1/2000 22:53', 946767180, 30.9, 'A'),
('1/1/2000 22:59', 946767540, 30, 'A'),
('1/2/2000 1:25', 946776300, 121, 'A'),
('1/2/2000 1:53', 946777980, 28.9, 'A'),
('1/2/2000 2:53', 946781580, 28.9, 'A'),
('1/3/2000 1:53', 946864380, 10, 'A'),
('1/3/2000 11:20', 946898400, 15.1, 'A'),
('1/3/2000 11:53', 946900380, 18, 'A'),
('1/3/2000 21:00', 946933200, 17.1, 'A'),
('1/3/2000 21:53', 946936380, 16, 'A');

我正在使用一些窗口函数来查找每 24 小时的 maxtemp、mintemp 等:

select dt, uxt, maxtemp, mintemp, ABS(maxtemp - mintemp) as tempDiff, city
from(
select dt, uxt, max(temp) over (w) as maxtemp, min(temp) over (w) as 
mintemp, city
from dat
WINDOW w as (partition by city order by uxt range between CURRENT ROW and 
86400 FOLLOWING))t1
order by tempDiff desc;

这给了我以下输出(第一行):

dt                   uxt      maxtemp   mintemp tempdiff    city
2000-01-01 13:54    946734840   121.0   -1.0    122.0        A

我想在输出中出现 maxtemp 并努力寻找解决方案时添加“dt”。

输出的第一行如下所示:

dt                   uxt      maxtemp   mintemp tempdiff    city   maxdt   
2000-01-01 13:54    946698780   121        -1    122         A     '2000-01-02 01:25'

使用 first_value 查询:

select dt
  ,uxt
  ,max(temp) over w as maxtemp
  ,min(temp) over w as mintemp
  ,abs(max(temp) over w - min(temp) over w) as tempDiff
  ,first_value(dt) over (w order by temp desc) as maxdt
  ,city
from dat
order by tempDiff desc
WINDOW w as (partition by city order by uxt 
         range between CURRENT ROW and 86400 FOLLOWING);

正在产生这个输出(前两行):

dt               uxt        maxtemp mintemp tempdiff maxdt          city
1/1/2000 0:59    946688340  121.0   -1.0    122.0    1/2/2000 1:53  A
1/1/2000 0:53    946687980  121.0   -1.0    122.0    1/1/2000 0:53   A

它给出的 maxtemp 不在 24 小时日期范围内。此外,2000 年 1 月 1 日 0:53 的第二行的温度不是 121.0

标签: hive

解决方案


这可以通过first_value在内部查询中包含以下窗口函数来完成。

first_value(dt) over (partition by city order by uxt,temp desc 
                      range between CURRENT ROW and 7200 FOLLOWING)

另请注意,查询可以简化为(消除子查询,因为在这种情况下不需要子查询)

select dt
      ,uxt
      ,max(temp) over w as maxtemp
      ,min(temp) over w as mintemp
      ,abs(max(temp) over w - min(temp) over w) as tempDiff
      ,first_value(dt) over (w order by temp desc) as maxdt
      ,city
from dat
WINDOW w as (partition by city order by uxt 
             range between CURRENT ROW and 7200 FOLLOWING)

推荐阅读