首页 > 解决方案 > Impala Last_Value() 未按预期给出结果

问题描述

我在 Impala 中有一个表,其中有 Unix-Time 的时间信息(频率为 1 毫秒)和三个变量的信息,如下所示:

ts          Val1        Val2        Val3        
1.60669E+12 7541.76     0.55964607  267.1613        
1.60669E+12 7543.04     0.5607262   267.27805       
1.60669E+12 7543.04     0.5607241   267.22308       
1.60669E+12 7543.6797   0.56109643  267.25974       
1.60669E+12 7543.6797   0.56107396  267.30624       
1.60669E+12 7543.6797   0.56170875  267.2643    

我想重新采样数据并获取新时间窗口的最后一个值。例如,如果我想重新采样为 10Sec 频率,则输出应该是 10Sec 窗口的最后一个值,如下所示:

ts                      val1_Last       Val2_Last       Val3_Last   
2020-11-29 22:30:00     7541.76         0.55964607      267.1613
2020-11-29 22:30:10     7542.3994       0.5613486       267.31238
2020-11-29 22:30:20     7542.3994       0.5601791       267.22842
2020-11-29 22:30:30     7544.32         0.56069416      267.20248

为了得到这个结果,我正在运行以下查询:

select distinct *
from (
select ts,
last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1, 
last_value(Val2) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val2,
last_value(Val3) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val3 
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts , 
Val1 as Val1, 
Val2 as Val2, 
Val3 as Val3
FROM Sensor_Data.Table where unit='Unit1'  
and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt 
order by ts

我在一些有时会导致问题的论坛上阅读过,所以我尝试使用withLAST_VALUE()来实现相同的目的。查询如下:FIRST_VALUEORDER BY DESC

select distinct *
from (
select ts,
first_value(Val1) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val1, 
first_value(Val2) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val2,
first_value(Val3) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val3
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts , 
Val1 as Val1, 
val2 as Val2, 
Val3 as Val3
FROM product_sofcdtw_ops.as_operated_full_backup where unit='FCS05-09'  
and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt 
order by ts

但在这两种情况下,我都没有得到预期的结果。重新采样的时间ts按预期出现(窗口为 10 秒),但我得到了0-9 秒、10-19 秒、...窗口之间Val1Val2随机值。Val3

从逻辑上讲,这个查询看起来不错,我没有发现任何问题。任何人都可以解释为什么我没有使用这个查询得到正确的答案。

谢谢 !!!

标签: sqldatabaseimpalaresamplingqsqlquery

解决方案


问题是这一行:

last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1, 

您正在按同一列进行分区和排序,ts因此没有排序(或者更具体地说,按整个分区中恒定的值排序会导致任意排序)。您需要保留原始ts 以使其工作,使用它进行订购:

select ts,
        last_value(Val1) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val1, 
        last_value(Val2) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val2,
        last_value(Val3) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val3 
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts_10, 
             t.*
      FROM Sensor_Data.Table t
      WHERE unit = 'Unit1' AND
            cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00'
     ) t

顺便说一句,问题last_value()在于当您忽略窗口框架(窗口函数规范的rowsrange部分)时,它会出现意外行为。

问题是默认规范是range between unbounded preceding and current row,这意味着last_value()它只获取当前行中的值。

另一方面,first_value()使用默认框架可以正常工作。但是,如果您包含显式框架,则两者是等效的。


推荐阅读