首页 > 解决方案 > SQL (Hive):在使用 GROUP BY 进行聚合时使用窗口函数

问题描述

我在雅典娜Hive / Presto)中有下表:

CREATE EXTERNAL TABLE tmp (
    id STRING,
    updated_at TIMESTAMP,
    location STRING,
    direction STRING
)
LOCATION 's3://path'; 

我需要对字段进行聚合和计数,同时id选择相对于组内最新的和(分区再次打开)。locationdirectiontimestampid

到目前为止,我想出了以下查询,首先利用窗口函数,然后再分组:

SELECT
    b.id,
    MAX(b.latest_location) AS "latest_location",  -- It seems it is not possible to use first_value() on GROUP BY
    MAX(b.latest_direction) AS "latest_direction",
    COUNT(*) AS "total"
FROM (
    SELECT
        a.id,
        first_value(a.location) OVER (PARTITION BY a.id ORDER BY a.updated_at DESC) AS "latest_location",
        first_value(a.direction) OVER (PARTITION BY a.id ORDER BY a.updated_at DESC) AS "latest_direction"
    FROM tmp a
) b
GROUP BY b.id;

我首先尝试同时通过聚合和窗口聚合进行分组,但似乎引擎不允许这样做。是否可以编写更有效的查询(可能没有子查询)?

标签: sqlhivehiveqlprestoamazon-athena

解决方案


SELECT DISTINCT
    id,
    first_value(a.location)  OVER (PARTITION BY id ORDER BY updated_at DESC) AS latest_location,
    first_value(a.direction) OVER (PARTITION BY id ORDER BY updated_at DESC) AS latest_direction,
    count(*) OVER (PARTITION BY id) as total
FROM tmp

在您的原始查询中,max基本上是一个虚拟聚合,因为所有行都具有相同的值。并且group by基本上是在distinct做这里所做的事情。


推荐阅读