首页 > 解决方案 > (REDSHIFT) 垂直合并 / FIRST_VALUE() 作为聚合

问题描述

(这是 Redshift 特有的,应该考虑到它的柱状性质、排序顺序等)

当按时间戳排序时,我需要按类别从每列中获取第一个非 NULL 值。

本质上,与 FIRST_VALUE() 相同,但作为一个聚合。

或者,将 COALESCE() 作为聚合。

然而,Redshift 没有更高版本的 PostgreSQL 或 Oracle 的优点。所以,我正在寻找选项来测试我的 1 亿行导入:)

(我不喜欢我的任何一个选择,但我很难找到更好的选择。)


样本输入

 category | row_timestamp | value_a | value_b | value_c
----------+---------------+---------+---------+---------

    01    |      001      |   NULL  |   NULL  |     4
    01    |      010      |      7  |   NULL  |  NULL
    01    |      100      |   NULL  |      1  |     2
    01    |      999      |      6  |      3  |     6

    02    |      001      |      1  |   NULL  |  NULL
    02    |      010      |   NULL  |      2  |  NULL
    02    |      100      |   NULL  |      1  |     9
    02    |      999      |      6  |      3  |     2

预期成绩

 category |                 value_a | value_b | value_c
----------+-------------------------+---------+---------
    01    |                      7  |      1  |     4
    02    |                      1  |      2  |     9


当前解决方案

SELECT DISTINCT
    category,
    FIRST_VALUE(value_a IGNORE NULLS)
        OVER (PARTITION BY category
                  ORDER BY row_timestamp
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
             )
                 AS value_a,

    FIRST_VALUE(value_b IGNORE NULLS)
        OVER (PARTITION BY category
                  ORDER BY row_timestamp
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
             )
                 AS value_b,

    FIRST_VALUE(value_c IGNORE NULLS)
        OVER (PARTITION BY category
                  ORDER BY row_timestamp
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
             )
                 AS value_c
FROM
    mytable

它有效,但 DISTINCT 可能适用于数百或数千行。不太理想。

如果它只用于一两列,这可能会起作用(但它是用于十几列,所以这很可怕) ......

WITH
    sorted_value_a AS
(
    SELECT
        category,
        value_a,
        ROW_NUMBER() OVER (PARTITION BY category
                               ORDER BY value_a IS NOT NULL, row_timestamp
                          )
                              AS row_ordinal
    FROM
        myTable
),
    sorted_value_b AS
(
    SELECT
        category,
        value_b,
        ROW_NUMBER() OVER (PARTITION BY category
                               ORDER BY value_b IS NOT NULL, row_timestamp
                          )
                              AS row_ordinal
    FROM
        myTable
),
    sorted_value_c AS
(
    SELECT
        category,
        value_c,
        ROW_NUMBER() OVER (PARTITION BY category
                               ORDER BY value_c IS NOT NULL, row_timestamp
                          )
                              AS row_ordinal
    FROM
        myTable
)
SELECT
    *
FROM
    sorted_value_a   AS a
INNER JOIN
    sorted_value_b   AS b
        ON b.category = a.category
INNER JOIN
    sorted_value_c   AS c
        ON c.category = a.category

标签: sqlamazon-redshiftaggregation

解决方案


好吧,我不知道这是否美观,但你可以这样做:

select category, value_a, value_b, value_c, value_d
from (select coalesce(value_a, lag(value_a ignore nulls) over (partition by category order by row_timestamp)) as value_a,
             coalesce(value_b, lag(value_b ignore nulls) over (partition by category order by row_timestamp)) as value_b,
             coalesce(value_c, lag(value_c ignore nulls) over (partition by category order by row_timestamp)) as value_c,
             coalesce(value_d, lag(value_d ignore nulls) over (partition by category order by row_timestamp)) as value_d
             row_number() over (partition by category order by row_timestamp desc) as seqnum 
      from mytable t
     ) t
where seqnum = 1;

推荐阅读