首页 > 解决方案 > 获取具有日期范围的自定义聚合的增量

问题描述

我需要找到一种有效的方法来创建查询报告聚合的增量,以及值的开始和结束日期。

要求

我试过的

我尝试创建一个 CTE,为一个类别生成所有可能的范围,然后重新连接到主查询,以便打破跨越多个范围的子类别。然后我按范围分组并做了一个 MAX(is_active)。

虽然这是一个好的开始(此时我需要做的就是将具有相同值的连续范围组合起来),但查询速度非常慢。我对 Postgres 的熟悉程度不如对其他 SQL 风格的熟悉,因此我决定最好花时间与更有经验的人联系并寻求帮助。

源数据

+----+------------+------------+--------+------------+-----------+-----------------------------------------------------+
| id | start_dt   | end_dt     | cat_id | sub_cat_id | is_active | comment                                             |
+----+------------+------------+--------+------------+-----------+-----------------------------------------------------+
| 1  | 2018-01-01 | 2018-01-31 | 1      | 1001       | 1         | (null)                                              |
| 2  | 2018-02-01 | 2018-02-14 | 1      | 1001       | 0         | (null)                                              |
| 3  | 2018-02-15 | 2018-02-28 | 1      | 1001       | 0         | cat 1 is_active is unchanged despite new record.    |
| 4  | 2018-03-01 | 2018-03-30 | 1      | 1001       | 1         | (null)                                              |
| 5  | 2018-01-01 | 2018-01-15 | 2      | 2001       | 1         | (null)                                              |
| 6  | 2018-01-01 | 2018-01-31 | 2      | 2002       | 1         | (null)                                              |
| 7  | 2018-01-15 | 2018-02-10 | 2      | 2001       | 0         | cat 2 should still be active until 2002 is inactive |
| 8  | 2018-02-01 | 2018-02-14 | 2      | 2002       | 0         | cat 2 is inactive                                   |
| 9  | 2018-02-10 | 2018-03-15 | 2      | 2001       | 0         | this record will cause trouble                      |
| 10 | 2018-02-15 | 2018-03-30 | 2      | 2002       | 1         | cat 2 should be active again                        |
| 11 | 2018-03-15 | 2018-03-30 | 2      | 2001       | 1         | cat 2 is_active is unchanged despite new record.    |
| 12 | 2018-04-01 | 2018-04-30 | 2      | 2001       | 0         | cat 2 ends in a zero                                |
+----+------------+------------+--------+------------+-----------+-----------------------------------------------------+

预期结果

+------------+------------+--------+-----------+
| start_dt   | end_dt     | cat_id | is_active |
+------------+------------+--------+-----------+
| 2018-01-01 | 2018-01-31 | 1      | 1         |
| 2018-02-01 | 2018-02-28 | 1      | 0         |
| 2018-03-01 | 2018-03-30 | 1      | 1         |
| 2018-01-01 | 2018-01-31 | 2      | 1         |
| 2018-02-01 | 2018-02-14 | 2      | 0         |
| 2018-02-15 | 2018-03-30 | 2      | 1         |
| 2018-04-01 | 2018-04-30 | 2      | 0         |
+------------+------------+--------+-----------+

这是一个 select 语句,可帮助您编写自己的测试。

SELECT id,start_dt::date start_date,end_dt::date end_date,cat_id,sub_cat_id,is_active::int is_active,comment
FROM (VALUES 
    (1, '2018-01-01', '2018-01-31', 1, 1001, '1', null),
    (2, '2018-02-01', '2018-02-14', 1, 1001, '0', null),
    (3, '2018-02-15', '2018-02-28', 1, 1001, '0', 'cat 1 is_active is unchanged despite new record.'),
    (4, '2018-03-01', '2018-03-30', 1, 1001, '1', null),
    (5, '2018-01-01', '2018-01-15', 2, 2001, '1', null),
    (6, '2018-01-01', '2018-01-31', 2, 2002, '1', null),
    (7, '2018-01-15', '2018-02-10', 2, 2001, '0', 'cat 2 should still be active until 2002 is inactive'),
    (8, '2018-02-01', '2018-02-14', 2, 2002, '0', 'cat 2 is inactive'),
    (9, '2018-02-10', '2018-03-15', 2, 2001, '0', 'cat 2 is_active is unchanged despite new record.'),
    (10, '2018-02-15', '2018-03-30', 2, 2002, '1', 'cat 2 should be active agai'),
    (11, '2018-03-15', '2018-03-30', 2, 2001, '1', 'cat 2 is_active is unchanged despite new record.'),
    (12, '2018-04-01', '2018-04-30', 2, 2001, '0', 'cat 2 ends in 0.')

) src ( "id","start_dt","end_dt","cat_id","sub_cat_id","is_active","comment" )

标签: sqlpostgresql

解决方案


因此,如果该日期的任何子类别处于活动状态,则该日期处于活动状态。换言之,如果至少有一个子类别处于活动状态,则该日期被视为处于活动状态。如果在给定日期没有活动的子类别,则该日期为非活动日期。在最初的问题中,这条逻辑对我来说并不清楚。


我提到了 Itzik Ben-Gan Packing Intervals的一篇文章,这是处理它的一种方法。

使用这种方法,您可以打包所有活动区间而完全忽略非活动区间。打包活动间隔后留下的间隙将处于非活动状态。

如果您从来没有既不活跃也不活跃的日期,这是最终的答案。如果你可以有这样的“不确定”日期,事情可能会变得棘手。


一种完全不同的方法是使用日历表(永久表或动态生成的一系列日期)。将原始表的每一行连接到日历表以扩展它并为给定时间间隔内的每个日期创建一行。

然后按类别和日期将它们全部分组,并将 is_active 标志设置为 MAX(如果该日期至少有一个子类别的 is_active=1,则 MAX 将为 1,即也是活动的)。

这种方法更容易理解,如果间隔的长度不太长,应该可以很好地工作。

像这样的东西:

SELECT
    Calendar.dt
    ,src.cat_id
    ,MAX(src.is_active) AS is_active
    -- we don't even need to know sub_cat_id
FROM
    src
    INNER JOIN Calendar
        ON  Calendar.dt >= src.start_dt
        AND Calendar.dt <= src.end_dt
GROUP BY
    Calendar.dt
    ,src.cat_id

因此,您将获得每个日期和类别的一行。现在您需要将连续日期合并回间隔。您可以再次使用 Packing Intervals 方法或间隙和岛的一些更简单的变体。

样本数据

WITH src AS
(
    SELECT id,start_dt::date start_dt,end_dt::date end_dt,cat_id,sub_cat_id,is_active,comment
    FROM (VALUES 
        (1,  '2018-01-01', '2018-01-31', 1, 1001, 1, null),
        (2,  '2018-02-01', '2018-02-14', 1, 1001, 0, null),
        (3,  '2018-02-15', '2018-02-28', 1, 1001, 0, 'cat 1 is_active is unchanged despite new record.'),
        (4,  '2018-03-01', '2018-03-30', 1, 1001, 1, null),
        (5,  '2018-01-01', '2018-01-15', 2, 2001, 1, null),
        (6,  '2018-01-01', '2018-01-31', 2, 2002, 1, null),
        (7,  '2018-01-15', '2018-02-10', 2, 2001, 0, 'cat 2 should still be active until 2002 is inactive'),
        (8,  '2018-02-01', '2018-02-14', 2, 2002, 0, 'cat 2 is inactive'),
        (9,  '2018-02-10', '2018-03-15', 2, 2001, 0, 'cat 2 is_active is unchanged despite new record.'),
        (10, '2018-02-15', '2018-03-30', 2, 2002, 1, 'cat 2 should be active agai'),
        (11, '2018-03-15', '2018-03-30', 2, 2001, 1, 'cat 2 is_active is unchanged despite new record.'),
        (12, '2018-04-01', '2018-04-30', 2, 2001, 0, 'cat 2 ends in 0.')
    ) src ( id,start_dt,end_dt,cat_id,sub_cat_id,is_active,comment)
)
,Calendar AS
(
    -- OP Note: Union of all dates from source produced 30% faster results.
    -- OP Note 2: Including the cat_id (which was indexed FK), Made Query 8x faster.
    SELECT cat_id, start_dt dt FROM src
    UNION SELECT cat_id, end_dt dt FROM src 
    /*SELECT dt::date dt
    FROM (
        SELECT MIN(start_dt) min_start, MAX(end_dt) max_end
        FROM src
    ) max_ranges
    CROSS JOIN generate_series(min_start, max_end, '1 day'::interval) dt*/
)

主要查询

检查每个中间 CTE 的结果,以充分了解其工作原理。

-- expand intervals into individual dates
,CTE_Dates
AS
(
    SELECT
        Calendar.dt
        ,src.cat_id
        ,MAX(src.is_active) AS is_active
        -- we don't even need to know sub_cat_id
    FROM
        src
        INNER JOIN Calendar
            ON  Calendar.dt >= src.start_dt
            AND Calendar.dt <= src.end_dt
            AND Calender.cat_id = src.cat_id
    GROUP BY
        Calendar.dt
        ,src.cat_id
)
-- simple gaps-and-islands
,CTE_rn
AS
(
    SELECT
        *
        ,ROW_NUMBER() OVER (PARTITION BY cat_id ORDER BY dt) AS rn1
        ,ROW_NUMBER() OVER (PARTITION BY cat_id, is_active ORDER BY dt) AS rn2
    FROM CTE_Dates
)
-- diff of row numbers gives us a group's "ID"
-- condense each island and gap back into interval using simple GROUP BY
SELECT
    MIN(dt) AS start_dt
    ,MAX(dt) AS end_dt
    ,cat_id
    ,is_active
FROM CTE_rn
GROUP BY
    cat_id
    ,is_active
    ,rn1 - rn2
ORDER BY
    cat_id
    ,start_dt
;

没有通用日历的第二个变体

它可能会表现得更好,因为这个变体不必扫描src表(两次)来制作一个临时的日期列表,对该列表进行排序以删除重复项,然后没有连接到那个最有可能没有的临时日期列表' t 有任何支持指标。但是,它会生成更多行。

-- remove Calendar CTE above, 
-- use generate_series() to generate the exact range of dates we need 
-- without joining to generic Calendar table

-- expand intervals into individual dates
,CTE_Dates
AS
(
    SELECT
        Dates.dt
        ,src.cat_id
        ,MAX(src.is_active) AS is_active
        -- we don't even need to know sub_cat_id
    FROM
        src
        INNER JOIN LATERAL
        (
            SELECT dt::date
            FROM generate_series(src.start_dt, src.end_dt, '1 day'::interval) AS s(dt)
        ) AS Dates ON true
    GROUP BY
        Dates.dt
        ,src.cat_id
)
-- simple gaps-and-islands
,CTE_rn
AS
(
    SELECT
        *
        ,ROW_NUMBER() OVER (PARTITION BY cat_id ORDER BY dt) AS rn1
        ,ROW_NUMBER() OVER (PARTITION BY cat_id, is_active ORDER BY dt) AS rn2
    FROM CTE_Dates
)
-- diff of row numbers gives us a group's "ID"
-- condense each island and gap back into interval using simple GROUP BY
SELECT
    MIN(dt) AS start_dt
    ,MAX(dt) AS end_dt
    ,cat_id
    ,is_active
FROM CTE_rn
GROUP BY
    cat_id
    ,is_active
    ,rn1 - rn2
ORDER BY
    cat_id
    ,start_dt
;

结果

+------------+------------+--------+-----------+
|  start_dt  |   end_dt   | cat_id | is_active |
+------------+------------+--------+-----------+
| 2018-01-01 | 2018-01-31 |      1 |         1 |
| 2018-02-01 | 2018-02-28 |      1 |         0 |
| 2018-03-01 | 2018-03-30 |      1 |         1 |
| 2018-01-01 | 2018-01-31 |      2 |         1 |
| 2018-02-01 | 2018-02-14 |      2 |         0 |
| 2018-02-15 | 2018-03-30 |      2 |         1 |
| 2018-04-01 | 2018-04-30 |      2 |         0 |
+------------+------------+--------+-----------+

此外,众所周知,CTE 是 Postgres 中的“优化障碍”,因此如果将这些 CTE 内联到单个查询中,其性能可能会发生变化。您需要使用您的数据在您的系统上进行测试。


推荐阅读