首页 > 解决方案 > 检测间隙并创建缺失的时间段行

问题描述

设置:

我有一个包含以下列的表格:

对于每个令牌名称,我有一个“总体”/一般时间段(例如 01.01.2020 - 31.12.2020)和每个令牌名称的多行,每行都有一个特定的子时间段。举个例子:

子时间段应该是连续的,并且应该以与总体/一般时间段相同的日期开始和结束。时间段之间没有重叠。

问题:

一些令牌缺少一些行/子时间段。例如,第一个、中间或最后一个子时间段可能会丢失。

我一直在绞尽脑汁如何:

  1. 找出差距
  2. 填满它们

我考虑过使用窗口函数来检测是否存在间隙,但我陷入了如何从那里开始创建具有时间段的缺失行的问题。

     CASE 
     WHEN -- First entry of the token art
        LAG(sub_date_end ,1) OVER (   
            PARTITION BY token
            ORDER BY     token, sub_date_start ASC
        ) IS NULL  -- no previous one == first row.
        THEN 
                  CASE -- check gap between row and overall
                    WHEN sub_date_start != overall_date_start 
                    THEN 'gap detected between start and first row'
                  ELSE '' -- no gap 
                 END 
     WHEN -- Gap between row's start and previous row's end
            LAG(sub_date_end ,1) OVER (
            PARTITION BY token
            ORDER BY     token, sub_date_start ASC
            ) != sub_date_start -1  
           THEN 'gap detected between this and the previous one'
    ELSE '' -- no gap

    END AS gaps_between_rows

伪代码解释:

标签: sqlamazon-redshift

解决方案


这只是完成您列出的想法。

此虚拟数据有 4 条缺失记录 - 包括第一条记录和最后一条记录。

COALESCE如果第一条记录丢失,可用于处理LAG将导致NULL. 最后一条记录需要单独处理。

WITH dummy_data AS (
  -- Missing 2019-08-01 to 2019-12-31
  SELECT 1 AS token, '2019-08-01' AS overall_date_start, '2021-05-01'::DATE AS overall_date_end, '2020-01-01'::DATE AS sub_date_start, '2020-04-28'::DATE AS sub_date_end
  UNION SELECT 1, '2019-08-01'::DATE, '2021-05-01'::DATE, '2020-04-29'::DATE, '2020-08-01'::DATE
  -- Missing 2020-08-02 to 2020-09-01
  UNION SELECT 1, '2019-08-01'::DATE, '2021-05-01'::DATE, '2020-09-02'::DATE, '2020-10-01'::DATE
  -- Missing 2020-10-02 to 2020-11-01
  UNION SELECT 1, '2019-08-01'::DATE, '2021-05-01'::DATE, '2020-11-02'::DATE, '2021-02-12'::DATE
  -- Missing  2021-02-13 to 2021-05-01
),
missing_last AS (
  SELECT
    token,
    overall_date_start,
    overall_date_end,
    
    MAX(sub_date_end) + 1 AS new_sub_date_start,
    overall_date_end AS new_sub_date_end,
    new_sub_date_start < overall_date_end AS is_missing
  FROM dummy_data
  GROUP BY token, overall_date_start, overall_date_end
),
missing_other AS (
  SELECT
    token,
    overall_date_start,
    overall_date_end,
    
    COALESCE(
      LAG(sub_date_end, 1) OVER (
        PARTITION BY token
        ORDER BY token, sub_date_start
      ) + 1,
      overall_date_start
    ) AS new_sub_date_start,
    sub_date_start - 1 AS new_sub_date_end,
    
    new_sub_date_start != sub_date_start AS is_missing
  FROM dummy_data
),
missing_all AS (
  SELECT * FROM missing_last
  UNION ALL SELECT * FROM missing_other
)
SELECT
  token,
  overall_date_start,
  overall_date_end,
  new_sub_date_start AS sub_date_start,
  new_sub_date_end AS sub_date_end
FROM missing_all
WHERE is_missing
ORDER BY sub_date_start
-- token    overall_date_start  overall_date_end  sub_date_start  sub_date_end
--     1            2019-08-01        2021-05-01      2019-08-01    2020-12-31
--     1            2019-08-01        2021-05-01      2020-08-02    2020-09-01
--     1            2019-08-01        2021-05-01      2020-10-02    2020-11-01  
--     1            2019-08-01        2021-05-01      2021-02-13    2021-05-01  

推荐阅读