首页 > 解决方案 > 如何确保每个日期和部署都存在并且窗口功能不会因为数据不存在而跳过几天?

问题描述

在 Athena 工作,我有一个查询,该查询生成一个表,其中包含针对日期和部署的每次迭代的多个度量值。

WITH
  core AS (
   SELECT /* create dataset for window functions to work on */
     "deployment"
   , "macaddress"
   , "date"
   FROM 
     (
      SELECT /* de-dupe the base dataset based on a macaddress being valid once per day per deployment */
        "_deployment" "deployment"
      , "macaddress"
      , "date"("timestamp") "date"
      , "row_number"() OVER (PARTITION BY "_deployment", "macaddress" ORDER BY "date"("timestamp") ASC) "rn"
      FROM
        registration
   ) 
   WHERE (CAST("rn" AS varchar(2)) = '1')
) 
SELECT /* window functions to aggregate */
  "count"("macaddress") "registrations"
, "deployment"
, "date"
, "lag"("count"("macaddress"), 1) OVER (PARTITION BY "deployment" ORDER BY "date" ASC) "yesterdayCount"
, "sum"("count"("macaddress")) OVER (PARTITION BY "deployment" ORDER BY "date" ASC ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) "Past7Days"
, "sum"("count"("macaddress")) OVER (PARTITION BY "deployment" ORDER BY "date" ASC ROWS BETWEEN 13 PRECEDING AND 7 PRECEDING) "7daysTarget"
, "sum"("count"("macaddress")) OVER (PARTITION BY "deployment" ORDER BY "date" ASC ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) "Past30Days"
, "sum"("count"("macaddress")) OVER (PARTITION BY "deployment" ORDER BY "date" ASC ROWS BETWEEN 59 PRECEDING AND 30 PRECEDING) "30daysTarget"
, "avg"("count"("macaddress")) OVER (PARTITION BY "deployment" ORDER BY "date" ASC ROWS BETWEEN 9 PRECEDING AND CURRENT ROW) "10DayTrend"
FROM
  core
GROUP BY "deployment", "date"

我遇到的问题与并非每个部署都有每个日期的数据这一事实有关,因此窗口函数没有按预期方式工作,它们对前一个 7 求和rows,而不是前一个 7 days

我所追求的输出是一个表格,date在数据集中最早和最新之间有一行(最新的date将继续移动,表格每晚更新)并且每个都deployment存在于数据集中,数据集中是否有数据或不。

因此,如果部署 1 有以下几天的数据:

2020-04-1
2020-04-2
2020-04-3
2020-04-5
2020-04-6
2020-04-7
2020-04-8

我希望该Past7Days字段为to2020-04-8的总和。目前,结果是 to 的总和,因为它正在回顾行,而不是日期,并且丢失了。2020-04-22020-04-082020-04-012020-04-082020-04-4

决赛桌需要采用以下格式:

registrations | deployment | date | yesterdayCount | Past7Days | 7daysTarget | Past30Days | 30daysTarget | 10DayTrend

并且在日期或部署方面没有差距。

我的工作理论是,我需要在最初的最里面的 SELECT 语句(重复数据删除语句)上建立它。我有一个日历表并且一直在加入它,但是我被deployment日历中没有字段的事实所困扰,因此加入是不完整的,仅包含date. 如果可能的话,我想避免每次部署都有一个日历表。

预先感谢您的任何帮助。

标签: sqlwindow-functionsamazon-athena

解决方案


只需通过将部署与日历表交叉连接来创建部署日历。就像是

WITH
  core AS (
   SELECT /* create dataset for window functions to work on */
     "deployment"
   , "macaddress"
   , "date"
   FROM 
     (
      SELECT /* de-dupe the base dataset based on a macaddress being valid once per day per deployment */
        "_deployment" "deployment"
      , "macaddress"
      , "date"("timestamp") "date"
      , "row_number"() OVER (PARTITION BY "_deployment", "macaddress" ORDER BY "date"("timestamp") ASC) "rn"
      FROM
        registration
   ) 
   WHERE (CAST("rn" AS varchar(2)) = '1')
) ,
deploymentcalendar as (
select t.deployment,ct.date 
from (select distinct deploymet from core) t
cross join calendar_table ct
)
SELECT /* window functions to aggregate */
  "count"("macaddress") "registrations"
, "deployment"
, "date"
, "lag"("count"("macaddress"), 1) OVER (PARTITION BY dc."deployment" ORDER BY "date" ASC) "yesterdayCount"
, "sum"("count"("macaddress")) OVER (PARTITION BY dc."deployment" ORDER BY dc."date" ASC ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) "Past7Days"
, "sum"("count"("macaddress")) OVER (PARTITION BY dc."deployment" ORDER BY dc."date" ASC ROWS BETWEEN 13 PRECEDING AND 7 PRECEDING) "7daysTarget"
, "sum"("count"("macaddress")) OVER (PARTITION BY dc."deployment" ORDER BY dc."date" ASC ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) "Past30Days"
, "sum"("count"("macaddress")) OVER (PARTITION BY dc."deployment" ORDER BY dc."date" ASC ROWS BETWEEN 59 PRECEDING AND 30 PRECEDING) "30daysTarget"
, "avg"("count"("macaddress")) OVER (PARTITION BY dc."deployment" ORDER BY dc."date" ASC ROWS BETWEEN 9 PRECEDING AND CURRENT ROW) "10DayTrend"
FROM
  core
RIGHT JOIN deploymentcalendar dc on dc.deployment=core.deployment and dc.date=core.date
GROUP BY core."deployment", core."date"

推荐阅读