sql - 如何确保每个日期和部署都存在并且窗口功能不会因为数据不存在而跳过几天?
问题描述
在 Athena 工作,我有一个查询,该查询生成一个表,其中包含针对日期和部署的每次迭代的多个度量值。
WITH
core AS (
SELECT /* create dataset for window functions to work on */
"deployment"
, "macaddress"
, "date"
FROM
(
SELECT /* de-dupe the base dataset based on a macaddress being valid once per day per deployment */
"_deployment" "deployment"
, "macaddress"
, "date"("timestamp") "date"
, "row_number"() OVER (PARTITION BY "_deployment", "macaddress" ORDER BY "date"("timestamp") ASC) "rn"
FROM
registration
)
WHERE (CAST("rn" AS varchar(2)) = '1')
)
SELECT /* window functions to aggregate */
"count"("macaddress") "registrations"
, "deployment"
, "date"
, "lag"("count"("macaddress"), 1) OVER (PARTITION BY "deployment" ORDER BY "date" ASC) "yesterdayCount"
, "sum"("count"("macaddress")) OVER (PARTITION BY "deployment" ORDER BY "date" ASC ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) "Past7Days"
, "sum"("count"("macaddress")) OVER (PARTITION BY "deployment" ORDER BY "date" ASC ROWS BETWEEN 13 PRECEDING AND 7 PRECEDING) "7daysTarget"
, "sum"("count"("macaddress")) OVER (PARTITION BY "deployment" ORDER BY "date" ASC ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) "Past30Days"
, "sum"("count"("macaddress")) OVER (PARTITION BY "deployment" ORDER BY "date" ASC ROWS BETWEEN 59 PRECEDING AND 30 PRECEDING) "30daysTarget"
, "avg"("count"("macaddress")) OVER (PARTITION BY "deployment" ORDER BY "date" ASC ROWS BETWEEN 9 PRECEDING AND CURRENT ROW) "10DayTrend"
FROM
core
GROUP BY "deployment", "date"
我遇到的问题与并非每个部署都有每个日期的数据这一事实有关,因此窗口函数没有按预期方式工作,它们对前一个 7 求和rows
,而不是前一个 7 days
。
我所追求的输出是一个表格,date
在数据集中最早和最新之间有一行(最新的date
将继续移动,表格每晚更新)并且每个都deployment
存在于数据集中,数据集中是否有数据或不。
因此,如果部署 1 有以下几天的数据:
2020-04-1
2020-04-2
2020-04-3
2020-04-5
2020-04-6
2020-04-7
2020-04-8
我希望该Past7Days
字段为to2020-04-8
的总和。目前,结果是 to 的总和,因为它正在回顾行,而不是日期,并且丢失了。2020-04-2
2020-04-08
2020-04-01
2020-04-08
2020-04-4
决赛桌需要采用以下格式:
registrations | deployment | date | yesterdayCount | Past7Days | 7daysTarget | Past30Days | 30daysTarget | 10DayTrend
并且在日期或部署方面没有差距。
我的工作理论是,我需要在最初的最里面的 SELECT 语句(重复数据删除语句)上建立它。我有一个日历表并且一直在加入它,但是我被deployment
日历中没有字段的事实所困扰,因此加入是不完整的,仅包含date
. 如果可能的话,我想避免每次部署都有一个日历表。
预先感谢您的任何帮助。
解决方案
只需通过将部署与日历表交叉连接来创建部署日历。就像是
WITH
core AS (
SELECT /* create dataset for window functions to work on */
"deployment"
, "macaddress"
, "date"
FROM
(
SELECT /* de-dupe the base dataset based on a macaddress being valid once per day per deployment */
"_deployment" "deployment"
, "macaddress"
, "date"("timestamp") "date"
, "row_number"() OVER (PARTITION BY "_deployment", "macaddress" ORDER BY "date"("timestamp") ASC) "rn"
FROM
registration
)
WHERE (CAST("rn" AS varchar(2)) = '1')
) ,
deploymentcalendar as (
select t.deployment,ct.date
from (select distinct deploymet from core) t
cross join calendar_table ct
)
SELECT /* window functions to aggregate */
"count"("macaddress") "registrations"
, "deployment"
, "date"
, "lag"("count"("macaddress"), 1) OVER (PARTITION BY dc."deployment" ORDER BY "date" ASC) "yesterdayCount"
, "sum"("count"("macaddress")) OVER (PARTITION BY dc."deployment" ORDER BY dc."date" ASC ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) "Past7Days"
, "sum"("count"("macaddress")) OVER (PARTITION BY dc."deployment" ORDER BY dc."date" ASC ROWS BETWEEN 13 PRECEDING AND 7 PRECEDING) "7daysTarget"
, "sum"("count"("macaddress")) OVER (PARTITION BY dc."deployment" ORDER BY dc."date" ASC ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) "Past30Days"
, "sum"("count"("macaddress")) OVER (PARTITION BY dc."deployment" ORDER BY dc."date" ASC ROWS BETWEEN 59 PRECEDING AND 30 PRECEDING) "30daysTarget"
, "avg"("count"("macaddress")) OVER (PARTITION BY dc."deployment" ORDER BY dc."date" ASC ROWS BETWEEN 9 PRECEDING AND CURRENT ROW) "10DayTrend"
FROM
core
RIGHT JOIN deploymentcalendar dc on dc.deployment=core.deployment and dc.date=core.date
GROUP BY core."deployment", core."date"
推荐阅读
- batch-file - 如何将文件打包成同名但后缀不同的 ZIP 文件?
- reactjs - 如何在 React js 中使用 mxgraph 创建调整大小的元素
- google-apps-script - 在使用同一列中的最新内容进行更新之前,如何使用 Google Apps 脚本仅清除“A 列”
- spring - 异常打开套接字 - MongoDB,Docker
- ruby-on-rails - Rails:添加帖子而不刷新页面
- rust - 排序向量
> 按第一个元素升序排列,当第一个元素相等时按第二个元素降序排列 - sql-server - 如何在 CREATE TABLE 语句中自动计算部门的当前员工数?
- python - 更新值时是否可以使用python dict理解?
- java - 如何压缩同一目录中的多个文件夹?
- r - 在 R 中打开一个 4 GB 的 geojson