首页 > 解决方案 > 如何在 BigQuery 中连续四个星期循环过去六个月的数据

问题描述

我在大查询中有表,例如如下所示。(按日期分区)。我必须为下面详细提到的这个问题编写标准的 sql 查询。

student_id     date      duration(in hours)
  1          2020-05-10   7             
  2          2020-05-10   8
  3          2020-05-10   8
  1          2020-05-11   8
  2          2020-05-11   7
  3          2020-05-12   6

这是我们几乎每天都在添加数据的表格,因此数据将增长得非常快。我必须找到在过去六个月中连续 4 周出勤时间超过 7 小时的学生 ID(每天检查,最近几个月本周将增加 1 周。)并将学生类型转换为好学生。例如在编程语言中。

for(start week->1 - end_week-> 4 till last six months):
      if duration >=7 for date
        boolean true
      start_week = 2 //start week is incremented by 1 week for next loop
      end_week = 5

对于任何学生,如果过去六个月的数据持续时间连续 4 周超过 7 小时,那么他就是好学生。这对我来说看起来非常具有挑战性,因为我在 bigquery 和 mysql 方面处于平均水平。我不知道如何实现这一点。

标签: sqlgoogle-bigquery

解决方案


这是您的用例示例

# Only for initiate the test with your data
with sample as (
  select 1 as ID,  DATE("2020-05-10") as d, 7 as hour
  union all             
  select 2 as ID,  DATE("2020-05-10") as d, 8 as hour
  union all
  select 3 as ID,  DATE("2020-05-10") as d, 8 as hour
  union all
  select 1 as ID,  DATE("2020-05-11") as d, 8 as hour
  union all
  select 2 as ID,  DATE("2020-05-11") as d, 7 as hour
  union all
  select 3 as ID,  DATE("2020-05-12") as d, 6 as hour
), 
# Create an array of date to take into account the missing days (important for the sum over the 28 previous days)
date_array as (
  select  dd from UNNEST(GENERATE_DATE_ARRAY('2020-05-10', '2020-05-15', INTERVAL 1 DAY)) dd
), 
# Product of existing IDs and possible date on the range
data_grid as (
  select distinct ID, dd from sample, date_array
), 
# Perform a right outer join to add missing date to the logs that you have in your sample data
merged_data as (
select data_grid.ID,d,hour,dd from sample RIGHT outer join data_grid on sample.d = data_grid.dd and sample.ID = data_grid.ID
)
# Sum per ID the 27 previous day in sliding windows (every day, the day and the last 27 are added)
select ID,dd, SUM(hour)
  OVER (
    PARTITION BY ID
    ORDER BY dd
    ROWS BETWEEN 27 PRECEDING AND CURRENT ROW
  ) AS total_purchases
  from merged_data 

推荐阅读