首页 > 解决方案 > 按 1 分钟间隔分组操作链 sql BigQuery

问题描述

我需要以 1 分钟的间隔对数据进行分组以进行一系列操作。我的数据如下所示:

id    MetroId            Time             ActionName            refererurl
111     a          2020-09-01-09:19:00     First           www.stackoverflow/a12345
111     b         2020-09-01-12:36:54      First           www.stackoverflow/a12345
111     f         2020-09-01-12:36:56      First     www.stackoverflow/xxxx
111     b         2020-09-01-12:36:58      Midpoint        www.stackoverflow/a12345
111     f         2020-09-01-12:37:01      Midpoint    www.stackoverflow/xxx
111     b          2020-09-01-12:37:03     Third           www.stackoverflow/a12345
111     b          2020-09-01-12:37:09     Complete        www.stackoverflow/a12345
222     d          2020-09-01-15:17:44     First           www.stackoverflow/a2222
222     d          2020-09-01-15:17:48     Midpoint        www.stackoverflow/a2222
222     d          2020-09-01-15:18:05     Third           www.stackoverflow/a2222

我需要获取具有以下条件的数据:如果x_id并且x_url具有列的Completeaction_name,则获取它。如果没有Complete则抓取Third等。

  ARRAY_AGG(current_query_result 
    ORDER BY CASE ActionName
      WHEN 'Complete' THEN 1
      WHEN 'Third' THEN 2
      WHEN 'Midpoint' THEN 3
      WHEN 'First' THEN 4
    END
    LIMIT 1
  )[OFFSET(0)]
FROM
    (
        SELECT d.id, c.Time, c.ActionName, c.refererurl, c.MetroId
        FROM
            `bq_query_table_c` c
            INNER JOIN `bq_table_d` d ON d.id = c.CreativeId
        WHERE
            c.refererurl LIKE "https://www.stackoverflow/%"
            AND c.ActionName in ('First', 'Midpoint', 'Third', 'Complete')
    ) current_query_result
GROUP BY
    id,
    refererurl,
    MetroId 
    TIMESTAMP_SUB(
    PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%S', time), 
    INTERVAL MOD(UNIX_SECONDS(PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%S', time)), 1 * 60) 
    SECOND
  ) 

期望的输出:

id   MetroId         Time             ActionName            refererurl
111      a     2020-09-01-09:19:00     First           www.stackoverflow/a12345
111     f     2020-09-01-12:37:01      Midpoint    www.stackoverflow/xxx
111     b     2020-09-01-12:37:09     Complete        www.stackoverflow/a12345
222     c      2020-09-01-15:18:05     Third           www.stackoverflow/a2222

标签: sqlgoogle-bigquerygreatest-n-per-groupwindow-functionsgaps-and-islands

解决方案


以下是 BigQuery 标准 SQL

#standardSQL
WITH temp AS (
  SELECT *, PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%S', time) ts
  FROM `project.dataset.bq_table`
)
SELECT * EXCEPT (ts, time_lag) FROM (
  SELECT * ,
    TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY id ORDER BY ts), ts, SECOND) time_lag
  FROM (
    SELECT 
      AS VALUE ARRAY_AGG(t 
        ORDER BY STRPOS('First,Midpoint,Third,Complete',action_name) DESC 
        LIMIT 1
      )[OFFSET(0)]
    FROM temp t
    WHERE action_name IN ('First', 'Midpoint', 'Third', 'Complete')
    GROUP BY id, url, 
      TIMESTAMP_SUB(ts, INTERVAL MOD(UNIX_SECONDS(ts), 60) SECOND
      )   
  )
)
WHERE NOT IFNULL(time_lag, 777) < 60    

您可以使用您问题中的示例数据进行测试,使用上面的示例数据,如下例所示

#standardSQL
WITH `project.dataset.bq_table` AS (
  SELECT 111 id, '2020-09-01-09:19:00' time, 'First' action_name, 'www.stackoverflow/a12345' url UNION ALL
  SELECT 111, '2020-09-01-12:36:54', 'First', 'www.stackoverflow/a12345' UNION ALL
  SELECT 111, '2020-09-01-12:36:58', 'Midpoint', 'www.stackoverflow/a12345' UNION ALL
  SELECT 111, '2020-09-01-12:37:03', 'Third', 'www.stackoverflow/a12345' UNION ALL
  SELECT 111, '2020-09-01-12:37:09', 'Complete', 'www.stackoverflow/a12345' UNION ALL
  SELECT 222, '2020-09-01-15:17:44', 'First', 'www.stackoverflow/a2222' UNION ALL
  SELECT 222, '2020-09-01-15:17:48', 'Midpoint', 'www.stackoverflow/a2222' UNION ALL
  SELECT 222, '2020-09-01-15:18:05', 'Third', 'www.stackoverflow/a2222' 
), temp AS (
  SELECT *, PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%S', time) ts
  FROM `project.dataset.bq_table`
)
SELECT * EXCEPT (ts, time_lag) FROM (
  SELECT * ,
    TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY id ORDER BY ts), ts, SECOND) time_lag
  FROM (
    SELECT 
      AS VALUE ARRAY_AGG(t 
        ORDER BY STRPOS('First,Midpoint,Third,Complete',action_name) DESC 
        LIMIT 1
      )[OFFSET(0)]
    FROM temp t
    WHERE action_name IN ('First', 'Midpoint', 'Third', 'Complete')
    GROUP BY id, url, 
      TIMESTAMP_SUB(ts, INTERVAL MOD(UNIX_SECONDS(ts), 60) SECOND
      )   
  )
)
WHERE NOT IFNULL(time_lag, 777) < 60   

结果

Row     id      time                    action_name     url  
1       111     2020-09-01-09:19:00     First           www.stackoverflow/a12345     
2       111     2020-09-01-12:37:09     Complete        www.stackoverflow/a12345     
3       222     2020-09-01-15:18:05     Third           www.stackoverflow/a2222    

注意:我仍然不能 100% 确定您的用例 - 但以上是基于到目前为止讨论/评论的内容


推荐阅读