首页 > 解决方案 > SQL连接来自同一张表的时间戳差异

问题描述

我不确定如何在 BigQuery 中编写此 SQL 查询。我有一个包含名称和时间戳的事件表。假设我在表中只有两个事件:A 和 B。我想要做的是查询表以获取事件 A 的所有实例,并获取下一个最接近的 B 并创建一个具有时间差的新列。B 总是发生在 A 之后。

例如,如果我有一个看起来像这样的表:

A1 | 1:00 pm
B5 | 2:00 pm
A3 | 3:00 pm
B9 | 5:00 pm

我的结果表将是:

A1 | 1 hour
A3 | 2 hours

我提出的查询如下:

SELECT
CAST(TIMESTAMP_DIFF((SELECT MIN(sub.time)
FROM table sub
WHERE sub.time > main.time), main.time, SECOND) AS INT64) duration 
FROM table main

这可以很好地获取我想要的表,但我还想在子查询中包含一个附加列。看起来像:

A1 | 1 hour  | B5Column
A3 | 2 hours | B9Column

我尝试使用以下查询:

SELECT
(SELECT
 sub.SubQueryColumn
 FROM table sub
 WHERE sub.time > main.time
 ORDER BY sub.time asc
 LIMIT 1) SubColumn,
CAST(TIMESTAMP_DIFF((SELECT MIN(sub.time)
FROM table sub
WHERE sub.time > main.time), main.time, SECOND) AS INT64) duration 
FROM table main

但它没有用。我得到的错误是

不支持引用其他表的相关子查询,除非它们可以去相关,例如通过将它们转换为有效的 JOIN。

我能得到一些帮助吗?

标签: sqlgoogle-bigquery

解决方案


以下是 BigQuery 标准 SQL

#standardSQL
SELECT event, TIMESTAMP_DIFF(b_time, time, SECOND) duration, b_event
  FROM (
  SELECT event, time,
    LEAD(time) OVER(PARTITION BY grp ORDER BY time) b_time,
    LEAD(event) OVER(PARTITION BY grp ORDER BY time) b_event
  FROM (
    SELECT *, 
      COUNTIF(STARTS_WITH(event, 'A')) OVER(ORDER BY time) grp
    FROM `project.dataset.your_table` t
  )
)
WHERE STARTS_WITH(event, 'A')
-- ORDER BY time

您可以使用您问题中的虚拟数据来测试/玩它,如下所示

#standardSQL
WITH `project.dataset.your_table` AS (
  SELECT 'A1' event, TIMESTAMP '2018-01-01 1:00:00' time UNION ALL
  SELECT 'B5', TIMESTAMP '2018-01-01 2:00:00' UNION ALL
  SELECT 'A3', TIMESTAMP '2018-01-01 3:00:00' UNION ALL
  SELECT 'B9', TIMESTAMP '2018-01-01 5:00:00' 
)
SELECT event, TIMESTAMP_DIFF(b_time, time, SECOND) duration, b_event
  FROM (
  SELECT event, time,
    LEAD(time) OVER(PARTITION BY grp ORDER BY time) b_time,
    LEAD(event) OVER(PARTITION BY grp ORDER BY time) b_event
  FROM (
    SELECT *, 
      COUNTIF(STARTS_WITH(event, 'A')) OVER(ORDER BY time) grp
    FROM `project.dataset.your_table` t
  )
)
WHERE STARTS_WITH(event, 'A')
ORDER BY time   

结果为

Row event   duration    b_event  
1   A1      3600        B5   
2   A3      7200        B9   

请注意:上述解决方案依赖于您问题中的陈述 -B will always happen after A因此,如果您的顺序如下

WITH `project.dataset.your_table` AS (
  SELECT 'A1' event, TIMESTAMP '2018-01-01 1:00:00' time UNION ALL
  SELECT 'A2', TIMESTAMP '2018-01-01 1:30:00' UNION ALL
  SELECT 'B5', TIMESTAMP '2018-01-01 2:00:00' UNION ALL
  SELECT 'A3', TIMESTAMP '2018-01-01 3:00:00' UNION ALL
  SELECT 'B9', TIMESTAMP '2018-01-01 5:00:00' 
)  

结果将是

Row event   duration    b_event  
1   A1      null        null     
2   A2      1800        B5   
3   A3      7200        B9     

如果您需要解决这个问题 - 请尝试以下

#standardSQL
WITH `project.dataset.your_table` AS (
  SELECT 'A1' event, TIMESTAMP '2018-01-01 1:00:00' time UNION ALL
  SELECT 'A2', TIMESTAMP '2018-01-01 1:30:00' UNION ALL
  SELECT 'B5', TIMESTAMP '2018-01-01 2:00:00' UNION ALL
  SELECT 'A3', TIMESTAMP '2018-01-01 3:00:00' UNION ALL
  SELECT 'B9', TIMESTAMP '2018-01-01 5:00:00' 
)
SELECT event, TIMESTAMP_DIFF(b_time, time, SECOND) duration, b_event
FROM (
  SELECT event, time, type, grp,
    FIRST_VALUE(event) OVER(ORDER BY grp RANGE BETWEEN 1 FOLLOWING AND 1 FOLLOWING) b_event,
    FIRST_VALUE(time) OVER(ORDER BY grp RANGE BETWEEN 1 FOLLOWING AND 1 FOLLOWING) b_time
  FROM (
    SELECT event, time, SUBSTR(event, 1, 1) type,
      COUNTIF(STARTS_WITH(event, 'B')) OVER(ORDER BY time) grp
    FROM `project.dataset.your_table` t
  )
)
WHERE STARTS_WITH(event, 'A')
ORDER BY time  

此版本将返回

Row event   duration    b_event  
1   A1      3600        B5   
2   A2      1800        B5   
3   A3      7200        B9     

推荐阅读