首页 > 解决方案 > BigQuery:连接来自不同表的最接近当前行时间的行的所有列

问题描述

我有一个 BigQuery 数据库,其中包含相当长的表,其中包含 IoT 事件和遥测数据,我需要在其中找到最接近每个事件的遥测数据点。

由于(据我了解与性能相关的)BigQuery 子查询的限制,到目前为止,我还没有设法获得结果。

我发现了另一个类似的问题(BigQuery: Join based on nearest timestamp),它有一个有希望的答案(https://stackoverflow.com/a/58172418/21217),但是将其调整到我的表中我得到了错误消息Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.,即使我试过了用一些 where 子句限制遥测查询。

我的查询目前如下所示:

select event, (
  select as struct *
  from master.telemetry as telemetry
  where EXTRACT(DATE FROM event.time) = EXTRACT(DATE FROM telemetry.time) and event.module_id = telemetry.module_id
  order by ABS(TIMESTAMP_DIFF(event.time, telemetry.time, SECOND))
  limit 1
) as closest_telemetry
from master.event as event

不太确定针对此特定用例还可以尝试什么。我只是想知道将基于时间的分区引入遥测表是否会有所帮助?

编辑:示例数据 JSON

事件:

{
  "id": "33373",
  "module_id": "module-standard-7",
  "time": "2019-11-05 15:09:31.650209 UTC",
  "state": "fault"
},

遥测:

{
  "module_id": "module-standard-7",
  "latitude": "50.721649169921875",
  "longitude": "-1.8665790557861328",
  "time": "2019-12-07 16:37:35.721218 UTC",
  ... (many other columns which are not relevant in this case)
},

所需的输出(遥测合并到事件中):

{
  "id": "33373",
  "module_id": "module-standard-7",
  "time": "2019-11-05 15:09:31.650209 UTC",
  "state": "fault"
  "telemetry.latitude": "50.721649169921875",
  "telemetry.longitude": "-1.8665790557861328",
  "telemetry.time": "2019-12-07 16:37:35.721218 UTC",
},

如果遥测字段不嵌套会更好,但希望将其时间保留为单独的列以了解它们的距离。

标签: sqlgoogle-bigquery

解决方案


这是一个替代答案 - 在这种情况下,我创建了物理表以确保当表不是临时的WITH语句时运行:

CREATE TABLE temp.a AS (
  SELECT * FROM UNNEST(
    [STRUCT(TIMESTAMP('2018-01-02 20:01:00') AS time, 'monkey' AS animal)
     ,STRUCT('2018-03-04 10:10:10', 'lion')
     ,STRUCT('2018-07-04 10:10:10', 'donkey')
    ]) 
)
;
CREATE TABLE temp.b AS (
  SELECT * FROM UNNEST(
    [STRUCT(TIMESTAMP('2017-01-02 10:01:00') AS time, 'one' AS festival)
     ,STRUCT('2019-03-04 10:10:10', 'two')
     ,STRUCT('2018-07-04 10:10:10', 'three')
     ,STRUCT('2018-03-05 10:10:10', 'four')
    ]) 
)
;


SELECT b.*,
  ARRAY_AGG(a 
   ORDER BY ABS(TIMESTAMP_DIFF(b.time, a.time, SECOND))
   LIMIT 1)[OFFSET(0)] closest
FROM temp.b
JOIN temp.a 
ON ABS(timestamp_diff(b.time,a.time, DAY)) < 360*2
GROUP BY 1, 2

该子句ON ABS(timestamp_diff(b.time,a.time, DAY)) < 360*2是一个健全性检查,以避免爆炸性连接。360*2如果您的数据比此处提供的更密集,请降低限制。

在此处输入图像描述


推荐阅读