首页 > 解决方案 > 折叠大型 BigQuery 结果

问题描述

我有什么简单的方法可以对fold_leftBigQuery 查询的结果执行类似 Ocaml 的操作,其中每次迭代对应于结果中的一行?

什么产品或方法是最简单的方法?如果:

由于我不知道哪种产品或语言会起作用,我不能更具体,但伪代码就像:

let my_init = []
let my_folder = fun state row ->
  // append for now, but it will be complicated. I need to do some set operations here. The point is that I need some way of transferring "state" across rows, when I iterate over rows in a predefined order.
  row.col1 :: state

let query = "SELECT col1, col2, col3 FROM table1 ORDER BY timestamp"
query |> List.fold my_folder my_init

我想从这个简化的例子中得到的结果是最终的“状态”。

- - 更新 - -

行数没有限制——如果我们收到更多,我们会得到更多行。通常,这个数字超过几百万,但也可能更大。

这是一个简化的示例,显示了我遇到的主要问题。我们有一个包含几列的表:

例如,以下是有效行:

----------+---------+----------------------------------------------
timestamp | user_id | operation_json
----------+---------+----------------------------------------------
1         | id1     | [ { "op": "add", "set": "set1" } ]
2         | id2     | [ { "op": "add", "set": "set1" } ]
3         | id1     | [ { "op": "add", "set": "set2" } ]
4         | id3     | [ { "op": "add", "set": "set2" } ]
5         | id1     | [ { "op": "remove", "set": "set1" } ]
----------+---------+----------------------------------------------

结果,我想获得一组用户;IE,

set1 |-> { id2 }
set2 |-> { id1, id3 }

我认为类似 fold_left 的操作会很方便。状态为 map>,初始状态为空映射。

标签: sqlgoogle-bigquery

解决方案


下面是 BigQuery 标准 SQL 的 [快速简单] 示例

#standardSQL
CREATE TEMP FUNCTION fold(arr ARRAY<INT64>, init INT64)
RETURNS FLOAT64
LANGUAGE js AS """
  const reducer = (accumulator, currentValue) => accumulator + parseInt(currentValue);
  return arr.reduce(reducer, 5);
""";
WITH `project.dataset.table` AS (
  SELECT 1 id, [1, 2, 3, 4] arr, 5 initial_state UNION ALL
  SELECT 2, [1, 2, 3, 4, 5, 6, 7], 10 
)
SELECT id, fold(arr, initial_state) result
FROM `project.dataset.table`   

输出是

Row id  result
1   1   15.0     
2   2   33.0      

我认为这是不言自明的

查看更多JS UDF

行的折叠列表

请参见上面的扩展
下面在这里,您在应用折叠功能之前从结果的行中组装数组(当然,您需要limits记住一些 UDF 以及您的 ARRAY 行可以有多大,等等。

#standardSQL
CREATE TEMP FUNCTION fold(arr ARRAY<INT64>, init INT64)
RETURNS FLOAT64
LANGUAGE js AS """
  const reducer = (accumulator, currentValue) => accumulator + parseInt(currentValue);
  return arr.reduce(reducer, 5);
""";
WITH `project.dataset.table` AS (
  SELECT 1 id, 1 item UNION ALL
  SELECT 1, 2 UNION ALL 
  SELECT 1, 3 UNION ALL 
  SELECT 1, 4 UNION ALL 
  SELECT 2, 1 UNION ALL 
  SELECT 2, 2 UNION ALL 
  SELECT 2, 3 UNION ALL 
  SELECT 2, 4 UNION ALL 
  SELECT 2, 5 UNION ALL 
  SELECT 2, 6 UNION ALL 
  SELECT 2, 7 
)
SELECT id, fold(ARRAY_AGG(item), 5) result
FROM `project.dataset.table`  
GROUP BY id

请注意,如果您需要在每一行中包含多个字段 - 您可以使用 STRUCT 的 ARRAY,如下例所示

ARRAY_AGG(STRUCT(id , item) ORDER by id)

当然,您需要分别调整折叠 UDF 的签名

例如:

#standardSQL
CREATE TEMP FUNCTION fold(arr ARRAY<STRUCT<id INT64, item INT64>>, init INT64)
RETURNS FLOAT64
LANGUAGE js AS """
  const reducer = (accumulator, currentValue) => accumulator + parseInt(currentValue.item);
  return arr.reduce(reducer, 5);
""";
WITH `project.dataset.table` AS (
  SELECT 1 id, 1 item UNION ALL
  SELECT 1, 2 UNION ALL 
  SELECT 1, 3 UNION ALL 
  SELECT 1, 4 UNION ALL 
  SELECT 2, 1 UNION ALL 
  SELECT 2, 2 UNION ALL 
  SELECT 2, 3 UNION ALL 
  SELECT 2, 4 UNION ALL 
  SELECT 2, 5 UNION ALL 
  SELECT 2, 6 UNION ALL 
  SELECT 2, 7 
)
SELECT id, fold(ARRAY_AGG(t), 5) result
FROM `project.dataset.table` t 
GROUP BY id

推荐阅读