首页 > 解决方案 > 查询超出 BigQuery GROUP BY 中关于大数据的资源限制

问题描述

我有两张桌子 -

StringList只有一列

SearchString
abc
def
ghi
lmn

我有另一个表格短语,一列

Phrase
xyzabcefg
ddflkd
fdjkd
abcdefghi
extreme

我需要一个包含两列的输出表 -

搜索字符串 短语

搜索字符串将是第一个表中的搜索字符串,短语将是第二个表中的所有数据,其中搜索字符串出现在短语中

像下面的东西 -

SearchString Phrases
abc   [xyzabcefg, abcdefghi]
def   [abcdefghi]
ghi   [abcdefghi]

我能够得到结果,但查询根本没有优化,我在第二张表中有大约 240 万条记录,在第一张表中有 7.5K 条记录。

我需要将此查询扩展到第二个表中的 240M 记录,但此查询现在即使在 1.5M 记录上也不起作用。

给出以下错误 -

查询超出资源限制。使用了 13256.532365742061 CPU 秒,并且此查询必须使用少于 9200.0 CPU 秒。

我尝试了各种方法,但没有成功。我目前的查询是 -

With first_batch as (
SELECT
a.Phrase,
b.SearchString
FROM `project-id.DS_WORK.Phrase`  a
join `project-id.DS_WORK.StringList` b
ON a.Phrase LIKE '%' || b.SearchString || '%' )
SELECT
first_batch.SearchString,
'[' || STRING_AGG(first_batch.Phrase) || ']' AS Phrases
FROM first_batch
GROUP BY first_batch.SearchString;


任何帮助将不胜感激。

我正在考虑在用户定义的函数中使用循环,但我首先想检查是否有更简单的方法来实现这一点。

在此处输入图像描述

标签: google-bigqueryquery-optimization

解决方案


在 BigQuery 中,您应该避免group by在大数据上使用。原因是数据是在小的分布式节点上处理的。group by 或 order by 等功能需要一个节点上的所有数据,这是不可能的。

尝试改用窗口函数。我会重写你的查询

With first_batch as (
SELECT
a.Phrase,
b.SearchString
FROM `project-id.DS_WORK.Phrase`  a
join `project-id.DS_WORK.StringList` b
ON a.Phrase LIKE '%' || b.SearchString || '%' )
SELECT 
a.SearchString,
'[' || a.Phrases || ']' AS Phrases
FROM ( SELECT 
first_batch.SearchString as SearchString,
ROW_NUMBER() OVER(PARTITION BY first_batch.Phrase) AS rn,
STRING_AGG(first_batch.Phrase)
  OVER (
    PARTITION BY first_batch.Phrase
  ) AS Phrases
FROM first_batch ) AS a
WHERE a.rn = 1

如果您需要尝试使用真实数据进行简单查询,请使用以下命令:

#standardSQL
WITH `table` AS (
  SELECT '1' AS UserId, 'A' AS Name UNION ALL
  SELECT '1' AS UserId, 'B' AS Name UNION ALL
  SELECT '1' AS UserId, 'C' AS Name UNION ALL
  SELECT '2' AS UserId, 'A' AS Name UNION ALL
  SELECT '3' AS UserId, 'B' AS Name
)
SELECT a.UserId, a.Name FROM (SELECT
UserId,
ROW_NUMBER() OVER(PARTITION BY UserId) AS rn,
STRING_AGG(Name)
  OVER (
    PARTITION BY UserId
  ) AS Name
from `table`) as a
where a.rn=1

如果窗口函数没有帮助,您可以使用不同的嵌套选择来编写类似于以下的查询

#standardSQL
WITH `table` AS (
  SELECT '1' AS UserId, 'A' AS Name UNION ALL
  SELECT '1' AS UserId, 'B' AS Name UNION ALL
  SELECT '1' AS UserId, 'C' AS Name UNION ALL
  SELECT '2' AS UserId, 'A' AS Name UNION ALL
  SELECT '3' AS UserId, 'B' AS Name
)
SELECT a.UserId,
(SELECT STRING_AGG(Name) from `table` as t where a.UserId = t.UserId) as Names
FROM (SELECT
DISTINCT UserId,
from `table`) as a

推荐阅读