sql - BigQuery - 获取每个组的聚合结果,即使其中一些组没有任何成员
问题描述
使用 BigQuery,我想通过一个查询根据标题对页面进行分组,并计算组的不同指标。由于标题的规则不是相互排斥的,我这样做了:
SELECT SUM(views) views, ..., title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`,
UNNEST([
CASE WHEN (title LIKE '%game%')
THEN 'games_group' END,
CASE WHEN (title LIKE '%sport%')
THEN 'sports_group' END,
CASE WHEN (title LIKE '%rarerare%')
THEN 'rare_group' END
]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
GROUP BY title_group
我得到:
views ... title_group
3414469869 ...
4355264 ... games_group
1361074 ... sports_group
我想获得的是有点不同:
views ... title_group
4355264 ... games_group
1361074 ... sports_group
0 ... rare_group
确实,即使小组没有出现,我也需要有结果。
我怎样才能修改我的查询以获得这个结果?
解决方案
最简单的方法是使用额外的左连接来包装原始(无任何更改)查询,如下例所示
#standardSQL
SELECT IFNULL(views, 0) views, title_group
FROM (SELECT title_group FROM UNNEST(['games_group', 'sports_group', 'rare_group']) AS title_group)
LEFT JOIN (
SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
CROSS JOIN UNNEST([
CASE WHEN (title LIKE '%game%') THEN 'games_group' END,
CASE WHEN (title LIKE '%sport%') THEN 'sports_group' END,
CASE WHEN (title LIKE '%rarerare%') THEN 'rare_group' END
]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
GROUP BY title_group
)
USING(title_group)
-- ORDER BY views DESC
带输出
Row views title_group
1 4355264 games_group
2 1361074 sports_group
3 0 rare_group
如果您愿意进行以上优化/重构 - 考虑以下版本(显然具有相同的输出)
#standardSQL
WITH title_groups AS (
SELECT title_group, pattern
FROM UNNEST([STRUCT<title_group STRING, pattern STRING>
('games_group', '%game%'), ('sports_group', '%sport%'), ('rare_group', '%rarerare%')
])
)
SELECT IFNULL(views, 0) views, title_group FROM title_groups
LEFT JOIN (
SELECT SUM(CASE WHEN title LIKE pattern THEN views END) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
CROSS JOIN title_groups
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
GROUP BY title_group
)
USING(title_group)