首页 > 解决方案 > BigQuery - 获取每个组的聚合结果,即使其中一些组没有任何成员

问题描述

使用 BigQuery,我想通过一个查询根据标题对页面进行分组,并计算组的不同指标。由于标题的规则不是相互排斥的,我这样做了:

SELECT SUM(views) views, ..., title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`, 
      UNNEST([
                CASE WHEN (title LIKE '%game%') 
                THEN 'games_group' END, 
                CASE WHEN (title LIKE '%sport%') 
                THEN 'sports_group' END, 
                CASE WHEN (title LIKE '%rarerare%')
                THEN 'rare_group' END
             ]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
GROUP BY title_group

我得到:

views       ...   title_group
3414469869  ... 
4355264     ...   games_group
1361074     ...   sports_group

我想获得的是有点不同:

views     ...   title_group
4355264   ...   games_group
1361074   ...   sports_group
0         ...   rare_group

确实,即使小组没有出现,我也需要有结果。

我怎样才能修改我的查询以获得这个结果?

标签: sqlgroup-bygoogle-bigquerycase

解决方案


最简单的方法是使用额外的左连接来包装原始(无任何更改)查询,如下例所示

#standardSQL
SELECT IFNULL(views, 0) views, title_group
FROM (SELECT title_group FROM UNNEST(['games_group', 'sports_group', 'rare_group']) AS title_group)
LEFT JOIN (
  SELECT SUM(views) views, title_group
  FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
  CROSS JOIN UNNEST([
    CASE WHEN (title LIKE '%game%') THEN 'games_group' END, 
    CASE WHEN (title LIKE '%sport%') THEN 'sports_group' END, 
    CASE WHEN (title LIKE '%rarerare%') THEN 'rare_group' END
  ]) AS title_group
  WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
  GROUP BY title_group
)
USING(title_group)
-- ORDER BY views DESC   

带输出

Row views   title_group  
1   4355264 games_group  
2   1361074 sports_group     
3   0       rare_group    

如果您愿意进行以上优化/重构 - 考虑以下版本(显然具有相同的输出)

#standardSQL
WITH title_groups AS (
  SELECT title_group, pattern 
  FROM UNNEST([STRUCT<title_group STRING, pattern STRING>
    ('games_group', '%game%'), ('sports_group', '%sport%'), ('rare_group', '%rarerare%')
  ])
)
SELECT IFNULL(views, 0) views, title_group FROM title_groups
LEFT JOIN (
  SELECT SUM(CASE WHEN title LIKE pattern THEN views END) views, title_group 
  FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
  CROSS JOIN title_groups
  WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
  GROUP BY title_group
)
USING(title_group)

推荐阅读