首页 > 解决方案 > 如何将数组中元素的出现计数为 SQL 表中的新列?

问题描述

假设我有一个调用的表my.table和一个split已经定义并返回字符串数组的函数。

SELECT split(lang) as langs
FROM my.table 

which returns:

+-----------------------------+
|           langs             |
+-----------------------------+
| [French, English, English]  |
+-----------------------------+
| [Dutch, French, English]    |
+-----------------------------+
| [English]                   |
+-----------------------------+
| [French, Dutch]             |
+-----------------------------+

现在我正在尝试将unnest上述内容转换为每种语言出现的表格,例如:

+--------------------------+
| English | French | Dutch | 
+--------------------------+
|    2    |    1   |   0   |    # corresponds to [French, English, English] (0 Dutch)
+--------------------------+
|    1    |    1   |   1   | 
+--------------------------+
|    1    |    0   |   0   | 
+--------------------------+
|    0    |    1   |   1   | 
+--------------------------+

我可以用一种天真的方式计算说“英语”的总数,例如:

WITH x AS (SELECT split(lang) as langs
FROM my.table)
SELECT count(arr_item) as English
FROM x, UNNEST(arr) as arr_item where arr_item = 'English'

编辑:每一行可能包含重复的元素,如[English, English, French]. 请参阅第一个表:row1。

所以那个输出显示在第二个表中。

标签: google-bigquery

解决方案


以下是 BigQuery 标准 SQL

您的数据中最有可能的语言数量是事先不知道的 - 所以我建议采用以下方法,首先收集数据中的所有语言并按字母顺序排列,然后为每一行生成代表各自语言存在的 0 和 1 向量他们在该基本语言列表中的位置

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'French,English' langs UNION ALL
  SELECT 'Dutch,French,English' UNION ALL
  SELECT 'English' UNION ALL
  SELECT 'French,Dutch' 
), base AS (
  SELECT STRING_AGG(lang ORDER BY lang) all_langs
  FROM (
    SELECT DISTINCT lang
    FROM `project.dataset.table`, 
    UNNEST(SPLIT(langs)) lang
  )
)
SELECT langs, all_langs,
  (SELECT STRING_AGG(IF(lang IS NULL, '0', '1') ORDER BY pos)
    FROM UNNEST(SPLIT(all_langs)) base_lang WITH OFFSET pos
    LEFT JOIN UNNEST(SPLIT(langs)) lang
    ON base_lang = lang
  ) AS value
FROM `project.dataset.table` t
CROSS JOIN base b 

结果是

Row langs                   all_langs               value    
1   French,English          Dutch,English,French    0,1,1    
2   Dutch,French,English    Dutch,English,French    1,1,1    
3   English                 Dutch,English,French    0,1,0    
4   French,Dutch            Dutch,English,French    1,0,1     

希望,这将为您的特定用例提供良好的起点

注意:BigQuery 不支持原生 PIVOT,因此上述方法很可能是最适合您的方法

...我的行已经是字符串数组了...我有 ['French', 'English'] 而不是 'French, English' ... 那么这仍然有效吗?

绝对 - 是的!您需要做的唯一更改是替换UNNEST(SPLIT(langs))UNNEST(langs)如下示例

#standardSQL
WITH `project.dataset.table` AS (
  SELECT ['French','English'] langs UNION ALL
  SELECT ['Dutch','French','English'] UNION ALL
  SELECT ['English'] UNION ALL
  SELECT ['French','Dutch'] 
), base AS (
  SELECT STRING_AGG(lang ORDER BY lang) all_langs
  FROM (
    SELECT DISTINCT lang
    FROM `project.dataset.table`, 
    UNNEST(langs) lang
  )
)
SELECT langs, all_langs,
  (SELECT STRING_AGG(IF(lang IS NULL, '0', '1') ORDER BY pos)
    FROM UNNEST(SPLIT(all_langs)) base_lang WITH OFFSET pos
    LEFT JOIN UNNEST(langs) lang
    ON base_lang = lang
  ) AS value
FROM `project.dataset.table` t
CROSS JOIN base b   

结果

在此处输入图像描述

如果一行是 [French, English, English]。想要的是 0,1,2

见下面的例子

#standardSQL
WITH `project.dataset.table` AS (
  SELECT ['French','English','English'] langs UNION ALL
  SELECT ['Dutch','French','English'] UNION ALL
  SELECT ['English','English'] UNION ALL
  SELECT ['French','Dutch'] 
), base AS (
  SELECT STRING_AGG(lang ORDER BY lang) all_langs
  FROM (
    SELECT DISTINCT lang
    FROM `project.dataset.table`, 
    UNNEST(langs) lang
  )
)
SELECT langs, all_langs,
  ARRAY_TO_STRING(ARRAY(SELECT CAST(SUM(IF(lang IS NULL, 0, 1)) AS STRING) 
    FROM UNNEST(SPLIT(all_langs)) base_lang WITH OFFSET pos
    LEFT JOIN UNNEST(langs) lang
    ON base_lang = lang
    GROUP BY base_lang
    ORDER BY MIN(pos)
  ), ',') AS value
FROM `project.dataset.table` t
CROSS JOIN base b   

结果

在此处输入图像描述


推荐阅读