google-bigquery - 如何将数组中元素的出现计数为 SQL 表中的新列?
问题描述
假设我有一个调用的表my.table
和一个split
已经定义并返回字符串数组的函数。
SELECT split(lang) as langs
FROM my.table
which returns:
+-----------------------------+
| langs |
+-----------------------------+
| [French, English, English] |
+-----------------------------+
| [Dutch, French, English] |
+-----------------------------+
| [English] |
+-----------------------------+
| [French, Dutch] |
+-----------------------------+
现在我正在尝试将unnest
上述内容转换为每种语言出现的表格,例如:
+--------------------------+
| English | French | Dutch |
+--------------------------+
| 2 | 1 | 0 | # corresponds to [French, English, English] (0 Dutch)
+--------------------------+
| 1 | 1 | 1 |
+--------------------------+
| 1 | 0 | 0 |
+--------------------------+
| 0 | 1 | 1 |
+--------------------------+
我可以用一种天真的方式计算说“英语”的总数,例如:
WITH x AS (SELECT split(lang) as langs
FROM my.table)
SELECT count(arr_item) as English
FROM x, UNNEST(arr) as arr_item where arr_item = 'English'
编辑:每一行可能包含重复的元素,如[English, English, French]
. 请参阅第一个表:row1。
所以那个输出显示在第二个表中。
解决方案
以下是 BigQuery 标准 SQL
您的数据中最有可能的语言数量是事先不知道的 - 所以我建议采用以下方法,首先收集数据中的所有语言并按字母顺序排列,然后为每一行生成代表各自语言存在的 0 和 1 向量他们在该基本语言列表中的位置
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'French,English' langs UNION ALL
SELECT 'Dutch,French,English' UNION ALL
SELECT 'English' UNION ALL
SELECT 'French,Dutch'
), base AS (
SELECT STRING_AGG(lang ORDER BY lang) all_langs
FROM (
SELECT DISTINCT lang
FROM `project.dataset.table`,
UNNEST(SPLIT(langs)) lang
)
)
SELECT langs, all_langs,
(SELECT STRING_AGG(IF(lang IS NULL, '0', '1') ORDER BY pos)
FROM UNNEST(SPLIT(all_langs)) base_lang WITH OFFSET pos
LEFT JOIN UNNEST(SPLIT(langs)) lang
ON base_lang = lang
) AS value
FROM `project.dataset.table` t
CROSS JOIN base b
结果是
Row langs all_langs value
1 French,English Dutch,English,French 0,1,1
2 Dutch,French,English Dutch,English,French 1,1,1
3 English Dutch,English,French 0,1,0
4 French,Dutch Dutch,English,French 1,0,1
希望,这将为您的特定用例提供良好的起点
注意:BigQuery 不支持原生 PIVOT,因此上述方法很可能是最适合您的方法
...我的行已经是字符串数组了...我有 ['French', 'English'] 而不是 'French, English' ... 那么这仍然有效吗?
绝对 - 是的!您需要做的唯一更改是替换UNNEST(SPLIT(langs))
为UNNEST(langs)
如下示例
#standardSQL
WITH `project.dataset.table` AS (
SELECT ['French','English'] langs UNION ALL
SELECT ['Dutch','French','English'] UNION ALL
SELECT ['English'] UNION ALL
SELECT ['French','Dutch']
), base AS (
SELECT STRING_AGG(lang ORDER BY lang) all_langs
FROM (
SELECT DISTINCT lang
FROM `project.dataset.table`,
UNNEST(langs) lang
)
)
SELECT langs, all_langs,
(SELECT STRING_AGG(IF(lang IS NULL, '0', '1') ORDER BY pos)
FROM UNNEST(SPLIT(all_langs)) base_lang WITH OFFSET pos
LEFT JOIN UNNEST(langs) lang
ON base_lang = lang
) AS value
FROM `project.dataset.table` t
CROSS JOIN base b
结果
如果一行是 [French, English, English]。想要的是 0,1,2
见下面的例子
#standardSQL
WITH `project.dataset.table` AS (
SELECT ['French','English','English'] langs UNION ALL
SELECT ['Dutch','French','English'] UNION ALL
SELECT ['English','English'] UNION ALL
SELECT ['French','Dutch']
), base AS (
SELECT STRING_AGG(lang ORDER BY lang) all_langs
FROM (
SELECT DISTINCT lang
FROM `project.dataset.table`,
UNNEST(langs) lang
)
)
SELECT langs, all_langs,
ARRAY_TO_STRING(ARRAY(SELECT CAST(SUM(IF(lang IS NULL, 0, 1)) AS STRING)
FROM UNNEST(SPLIT(all_langs)) base_lang WITH OFFSET pos
LEFT JOIN UNNEST(langs) lang
ON base_lang = lang
GROUP BY base_lang
ORDER BY MIN(pos)
), ',') AS value
FROM `project.dataset.table` t
CROSS JOIN base b
结果
推荐阅读
- django - Django密码重置没有反向匹配
- mime - HEIC文件签名
- sql - 使用空值/限制对空列进行排序时,Postgresql 的奇怪行为
- google-app-maker - 将来自两个数据源的数据合并到一个表中
- r - ggplot2:创建一个带有单独观察的箱须图
- typescript - 打字稿不能转换为泛型类型变量
- apache-kafka - 使用卡夫卡流将数据从输入主题写入输出主题时卡夫卡消费者出错
- python - 在 Anaconda 中安装 py2exe 时出现无法满足的错误
- javascript - 如何在 Bootstrap 4 手风琴示例中折叠边框?
- android - 如何在 android 的工作管理器中传递 POJO 类?