google-bigquery - 使用 BigQuery 进行一次热编码(虚拟变量)
问题描述
我想使用 BigQuery 而不是 Pandas 为我的类别创建虚拟变量(单热编码)。我最终会得到大约 200 列,因此我无法手动完成并对其进行硬编码
测试数据集(实际的变量比这个多得多)
WITH table AS (
SELECT 1001 as ID, 'blue' As Color, 'big' AS size UNION ALL
SELECT 1002 as ID, 'yellow' As Color, 'medium' AS size UNION ALL
SELECT 1003 as ID, 'red' As Color, 'small' AS size UNION ALL
SELECT 1004 as ID, 'blue' As Color, 'small' AS size)
SELECT *
FROM table
预期结果:
解决方案
以下是 BigQuery 标准 SQL
DECLARE Colors, Sizes ARRAY<STRING>;
SET (Colors, Sizes) = (SELECT AS STRUCT ARRAY_AGG(DISTINCT Color), ARRAY_AGG(DISTINCT Size) FROM `project.dataset.table`);
EXECUTE IMMEDIATE '''
CREATE TEMP TABLE result AS -- added line
SELECT ID, ''' || (
SELECT STRING_AGG("COUNTIF(Color = '" || Color || "') AS Color_" || Color ORDER BY Color)
FROM UNNEST(Colors) AS Color
) || (
SELECT ', ' || STRING_AGG("COUNTIF(Size = '" || Size || "') AS Size_" || Size ORDER BY Size)
FROM UNNEST(Sizes) AS Size
) || '''
FROM `project.dataset.table`
GROUP BY ID
ORDER BY ID
'''; -- added `;`
SELECT * FROM result; -- added line
如果应用于您问题中的样本数据 - 输出如下
推荐阅读
- tuples - 在 Julia 中,从前面还是后面增长 Tuple 更高效?
- css - 滑块元素动画延迟
- python - 使用 Kabsch 算法进行 3d 最佳旋转
- html - 修复了 CSS 中的列布局
- angular - SpyOn SideNav 找不到给定的方法
- java - Drools 硬约束实现
- ios - 最新 Xcode 更新后无法将 NativeScript 部署到 iOS?
- powershell - 如何使用 powershell 访问 MTP 设备的内容?
- javascript - 在运行时反应从服务器导入图像
- apache-spark - NettyBlockTransferService 不尊重 spark.blockManager.port 配置