scala - Spark根据条件添加多列
问题描述
输入样本数据集
+--------------+----------+-----------+
| col_a | col_b | col_label |
+------------------+-----+------------+
| aa | hy | Ford |
| bb | NA | Mahindra |
| | my | Ford |
| de | cz | Toyota |
| NA Except xy | mg | Hyundai |
| NA | gh | Toyota |
+--------------+---------+------------+
预期输出数据集
+--------------+----------+------+------+------------+------+----------+----------+---------+---------+-------+--------+--------+--------+----------+
| col_a | col_b | col_c | col_d| col_brand | Ford | Mahindra | Toyota | Hyundai | BMW | AUDI | Nissan | flag_a | flag_b | flag_ab |
+--------------+---------+-------+------+------------+------+----------+----------+---------+---------+-------+--------+--------+--------+----------+
| aa | hy | aahy | | Ford | yes | no | no | no | no | no | no | 0 | 0 | 1 |
| bb | NA | bb | | Mahindra | no | yes | no | no | no | no | no | 1 | 0 | 0 |
| NA | my | my | | Ford | yes | no | no | no | no | no | no | 0 | 1 | 0 |
| de | cz | decz | | Toyota | no | no | yes | no | no | no | no | 0 | 0 | 1 |
| NA except xy | mg | mg | xy | Hyundai | no | no | no | yes | no | no | no | 0 | 1 | 0 |
| NA | gh | gh | | Toyota | no | no | yes | no | no | no | no | 0 | 1 | 0 |
+--------------+---------+-------+------|------------+------+----------+----------+---------+---------+--------+-------+--------+--------+----------+
解释:
第 1 列。 col_a 保持原样
列 2.col_b 保持原样
第 3 列。 col_c => concat(col_a,col_b)。如果具有 NA 或 NA 除了或 NULL 的值必须被视为空白(“”)
第 4 列 col_d => 从 col_a 捕获“NA 除外”之后的任何数据
第 5 列。 col_brand 保持原样
第 6 列到第 12 列 => => 静态品牌列表是福特、马恒达、丰田、现代、宝马、奥迪、日产。=> 将 col_brand 与静态品牌列表 (6.1) 进行比较,适当地为行分配是或否
第 13 到 15 列(flag_a、flag_b 和 flag_ab)=> 根据 col_a 和 col_b 中存在的值分配 1 或 0 => NA / NA 除了 / Null 必须被视为空白(“”)
谢谢你
解决方案
检查以下 SQL(通过 spark-shell),其中在设置临时 flag_a、flag_b(在 CTE 中)并用作检索 col_d 的分隔符时,“NA”和“NA except”不区分大小写:
df.createOrReplaceTempView("df_table")
spark.sql("""
WITH t1 AS (
SELECT IF(col_a = '' OR col_a is NULL, 'NA', col_a) as col_a
, IF(col_b = '' OR col_b is NULL, 'NA', col_b) as col_b
, IF(col_a rlike '^(?i)NA(?:\\s+except\\b|$)' OR col_a = '' OR col_a is NULL, 0, 1) as flag_a
, IF(col_b rlike '^(?i)NA(?:\\s+except\\b|$)' OR col_b = '' OR col_b is NULL, 0, 1) as flag_b
, col_label as col_brand
FROM df_table
)
SELECT col_a
, col_b
, concat(IF(flag_a > 0, col_a, ''), IF(flag_b > 0, col_b, '')) as col_c
, coalesce(split(col_a, '(?i)NA except ')[1], '') as col_d
, col_brand
, IF(col_brand = 'Ford', 'yes', 'no') as Ford
, IF(col_brand = 'Mahindra', 'yes', 'no') as Mahindra
, IF(col_brand = 'Toyota', 'yes', 'no') as Toyota
, IF(col_brand = 'Hyundai', 'yes', 'no') as Hyundai
, IF(col_brand = 'BMW', 'yes', 'no') as BMW
, IF(col_brand = 'AUDI', 'yes', 'no') as AUDI
, IF(col_brand = 'Nissan', 'yes', 'no') as Nissan
, IF(flag_a = 1 AND flag_b = 0, 1, 0) as flag_a
, IF(flag_a = 0 AND flag_b = 1, 1, 0) as flag_b
, IF(flag_a = 1 AND flag_b = 1, 1, 0) as flag_ab
FROM t1
""").show
+------------+-----+-----+-----+---------+----+--------+------+-------+---+----+------+------+------+-------+
| col_a|col_b|col_c|col_d|col_brand|Ford|Mahindra|Toyota|Hyundai|BMW|AUDI|Nissan|flag_a|flag_b|flag_ab|
+------------+-----+-----+-----+---------+----+--------+------+-------+---+----+------+------+------+-------+
| aa| hy| aahy| | Ford| yes| no| no| no| no| no| no| 0| 0| 1|
| bb| NA| bb| | Mahindra| no| yes| no| no| no| no| no| 1| 0| 0|
| NA| my| my| | Ford| yes| no| no| no| no| no| no| 0| 1| 0|
| de| cz| decz| | Toyota| no| no| yes| no| no| no| no| 0| 0| 1|
|NA Except xy| mg| mg| xy| Hyundai| no| no| no| yes| no| no| no| 0| 1| 0|
| NA| gh| gh| | Toyota| no| no| yes| no| no| no| no| 0| 1| 0|
+------------+-----+-----+-----+---------+----+--------+------+-------+---+----+------+------+------+-------+
推荐阅读
- reactjs - 我认为渲染工作两次
- javascript - Can I select all child elements of a specific element with a specific class without writing a list of selectors for each child?
- nginx - nginx proxy in front of Minio-like S3 server bad request
- javascript - How can I query select all inputs from table column
- excel - 使用 Chrome 和 VBA
- android - 当用户单击edittext之外的任何其他位置或屏幕上的任何其他位置时,如何隐藏edittext光标?
- bazel - 我可以从 .bzl 文件加载常用规则吗?
- sql - SQL Join on Like 运算符
- c++ - 如何遍历双端队列并搜索元素?
- java - 获取不同代理类型 AnyLogic 的等待时间