首页 > 解决方案 > Spark根据条件添加多列

问题描述

输入样本数据集

+--------------+----------+-----------+
| col_a        | col_b   |  col_label |
+------------------+-----+------------+
| aa           | hy      | Ford       |
| bb           | NA      | Mahindra   |
|              | my      |  Ford      |
| de           | cz      | Toyota     |
| NA Except xy | mg      |  Hyundai   |
| NA           | gh      |  Toyota    |
+--------------+---------+------------+

预期输出数据集

+--------------+----------+------+------+------------+------+----------+----------+---------+---------+-------+--------+--------+--------+----------+
| col_a        | col_b   | col_c | col_d| col_brand  | Ford | Mahindra | Toyota   | Hyundai |  BMW    | AUDI  | Nissan | flag_a | flag_b | flag_ab  |
+--------------+---------+-------+------+------------+------+----------+----------+---------+---------+-------+--------+--------+--------+----------+
| aa           | hy      |  aahy |      |  Ford      |  yes |   no     |    no    |   no    |   no    |  no   |   no   |   0    |   0    |   1      |
| bb           | NA      |  bb   |      | Mahindra   |  no  |   yes    |    no    |   no    |   no    |  no   |   no   |   1    |   0    |   0      |
| NA           | my      |  my   |      | Ford       |  yes |   no     |    no    |   no    |   no    |  no   |   no   |   0    |   1    |   0      |
| de           | cz      |  decz |      | Toyota     |  no  |   no     |    yes   |   no    |   no    |  no   |   no   |   0    |   0    |   1      |
| NA except xy | mg      |  mg   |  xy  | Hyundai    |  no  |   no     |    no    |   yes   |   no    |  no   |   no   |   0    |   1    |   0      |
| NA           | gh      |  gh   |      | Toyota     |  no  |   no     |    yes   |   no    |   no    |  no   |   no   |   0    |   1    |   0      |
+--------------+---------+-------+------|------------+------+----------+----------+---------+---------+--------+-------+--------+--------+----------+

解释:

第 1 列。 col_a 保持原样

列 2.col_b 保持原样

第 3 列。 col_c => concat(col_a,col_b)。如果具有 NA 或 NA 除了或 NULL 的值必须被视为空白(“”)

第 4 列 col_d => 从 col_a 捕获“NA 除外”之后的任何数据

第 5 列。 col_brand 保持原样

第 6 列到第 12 列 => => 静态品牌列表是福特、马恒达、丰田、现代、宝马、奥迪、日产。=> 将 col_brand 与静态品牌列表 (6.1) 进行比较,适当地为行分配是或否

第 13 到 15 列(flag_a、flag_b 和 flag_ab)=> 根据 col_a 和 col_b 中存在的值分配 1 或 0 => NA / NA 除了 / Null 必须被视为空白(“”)

谢谢你

标签: scalaapache-spark

解决方案


检查以下 SQL(通过 spark-shell),其中在设置临时 flag_a、flag_b(在 CTE 中)并用作检索 col_d 的分隔符时,“NA”和“NA except”不区分大小写:

df.createOrReplaceTempView("df_table")

spark.sql("""

    WITH t1 AS (
        SELECT IF(col_a = '' OR col_a is NULL, 'NA', col_a) as col_a
        ,      IF(col_b = '' OR col_b is NULL, 'NA', col_b) as col_b
        ,      IF(col_a rlike '^(?i)NA(?:\\s+except\\b|$)' OR col_a = '' OR col_a is NULL, 0, 1) as flag_a
        ,      IF(col_b rlike '^(?i)NA(?:\\s+except\\b|$)' OR col_b = '' OR col_b is NULL, 0, 1) as flag_b
        ,      col_label as col_brand
        FROM df_table
    )
    SELECT col_a
    ,      col_b
    ,      concat(IF(flag_a > 0, col_a, ''), IF(flag_b > 0, col_b, '')) as col_c
    ,      coalesce(split(col_a, '(?i)NA except ')[1], '') as col_d
    ,      col_brand
    ,      IF(col_brand = 'Ford', 'yes', 'no') as Ford
    ,      IF(col_brand = 'Mahindra', 'yes', 'no') as Mahindra
    ,      IF(col_brand = 'Toyota', 'yes', 'no') as Toyota
    ,      IF(col_brand = 'Hyundai', 'yes', 'no') as Hyundai
    ,      IF(col_brand = 'BMW', 'yes', 'no') as BMW
    ,      IF(col_brand = 'AUDI', 'yes', 'no') as AUDI
    ,      IF(col_brand = 'Nissan', 'yes', 'no') as Nissan
    ,      IF(flag_a = 1 AND flag_b = 0, 1, 0) as flag_a
    ,      IF(flag_a = 0 AND flag_b = 1, 1, 0) as flag_b
    ,      IF(flag_a = 1 AND flag_b = 1, 1, 0) as flag_ab
    FROM t1

""").show
+------------+-----+-----+-----+---------+----+--------+------+-------+---+----+------+------+------+-------+
|       col_a|col_b|col_c|col_d|col_brand|Ford|Mahindra|Toyota|Hyundai|BMW|AUDI|Nissan|flag_a|flag_b|flag_ab|
+------------+-----+-----+-----+---------+----+--------+------+-------+---+----+------+------+------+-------+
|          aa|   hy| aahy|     |     Ford| yes|      no|    no|     no| no|  no|    no|     0|     0|      1|
|          bb|   NA|   bb|     | Mahindra|  no|     yes|    no|     no| no|  no|    no|     1|     0|      0|
|          NA|   my|   my|     |     Ford| yes|      no|    no|     no| no|  no|    no|     0|     1|      0|
|          de|   cz| decz|     |   Toyota|  no|      no|   yes|     no| no|  no|    no|     0|     0|      1|
|NA Except xy|   mg|   mg|   xy|  Hyundai|  no|      no|    no|    yes| no|  no|    no|     0|     1|      0|
|          NA|   gh|   gh|     |   Toyota|  no|      no|   yes|     no| no|  no|    no|     0|     1|      0|
+------------+-----+-----+-----+---------+----+--------+------+-------+---+----+------+------+------+-------+

推荐阅读