首页 > 解决方案 > 根据 spark 数据框中的列条件将最后一次出现的分隔符拆分为两列

问题描述

我有以下火花数据框

+------+--------------------+-----------------+--------------------+
|entity|            instance|             name|               value|
+------+--------------------+-----------------+--------------------+
|Column|            ARM_PRED|  Histogram.abs. |            638065.0|
|Column|            ARM_PRED|Histogram.ratio. |                 1.0|
|Column|       ATTR044_Count|       Compliance|  0.9886814039322013|
|Column|ATTR090_MissingCount|       Compliance|  0.9997570780406385|
|Column|             ATTR045|             Mean|  17101.922291929834|
|Column|             ATTR045|          Maximum|           4000000.0|
|Column|             ATTR045|          Minimum|                 0.0|
|Column|             ATTR045|StandardDeviation|   32981.67451994775|
|Column|       ATTR020_Count|       Compliance|    0.99814909139351|
|Column|ATTR036_MissingCount|       Compliance|0.011260608245241473|
|Column|       ATTR011_Count|       Compliance|    0.99814909139351|
|Column|             ATTR069|             Mean|  0.8678189434810735|
|Column|             ATTR069|          Maximum|               103.0|

name列具有合规性值时。我必须instance根据最后一次出现的分隔符“_”拆分列,并将值放在instance列和name列中。

预期输出:

+------+--------------------+-----------------+--------------------+
|entity|            instance|             name|               value|
+------+--------------------+-----------------+--------------------+
|Column|            ARM_PRED|  Histogram.abs. |            638065.0|
|Column|            ARM_PRED|Histogram.ratio. |                 1.0|
|Column|             ATTR044|            Count|  0.9886814039322013|
|Column|             ATTR090|     MissingCount|  0.9997570780406385|
|Column|             ATTR045|             Mean|  17101.922291929834|
|Column|             ATTR045|          Maximum|           4000000.0|
|Column|             ATTR045|          Minimum|                 0.0|
|Column|             ATTR045|StandardDeviation|   32981.67451994775|
|Column|             ATTR020|            Count|    0.99814909139351|
|Column|             ATTR036|     MissingCount|0.011260608245241473|
|Column|             ATTR011|            Count|    0.99814909139351|
|Column|             ATTR069|             Mean|  0.8678189434810735|
|Column|             ATTR069|          Maximum|               103.0|

下面的代码对所有记录执行操作,但我只需要在name列具有合规性值时执行

df.withColumn("instance", split($"instance", "_(?!.*_)"))
   .withColumn("instance", $"instance"(0))
   .withColumn("name", $"instance"(1))).show

需要帮忙!

标签: scalaapache-sparkapache-spark-sql

解决方案


你可以使用when函数

例子:

df
 .withColumn("_instance", split($"instance", "_(?!.*_)"))
 .withColumn("instance", when($"name" === "Compliance", $"_instance"(0)).otherwise($"instance"))
 .withColumn("name", when($"name" === "Compliance", $"_instance"(1)).otherwise($"name"))
 .drop("_instance")

推荐阅读