首页 > 解决方案 > 条件满足时如何用“stage{col's}”开头的所有列的NULL替换值

问题描述

我有一个场景,最终的数据框如下所示,这是加入阶段和基础的结果。

+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|ID_key             |ICC_key                 |suff_key       |stage_{timestamp}      |stage_{code}|stage_{dol1}|stage_{dol2}|final_{timestamp}      |final_{code}|final_{dol1}|final_{dol2}|
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|222                |222                     |1              |2019-02-02 21:50:25.585|9123        |20.00       |1000.00     |2019-03-02 21:50:25.585|7123        |30.00       |200.00      |
|333                |333                     |1              |2020-03-03 21:50:25.585|8123        |30.00       |200.00      |2020-01-03 21:50:25.585|823         |30.00       |200.00      |
|444                |444                     |1              |2020-04-03 21:50:25.585|8123        |30.00       |200.00      |null                   |null        |null        |null        |
|555                |333                     |1              |null                   |null        |null        |null        |2020-05-03 21:50:25.585|813         |30.00       |200.00      |
|111                |111                     |1              |2020-01-01 21:50:25.585|A123        |10.00       |99.00       |null                   |null        |null        |null        |
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+

我正在寻找一个逻辑,在 final_{timestamp} > stage_{timestamp} 的每一行上,必须将值替换为“null”,所有列都以 stage_{} 开头。

如下所示:

+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|ID_key             |ICC_key                 |suff_key       |stage_{timestamp}      |stage_{code}|stage_{dol1}|stage_{dol2}|final_{timestamp}      |final_{code}|final_{dol1}|final_{dol2}|
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|222                |222                     |1              |null                   |null        |null        |null        |2019-03-02 21:50:25.585|7123        |30.00       |200.00      |
|333                |333                     |1              |2020-03-03 21:50:25.585|8123        |30.00       |200.00      |2020-01-03 21:50:25.585|823         |30.00       |200.00      |
|444                |444                     |1              |2020-04-03 21:50:25.585|8123        |30.00       |200.00      |null                   |null        |null        |null        |
|555                |333                     |1              |null                   |null        |null        |null        |2020-05-03 21:50:25.585|813         |30.00       |200.00      |
|111                |111                     |1              |2020-01-01 21:50:25.585|A123        |10.00       |99.00       |null                   |null        |null        |null        |
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+

如果您能帮助我理解逻辑,那就太好了。"""

标签: scalaapache-spark

解决方案


检查下面的代码。

健康)状况

scala> val expr = col("final_{timestamp}") > col("stage_{timestamp}")

条件Matched

scala> val matched = df
                      .columns
                      .filter(_.startsWith("stage"))
                      .map(c => (when(expr,lit(null)).otherwise(col(c))).as(c))

条件Not Matched

scala> val notMatched = df
                         .columns
                         .filter(!_.startsWith("stage"))
                         .map(c => col(c).as(c))

组合Not MatchedMatched

scala> val allColumns = notMatched ++ matched

最后结果


scala> df.select(allColumns:_*).show(false)
+------+-------+--------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|ID_key|ICC_key|suff_key|final_{timestamp}      |final_{code}|final_{dol1}|final_{dol2}|stage_{timestamp}      |stage_{code}|stage_{dol1}|stage_{dol2}|
+------+-------+--------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|222   |222    |1       |2019-03-02 21:50:25.585|7123        |30.00       |200.00      |null                   |null        |null        |null        |
|333   |333    |1       |2020-01-03 21:50:25.585|823         |30.00       |200.00      |2020-03-03 21:50:25.585|8123        |30.00       |200.00      |
|444   |444    |1       |null                   |null        |null        |null        |null                   |null        |null        |null        |
|555   |333    |1       |2020-05-03 21:50:25.585|813         |30.00       |200.00      |null                   |null        |null        |null        |
|111   |111    |1       |null                   |null        |null        |null        |null                   |null        |null        |null        |
+------+-------+--------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+

推荐阅读