scala - 条件满足时如何用“stage{col's}”开头的所有列的NULL替换值
问题描述
我有一个场景,最终的数据框如下所示,这是加入阶段和基础的结果。
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|ID_key |ICC_key |suff_key |stage_{timestamp} |stage_{code}|stage_{dol1}|stage_{dol2}|final_{timestamp} |final_{code}|final_{dol1}|final_{dol2}|
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|222 |222 |1 |2019-02-02 21:50:25.585|9123 |20.00 |1000.00 |2019-03-02 21:50:25.585|7123 |30.00 |200.00 |
|333 |333 |1 |2020-03-03 21:50:25.585|8123 |30.00 |200.00 |2020-01-03 21:50:25.585|823 |30.00 |200.00 |
|444 |444 |1 |2020-04-03 21:50:25.585|8123 |30.00 |200.00 |null |null |null |null |
|555 |333 |1 |null |null |null |null |2020-05-03 21:50:25.585|813 |30.00 |200.00 |
|111 |111 |1 |2020-01-01 21:50:25.585|A123 |10.00 |99.00 |null |null |null |null |
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
我正在寻找一个逻辑,在 final_{timestamp} > stage_{timestamp} 的每一行上,必须将值替换为“null”,所有列都以 stage_{} 开头。
如下所示:
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|ID_key |ICC_key |suff_key |stage_{timestamp} |stage_{code}|stage_{dol1}|stage_{dol2}|final_{timestamp} |final_{code}|final_{dol1}|final_{dol2}|
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|222 |222 |1 |null |null |null |null |2019-03-02 21:50:25.585|7123 |30.00 |200.00 |
|333 |333 |1 |2020-03-03 21:50:25.585|8123 |30.00 |200.00 |2020-01-03 21:50:25.585|823 |30.00 |200.00 |
|444 |444 |1 |2020-04-03 21:50:25.585|8123 |30.00 |200.00 |null |null |null |null |
|555 |333 |1 |null |null |null |null |2020-05-03 21:50:25.585|813 |30.00 |200.00 |
|111 |111 |1 |2020-01-01 21:50:25.585|A123 |10.00 |99.00 |null |null |null |null |
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
如果您能帮助我理解逻辑,那就太好了。"""
解决方案
检查下面的代码。
健康)状况
scala> val expr = col("final_{timestamp}") > col("stage_{timestamp}")
条件Matched
列
scala> val matched = df
.columns
.filter(_.startsWith("stage"))
.map(c => (when(expr,lit(null)).otherwise(col(c))).as(c))
条件Not Matched
列
scala> val notMatched = df
.columns
.filter(!_.startsWith("stage"))
.map(c => col(c).as(c))
组合Not Matched
和Matched
列
scala> val allColumns = notMatched ++ matched
最后结果
scala> df.select(allColumns:_*).show(false)
+------+-------+--------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|ID_key|ICC_key|suff_key|final_{timestamp} |final_{code}|final_{dol1}|final_{dol2}|stage_{timestamp} |stage_{code}|stage_{dol1}|stage_{dol2}|
+------+-------+--------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|222 |222 |1 |2019-03-02 21:50:25.585|7123 |30.00 |200.00 |null |null |null |null |
|333 |333 |1 |2020-01-03 21:50:25.585|823 |30.00 |200.00 |2020-03-03 21:50:25.585|8123 |30.00 |200.00 |
|444 |444 |1 |null |null |null |null |null |null |null |null |
|555 |333 |1 |2020-05-03 21:50:25.585|813 |30.00 |200.00 |null |null |null |null |
|111 |111 |1 |null |null |null |null |null |null |null |null |
+------+-------+--------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+