scala - 基于另一列更新火花数据框中的列值
问题描述
我有一个如下所述的火花数据框。
val data = spark.sparkContext.parallelize(Seq(
(1,"", "SNACKS", "BISCUITS - AMBIENT", "BISCUITS - AMBIENT", "", "REFLETS DE FRANCE CROQUANT", "UNCOATED BISCUIT", "NO PROMOTION", "", "", "400G","",""),
(2,"GROCERY", "BISCUITS", "SWEET BISCUITS ", "BISCUITS - AMBIENT", "", "", "AMBIENT BISCUIT", "NO PROMOTION", "", "", "400G","","CHOCOS")
))
.toDF("id", "c4", "c1001", "c1002", "c1003", "c1008", "c1008_unmasked", "c1009", "c1011", "c1012", "c1013", "c1015", "c1016", "c1016_unmasked")
data.show(false)
样品输入:
+---+-------+--------+------------------+------------------+-----+--------------------------+-----------------+------------+-----+-----+-----+-----+--------------+
|id |c4 |c1001 |c1002 |c1003 |c1008|c1008_unmasked |c1009 |c1011 |c1012|c1013|c1015|c1016|c1016_unmasked|
+---+-------+--------+------------------+------------------+-----+--------------------------+-----------------+------------+-----+-----+-----+-----+--------------+
|1 | |SNACKS |BISCUITS - AMBIENT|BISCUITS - AMBIENT| |REFLETS DE FRANCE CROQUANT|UNCOATED BISCUIT|NO PROMOTION| | |400G | | |
|2 |GROCERY|BISCUITS|SWEET BISCUITS |BISCUITS - AMBIENT| | |AMBIENT BISCUIT |NO PROMOTION| | |400G | |CHOCOS |
+---+-------+--------+------------------+------------------+-----+--------------------------+-----------------+------------+-----+-----+-----+-----+--------------+
仅当相同的cXXXX_unmasked具有值时,才需要使用值“MASKED”填充列cXXXX 。请检查示例输出以更好地理解。
+---+-------+--------+------------------+------------------+------+--------------------------+-----------------+------------+-----+-----+-----+------+--------------+
|id |c4 |c1001 |c1002 |c1003 |c1008 |c1008_unmasked |c1009 |c1011 |c1012|c1013|c1015|c1016 |c1016_unmasked|
+---+-------+--------+------------------+------------------+------+--------------------------+-----------------+------------+-----+-----+-----+------+--------------+
|1 | |SNACKS |BISCUITS - AMBIENT|BISCUITS - AMBIENT|MASKED|REFLETS DE FRANCE CROQUANT|UNCOATED BISCUIT|NO PROMOTION| | |400G | | |
|2 |GROCERY|BISCUITS|SWEET BISCUITS |BISCUITS - AMBIENT| | |AMBIENT BISCUIT |NO PROMOTION| | |400G |MASKED|CHOCOS |
+---+-------+--------+------------------+------------------+------+--------------------------+-----------------+------------+-----+-----+-----+------+--------------+
提前致谢
解决方案
这是我的尝试。
val cols = data.columns.filter(_.endsWith("_unmasked"))
val new_data = cols.foldLeft(data) { (df, c) =>
df.withColumn(c.split("_").head, when(col(c) =!= "" && col(c).isNotNull, lit("MASKED")).otherwise(col(c)))
}
new_data.show
+---+-------+--------+------------------+------------------+------+--------------------+-----------------+------------+-----+-----+-----+------+--------------+
| id| c4| c1001| c1002| c1003| c1008| c1008_unmasked| c1009| c1011|c1012|c1013|c1015| c1016|c1016_unmasked|
+---+-------+--------+------------------+------------------+------+--------------------+-----------------+------------+-----+-----+-----+------+--------------+
| 1| | SNACKS|BISCUITS - AMBIENT|BISCUITS - AMBIENT|MASKED|REFLETS DE FRANCE...|UNCOATED BISCUIT|NO PROMOTION| | | 400G| | |
| 2|GROCERY|BISCUITS| SWEET BISCUITS |BISCUITS - AMBIENT| | | AMBIENT BISCUIT|NO PROMOTION| | | 400G|MASKED| CHOCOS|
+---+-------+--------+------------------+------------------+------+--------------------+-----------------+------------+-----+-----+-----+------+--------------+
推荐阅读
- rust - 东京编解码器。无法解码发送的多个帧
- asp.net-core - ASP.Net 核心 - 在嵌套集合中进行搜索
- excel - Excel 公式查找
- javascript - 在材料 ui Avatar 组件上悬停时显示用户的名称
- html - 嵌套 HTML 结构中的 CSS 属性继承和覆盖
- java - 如何使二叉搜索树的遍历方法返回 Java 中的字符串?
- cmake - 我可以让 CMake 的 Make 生成器为“开销”命令添加 silencing-@ 吗?
- java - BigQuery 流插入错误 - 在数组之外添加重复记录
- java - 带抽屉布局的底部导航 | onBackPressed 在片段中
- excel - 将列表中的多个图像合并为 PDF