scala - 基于scala中拆分列值的Spark数据帧重复行
问题描述
我在scala中有以下代码:
val fullCertificateSourceDf = certificateSourceDf
.withColumn("Stage", when(col("Data.WorkBreakdownUp1Summary").isNotNull && col("Data.WorkBreakdownUp1Summary")=!="", rtrim(regexp_extract($"Data.WorkBreakdownUp1Summary","^.*?(?= - *[a-zA-Z])",0))).otherwise(""))
.withColumn("SubSystem", when(col("Data.ProcessBreakdownSummaryList").isNotNull && col("Data.ProcessBreakdownSummaryList")=!="", regexp_extract($"Data.ProcessBreakdownSummaryList","^.*?(?= - *[a-zA-Z])",0)).otherwise(""))
.withColumn("System", when(col("Data.ProcessBreakdownUp1SummaryList").isNotNull && col("Data.ProcessBreakdownUp1SummaryList")=!="", regexp_extract($"Data.ProcessBreakdownUp1SummaryList","^.*?(?= - *[a-zA-Z])",0)).otherwise(""))
.withColumn("Facility", when(col("Data.ProcessBreakdownUp2Summary").isNotNull && col("Data.ProcessBreakdownUp2Summary")=!="", regexp_extract($"Data.ProcessBreakdownUp2Summary","^.*?(?= - *[a-zA-Z])",0)).otherwise(""))
.withColumn("Area", when(col("Data.ProcessBreakdownUp3Summary").isNotNull && col("Data.ProcessBreakdownUp3Summary")=!="", regexp_extract($"Data.ProcessBreakdownUp3Summary","^.*?(?= - *[a-zA-Z])",0)).otherwise(""))
.select("Data.ID",
"Data.CertificateID",
"Data.CertificateTag",
"Data.CertificateDescription",
"Data.WorkBreakdownUp1Summary",
"Data.ProcessBreakdownSummaryList",
"Data.ProcessBreakdownUp1SummaryList",
"Data.ProcessBreakdownUp2Summary",
"Data.ProcessBreakdownUp3Summary",
"Data.ActualStartDate",
"Data.ActualEndDate",
"Data.ApprovedDate",
"Data.CurrentState",
"DataType",
"PullDate",
"PullTime",
"Stage",
"System",
"SubSystem",
"Facility",
"Area"
)
.filter((col("Stage").isNotNull) && (length(col("Stage"))>0))
.filter(((col("SubSystem").isNotNull) && (length(col("SubSystem"))>0)) || ((col("System").isNotNull) && (length(col("System"))>0)) || ((col("Facility").isNotNull) && (length(col("Facility"))>0)) || ((col("Area").isNotNull) && (length(col("Area"))>0))
)
.select("*")
此数据框 fullCertificateSourceDf 包含以下数据:
为了简洁,我隐藏了一些列。
我希望数据看起来像这样:
我们分为两列:ProcessBreakdownSummaryList 和 ProcessBreakdownUp1SummaryList。它们都是逗号分隔的列表。
请注意 ProcessBreakdownSummaryList (CS10-100-22-10 - Mine Intake Air Fan Heater System, CS10-100-81 -10 - Mine Services Switchgear) 和 ProcessBreakdownUp1SummaryList (CS10-100-22 - Service Shaft Ventilation, CS10- 100-81 - 维修轴电气)是相同的,我们应该只拆分一次。
但是,如果它们与 ProcessBreakdownSummaryList(CS10-100-22-10 - Mine Intake Air Fan Heater System, CS10-100-81 -10 - Mine Services Switchgear) 和 ProcessBreakdownUp1SummaryList (CS10-100-22 - Service Shaft Ventilation, CS10-100-34 - 维修轴电气)它应该再次分开第三排。
预先感谢您对此的帮助。
解决方案
您可以通过多种方式解决它,我认为复杂处理最简单的方法是使用 scala。您可以读取包括“ProcessBreakdownSummaryList”和“ProcessBreakdownUp1SummaryList”在内的所有列,比较它们的值是否相同/不同,并为单个输入行发出多行。然后在输出上进行平面映射以获取包含您需要的所有行的数据框。
val fullCertificateSourceDf = // your code
fullCertificateSourceDf.map{ row =>
val id = row.getAs[String]("Data.ID")
... read all columns
val processBreakdownSummaryList = row.getAs[String]("Data.ProcessBreakdownSummaryList")
val processBreakdownUp1SummaryList = row.getAs[String]("Data.ProcessBreakdownUp1SummaryList")
//split processBreakdownSummaryList on ","
//split processBreakdownUp1SummaryList on ","
//compare then for equality
//lets say you end up with 4 rows.
//return Seq of those 4 rows in a list processBreakdownSummary
//return a List of tuple of strings like List((id, certificateId, certificateTag, ..distinct values of processBreakdownUp1SummaryList...), (...) ...)
//all columns id, certificateId, certificateTag etc are repeated for each distinct value of processBreakdownUp1SummaryList and processBreakdownSummaryList
}.flatMap(identity(_)).toDF("column1","column2"...)
这是将一行拆分为多个的示例
val employees = spark.createDataFrame(Seq(("E1",100.0,"a,b"), ("E2",200.0,"e,f"),("E3",300.0,"c,d"))).toDF("employee","salary","clubs")
employees.map{ r =>
val clubs = r.getAs[String]("clubs").split(",")
for{
c : String <- clubs
}yield(r.getAs[String]("employee"),r.getAs[Double]("salary"), c)
}.flatMap(identity(_)).toDF("employee","salary","clubs").show(false)
结果看起来像
+--------+------+-----+
|employee|salary|clubs|
+--------+------+-----+
|E1 |100.0 |a |
|E1 |100.0 |b |
|E2 |200.0 |e |
|E2 |200.0 |f |
|E3 |300.0 |c |
|E3 |300.0 |d |
+--------+------+-----+
推荐阅读
- django - Django static file not found
- php - 查询经纬度不返回任何结果mysql
- c# - .Net Core 2.1中的主事务发生错误时如何回滚子事务
- php - Why we create multiple objects in same class in oop?
- r - R gsub/str_replace 返回一个反斜杠
- python - 在 Python/Numpy 中为多维评估创建参数
- visual-studio - Unity 2D 对撞机和刚体
- jquery - Modifying the tag via jQuery each loop
- node.js - 带范围的猫鼬选择始终返回 null
- go - 转换golang时将字符串中的负数保持为负