首页 > 解决方案 > 需要基于Scala中的一列展平数据框

问题描述

我有一个具有以下架构的数据框

 root
 |-- name: string (nullable = true)
 |-- roll: string (nullable = true)
 |-- subjectID: string (nullable = true)

数据框中的值如下

+-------------------+---------+--------------------+
|               name|     roll|           SubjectID|
+-------------------+---------+--------------------+
|                sam|ta1i3dfk4|            xy|av|mm|
|               royc|rfhqdbnb3|                   a|
|             alcaly|ta1i3dfk4|               xx|zz|
+-------------------+---------+--------------------+

我需要通过 flattenig 主题 ID 导出数据框,如下所示。请注意:SubjectID 也是字符串

+-------------------+---------+--------------------+
|               name|     roll|           SubjectID|
+-------------------+---------+--------------------+
|                sam|ta1i3dfk4|                  xy|
|                sam|ta1i3dfk4|                  av|
|                sam|ta1i3dfk4|                  mm|
|               royc|rfhqdbnb3|                   a|
|             alcaly|ta1i3dfk4|                  xx|
|             alcaly|ta1i3dfk4|                  zz|
+-------------------+---------+--------------------+

任何建议

标签: scaladataframeapache-spark-sql

解决方案


您可以使用explode函数来展平。例子:

 val inputDF = Seq(
      ("sam", "ta1i3dfk4", "xy|av|mm"),
      ("royc", "rfhqdbnb3", "a"),
      ("alcaly", "rfhqdbnb3", "xx|zz")
    ).toDF("name", "roll", "subjectIDs")

  //split and explode `subjectIDs`
val result = input.withColumn("subjectIDs",
  split(col("subjectIDs"), "\\|"))
  .withColumn("subjectIDs", explode($"subjectIDs"))

    resultDF.show()

    +------+---------+----------+ 
    |  name|     roll|subjectIDs|
    +------+---------+----------+
    |   sam|ta1i3dfk4|        xy|
    |   sam|ta1i3dfk4|        av|
    |   sam|ta1i3dfk4|        mm|
    |  royc|rfhqdbnb3|         a|
    |alcaly|rfhqdbnb3|        xx|
    |alcaly|rfhqdbnb3|        zz|
    +------+---------+----------+

推荐阅读