首页 > 解决方案 > 如何使用explode进行转换并同时对其他单列进行自定义修改

问题描述

我有一个如下数据集:

Input Dataset

Id, Parent_id, Data
-----------------------
1, NULL, favorite: 3
2, NULL, favorite: 4
Output Dataset

Id, Parent_Id, Data
------------------------
1, NULL, favorite: 3
1_t1, 1, favorite: 3
1_t2, 1, favorite: 3
1_t3, 1, favorite: 3
2, NULL, favorite: 4
2_t1, 2, favorite: 4
2_t2, 2, favorite: 4
2_t3, 2, favorite: 4
2_t4, 2, favorite: 4

正如您在上面看到的,我正在尝试将数据列喜爱计数属性分解为它们自己的单独行,并使用 parent_id 列来表示其根记录。

到目前为止,我尝试使用 Spark SQL Explode 函数来尝试执行此操作,但是我无法使其正常工作。

标签: sqlapache-sparkdataframeapache-spark-sql

解决方案


如果我正确理解您的问题,您正在尝试使用数据列中的数字从现有行生成/创建新行,并希望生成那么多新行idparent_id指向原始记录

如果是这种情况,那么您可以使用mapflatMap操作如下:

import org.apache.spark.sql.Row

import scala.collection.mutable.ArrayBuffer

import sparkSession.sqlContext.implicits._

val input = Seq(("1", "NULL", "favorite:3"), ("2", "NULL", "favorite:4")).toDF("id", "parent_id", "data")

input.printSchema()
input.show(false)

val resultRDD = input.rdd.map(row => {
  val list = new ArrayBuffer[Row]()
  list += row

  val pointer = row.getAs[String]("data").split(":")(1).toInt

  for (index <- 1 to pointer) {
    val newId = s"${row.getAs[String]("id")}_t$index"
    list += Row.fromSeq(Seq(newId, row.getAs[String]("id"), row.getAs[String]("data")))
  }

  list
}).flatMap(_.toIterator)


val resultDF = sparkSession.createDataFrame(resultRDD, input.schema)
resultDF.show(false)

结果将是:

root
 |-- id: string (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- data: string (nullable = true)

+---+---------+----------+
|id |parent_id|data      |
+---+---------+----------+
|1  |NULL     |favorite:3|
|2  |NULL     |favorite:4|
+---+---------+----------+

+----+---------+----------+
|id  |parent_id|data      |
+----+---------+----------+
|1   |NULL     |favorite:3|
|1_t1|1        |favorite:3|
|1_t2|1        |favorite:3|
|1_t3|1        |favorite:3|
|2   |NULL     |favorite:4|
|2_t1|2        |favorite:4|
|2_t2|2        |favorite:4|
|2_t3|2        |favorite:4|
|2_t4|2        |favorite:4|
+----+---------+----------+

推荐阅读