首页 > 解决方案 > 两个数据集连接后如何自动触发推断数据集?

问题描述

spark 是否可以自动推断架构并将 Dataframe 转换为 Dataset 而程序员不必为每个连接创建一个案例类?

    import spark.implicits._
    case class DfLeftClass(
        id: Long,
        name: String,
        age: Int
                 )
    val dfLeft = Seq(
      (1,"Tim",30),
      (2,"John",15),
      (3,"Pens",20)
    ).toDF("id","name", "age").as[DfLeftClass]

    case class DfRightClass(
                            id: Long,
                            name: String,
                            age: Int
                            hobby: String
                          )
    val dfRight = Seq(
      (1,"Tim",30,"Swimming"),
      (2,"John",15,"Reading"),
      (3,"Pens",20,"Programming")
    ).toDF("id","name", "age", "hobby").as[DfRightClass]

    val joined: DataFrame = dfLeft.join(dfRight) // this results in DataFrame instead of a Dataset

标签: scalaapache-sparkfunctional-programming

解决方案


要留在 Dataset API 中,您可以使用joinWith。这个函数返回一个包含连接两边的元组数据集:

val joined: Dataset[(DfLeftClass, DfRightClass)] = dfLeft.joinWith(dfRight,
                          dfLeft.col("id").eqNullSafe(dfRight.col("id")))

结果:

+-------------+--------------------------+
|_1           |_2                        |
+-------------+--------------------------+
|{1, Tim, 30} |{1, Tim, 30, Swimming}    |
|{2, John, 15}|{2, John, 15, Reading}    |
|{3, Pens, 20}|{3, Pens, 20, Programming}|
+-------------+--------------------------+

从这里您可以继续使用元组,也可以将元组映射到第三个案例类。


推荐阅读