首页 > 解决方案 > 如何使用相同的案例类创建多个数据框

问题描述

如何使用相同的案例类创建多个数据框?假设我想创建多个数据框,一个有 5 列,另一个有 3 列,我将如何使用单个案例类来实现?

标签: scalaapache-sparkhadoop

解决方案


You can't create two Dataframe using single case class with the same number of columns directly. Assume you have the below case class FlightData. If you created a Dataframe from this case class it will contains 3 columns. However, you could create two Dataframe but in the next one you can select some column from this case class. If you have two different file and every file contains different structure you need to create two separated case class.

   val someData = Seq(
    Row("United States", "Romania", 15),
    Row("United States", "Croatia", 1),
    Row("United States", "Ireland", 344),
    Row("Egypt", "United States", 15)
  )


  val flightDataSchema = List(
    StructField("DEST_COUNTRY_NAME", StringType, true),
    StructField("ORIGIN_COUNTRY_NAME", StringType, true),
    StructField("count", IntegerType, true)
  )

  case class FlightData(DEST_COUNTRY_NAME: String, ORIGIN_COUNTRY_NAME: String, count: Int)
  import spark.implicits._

  val dataDS = spark.createDataFrame(
    spark.sparkContext.parallelize(someData),
    StructType(flightDataSchema)
  ).as[FlightData]

  val dataDS_2 = spark.createDataFrame(
    spark.sparkContext.parallelize(someData),
    StructType(flightDataSchema)
  ).as[FlightData].select('DEST_COUNTRY_NAME)

推荐阅读