首页 > 解决方案 > 如何根据scala/spark中的案例类更改数据框中列的数据类型

问题描述

我正在尝试基于案例类转换某些列的数据类型。

val simpleDf = Seq(("James",34,"2006-01-01","true","M",3000.60),
                     ("Michael",33,"1980-01-10","true","F",3300.80),
                     ("Robert",37,"1995-01-05","false","M",5000.50)
                 ).toDF("firstName","age","jobStartDate","isGraduated","gender","salary")

// Output
simpleDf.printSchema()
root
|-- firstName: string (nullable = true)
|-- age: integer (nullable = false)
|-- jobStartDate: string (nullable = true)
|-- isGraduated: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: double (nullable = false)

在这里,我想将数据类型更改jobStartDate为时间戳和isGraduated布尔值。我想知道是否可以使用案例类进行转换?我知道这可以通过转换每一列来完成,但就我而言,我需要根据定义的案例类映射传入的 DF。

case class empModel(firstName:String, 
                       age:Integer, 
                       jobStartDate:java.sql.Timestamp, 
                       isGraduated:Boolean, 
                       gender:String,
                       salary:Double
                      )

val newDf = simpleData.as[empModel].toDF
newDf.show(false)

由于时间戳对话的字符串,我收到了错误。有什么解决方法吗?

标签: scalaapache-sparkapache-spark-sql

解决方案


您可以使用以下方法从案例类生成架构ScalaReflection

import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.catalyst.ScalaReflection


val schema = ScalaReflection.schemaFor[empModel].dataType.asInstanceOf[StructType]

现在,您可以在将文件加载到数据框时传递此架构。

或者,如果您希望在读取数据帧后转换部分或全部列,则可以迭代模式字段并转换为相应的数据类型。通过使用foldLeft例如:

val df = schema.fields.foldLeft(simpleDf){ 
  (df, s) => df.withColumn(s.name, df(s.name).cast(s.dataType))     
}

df.printSchema

//root
// |-- firstName: string (nullable = true)
// |-- age: integer (nullable = true)
// |-- jobStartDate: timestamp (nullable = true)
// |-- isGraduated: boolean (nullable = false)
// |-- gender: string (nullable = true)
// |-- salary: double (nullable = false)

df.show
//+---------+---+-------------------+-----------+------+------+
//|firstName|age|       jobStartDate|isGraduated|gender|salary|
//+---------+---+-------------------+-----------+------+------+
//|    James| 34|2006-01-01 00:00:00|       true|     M|3000.6|
//|  Michael| 33|1980-01-10 00:00:00|       true|     F|3300.8|
//|   Robert| 37|1995-01-05 00:00:00|      false|     M|5000.5|
//+---------+---+-------------------+-----------+------+------+

推荐阅读