首页 > 解决方案 > Spark scala非常规日期转换中的两个相似的udf产生不同的结果

问题描述

我有一个简单的 Spark df:

val df = Seq("24-12-2017","25-01-2016").toDF("dates")
df.show()
+----------+
|     dates|
+----------+
|24-12-2017|
|25-01-2016|
+----------+

要将其转换为所需的格式,我使用以下代码段:

import java.text.SimpleDateFormat

def fmt(d:String) = {
    val f = new SimpleDateFormat("dd-MM-yyyy").parse(d).getTime
    new java.sql.Timestamp(f)
}

val fmtTimestamp = udf(fmt(_:String):java.sql.Timestamp)

df.select($"dates",fmtTimestamp($"dates")).show
+----------+-------------------+
|     dates|         UDF(dates)|
+----------+-------------------+
|24-12-2017|2017-12-24 00:00:00|
|25-01-2016|2016-01-25 00:00:00|
+----------+-------------------+

并且外翻按预期进行。

但是,当我尝试简化版本时,一切都被粉碎了:

import java.text.SimpleDateFormat

def fmt(d:String) = {
    new SimpleDateFormat("dd-MM-yyyy").parse(d)
}

val fmtTimestamp = udf(fmt(_:String):java.util.Date)

java.lang.UnsupportedOperationException: Schema for type java.util.Date is not supported
  org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789)
  org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:724)
  scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
  org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:906)
  org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:46)
  org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:723)
  org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:720)
  org.apache.spark.sql.functions$.udf(functions.scala:3898)
  $sess.cmd17Wrapper$Helper.<init>(cmd17.sc:7)
  $sess.cmd17Wrapper.<init>(cmd17.sc:718)
  $sess.cmd17$.<init>(cmd17.sc:563)
  $sess.cmd17$.<clinit>(cmd17.sc:-1)

第一个案例完成和第二个破碎的原因可能是什么?

标签: javascalaapache-sparkuser-defined-functionssimpledateformat

解决方案


推荐阅读