首页 > 解决方案 > 如何理解 to_timestamp 以在 Spark Scala 中从 String 转换为 Timestamp?

问题描述

我对 to_timestamp 函数感到困惑,因为它的两种格式 to_timestamp(s: Column, fmt: String) 和 to_timestamp(s: Column) 有非常不同的结果

我尝试过以下场景:

val dfMilli = Seq(
("2019-02-19 03:14:12.254", "20190219T031412.254"),
("9999-12-31 23:23:41.182", "99991231T232341.182")
).toDF("Standard", "Formatted")

val dfMicro = Seq(
("2019-02-19 03:14:12.254183", "20190219T031412.254183"),
("9999-12-31 23:23:41.182109", "99991231T232341.182109")
).toDF("Standard", "Formatted")

val dfNano = Seq(
("2019-02-19 03:14:12.254183893", "20190219T031412.254183893"),
("9999-12-31 23:23:41.182109821", "99991231T232341.182109821")
).toDF("Standard", "Formatted")

然后,我在“标准”和“格式化”列上使用 to_timestamp 计算转换。

val dfMilliTs = dfMilli
  .withColumn("StandardTs", to_timestamp('Standard))
  .withColumn("FormattedTs", to_timestamp('Formatted, "yyyyMMdd'T'HHmmss.SSS"))

val dfMicroTs = dfMicro
  .withColumn("StandardTs", to_timestamp('Standard))
  .withColumn("FormattedTs", to_timestamp('Formatted, "yyyyMMdd'T'HHmmss.SSSSSS"))

val dfNanoTs = dfNano
  .withColumn("StandardTs", to_timestamp('Standard))
  .withColumn("FormattedTs", to_timestamp('Formatted, "yyyyMMdd'T'HHmmss.SSSSSSSSS"))

并得到结果:

scala> dfMilliTs.show(false)
+-----------------------+-------------------+-----------------------+-------------------+
|Standard               |Formatted          |StandardTs             |FormattedTs        |
+-----------------------+-------------------+-----------------------+-------------------+
|2019-02-19 03:14:12.254|20190219T031412.254|2019-02-19 03:14:12.254|2019-02-19 03:14:12|
|9999-12-31 23:23:41.182|99991231T232341.182|9999-12-31 23:23:41.182|9999-12-31 23:23:41|
+-----------------------+-------------------+-----------------------+-------------------+


scala> dfMicroTs.show(false)
+--------------------------+----------------------+--------------------------+-----------+
|Standard                  |Formatted             |StandardTs                |FormattedTs|
+--------------------------+----------------------+--------------------------+-----------+
|2019-02-19 03:14:12.254183|20190219T031412.254183|2019-02-19 03:14:12.254183|null       |
|9999-12-31 23:23:41.182109|99991231T232341.182109|9999-12-31 23:23:41.182109|null       |
+--------------------------+----------------------+--------------------------+-----------+


scala> dfNanoTs.show(false)
+-----------------------------+-------------------------+--------------------------+-----------+
|Standard                     |Formatted                |StandardTs                |FormattedTs|
+-----------------------------+-------------------------+--------------------------+-----------+
|2019-02-19 03:14:12.254183893|20190219T031412.254183893|2019-02-19 03:14:12.254183|null       |
|9999-12-31 23:23:41.182109821|99991231T232341.182109821|9999-12-31 23:23:41.182109|null       |
+-----------------------------+-------------------------+--------------------------+-----------+

所以你看,两种风格的 to_timestamp 函数有非常不同的结果:

  1. to_timestamp(s: Column) 可以正确地将毫秒和微秒时间戳转换为 TimestampType。

  2. 对于nano seconds, to_timestamp(s: Column) 可以处理它,但只需将其减少到微发送

  3. to_timestamp(s: Column, fmt: String) 只能支持毫秒,但它甚至可以将字符串减少到只有

  4. 你认为我使用了错误的格式字符串吗?

我知道 Spark 中的 TimestampType 是基于 java.sql.Timestamp 的,它是 java.util.Date 的子类,只支持毫秒。但是这两种风格的 to_timestamp 转换字符串的逻辑大相径庭?to_timestamp(s: Column, fmt: String) 能否更好地支持对话?

更新

基于 Github 上的 spark 代码:datetimeExpressions.scala

我得到了这两种 to_timestamp 函数的定义。

  1. 标准样式 to_timestamp(s: Column) 有实现: Cast(left, TimestampType) --- 直接使用 Cast 函数

  2. 格式化样式 to_timestamp(s: Column, fmt: String) 有实现 GetTimestamp(left, format) --- 另一个函数 GetTimestamp

接下来,我需要对 Cast 和 GetTimestamp 这两个函数进行更多调查。

更新

似乎主快照的 Spark 版本现在已经解决了这个问题。[火花-27438]

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
      /_/

Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.

非常好,还有新的结果—— to_timestamp(s: Column, fmt: String) 可以支持微秒,并且对于nano 格式的字符串没有错误。伟大的!:

scala> dfMilliTs.show(false)
+-----------------------+-------------------+-----------------------+-----------------------+
|Standard               |Formatted          |StandardTs             |FormattedTs            |
+-----------------------+-------------------+-----------------------+-----------------------+
|2019-02-19 03:14:12.254|20190219T031412.254|2019-02-19 03:14:12.254|2019-02-19 03:14:12.254|
|9999-12-31 23:23:41.182|99991231T232341.182|9999-12-31 23:23:41.182|9999-12-31 23:23:41.182|
+-----------------------+-------------------+-----------------------+-----------------------+


scala> dfMicroTs.show(false)
+--------------------------+----------------------+--------------------------+--------------------------+
|Standard                  |Formatted             |StandardTs                |FormattedTs               |
+--------------------------+----------------------+--------------------------+--------------------------+
|2019-02-19 03:14:12.254183|20190219T031412.254183|2019-02-19 03:14:12.254183|2019-02-19 03:14:12.254183|
|9999-12-31 23:23:41.182109|99991231T232341.182109|9999-12-31 23:23:41.182109|9999-12-31 23:23:41.182109|
+--------------------------+----------------------+--------------------------+--------------------------+


scala> dfNanoTs.show(false)
+-----------------------------+-------------------------+--------------------------+--------------------------+
|Standard                     |Formatted                |StandardTs                |FormattedTs               |
+-----------------------------+-------------------------+--------------------------+--------------------------+
|2019-02-19 03:14:12.254183893|20190219T031412.254183893|2019-02-19 03:14:12.254183|2019-02-19 03:14:12.254183|
|9999-12-31 23:23:41.182109821|99991231T232341.182109821|9999-12-31 23:23:41.182109|9999-12-31 23:23:41.182109|
+-----------------------------+-------------------------+--------------------------+--------------------------+

标签: apache-sparkapache-spark-sql

解决方案


推荐阅读