apache-spark - 如何理解 to_timestamp 以在 Spark Scala 中从 String 转换为 Timestamp?
问题描述
我对 to_timestamp 函数感到困惑,因为它的两种格式 to_timestamp(s: Column, fmt: String) 和 to_timestamp(s: Column) 有非常不同的结果
我尝试过以下场景:
val dfMilli = Seq(
("2019-02-19 03:14:12.254", "20190219T031412.254"),
("9999-12-31 23:23:41.182", "99991231T232341.182")
).toDF("Standard", "Formatted")
val dfMicro = Seq(
("2019-02-19 03:14:12.254183", "20190219T031412.254183"),
("9999-12-31 23:23:41.182109", "99991231T232341.182109")
).toDF("Standard", "Formatted")
val dfNano = Seq(
("2019-02-19 03:14:12.254183893", "20190219T031412.254183893"),
("9999-12-31 23:23:41.182109821", "99991231T232341.182109821")
).toDF("Standard", "Formatted")
然后,我在“标准”和“格式化”列上使用 to_timestamp 计算转换。
val dfMilliTs = dfMilli
.withColumn("StandardTs", to_timestamp('Standard))
.withColumn("FormattedTs", to_timestamp('Formatted, "yyyyMMdd'T'HHmmss.SSS"))
val dfMicroTs = dfMicro
.withColumn("StandardTs", to_timestamp('Standard))
.withColumn("FormattedTs", to_timestamp('Formatted, "yyyyMMdd'T'HHmmss.SSSSSS"))
val dfNanoTs = dfNano
.withColumn("StandardTs", to_timestamp('Standard))
.withColumn("FormattedTs", to_timestamp('Formatted, "yyyyMMdd'T'HHmmss.SSSSSSSSS"))
并得到结果:
scala> dfMilliTs.show(false)
+-----------------------+-------------------+-----------------------+-------------------+
|Standard |Formatted |StandardTs |FormattedTs |
+-----------------------+-------------------+-----------------------+-------------------+
|2019-02-19 03:14:12.254|20190219T031412.254|2019-02-19 03:14:12.254|2019-02-19 03:14:12|
|9999-12-31 23:23:41.182|99991231T232341.182|9999-12-31 23:23:41.182|9999-12-31 23:23:41|
+-----------------------+-------------------+-----------------------+-------------------+
scala> dfMicroTs.show(false)
+--------------------------+----------------------+--------------------------+-----------+
|Standard |Formatted |StandardTs |FormattedTs|
+--------------------------+----------------------+--------------------------+-----------+
|2019-02-19 03:14:12.254183|20190219T031412.254183|2019-02-19 03:14:12.254183|null |
|9999-12-31 23:23:41.182109|99991231T232341.182109|9999-12-31 23:23:41.182109|null |
+--------------------------+----------------------+--------------------------+-----------+
scala> dfNanoTs.show(false)
+-----------------------------+-------------------------+--------------------------+-----------+
|Standard |Formatted |StandardTs |FormattedTs|
+-----------------------------+-------------------------+--------------------------+-----------+
|2019-02-19 03:14:12.254183893|20190219T031412.254183893|2019-02-19 03:14:12.254183|null |
|9999-12-31 23:23:41.182109821|99991231T232341.182109821|9999-12-31 23:23:41.182109|null |
+-----------------------------+-------------------------+--------------------------+-----------+
所以你看,两种风格的 to_timestamp 函数有非常不同的结果:
to_timestamp(s: Column) 可以正确地将毫秒和微秒时间戳转换为 TimestampType。
对于nano seconds, to_timestamp(s: Column) 可以处理它,但只需将其减少到微发送
to_timestamp(s: Column, fmt: String) 只能支持毫秒,但它甚至可以将字符串减少到只有秒。
你认为我使用了错误的格式字符串吗?
我知道 Spark 中的 TimestampType 是基于 java.sql.Timestamp 的,它是 java.util.Date 的子类,只支持毫秒。但是这两种风格的 to_timestamp 转换字符串的逻辑大相径庭?to_timestamp(s: Column, fmt: String) 能否更好地支持对话?
更新:
基于 Github 上的 spark 代码:datetimeExpressions.scala
我得到了这两种 to_timestamp 函数的定义。
标准样式 to_timestamp(s: Column) 有实现: Cast(left, TimestampType) --- 直接使用 Cast 函数
格式化样式 to_timestamp(s: Column, fmt: String) 有实现 GetTimestamp(left, format) --- 另一个函数 GetTimestamp
接下来,我需要对 Cast 和 GetTimestamp 这两个函数进行更多调查。
更新:
似乎主快照的 Spark 版本现在已经解决了这个问题。[火花-27438]
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT
/_/
Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.
非常好,还有新的结果—— to_timestamp(s: Column, fmt: String) 可以支持微秒,并且对于nano 格式的字符串没有错误。伟大的!:
scala> dfMilliTs.show(false)
+-----------------------+-------------------+-----------------------+-----------------------+
|Standard |Formatted |StandardTs |FormattedTs |
+-----------------------+-------------------+-----------------------+-----------------------+
|2019-02-19 03:14:12.254|20190219T031412.254|2019-02-19 03:14:12.254|2019-02-19 03:14:12.254|
|9999-12-31 23:23:41.182|99991231T232341.182|9999-12-31 23:23:41.182|9999-12-31 23:23:41.182|
+-----------------------+-------------------+-----------------------+-----------------------+
scala> dfMicroTs.show(false)
+--------------------------+----------------------+--------------------------+--------------------------+
|Standard |Formatted |StandardTs |FormattedTs |
+--------------------------+----------------------+--------------------------+--------------------------+
|2019-02-19 03:14:12.254183|20190219T031412.254183|2019-02-19 03:14:12.254183|2019-02-19 03:14:12.254183|
|9999-12-31 23:23:41.182109|99991231T232341.182109|9999-12-31 23:23:41.182109|9999-12-31 23:23:41.182109|
+--------------------------+----------------------+--------------------------+--------------------------+
scala> dfNanoTs.show(false)
+-----------------------------+-------------------------+--------------------------+--------------------------+
|Standard |Formatted |StandardTs |FormattedTs |
+-----------------------------+-------------------------+--------------------------+--------------------------+
|2019-02-19 03:14:12.254183893|20190219T031412.254183893|2019-02-19 03:14:12.254183|2019-02-19 03:14:12.254183|
|9999-12-31 23:23:41.182109821|99991231T232341.182109821|9999-12-31 23:23:41.182109|9999-12-31 23:23:41.182109|
+-----------------------------+-------------------------+--------------------------+--------------------------+
解决方案
推荐阅读
- c# - 当我 Moq Confluent.Kafka.WatermarkOffsets 时无法设置值
- html - 流式传输 HTML 画布内容的最简单方法是什么
- javascript - 即使我的朋友有烫发,您也不允许使用该命令
- python - 如何在silenium中获取下一页?
- firebase - 在应用启动时发出 Firestore 请求并保存 (Flutter)
- r - 将数据集导出到 xslx 并包含过滤器
- powershell - 如何将项目属性与字符串值进行比较
- mysql - MYSQL - LOAD DATA INFILE 给出 ERROR 1045 (28000): Access denied for user '
'@' '(使用密码:是) - python - 如何使用硒滚动特定的滚动条?
- python - 如何获取大华热像仪(TPC-BF5400)的热数据?