首页 > 解决方案 > Spark Dataframes:将unix指数数字转换为字符串整数以获得时间戳

问题描述

下面的 spark 数据帧具有 unix 格式的 start_t 和 end_t,但其中有一个指数 e。

+------+----------------+------------------+--------+----------+----------+-------+-----------+-----------+-----------+-------------+-------+---------------+----------------+
| alt_t|           end_t|engine_fuel_rate_t|   lat_t|left_max_t|left_min_t|  lon_t|plm3_incl_t|right_max_t|right_min_t|road_class_u8|speed_t|sprung_weight_t|         start_t|
+------+----------------+------------------+--------+----------+----------+-------+-----------+-----------+-----------+-------------+-------+---------------+----------------+
|1237.5|1.521956985733E9|                 0|-27.7314|       0.0|       0.0|22.9552|        1.5|        0.0|        0.0|            0|   17.4|          198.0| 1.52195698056E9|
|1236.5|1.521956989922E9|                 0|-27.7316|       0.0|       0.0|22.9552|       -3.3|        0.0|        0.0|            0|   17.6|          156.1|1.521956985733E9|
|1234.5|1.521956995378E9|                 0|-27.7318|       0.0|       0.0|22.9552|       -2.7|        0.0|        0.0|            0|   11.9|          148.6|1.521956989922E9|
|1230.5|1.521957001498E9|                 0| -27.732|       0.0|       0.0|22.9551|        2.3|        0.0|        0.0|            0|   13.2|          169.1|1.521956995378E9|

由于它是双倍的,因此不能直接转换为时间戳。它将通过一个错误说明它需要是字符串。

+------+----------------+------------------+--------+----------+----------+-------+-----------+-----------+-----------+-------------+-------+---------------+-------+
| alt_t|           end_t|engine_fuel_rate_t|   lat_t|left_max_t|left_min_t|  lon_t|plm3_incl_t|right_max_t|right_min_t|road_class_u8|speed_t|sprung_weight_t|start_t|
+------+----------------+------------------+--------+----------+----------+-------+-----------+-----------+-----------+-------------+-------+---------------+-------+
|1237.5|1.521956985733E9|                 0|-27.7314|       0.0|       0.0|22.9552|        1.5|        0.0|        0.0|            0|   17.4|          198.0|   null|
|1236.5|1.521956989922E9|                 0|-27.7316|       0.0|       0.0|22.9552|       -3.3|        0.0|        0.0|            0|   17.6|          156.1|   null|
|1234.5|1.521956995378E9|                 0|-27.7318|       0.0|       0.0|22.9552|       -2.7|        0.0|        0.0|            0|   11.9|          148.6|   null|

因此我使用了以下代码:

%scala

val df2 = df.withColumn("start_t", df("start_t").cast("string"))
val df3 = df2.withColumn("end_t", df("end_t").cast("string"))
val filteredDF = df3.withColumn("start_t", unix_timestamp($"start_t", "yyyyMMddHHmmss").cast("timestamp"))
filteredDF.show()

我在 start_t 中得到 null 并认为它是由于 E(指数符号)。我在 pandas python 中测试过,日期有效并输出结果。我知道有一种方法可以使用精度来改变这一点。我正在尝试将其转换为格式为 yyyy-MM-dd HH:mm:ss 的时间戳,并为时间和日期设置一个单独的列。

注意:提出了类似的问题,但没有回答。Scala Spark:将双列转换为数据框中的日期时间列

标签: scalaapache-spark

解决方案


从 String -> Double -> Timestamp 链接转换。以下作品

scala> val df = Seq(("1237.5","1.521956985733E9"),("1236.5","1.521956989922E9"),("1234.5","1.521956995378E9"),("1230.5","1.521957001498E9")).toDF("alt_t","end_t")
df: org.apache.spark.sql.DataFrame = [alt_t: string, end_t: string]

scala> df.withColumn("end_t",'end_t.cast("double").cast("timestamp")).show(false)
+------+-----------------------+
|alt_t |end_t                  |
+------+-----------------------+
|1237.5|2018-03-25 01:49:45.733|
|1236.5|2018-03-25 01:49:49.922|
|1234.5|2018-03-25 01:49:55.378|
|1230.5|2018-03-25 01:50:01.498|
+------+-----------------------+


scala>

更新1

scala> val df = Seq(("1237.5","1.521956985733E9"),("1236.5","1.521956989922E9"),("1234.5","1.521956995378E9"),("1230.5","1.521957001498E9")).toDF("alt_t","end_t").withColumn("end_t",'end_t.cast("double").cast("timestamp"))
df: org.apache.spark.sql.DataFrame = [alt_t: string, end_t: timestamp]

scala> df.printSchema
root
 |-- alt_t: string (nullable = true)
 |-- end_t: timestamp (nullable = true)


scala>

推荐阅读