首页 > 解决方案 > 字符串到时间戳 Pyspark / Spark

问题描述

我需要将输入字符串 08-DEC-2011 00.00.00 转换为时间戳 20111208000000 - 文件中的数据。

"CLIENTCONTEXTID","SRVR","CLNT","USERNAME","UPDATEDTM"
1202,"jbosswabcd6","100.126.164.172","SUSER",08-DEC-2011 00.00.00
1653,"jbossbabcd4","17.26.164.176","SUSER",08-DEC-2011 00.00.00
1654,"jbossbabcd4","12.26.164.221","SUSER",08-JAN-2011 00.00.00
1655,"jbossbabcd4","17.26.164.223","SuperUser",08-DEC-2011 00.00.00
1656,"jbossbabcd4","17.26.164.221","SUSER",08-DEC-2011 00.00.00
1657,"jbossbabcd4","17.26.164.226","SUSER",08-DEC-2011 00.00.00
1658,"jbossbabcd4","100.26.164.221","SUSER",08-DEC-2011 00.00.00
1659,"jbossbabcd4","100.26.164.221","SUSER",08-DEC-2011 00.00.00
2802,"jbosswabcd1","172.20.19.130","SuperUser",08-DEC-2011 00.00.00

作为第一步,我试图将其转换为时间戳

from pyspark.sql import functions as F
from pyspark.sql.functions import unix_timestamp
df = spark.read.csv('/hdfs/context.csv',header=True)
df.printSchema()
df.createOrReplaceTempView("tablesss")    
spark.sql('select UPDATEDTM,cast(unix_timestamp(UPDATEDTM,"dd-MMM-YYYY HH.mm.ss") as timestamp) columnn12 from tablesss').show()

我也尝试了另一种方法

df = spark.read.csv('/hdfs/context.csv',header=True)
df.printSchema()
df=df.withColumn("datetype_timestamp",F.to_timestamp(F.col("UPDATEDTM"),"dd-MMM-YYYY HH.mm.ss"))
df.select(F.col("UPDATEDTM"),F.col("datetype_timestamp")).show()

两者都导致以下结果

+--------------------+-------------------+
|           UPDATEDTM| datetype_timestamp|
+--------------------+-------------------+
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|

但是,如果我使用 scala

val df = spark.read.option("header",true).csv("/hdfs/context.csv")
df.withColumn("datetype_timestamp", to_timestamp(col("UPDATEDTM"),"dd-MMM-yyyy HH.mm.ss")).show(false)

产生的结果是

+--------------------+-------------------+
|UPDATEDTM           |datetype_timestamp |
+--------------------+-------------------+
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00

不确定 Scala 和 Pyspark 有什么区别。同样的反应很奇怪。

在这个时间戳之后,我想应用 Python strftime 转换为所需的格式。

请帮助确定此转换的问题和最佳方式

标签: scalapysparktimestampunix-timestampto-timestamp

解决方案


使用yyyy(普通日历年)而不是YYYY(基于周的日历)。

在您使用的 scala 代码yyyy和 pyspark 中YYYY

Example:

df.show()
#+--------------------+
#|           UPDATEDTM|
#+--------------------+
#|08-DEC-2011 00.00.00|
#+--------------------+

df.withColumn("dd",to_timestamp(col("UPDATEDTM"),'dd-MMM-yyyy HH.mm.ss')).show()

#using unix_timestamp function
df.withColumn("dd",unix_timestamp(col("UPDATEDTM"),'dd-MMM-yyyy HH.mm.ss').cast("timestamp")).show()
#+--------------------+-------------------+
#|           UPDATEDTM|                 dd|
#+--------------------+-------------------+
#|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
#+--------------------+-------------------+

推荐阅读