首页 > 解决方案 > 在 pyspark 中验证和更改不同的日期格式

问题描述

这是这个问题的继续(验证并更改 pyspark 中的日期格式)在上述场景中,解决方案是完美的,但如果我有时间戳日期格式和一些更多不同的日期格式,如下所示。

df = sc.parallelize([['12-21-2006'],
                     ['05/30/2007'],
                     ['01-01-1984'],
                     ['22-12-2017'],
                     ['12222019'],
                     ['2020/12/23'],
                     ['2020-12-23'],
                     ['12.11.2020'],
                     ['22/02/2012'],
                     ['2020/12/23 04:50:10'],
                     ['12/23/1996 05:56:20'],
                     ['23/12/2002 10:30:50'],
                     ['24.12.1990'],
                     ['12/03/20']]).toDF(["Date"])

df.show()

+-------------------+
|               Date|
+-------------------+
|         12-21-2006|
|         05/30/2007|
|         01-01-1984|
|         22-12-2017|
|           12222019|
|         2020/12/23|
|         2020-12-23|
|         12.11.2020|
|         22/02/2012|
|2020/12/23 04:50:10|
|12/23/1996 05:56:20|
|23/12/2002 10:30:50|
|         24.12.1990|
|           12/03/20|
+-------------------+

当我尝试以相同的方式解决此问题时(验证并更改 pyspark 中的日期格式)。我遇到了一个错误。据我所知,该错误是由于时间戳格式引起的,并且具有类似记录的记录MM/dd/yyy, dd/MM/yyyy明显可以转换为所需的格式。

sdf = df.withColumn("d1", F.to_date(F.col("Date"),'yyyy/MM/dd')) \
  .withColumn("d2", F.to_date(F.col("Date"),'yyyy-MM-dd')) \
  .withColumn("d3", F.to_date(F.col("Date"),'MM/dd/yyyy')) \
  .withColumn("d4", F.to_date(F.col("Date"),'MM-dd-yyyy')) \
  .withColumn("d5", F.to_date(F.col("Date"),'MMddyyyy')) \
  .withColumn("d6", F.to_date(F.col("Date"),'MM.dd.yyyy')) \
  .withColumn("d7", F.to_date(F.col("Date"),'dd-MM-yyyy')) \
  .withColumn("d8", F.to_date(F.col("Date"),'dd/MM/yy')) \
  .withColumn("d9", F.to_date(F.col("Date"),'yyyy/MM/dd HH:MM:SS'))\
  .withColumn("d10", F.to_date(F.col("Date"),'MM/dd/yyyy HH:MM:SS'))\
  .withColumn("d11", F.to_date(F.col("Date"),'dd/MM/yyyy HH:MM:SS'))\
  .withColumn("d12", F.to_date(F.col("Date"),'dd.MM.yyyy')) \
  .withColumn("d13", F.to_date(F.col("Date"),'dd-MM-yy')) \
  .withColumn("result", F.coalesce("d1", "d2", "d3", "d4",'d5','d6','d7','d8','d9','d10','d11','d12','d13')) 
sdf.show()


org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 34.0 failed 1 times, most recent failure: Lost task 0.0 in stage 34.0 (TID 34, ip-10-191-0-117.eu-west-1.compute.internal, executor driver): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '01-01-1984' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.

有没有更好的方法来解决这个问题?我只想知道可以将任何一种日期格式转换为单一日期格式的完美函数库。

标签: dataframedatepysparkapache-spark-sqldate-format

解决方案


推荐阅读