首页 > 解决方案 > 为什么在 spark-3 上写 1900 年之前的时间戳不会引发 SparkUpgradeException?

问题描述

在页面上: https
://www.waitingforcode.com/apache-spark-sql/whats-new-apache-spark-3-proleptic-calendar-date-time-management/read 我们可以阅读:

从 Parquet 文件中读取 1582-10-15 之前的日期或 1900-01-01T00:00:00Z 之前的时间戳可能不明确,因为这些文件可能是由 Spark 2.x 或使用传统混合日历的 Hive 的传统版本编写的这与 Spark 3.0+ 的 Proleptic Gregorian calendar 不同

请考虑以下未引发异常的情况:

scala> spark.conf.get("spark.sql.legacy.parquet.datetimeRebaseModeInWrite")
res27: String = EXCEPTION
scala> Seq(java.sql.Timestamp.valueOf("1899-01-01 00:00:00")).toDF("col").write.parquet("/tmp/someDate")
scala> // why did not it throw exception?

至于1582抛出异常之前的日期:

scala> Seq(java.sql.Date.valueOf("1581-01-01")).toDF("col").write.parquet("/tmp/someOtherDate")
21/03/10 19:07:19 ERROR Utils: Aborting task
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet files can be dangerous, as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during writing, to get maximum interoperability. Or set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar.

有人可以解释这种区别吗?

标签: scalaapache-sparkparquet

解决方案


我有 spark 3.1.2 版本我已经测试了这两种情况,两种情况下都抛出了异常......请参考以下内容:

scala> Seq(java.sql.Timestamp.valueOf("1899-01-01 00:00:00")).toDF("col").write.parquet("/tmp/someDate")
22/01/04 18:03:53 ERROR Utils: Aborting task                        (0 + 1) / 1]
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet INT96 files can be dangerous, as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.int96RebaseModeInWrite to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during writing, to get maximum interoperability. Or set spark.sql.legacy.parquet.int96RebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar. here

还有第二种情况:

scala> Seq(java.sql.Date.valueOf("1581-01-01")).toDF("col").write.parquet("/tmp/someOtherDate1")
22/01/04 18:05:08 ERROR Utils: Aborting task
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet files can be dangerous, as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during writing, to get maximum interoperability. Or set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar.

推荐阅读