apache-spark - Spark - 日期与时间戳比较 - 无意义的结果“2018-01-01”少于“2018-01-01 00:00:00”
问题描述
我正在使用 Spark 并将日期与时间戳进行比较,我只是不明白发生了什么。
这是重现的代码(pyspark)
query = '''with data as (
select date('2018-01-01') as d
, timestamp('2018-01-01') as t
)
select d < t as natural_lt
, d = t as natural_eq
, d > t as natural_gt
, d < date(t) as cast_date_lt
, d = date(t) as cast_date_eq
, d > date(t) as cast_date_gt
, timestamp(d) < t as cast_timestamp_lt
, timestamp(d) = t as cast_timestamp_eq
, timestamp(d) > t as cast_timestamp_gt
from data
'''
spark.sql(query).show()
结果:
+----------+----------+----------+------------+------------+------------+-----------------+-----------------+-----------------+
|natural_lt|natural_eq|natural_gt|cast_date_lt|cast_date_eq|cast_date_gt|cast_timestamp_lt|cast_timestamp_eq|cast_timestamp_gt|
+----------+----------+----------+------------+------------+------------+-----------------+-----------------+-----------------+
| true| false| false| false| true| false| false| true| false|
+----------+----------+----------+------------+------------+------------+-----------------+-----------------+-----------------+
这完全违背了我的预期。我们得到"2018-01-01"
的比"2018-01-01 00:00:00"
- 显然在时间之前的这个日期没有任何内容00:00:00
,所以我觉得这违反直觉。
我希望要么出现异常(比较日期与时间戳不明确),要么通过强制转换或两者与时间戳进行比较(2018-01-01
作为2018-01-01 00:00:00
比较处理)。
谁能解释为什么会发生这种比较?更重要的是,我能否让 Spark 按我的预期行事?我可以让 Spark 抛出异常吗?
解决方案
这是因为时间戳和日期都被向下转换为字符串,从而导致意外结果。
这是您的查询的分析逻辑计划:
+- Project [(cast(d#46 as string) < cast(t#47 as string)) AS natural_lt#37, (cast(d#46 as string) = cast(t#47 as string)) AS natural_eq#38, (cast(d#46 as string) > cast(t#47 as string)) AS natural_gt#39, (d#46 < cast(t#47 as date)) AS cast_date_lt#40, (d#46 = cast(t#47 as date)) AS cast_date_eq#41, (d#46 > cast(t#47 as date)) AS cast_date_gt#42, (cast(d#46 as timestamp) < t#47) AS cast_timestamp_lt#43, (cast(d#46 as timestamp) = t#47) AS cast_timestamp_eq#44, (cast(d#46 as timestamp) > t#47) AS cast_timestamp_gt#45]
Jira:https ://issues.apache.org/jira/browse/SPARK-23549(修复版本/s:2.4.0)
推荐阅读
- r - R中是否有一种方法可以根据来自不同数据帧的范围过滤来自一个数据帧的数据?
- javascript - 无法在 jest/selenium/webdriverio 中导入/需要
- c# - 在 DotNet Core (C#) 中调用 PowerShell cmdlet
- r - 在一个新组下折叠数据框值
- python - Python Zeep Client 使用 Ontario Health EBS MCEDT WSDL(SOAP) webservice
- typescript - 类型 { } 中始终缺少属性
- c++ - 在 C++ 中转换为二进制文件是否包含显示其中有内容的文本?
- java - 覆盖方法 onCreateView 导致错误
- c# - 类型相同但需要显式转换?
- java - 使用 java 正则表达式解析 apache 日志文件