apache-spark-sql - spark 2.4.0 为左连接与空右 DF 提供“检测到的隐式笛卡尔积”异常
问题描述
似乎在 spark 2.2.1 和 spark 2.4.0 之间,左连接与空右数据帧的行为从成功更改为返回“AnalysisException:检测到逻辑计划之间左外连接的隐式笛卡尔积”。
例如:
val emptyDf = spark.emptyDataFrame
.withColumn("id", lit(0L))
.withColumn("brand", lit(""))
val nonemptyDf = ((1L, "a") :: Nil).toDF("id", "size")
val neje = nonemptyDf.join(emptyDf, Seq("id"), "left")
neje.show()
在 2.2.1 中,结果是
+---+----+-----+
| id|size|brand|
+---+----+-----+
| 1| a| null|
+---+----+-----+
但是,在 2.4.0 中,我得到以下异常:
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans
LocalRelation [id#278L, size#279]
and
Project [ AS brand#55]
+- LogicalRDD false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
这是后者的完整计划说明:
> neje.explain(true)
== Parsed Logical Plan ==
'Join UsingJoin(LeftOuter,List(id))
:- Project [_1#275L AS id#278L, _2#276 AS size#279]
: +- LocalRelation [_1#275L, _2#276]
+- Project [id#53L, AS brand#55]
+- Project [0 AS id#53L]
+- LogicalRDD false
== Analyzed Logical Plan ==
id: bigint, size: string, brand: string
Project [id#278L, size#279, brand#55]
+- Join LeftOuter, (id#278L = id#53L)
:- Project [_1#275L AS id#278L, _2#276 AS size#279]
: +- LocalRelation [_1#275L, _2#276]
+- Project [id#53L, AS brand#55]
+- Project [0 AS id#53L]
+- LogicalRDD false
== Optimized Logical Plan ==
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans
LocalRelation [id#278L, size#279]
and
Project [ AS brand#55]
+- LogicalRDD false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans
LocalRelation [id#278L, size#279]
and
Project [ AS brand#55]
+- LogicalRDD false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
补充意见:
- 如果只有左侧数据框为空,则连接成功。
- 对于具有空左数据框的右连接,类似的行为变化也是如此。
- 但是,有趣的是,请注意,如果两个数据框都为空,则两个版本都会因内连接的 AnalysisException 而失败。
这是回归还是设计?早期的行为对我来说似乎更正确。我无法在 spark 发行说明、spark jira 问题或 stackoverflow 问题中找到任何相关信息。
解决方案
我没有你的问题,但至少同样的错误,我通过明确允许交叉连接来修复它:
spark.conf.set( "spark.sql.crossJoin.enabled" , "true" )
推荐阅读
- teradata - 为什么查询月分区表的月内记录时访问的分区数会发生变化?
- c# - UWP中同一台计算机上不同项目中的服务器和客户端
- android - 是否有可能用函数文字序列化 Kotlin Data 类?
- regex - 自动在二元运算符周围添加空格(使用正则表达式匹配 whit emacs)
- c# - Asp.Net MVC 5 中的两个外键
- docker - 无法拉取容器镜像 registry.eu-de.bluemix.net
- azure-devops - 如何允许脚本从 yaml 构建访问 OAuth 令牌
- android - 改造处理令牌过期
- sql - SQL 选择不同 id 的值,其中不是最小值 - 没有子查询 - Oracle SQL
- intellij-idea - 用于使用 intelliJ env 编辑规则文件的 Drools UI