apache-spark - 谓词下推不适用于 Spark Dataframe 中的完全外连接
问题描述
Spark Dataframe 中的完全外连接似乎没有发生谓词下推
当连接类型为内部时,谓词下推似乎有效。但是当它是fullouter时,它不会下推谓词
scala> val left = Seq((0, "a"), (1, "b"), (2, "c")).toDF("id", "val")
2019-07-03 13:46:40 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
left: org.apache.spark.sql.DataFrame = [id: int, val: string]
scala> val right = Seq ((2, "c"), (3, "d")).toDF("id", "val_2")
right: org.apache.spark.sql.DataFrame = [id: int, val_2: string]
scala> val df = left.join(right, Seq("id"), "fullouter")
df: org.apache.spark.sql.DataFrame = [id: int, val: string ... 1 more field]
scala> df.show
+---+----+-----+
| id| val|val_2|
+---+----+-----+
| 1| b| null|
| 3|null| d|
| 2| c| c|
| 0| a| null|
+---+----+-----+
scala> val df = left.join(right, Seq("id"), "fullouter").where($"id" === 1)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, val: string ... 1 more field]
scala> df.explain
== Physical Plan ==
*(3) Project [coalesce(id#5, id#14) AS id#33, val#6, val_2#15]
+- *(3) Filter (coalesce(id#5, id#14) = 1)
+- SortMergeJoin [id#5], [id#14], FullOuter
:- *(1) Sort [id#5 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#5, 200)
: +- LocalTableScan [id#5, val#6]
+- *(2) Sort [id#14 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#14, 200)
+- LocalTableScan [id#14, val_2#15]
scala> val df = left.join(right, Seq("id"), "inner").where($"id" === 1)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, val: string ... 1 more field]
scala> df.explain
== Physical Plan ==
*(2) Project [id#5, val#6, val_2#15]
+- *(2) BroadcastHashJoin [id#5], [id#14], Inner, BuildRight
:- *(2) Project [_1#2 AS id#5, _2#3 AS val#6]
: +- *(2) Filter (_1#2 = 1)
: +- LocalTableScan [_1#2, _2#3]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- *(1) Project [_1#11 AS id#14, _2#12 AS val_2#15]
+- *(1) Filter (_1#11 = 1)
+- LocalTableScan [_1#11, _2#12]
解决方案
我们可以在内连接中下推谓词,因为结果是相同的。但是,如果您在完全外连接中下推谓词,它将得到不同的结果(与不下推谓词的结果相比)。因此,在完全外连接中,谓词不能下推。你会在 mysql 或 postgresql 中找到同样的东西。
推荐阅读
- google-cloud-firestore - API Explorer 的 Firestore 的 projects.databases.documents.list 的正则表达式不正确
- regex - sed 版本提取
- node.js - 如何在环境中使用 docker-compose
- pytorch - PyTorch C++ 在 Ubuntu aarch64 上运行缓慢
- java - 在模块解析的上下文中,“由主机系统决定”是什么意思?
- apache-spark - Apache spark 与本地 s3 集成
- javascript - 无法在反应中单击由 Material-UI 制作的 PaginationItem 链接
- django - Django - Custom Join 的 F('') 在列引用中包含单引号,导致 Postgres 认为它是字符串而不是列引用
- asp.net-core-webapi - 模型包含接口属性未提交给 Web API 中的控制器
- javascript - 在 settimeout 角度内添加延迟