首页 > 解决方案 > Is there any difference between joining tables and exists when filtering a table?

问题描述

I have two tables A and B, and wanted to get a subset of A whose key k is also in B.

One option is by using join

select A.*
from A
join B on A.k = B.k

The other is

select A.*
from A
where exists (select *, B.k from B where A.k = B.k)

If the field k in B is unique, I feel they are the same. For for spark, is exist really considered by the subquery?

标签: apache-sparkapache-spark-sql

解决方案


最简单,最真实的方法是explain查询和比较他们的物理计划。

scala> println(spark.version)
2.4.0

scala> sql("select A.* from A join B on A.k = B.k").explain
== Physical Plan ==
*(2) Project [k#10L]
+- *(2) BroadcastHashJoin [k#10L], [k#6L], Inner, BuildRight
   :- *(2) Project [id#8L AS k#10L]
   :  +- *(2) Range (0, 10, step=1, splits=8)
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
      +- *(1) Project [id#4L AS k#6L]
         +- *(1) Range (0, 10, step=1, splits=8)

scala> sql("""select * from a where exists (select *, B.k from B where A.k = B.k)""").explain
== Physical Plan ==
*(2) Project [id#8L AS k#10L]
+- *(2) BroadcastHashJoin [id#8L], [k#6L], LeftSemi, BuildRight
   :- *(2) Range (0, 10, step=1, splits=8)
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
      +- *(1) Project [id#4L AS k#6L, id#4L AS k#6L]
         +- *(1) Range (0, 10, step=1, splits=8)

他们看起来很像,不是吗?

我觉得他们是一样的

它们如上所证明。


推荐阅读