apache-spark - Is there any difference between joining tables and exists when filtering a table?
问题描述
I have two tables A and B, and wanted to get a subset of A whose key k is also in B.
One option is by using join
select A.*
from A
join B on A.k = B.k
The other is
select A.*
from A
where exists (select *, B.k from B where A.k = B.k)
If the field k in B is unique, I feel they are the same. For for spark, is exist really considered by the subquery?
解决方案
最简单,最真实的方法是explain
查询和比较他们的物理计划。
scala> println(spark.version)
2.4.0
scala> sql("select A.* from A join B on A.k = B.k").explain
== Physical Plan ==
*(2) Project [k#10L]
+- *(2) BroadcastHashJoin [k#10L], [k#6L], Inner, BuildRight
:- *(2) Project [id#8L AS k#10L]
: +- *(2) Range (0, 10, step=1, splits=8)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *(1) Project [id#4L AS k#6L]
+- *(1) Range (0, 10, step=1, splits=8)
scala> sql("""select * from a where exists (select *, B.k from B where A.k = B.k)""").explain
== Physical Plan ==
*(2) Project [id#8L AS k#10L]
+- *(2) BroadcastHashJoin [id#8L], [k#6L], LeftSemi, BuildRight
:- *(2) Range (0, 10, step=1, splits=8)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *(1) Project [id#4L AS k#6L, id#4L AS k#6L]
+- *(1) Range (0, 10, step=1, splits=8)
他们看起来很像,不是吗?
我觉得他们是一样的
它们如上所证明。
推荐阅读
- .net - Entity Framework Core 2.0 选择多对多问题的查询
- linux - 如何通过 ssh 进入远程服务器、运行命令并留在服务器上
- sql - SQL Server,需要 SQL 查询协助
- mongodb - 动态子字段的 MongoDB 索引
- jboss7.x - 如何在 JBoss AS 7 中正确设置 MySQL 数据源?
- java - 二次方程 Java 类
- node.js - secretOrPrivateKey 必须有一个值
- sql - AUTOCOMMIT_DDL 是否适用于基于 ABAP 的系统,非 HANA 系统?
- swift - 传递带有重载的函数不会编译
- php - 将php sql行记录插入隐藏表单输入值字段