apache-spark - 广播哈希联接不适用于 Spark SQL 中的完全联接?
问题描述
我有两个小表要进行如下全外连接,我认为它应该使用广播连接,但是它选择了排序合并连接,我想知道为什么。
test("SparkTest 0461") {
val spark = SparkSession.builder().master("local").appName("SparkTest0460").getOrCreate()
import spark.implicits._
val data1 = Seq((1, 2), (1, 7), (3, 6), (5, 4), (1, 10), (6, 7), (2, 5))
val data2 = Seq(9, 4, 2, 7, 6, 8)
val x = 10L * 1024*1024
spark.sql(s"set spark.sql.autoBroadcastJoinThreshold=$x")
spark.createDataset(data1).toDF("a", "b").createOrReplaceTempView("x")
spark.createDataset(data2).toDF("c").createOrReplaceTempView("y")
val df = spark.sql(
"""
select * from x full join y on a = c
""".stripMargin(' '))
df.explain(true)
}
实物图如下,说明使用的是SMJ
== Physical Plan ==
SortMergeJoinExec [a#11], [c#19], FullOuter
:- *(1) SortExec [a#11 ASC NULLS FIRST], false, 0
: +- ShuffleExchangeExec hashpartitioning(a#11, 200)
: +- LocalTableScanExec [a#11, b#12]
+- *(2) SortExec [c#19 ASC NULLS FIRST], false, 0
+- ShuffleExchangeExec hashpartitioning(c#19, 200)
+- LocalTableScanExec [c#19]
解决方案
不支持 BroadcastHashJoin full outer join
。检查此链接以获取详细信息。
如果您替换full outer join
为任何受支持的联接,物理计划将显示它选择了 BroadcastHashJoin。
例如,
val dfOuter = spark.sql(""" select * from x outer join y on a = c """.stripMargin(' '))
dfOuter.explain(true)
给
== Parsed Logical Plan ==
'Project [*]
+- 'Join Inner, ('a = 'c)
:- 'SubqueryAlias outer
: +- 'UnresolvedRelation `x`
+- 'UnresolvedRelation `y`
== Analyzed Logical Plan ==
a: int, b: int, c: int
Project [a#75, b#76, c#82]
+- Join Inner, (a#75 = c#82)
:- SubqueryAlias outer
: +- SubqueryAlias x
: +- Project [_1#72 AS a#75, _2#73 AS b#76]
: +- LocalRelation [_1#72, _2#73]
+- SubqueryAlias y
+- Project [value#80 AS c#82]
+- LocalRelation [value#80]
== Optimized Logical Plan ==
Join Inner, (a#75 = c#82)
:- LocalRelation [a#75, b#76]
+- LocalRelation [c#82]
== Physical Plan ==
*(1) BroadcastHashJoin [a#75], [c#82], Inner, BuildRight
:- LocalTableScan [a#75, b#76]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- LocalTableScan [c#82]
推荐阅读
- php - 有没有办法在使用 htmlLoadFile 函数在 php 中加载 html 文件时传递自定义变量?
- amazon-s3 - 将 S3 静态站点与应用程序负载均衡器结合使用
- jhipster - 无法解析占位符“spring.security.oauth2.client.provider.oidc.issuer-uri”和创建名为“securityConfiguration”的 bean 时出错
- asp.net - Uploading Multiple Records from a CSV using LINQ
- http - How to access strava API?
- sql - 如何使用sql显示哪些学生还在上学
- c# - Invoke generic method on all the derived types in type hierarchy
- visual-studio - 如何在 boost 测试函数中获取函数的返回值?
- python - 如何在 Python 中用“”替换我的自定义字符?
- python - 一次跨多列查找线性趋势