apache-spark - Spark SQL解释计划多次调用临时表计算
问题描述
我是 Spark SQL 的新手,并使用 explain 来了解它如何优化代码。我假设一个在 WITH 中定义并被多次引用的表只计算一次。
然而,基于以下解释的优化逻辑计划,表 location_with_count 出现在不同的树中。
这是否意味着它会被计算两次或者这只是一个计划解释的显示问题。
In [24]: sql = """
...: WITH location_with_count AS (
...: SELECT uid, country_code, city_code, count() over (PARTITION BY country_code, city_code) as c
...: FROM location
...: ),
...:
...: rs AS (
...: SELECT uid, country_code, city_code,
...: row_number() over (PARTITION BY country_code, city_code
...: ORDER BY uid DESC) AS Rank
...: FROM location_with_count as uc
...: WHERE uc.c > 10
...: )
...:
...: (SELECT uid, country_code, city_code FROM rs WHERE Rank <= 10)
...: union
...: (SELECT uid, country_code, city_code FROM location_with_count WHERE c <= 10)
...: """
In [25]: session.sql(sql).explain(True)
== Parsed Logical Plan ==
CTE [location_with_count, rs]
: :- 'SubqueryAlias location_with_count
: : +- 'Project ['uid, 'country_code, 'city_code, 'count() windowspecdefinition('country_code, 'city_code, UnspecifiedFrame) AS c#281]
: : +- 'UnresolvedRelation `location`
: +- 'SubqueryAlias rs
: +- 'Project ['uid, 'country_code, 'city_code, 'row_number() windowspecdefinition('country_code, 'city_code, 'uid DESC NULLS LAST, UnspecifiedFrame) AS Rank#282]
: +- 'Filter ('uc.c > 10)
: +- 'SubqueryAlias uc
: +- 'UnresolvedRelation `location_with_count`
+- 'Distinct
+- 'Union
:- 'Project ['uid, 'country_code, 'city_code]
: +- 'Filter ('Rank <= 10)
: +- 'UnresolvedRelation `rs`
+- 'Project ['uid, 'country_code, 'city_code]
+- 'Filter ('c <= 10)
+- 'UnresolvedRelation `location_with_count`
== Analyzed Logical Plan ==
uid: bigint, country_code: string, city_code: string
Distinct
+- Union
:- Project [uid#283L, country_code#284, city_code#287]
: +- Filter (Rank#282 <= 10)
: +- SubqueryAlias rs
: +- Project [uid#283L, country_code#284, city_code#287, Rank#282]
: +- Project [uid#283L, country_code#284, city_code#287, Rank#282, Rank#282]
: +- Window [row_number() windowspecdefinition(country_code#284, city_code#287, uid#283L DESC NULLS LAST, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Rank#282], [country_code#284, city_code#287], [uid#283L DESC NULLS LAST]
: +- Project [uid#283L, country_code#284, city_code#287]
: +- Filter (c#281L > cast(10 as bigint))
: +- SubqueryAlias uc
: +- SubqueryAlias location_with_count
: +- Project [uid#283L, country_code#284, city_code#287, c#281L]
: +- Project [uid#283L, country_code#284, city_code#287, c#281L, c#281L]
: +- Window [count() windowspecdefinition(country_code#284, city_code#287, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS c#281L], [country_code#284, city_code#287]
: +- Project [uid#283L, country_code#284, city_code#287]
: +- SubqueryAlias location
: +- Relation[uid#283L,country_code#284,city_code#287] parquet
+- Project [uid#283L, country_code#284, city_code#287]
+- Filter (c#281L <= cast(10 as bigint))
+- SubqueryAlias location_with_count
+- Project [uid#283L, country_code#284, city_code#287, c#281L]
+- Project [uid#283L, country_code#284, city_code#287, c#281L, c#281L]
+- Window [count() windowspecdefinition(country_code#284, city_code#287, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS c#281L], [country_code#284, city_code#287]
+- Project [uid#283L, country_code#284, city_code#287]
+- SubqueryAlias location
+- Relation[uid#283L,country_code#284,city_code#287] parquet
== Optimized Logical Plan ==
Aggregate [uid#283L, country_code#284, city_code#287], [uid#283L, country_code#284, city_code#287]
+- Union
:- Project [uid#283L, country_code#284, city_code#287]
: +- Filter (isnotnull(Rank#282) && (Rank#282 <= 10))
: +- Window [row_number() windowspecdefinition(country_code#284, city_code#287, uid#283L DESC NULLS LAST, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Rank#282], [country_code#284, city_code#287], [uid#283L DESC NULLS LAST]
: +- Project [uid#283L, country_code#284, city_code#287]
: +- Filter (c#281L > 10)
: +- Window [0 AS c#281L], [country_code#284, city_code#287]
: +- Project [uid#283L, country_code#284, city_code#287]
: +- Relation[uid#283L,country_code#284,city_code#287] parquet
+- Project [uid#283L, country_code#284, city_code#287]
+- Filter (c#281L <= 10)
+- Window [0 AS c#281L], [country_code#284, city_code#287]
+- Project [uid#283L, country_code#284, city_code#287]
+- Relation[uid#283L,country_code#284,city_code#287] parquet
== Physical Plan ==
*HashAggregate(keys=[uid#283L, country_code#284, city_code#287], functions=[], output=[uid#283L, country_code#284, city_code#287])
+- Exchange hashpartitioning(uid#283L, country_code#284, city_code#287, 200)
+- *HashAggregate(keys=[uid#283L, country_code#284, city_code#287], functions=[], output=[uid#283L, country_code#284, city_code#287])
+- Union
:- *Project [uid#283L, country_code#284, city_code#287]
: +- *Filter (isnotnull(Rank#282) && (Rank#282 <= 10))
: +- Window [row_number() windowspecdefinition(country_code#284, city_code#287, uid#283L DESC NULLS LAST, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Rank#282], [country_code#284, city_code#287], [uid#283L DESC NULLS LAST]
: +- *Sort [country_code#284 ASC NULLS FIRST, city_code#287 ASC NULLS FIRST, uid#283L DESC NULLS LAST], false, 0
: +- *Project [uid#283L, country_code#284, city_code#287]
: +- *Filter (c#281L > 10)
: +- Window [0 AS c#281L], [country_code#284, city_code#287]
: +- *Sort [country_code#284 ASC NULLS FIRST, city_code#287 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(country_code#284, city_code#287, 200)
: +- *Project [uid#283L, country_code#284, city_code#287]
: +- *FileScan parquet default.location[uid#283L,country_code#284,city_code#287] Batched: true, Format: Parquet, Location: InMemoryFileIndex[.../location], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<uid:bigint,country_code:string,city_code:string>
+- *Project [uid#283L, country_code#284, city_code#287]
+- *Filter (c#281L <= 10)
+- Window [0 AS c#281L], [country_code#284, city_code#287]
+- *Sort [country_code#284 ASC NULLS FIRST, city_code#287 ASC NULLS FIRST], false, 0
+- ReusedExchange [uid#283L, country_code#284, city_code#287], Exchange hashpartitioning(country_code#284, city_code#287, 200)
在物理计划中,我看到
ReusedExchange [uid#283L, country_code#284, city_code#287], Exchange hashpartitioning(country_code#284, city_code#287, 200)
它实际上是否表明 location_with_count 被重用了?
解决方案
SubqueryAlias
逻辑运算符最终会被EliminateSubqueryAliases
逻辑优化所淘汰。别名是指向查询相同部分的指针(引用),不参与执行。
您可以在EliminateSubqueryAliases Logical Optimization中找到一些信息。
物理ReuseSubquery
查询优化应该避免多次执行子查询。
您可以在ReuseSubquery Physical Query Optimization中找到一些信息。
它实际上是否表明 location_with_count 被重用了?
我希望如此。
推荐阅读
- javascript - http GET 返回与浏览器的 DOM 元素上看到的完全不同的 HTML 元素和 innerHTML
- react-native - 如何为 react native 中的组件创建一个单独的类并将其用作公共类?
- css - 没有从 JSP 内部连接到 CSS
- android - 使用导航控制器按下后退按钮后如何防止前一个片段显示?
- vba - 如何使用 VBA 在 MS Project 中返回一天中的最后一个可用时间
- javascript - php使用ajax删除评论回复
- openlayers - 有没有机会改变geojson(mapshaper)的投影
- java - Java 用户输入没有结束
- ffmpeg - ffmpeg 在录制 RTSP 流时发出“501 Not Implemented”
- php - 如何在function.php的wordpresss中使用函数move_uploaded_file