首页 > 解决方案 > Redshift:查询计划中的嵌套循环联接

问题描述

我有我的事实表t_session,看起来像这样简化:

+------------+----------+-------------+
| start_hm   | end_hm   | device_id   |
|------------+----------+-------------|
| 0          | 10       | 111         |
| 2          | 10       | 112         |
| 12         | 20       | 113         |
| 60         | 90       | 111         |
| 60         | 90       | 112         |

我还有我的维度表dim_time,其中包含 1440 条记录,每小时 0-23 小时和 0-59 分钟。所以它包含一天的小时-分钟的所有组合。tk是 0-1439 的数字范围

+------+--------+----------+
| tk   | hour   | minute   |
|------+--------+----------|
| 0    | 0      | 0        |
| 1    | 0      | 1        |
| 2    | 0      | 2        |
............................
| 60   | 1      | 0        |
| 61   | 1      | 1        |
| 62   | 1      | 2        |
............................
| 120  | 2      | 0        |
| 121  | 2      | 1        |
| 122  | 2      | 2        |
............................

我想计算每分钟的活动 device_id 数量。在实际应用中,还有另一个表dim_date和六个其他关系,但让我们对这个问题保持简单。

start_hm设备在和之间的时隙中处于活动状态end_hm。两者start_hmend_hm值都在 0 到 1439 之间。

select count(distinct device_id)
from t_session
join dim_time on tk between start_hm and end_hm
group by tk
order by tk;

这个查询执行得很慢。当我查看执行计划时,它抱怨嵌套循环。

+--------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                                     |
|--------------------------------------------------------------------------------------------------------------------------------|
| XN Limit  (cost=1000002000820.94..1000002000820.97 rows=10 width=8)                                                            |
|   ->  XN Merge  (cost=1000002000820.94..1000002000821.44 rows=200 width=8)                                                     |
|         Merge Key: tk                                                                                                          |
|         ->  XN Network  (cost=1000002000820.94..1000002000821.44 rows=200 width=8)                                             |
|               Send to leader                                                                                                   |
|               ->  XN Sort  (cost=1000002000820.94..1000002000821.44 rows=200 width=8)                                          |
|                     Sort Key: tk                                                                                               |
|                     ->  XN HashAggregate  (cost=2000812.80..2000813.30 rows=200 width=8)                                       |
|                           ->  XN Subquery Scan volt_dt_0  (cost=2000764.80..2000796.80 rows=3200 width=8)                      |
|                                 ->  XN HashAggregate  (cost=2000764.80..2000764.80 rows=3200 width=8)                          |
|                                       ->  XN Nested Loop DS_BCAST_INNER  (cost=0.00..2000748.80 rows=3200 width=8)             |
|                                             Join Filter: (("outer".tk <= "inner".end_hm) AND ("outer".tk >= "inner".start_hm)) |
|                                             ->  XN Seq Scan on dim_time  (cost=0.00..28.80 rows=2880 width=4)                  |
|                                             ->  XN Seq Scan on t_session  (cost=0.00..0.10 rows=10 width=12)                   |
| ----- Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products -----                        |
+--------------------------------------------------------------------------------------------------------------------------------+

我了解嵌套循环的来源。它需要循环dim_time遍历t_session.

是否可以修改我的查询以避免嵌套循环并提高性能?

更新:相同的查询在 Postgres 上运行得非常快,并且执行计划没有笛卡尔积。

+--------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                   |
|--------------------------------------------------------------------------------------------------------------|
| Limit  (cost=85822.07..85839.17 rows=10 width=12)                                                            |
|   ->  GroupAggregate  (cost=85822.07..88284.47 rows=1440 width=12)                                           |
|         Group Key: dim_time.tk                                                                               |
|         ->  Sort  (cost=85822.07..86638.07 rows=326400 width=8)                                              |
|               Sort Key: dim_time.tk                                                                          |
|               ->  Nested Loop  (cost=0.00..51467.40 rows=326400 width=8)                                     |
|                     Join Filter: ((dim_time.tk >= t_session.start_hm) AND (dim_time.tk <= t_session.end_hm)) |
|                     ->  Seq Scan on t_session  (cost=0.00..30.40 rows=2040 width=12)                         |
|                     ->  Materialize  (cost=0.00..32.60 rows=1440 width=4)                                    |
|                           ->  Seq Scan on dim_time  (cost=0.00..25.40 rows=1440 width=4)                     |
+--------------------------------------------------------------------------------------------------------------+

更新2:

t_session表具有device_id作为 DISTKEY 的列和作为 SORTKEYstart_date的列(未在简化示例中显示):会话自然地按start_date.

dim_time表具有tkSORTKEY 和 DISTSTYLE ALL。

在 Redshift 上,每天 40000 个会话的执行时间为 5-6 分钟。还有几秒钟在 Postgres 上。

Redshift 集群上有两个 dc2.large 节点

标签: amazon-redshiftsql-optimization

解决方案


推荐阅读