首页 > 解决方案 > Postgresql 10 查询与过滤器中的大(?)IN 列表

问题描述

我正在处理一个相对简单的查询:

SELECT row.id, row.name FROM things AS row 
WHERE row.type IN ( 
'00000000-0000-0000-0000-000000031201',
...
 )
ORDER BY row.name ASC, row.id ASC 
LIMIT 2000;

问题:

如果列表包含 25 个或更少的 UUID,则查询很好:

 Limit  (cost=21530.51..21760.51 rows=2000 width=55) (actual time=5.057..7.780 rows=806 loops=1)
   ->  Gather Merge  (cost=21530.51..36388.05 rows=129196 width=55) (actual time=5.055..6.751 rows=806 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         ->  Sort  (cost=20530.50..20853.49 rows=129196 width=55) (actual time=2.273..2.546 rows=403 loops=2)
               Sort Key: name, id
               Sort Method: quicksort  Memory: 119kB
               ->  Parallel Index Only Scan using idx_things_type_name_id on things row  (cost=0.69..9562.28 rows=129196 width=55) (actual time=0.065..0.840 rows=403 loops=2)
                     Index Cond: (type = ANY ('{00000000-0000-0000-0000-000000031201,... (< 24 more)}'::text[]))
                     Heap Fetches: 0
 Planning time: 0.202 ms
 Execution time: 8.485 ms

但是一旦列表超过 25 个元素,就会使用不同的索引,并且查询执行时间确实会增加:

 Limit  (cost=1000.58..15740.63 rows=2000 width=55) (actual time=11.553..29789.670 rows=952 loops=1)
   ->  Gather Merge  (cost=1000.58..2400621.01 rows=325592 width=55) (actual time=11.551..29855.053 rows=952 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Parallel Index Scan using idx_things_name_id on things row  (cost=0.56..2362039.59 rows=135663 width=55) (actual time=3.570..24437.039 rows=317 loops=3)
               Filter: ((type)::text = ANY ('{00000000-0000-0000-0000-000000031201,... (> 24 more)}'::text[]))
               Rows Removed by Filter: 5478258
 Planning time: 0.209 ms
 Execution time: 29857.454 ms

细节:

该表包含 16435726 行,17 列。与查询相关的 3 列是:

相关指标为:

有 70 个不同的类型值,其中 2 个占约 1500 万行。这两个不在 IN 列表中。

实验和问题:

我首先检查此索引是否有帮助:

创建索引 idx_things_name_id_type ON things (name, id, type);

它只是轻微的。12s 是不可接受的:

 Limit  (cost=1000.71..7638.73 rows=2000 width=55) (actual time=5.888..12120.907 rows=952 loops=1)
   ->  Gather Merge  (cost=1000.71..963238.21 rows=289917 width=55) (actual time=5.886..12154.580 rows=952 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Parallel Index Only Scan using idx_things_name_id_type on things row  (cost=0.69..928774.57 rows=120799 width=55) (actual time=1.024..9852.923 rows=317 loops=3)
               Filter: ((type)::text = ANY ('{00000000-0000-0000-0000-000000031201,... 37 more}'::text[]))
               Rows Removed by Filter: 5478258
               Heap Fetches: 0
 Planning time: 0.638 ms
 Execution time: 12156.817 ms

我知道大型 IN 列表在 Postgres 中效率不高,但我很惊讶在 25 个元素时就达到了这一点。还是这里的问题是别的?

我尝试了其他帖子中建议的解决方案(内部加入 VALUES 列表,也将 IN 更改为 IN VALUES,......)但它让事情变得更糟。这是一个实验的例子:

SELECT row.id, row.name
FROM things AS row 
WHERE row.type IN (VALUES ('00000000-0000-0000-0000-000000031201'), ... )
ORDER BY row.name ASC, row.id ASC 
LIMIT 2000;
 Limit  (cost=0.56..1254.91 rows=2000 width=55) (actual time=45.718..847919.632 rows=952 loops=1)
   ->  Nested Loop Semi Join  (cost=0.56..10298994.72 rows=16421232 width=55) (actual time=45.714..847917.788 rows=952 loops=1)
         Join Filter: ((row.type)::text = "*VALUES*".column1)
         Rows Removed by Join Filter: 542360414
         ->  Index Scan using idx_things_name_id on things row  (cost=0.56..2170484.38 rows=16421232 width=92) (actual time=0.132..61387.582 rows=16435726 loops=1)
         ->  Materialize  (cost=0.00..0.58 rows=33 width=32) (actual time=0.001..0.022 rows=33 loops=16435726)
               ->  Values Scan on "*VALUES*"  (cost=0.00..0.41 rows=33 width=32) (actual time=0.004..0.030 rows=33 loops=1)
 Planning time: 1.131 ms
 Execution time: 847920.680 ms
(9 rows)

来自内部连接值的查询计划():

 Limit  (cost=0.56..1254.91 rows=2000 width=55) (actual time=38.289..847714.160 rows=952 loops=1)
   ->  Nested Loop  (cost=0.56..10298994.72 rows=16421232 width=55) (actual time=38.287..847712.333 rows=952 loops=1)
         Join Filter: ((row.type)::text = "*VALUES*".column1)
         Rows Removed by Join Filter: 542378006
         ->  Index Scan using idx_things_name_id on things row  (cost=0.56..2170484.38 rows=16421232 width=92) (actual time=0.019..60303.676 rows=16435726 loops=1)
         ->  Materialize  (cost=0.00..0.58 rows=33 width=32) (actual time=0.001..0.022 rows=33 loops=16435726)
               ->  Values Scan on "*VALUES*"  (cost=0.00..0.41 rows=33 width=32) (actual time=0.002..0.029 rows=33 loops=1)
 Planning time: 0.247 ms
 Execution time: 847715.215 ms
(9 rows)

我在这里做错了什么吗?

有关如何处理此问题的任何提示?如果需要更多信息,我会按照你们的要求添加。

附言。列/表/索引名称被“匿名”以符合公司政策,所以请不要指向愚蠢的名称:)

标签: postgresqlperformance

解决方案


我实际上弄清楚了发生了什么。Postgres 计划者的决定是正确的。但这使得它基于不完美的统计数据。关键是查询计划的那些行:

低于 25 个 UUID:

->  Gather Merge  (cost=21530.51..36388.05 **rows=129196** width=55) 
(actual time=5.055..6.751 **rows=806** loops=1)

高估约 160 倍

超过 25 个 UUID:

->  Gather Merge  (cost=1000.58..2400621.01 **rows=325592** width=55) 
(actual time=11.551..29855.053 **rows=952** loops=1)

高估了 ~342(!) 次

如果这确实比使用已根据需要排序的(type, id)325592上的索引并从中过滤可能是最有效的。但是因为高估 Postgres 需要删除超过5M 行才能获取完整结果:

**Rows Removed by Filter: 5478258**

我猜 Postgres 发现对325592行进行排序(查询 > 25 UUID)会非常昂贵,因此使用已经排序的索引与可以在内存中排序的129196行排序(查询 <25 UUID)相比更有利。

我看了看pg_stats,统计数据非常无用。

这是因为查询中不存在的少数类型如此频繁地出现,并且确实出现的 UUID 落入直方图中并且被高估了。

增加此列的统计目标:

ALTER TABLE things ALTER COLUMN type SET STATISTICS 1000;

解决了这个问题。

现在查询8ms也针对具有超过 25 个元素的 UUID 列表执行。

更新:由于声称查询计划中可见的 :text 强制转换是罪魁祸首,我去了测试服务器并运行:

ALTER TABLE things ALTER COLUMN type TYPE text;

不,没有演员表,但没有任何改变:

`Limit  (cost=1000.58..20590.31 rows=2000 width=55) (actual time=12.761..30611.878 rows=952 loops=1)
   ->  Gather Merge  (cost=1000.58..2386715.56 rows=243568 width=55) (actual time=12.759..30682.784 rows=952 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Parallel Index Scan using idx_things_name_id on things row  (cost=0.56..2357601.74 rows=101487 width=55) (actual time=3.994..24190.693 rows=317 loops=3)
               Filter: (type = ANY ('{00000000-0000-0000-0000-000000031201,... (> 24 more)}'::text[]))
               Rows Removed by Filter: 5478258
 Planning time: 0.227 ms
 Execution time: 30685.092 ms
(9 rows)`

推荐阅读