首页 > 解决方案 > 什么决定了postgres中同一张表中每个索引的成本

问题描述

我正在使用 PostgreSQL 服务器 12 和 psql 12。我有一个非常大的表(大约 600 万个元组),它有几列。说就像

People(
   bigint id,
   varchar company_type,
   bigint complany_id,
   varchar department_type,
   bigint department_id,
   ......
)

我有几个索引:

"people_pkey" PRIMARY KEY, btree (id),
"unique_person" UNIQUE, btree (company_type, company_id, department_type, department_id),
"company" btree (company_type, company_id),
"department" btree (department_type, department_id)

现在我有这个简单的查询

EXPLAIN ANALYZE SELECT array(
    SELECT DISTINCT my_people.company_id
    FROM people AS my_people
    WHERE
    "my_people"."company_type" = 'Some_company' AND
    "my_people"."department_type" = 'Some_department' AND
    "my_people"."department_id" = ANY(ARRAY[1,2,3,4,5,6,7])
) a

                                                                                                               QUERY PLAN                                                                                                  
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=8.60..8.61 rows=1 width=32) (actual time=2.377..2.378 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Unique  (cost=0.56..8.60 rows=1 width=8) (actual time=2.373..2.374 rows=0 loops=1)
           ->  Index Scan using company on people my_people  (cost=0.56..8.60 rows=1 width=8) (actual time=2.373..2.373 rows=0 loops=1)
                 Index Cond: ((company_type)::text = 'Some_company'::text)
                 Filter: ((department_type)::text = 'Some_department'::text) AND (department_id = ANY ('{1,2,3,4,5,6,7}'::integer[])))
                 Rows Removed by Filter: 1189
 Planning Time: 0.873 ms
 Execution Time: 2.405 ms

如果它使用“部门”索引,查询时间将得到优化,但这样做,它使用“公司”。

我试过使用 pg_hint_plan 来强制它使用我想要的索引,它会比“公司”快得多。

/*+ IndexScan(my_people department) */ EXPLAIN ANALYZE SELECT array(
    SELECT DISTINCT my_people.company_id
    FROM people AS my_people
    WHERE
    "my_people"."company_type" = 'Some_company' AND
    "my_people"."department_type" = 'Some_department' AND
    "my_people"."department_id" = ANY(ARRAY[1,2,3,4,5,6,7])
) a

                                                                                                                                                                        QUERY PLAN                                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=503435.96..503435.97 rows=1 width=32) (actual time=0.073..0.074 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Unique  (cost=503435.95..503435.96 rows=1 width=8) (actual time=0.070..0.071 rows=0 loops=1)
           ->  Sort  (cost=503435.95..503435.96 rows=1 width=8) (actual time=0.070..0.071 rows=0 loops=1)
                 Sort Key: my_people.id
                 Sort Method: quicksort  Memory: 25kB
                 ->  Index Scan using department on people my_people  (cost=0.56..503435.94 rows=1 width=8) (actual time=0.066..0.067 rows=0 loops=1)
                       Index Cond: (((department_type)::text = 'Some_department'::text) AND (department_id = ANY ('{1,2,3,4,5,6,7}'::integer[])))
                       Filter: ((Company_type)::text = 'Some_company'::text)
                       Rows Removed by Filter: 1
 Planning Time: 0.252 ms
 Execution Time: 0.096 ms
(12 rows)


/*+ IndexScan(my_people unique_person) */ EXPLAIN ANALYZE SELECT array(
    SELECT DISTINCT my_people.company_id
    FROM people AS my_people
    WHERE
    "my_people"."company_type" = 'Some_company' AND
    "my_people"."department_type" = 'Some_department' AND
    "my_people"."department_id" = ANY(ARRAY[1,2,3,4,5,6,7])
) a
                                                                                     QUERY PLAN                                                                                      
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=8.60..8.61 rows=1 width=32) (actual time=1.821..1.822 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Unique  (cost=0.56..8.60 rows=1 width=8) (actual time=1.818..1.818 rows=0 loops=1)
           ->  Index Scan using unique_person on people my_people  (cost=0.56..8.60 rows=1 width=8) (actual time=1.817..1.817 rows=0 loops=1)
                 Index Cond: (((company_type)::text = 'Some_company'::text) AND ((department_type)::text = 'Some_department'::text))
                 Filter: (department_id = ANY ('{1,2,3,4,5,6,7}'::integer[]))
                 Rows Removed by Filter: 994
 Planning Time: 0.258 ms
 Execution Time: 1.842 ms
(9 rows)

然后我想也许是它ScalarArrayOpExpr使它效率低下。所以我把查询改成了这个。这明显更快,但我仍然需要提示 postgres 使用“部门”索引。

/*+ IndexScan(my_people department) */ EXPLAIN ANALYZE SELECT array(
                SELECT a.*
                FROM unnest(ARRAY[1,2,3,4,5,6,7]) as t(fid)
                    , LATERAL (
                SELECT DISTINCT my_people.id
                FROM people AS my_people
                WHERE
                  "my_people"."company_type" = 'Some_company' AND
                  "my_people"."department_type" = 'Some_department' AND
                  "my_people"."target_id" = t.fid
                ) a
              ) b
;
                                                                                      QUERY PLAN                                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=3658.94..3658.95 rows=1 width=32) (actual time=0.092..0.094 rows=1 loops=1)
   InitPlan 1 (returns $1)
     ->  Nested Loop  (cost=522.67..3658.94 rows=7 width=8) (actual time=0.090..0.091 rows=0 loops=1)
           ->  Function Scan on unnest t  (cost=0.00..0.07 rows=7 width=4) (actual time=0.005..0.006 rows=7 loops=1)
           ->  Unique  (cost=522.67..522.68 rows=1 width=8) (actual time=0.011..0.012 rows=0 loops=7)
                 ->  Sort  (cost=522.67..522.67 rows=1 width=8) (actual time=0.011..0.011 rows=0 loops=7)
                       Sort Key: my_people.id
                       Sort Method: quicksort  Memory: 25kB
                       ->  Index Scan using department on people my_people  (cost=0.56..522.66 rows=1 width=8) (actual time=0.010..0.010 rows=0 loops=7)
                             Index Cond: (((department_type)::text = 'Some_department'::text) AND (department_id = t.fid))
                             Filter: ((company_type)::text = 'Some_company'::text)
                             Rows Removed by Filter: 0
 Planning Time: 0.248 ms
 Execution Time: 0.120 ms
(14 rows)

当使用“company”时,索引扫描的成本最低,所以我认为这就是psql会使用这个索引的原因。但是,这种行为会大大减慢我的查询速度,我想知道 psql 如何确定要使用的索引。

更新1:

我试过VACUUM (VERBOSE, ANALYZE) people;了,这是输出。似乎什么都没有改变,但现在我的查询使用了我需要的索引。

VACUUM (VERBOSE, ANALYZE) people;
INFO:  vacuuming "people"
INFO:  index "people_pkey" now contains 66865768 row versions in 183343 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.11 s, system: 0.41 s, elapsed: 3.24 s.
INFO:  index "unique_person" now contains 66865768 row versions in 867318 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.54 s, system: 1.97 s, elapsed: 6.13 s.
INFO:  index "department" now contains 66865768 row versions in 308674 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.21 s, system: 0.67 s, elapsed: 1.32 s.
INFO:  "people": found 0 removable, 66865768 nonremovable row versions in 1943422 out of 1943422 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 85284
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 6.00 s, system: 13.11 s, elapsed: 30.15 s.
INFO:  vacuuming "pg_toast.pg_toast_1418456"
INFO:  index "pg_toast_1418456_index" now contains 2344 row versions in 9 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "pg_toast_1418456": found 0 removable, 2344 nonremovable row versions in 534 out of 534 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 85284
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  analyzing "people"
INFO:  "people": scanned 30000 of 1943422 pages, containing 1032582 live rows and 0 dead rows; 30000 rows in sample, 66891419 estimated total rows
VACUUM

标签: postgresql

解决方案


您的标题非常笼统,因此我假设您正在寻找一般情况下解决此类问题的方法,而不是针对您的特定情况的固定答案。

计划者首选计划的 EXPLAIN ANALYZE 具有:

Rows Removed by Filter: 1189

但是,EXPLAIN 输出并没有告诉您它希望使用过滤器删除多少行,而只是告诉您它实际删除了多少行。(这种缺乏信息可以说是解释分析中的一个缺陷)。基于较低的总成本估算,它似乎没想到会移除那么多。你可以运行一个变体查询来查看它期望索引返回多少行(除了 1 之外它默默地期望过滤器删除):

explain select * from person where company_type = 'Some_company'

对于你真正希望它使用的索引,计划也有点莫名其妙:

             ->  Index Scan using department on people my_people  (cost=0.56..503435.94 rows=1 width=8) (actual time=0.066..0.067 rows=0 loops=1)
                   Index Cond: (((department_type)::text = 'Some_department'::text) AND (department_id = ANY ('{1,2,3,4,5,6,7}'::integer[])))
                   Filter: ((Company_type)::text = 'Some_company'::text)
                   Rows Removed by Filter: 1

对于大量“由过滤器删除”而言,这里的成本似乎太小了,而对于过滤的行数少,它现在似乎太高了。但同样,我们只能看到它实际删除了多少行,而不是它预期的多少。要查看在过滤之前希望此索引扫描找到多少,您可以运行:

explain select * from person where department_type = 'Some_department' AND department_id = ANY ('{1,2,3,4,5,6,7}'::integer[])

推荐阅读