首页 > 解决方案 > Postgres 索引在大表上被忽略

问题描述

我有一个非常简单的查询,在 Postgres 11.12 上需要超过 10 分钟才能完成单个连接:

SELECT COUNT(r.*) as active_jobs 
    FROM project_raw_data r 
        INNER JOIN jobs j ON j.id = r.processing_job_id 
    WHERE r.ind_requires_processing = True AND
          r.processing_error = False AND
          r.processing_job_id IS NOT NULL AND
          j.finished IS NULL AND
          j.started IS NOT NULL;

project_raw_data表有 ~50M 行,大小约为 500GB(其中有一些更大的元数据,此查询中涉及的列都是booleanor timezone),并且jobs表有 ~1M 行。

我在project_raw_data表上应用了以下索引:

CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_project_raw_data_processing_active ON  
    project_raw_data 
    USING btree (ind_requires_processing, processing_error, processing_job_id);

jobs桌子上的这个:

CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_studio_jobs_active ON 
    jobs 
    USING btree (id) WHERE ((finished IS NULL) AND (started IS NOT NULL));

如果我查看 PgAnalyze 索引应该已经是这个查询的最佳选择:

PgAnalyze 中的 SQL 语句和索引分析

所以我有点不了解为什么这需要 10 分钟并从磁盘读取 70GB (!)(在查看分析时)。

这是来自的输出EXPLAIN ANALYZE

Finalize Aggregate  (cost=9860168.44..9860168.45 rows=1 width=8) (actual time=690627.757..690628.919 rows=1 loops=1)
  ->  Gather  (cost=9860168.23..9860168.44 rows=2 width=8) (actual time=690625.874..690628.908 rows=3 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Partial Aggregate  (cost=9859168.23..9859168.24 rows=1 width=8) (actual time=690623.471..690623.475 rows=1 loops=3)
              ->  Hash Join  (cost=97.43..9859165.28 rows=1180 width=1135) (actual time=690623.464..690623.467 rows=0 loops=3)
                    Hash Cond: (r.processing_job_id = j.id)
                    ->  Parallel Seq Scan on project_raw_data r  (cost=0.00..9845488.13 rows=5173208 width=1139) (actual time=690623.462..690623.463 rows=0 loops=3)
                          Filter: (ind_requires_processing AND (NOT processing_error) AND (processing_job_id IS NOT NULL))
                          Rows Removed by Filter: 16661364
                    ->  Hash  (cost=93.42..93.42 rows=321 width=4) (never executed)
                          ->  Index Only Scan using idx_studio_jobs_active on jobs j  (cost=0.27..93.42 rows=321 width=4) (never executed)
                                Heap Fetches: 0
Planning Time: 2.207 ms
Execution Time: 690629.645 ms

我们看到project_raw_data表上的索引被完全跳过了。但为什么?

标签: postgresql

解决方案


@a_horse_with_no_name 在评论中建议在vacuum analyze桌子上运行就可以了。查询现在完全使用索引,并在 100 毫秒内完成。反而。


推荐阅读