sql - 我如何使 postgres 避免对此搜索分页查询进行双重顺序扫描?
问题描述
架构
- 我有一堆帖子存储在一个表中(feed_items)
- 我有一张表,其中包含喜欢/不喜欢哪个用户 ID 的 feed_item_id (feed_item_likes_dislikes)
- 我有另一个表包含哪个用户 id 喜欢/激怒了哪个 feed_item_id (feed_item_love_anger)
- 我有第四个表,其中包含哪个 feed_item_id 有哪些标签,其中标签是 varchar 的 ARRAY (feed_item_tags)
- 每个帖子的喜欢/不喜欢的总数存储在物化视图中(feed_item_likes_dislikes_aggregate)
- 爱/愤怒的总数存储在另一个物化视图中(feed_item_love_anger_agregate)
- 喜欢/不喜欢和喜欢/生气可以同时被喜欢/不喜欢和喜欢/生气(不幸的是业务需求)
- 我在 feed_items 中有 2 个名为 TSVECTOR 类型的 title_vector 和 summary_vector 列,这有助于通过搜索关键字查找帖子(postgres 中的全文搜索)
问题
- 我想以他们的 pubdate 和 feed_item_id 的 DESCENDING 顺序查找所有帖子
- 一些帖子同时发布,我想使用 (pubdate, feed_item_id) < (value1, value2) 搜索此处描述的分页方法进行分页
我的第 1 页查询
查找喜欢 > 0 且标题或摘要中带有“骗局”一词的帖子
SELECT
fi.feed_item_id,
pubdate,
link,
title,
summary,
author,
feed_id,
likes,
dislikes,
love,
anger,
tags
FROM
feed_items fi
LEFT JOIN
feed_item_tags t
ON fi.feed_item_id = t.feed_item_id
LEFT JOIN
feed_item_love_anger_aggregate bba
ON fi.feed_item_id = bba.feed_item_id
LEFT JOIN
feed_item_likes_dislikes_aggregate lda
ON fi.feed_item_id = lda.feed_item_id
WHERE
(
title_vector @@ to_tsquery('scam')
OR summary_vector @@ to_tsquery('scam')
)
AND 'for' = ANY(tags)
AND likes > 0
ORDER BY
pubdate DESC,
feed_item_id DESC LIMIT 3;
解释分析第 1 页
Limit (cost=2.83..16.88 rows=3 width=233) (actual time=0.075..0.158 rows=3 loops=1)
-> Nested Loop Left Join (cost=2.83..124.53 rows=26 width=233) (actual time=0.074..0.157 rows=3 loops=1)
-> Nested Loop (cost=2.69..116.00 rows=26 width=217) (actual time=0.067..0.146 rows=3 loops=1)
Join Filter: (t.feed_item_id = fi.feed_item_id)
Rows Removed by Join Filter: 73
-> Index Scan using idx_feed_items_pubdate_feed_item_id_desc on feed_items fi (cost=0.14..68.77 rows=76 width=62) (actual time=0.016..0.023 rows=3 loops=1)
Filter: ((title_vector @@ to_tsquery('scam'::text)) OR (summary_vector @@ to_tsquery('scam'::text)))
Rows Removed by Filter: 1
-> Materialize (cost=2.55..8.56 rows=34 width=187) (actual time=0.016..0.037 rows=25 loops=3)
-> Hash Join (cost=2.55..8.39 rows=34 width=187) (actual time=0.044..0.091 rows=36 loops=1)
Hash Cond: (t.feed_item_id = lda.feed_item_id)
-> Seq Scan on feed_item_tags t (cost=0.00..5.25 rows=67 width=155) (actual time=0.009..0.043 rows=67 loops=1)
Filter: ('for'::text = ANY ((tags)::text[]))
Rows Removed by Filter: 33
-> Hash (cost=1.93..1.93 rows=50 width=32) (actual time=0.029..0.029 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on feed_item_likes_dislikes_aggregate lda (cost=0.00..1.93 rows=50 width=32) (actual time=0.004..0.013 rows=50 loops=1)
Filter: (likes > 0)
Rows Removed by Filter: 24
-> Index Scan using idx_feed_item_love_anger_aggregate on feed_item_love_anger_aggregate bba (cost=0.14..0.32 rows=1 width=32) (actual time=0.002..0.003 rows=0 loops=3)
Index Cond: (feed_item_id = fi.feed_item_id)
Planning Time: 0.601 ms
Execution Time: 0.195 ms
(23 rows)
尽管在所有表上都有适当的索引,但它正在执行 2 次顺序扫描
我的页面 N 查询
从上述查询中获取第三个结果的发布日期和 feed_item_id 并加载接下来的 3 个结果
SELECT
fi.feed_item_id,
pubdate,
link,
title,
summary,
author,
feed_id,
likes,
dislikes,
love,
anger,
tags
FROM
feed_items fi
LEFT JOIN
feed_item_tags t
ON fi.feed_item_id = t.feed_item_id
LEFT JOIN
feed_item_love_anger_aggregate bba
ON fi.feed_item_id = bba.feed_item_id
LEFT JOIN
feed_item_likes_dislikes_aggregate lda
ON fi.feed_item_id = lda.feed_item_id
WHERE
(
pubdate,
fi.feed_item_id
)
< ('2020-06-19 19:50:00+05:30', 'bc5c8dfe-13a9-d97a-a328-0e5b8990c500')
AND
(
title_vector @@ to_tsquery('scam')
OR summary_vector @@ to_tsquery('scam')
)
AND 'for' = ANY(tags)
AND likes > 0
ORDER BY
pubdate DESC,
feed_item_id DESC LIMIT 3;
解释第 N 个查询 尽管过滤它正在执行 2 次顺序扫描
Limit (cost=2.83..17.13 rows=3 width=233) (actual time=0.082..0.199 rows=3 loops=1)
-> Nested Loop Left Join (cost=2.83..121.97 rows=25 width=233) (actual time=0.081..0.198 rows=3 loops=1)
-> Nested Loop (cost=2.69..113.67 rows=25 width=217) (actual time=0.073..0.185 rows=3 loops=1)
Join Filter: (t.feed_item_id = fi.feed_item_id)
Rows Removed by Join Filter: 183
-> Index Scan using idx_feed_items_pubdate_feed_item_id_desc on feed_items fi (cost=0.14..67.45 rows=74 width=62) (actual time=0.014..0.034 rows=6 loops=1)
Index Cond: (ROW(pubdate, feed_item_id) < ROW('2020-06-19 19:50:00+05:30'::timestamp with time zone, 'bc5c8dfe-13a9-d97a-a328-0e5b8990c500'::uuid))
Filter: ((title_vector @@ to_tsquery('scam'::text)) OR (summary_vector @@ to_tsquery('scam'::text)))
Rows Removed by Filter: 2
-> Materialize (cost=2.55..8.56 rows=34 width=187) (actual time=0.009..0.022 rows=31 loops=6)
-> Hash Join (cost=2.55..8.39 rows=34 width=187) (actual time=0.050..0.098 rows=36 loops=1)
Hash Cond: (t.feed_item_id = lda.feed_item_id)
-> Seq Scan on feed_item_tags t (cost=0.00..5.25 rows=67 width=155) (actual time=0.009..0.044 rows=67 loops=1)
Filter: ('for'::text = ANY ((tags)::text[]))
Rows Removed by Filter: 33
-> Hash (cost=1.93..1.93 rows=50 width=32) (actual time=0.028..0.029 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on feed_item_likes_dislikes_aggregate lda (cost=0.00..1.93 rows=50 width=32) (actual time=0.005..0.014 rows=50 loops=1)
Filter: (likes > 0)
Rows Removed by Filter: 24
-> Index Scan using idx_feed_item_love_anger_aggregate on feed_item_love_anger_aggregate bba (cost=0.14..0.32 rows=1 width=32) (actual time=0.003..0.003 rows=1 loops=3)
Index Cond: (feed_item_id = fi.feed_item_id)
Planning Time: 0.596 ms
Execution Time: 0.236 ms
(24 rows)
我已经设置了所需的表和索引,有人可以告诉我如何修复查询以充其量使用索引扫描或将顺序扫描的数量减少到 1?
解决方案
该构造'for' = ANY(tags)
不能使用 GIN 索引。为了能够使用它,您需要将其重新表述为类似'{for}' <@ tags
.
但是,它会选择不使用索引,因为表太小而且条件太无选择性。如果你想强制使用索引,以证明它有能力这样做,你可以先set enable_seqscan=off
.
推荐阅读
- javascript - HTML 按钮不显示弹出窗口
- python - 熊猫:数据框问题
- javascript - reactjs中如何访问数组元素?
- php - 如何显示添加到组中的最后一条记录
- angular - Angular navigator.mediaDevices.getDisplayMedia() 不存在
- java - Stanfordcore Nlp 无效的最大堆大小错误
- django - 基于 ListView 类的视图中的 Django prefetch_related()
- r - R 中带有 afex 的 ANCOVA:错误:受试者间设计中的空单元格,但没有 NA
- http - 在多级子域之间共享 cookie
- java - MVVM + 改造:检索 JSON 对象列表然后产生 null