postgresql - 为什么 PostgreSQL 中多维数据集列上的 GIST 索引实际上会使 K-Nearest Neighbor (KNN) ORDER BY 查询变得更糟?
问题描述
添加GIST索引实际上似乎会使 PostgreSQL 中列上的 K-最近邻 (KNN)ORDER BY
查询变得更糟。为什么会这样,可以做些什么呢?cube
这就是我的意思。在 PostgreSQL 数据库中,我有一个表,其 DDL 是create sample (id serial primary key, title text, embedding cube)
其中的列是使用 Google 语言模型获得embedding
的嵌入向量。数据类型由我安装的多维数据集扩展提供title
。cube
顺便说一下,这些是维基百科文章的标题。无论如何,有 100 万条记录。然后,我使用以下查询执行 KNN 查询。此查询distance
使用欧几里得距离算子进行定义<->
,但其他两个指标的结果相似。它执行ORDER BY
并应用 aLIMIT
以查找 10 篇具有“相似”标题的 Wikipedia 文章(最相似的是目标标题本身)。这一切都很好。
select sample.title, sample.embedding <-> cube('(0.18936706, -0.12455666, -0.31581765, 0.0192692, -0.07364611, 0.07851536, 0.0290586, -0.02582532, -0.03378124, -0.10564457, -0.03903799, 0.08668878, -0.15357816, -0.17793414, -0.01826405, 0.01969068, 0.11386908, 0.1555583, 0.09368557, 0.13697313, -0.05610929, -0.06536788, -0.12212707, 0.26356605, -0.06004387, -0.01966437, -0.1250324, -0.16645767, -0.13525756, 0.22482251, -0.1709727, 0.28966117, -0.07927769, -0.02498624, -0.10018375, -0.10923951, 0.04770213, 0.11573371, 0.04619929, 0.05216618, 0.19176421, 0.12948817, 0.08719034, -0.16109011, -0.02411379, -0.05638905, -0.37334979, 0.31225419, 0.0744801, 0.27044332)') distance from sample order by distance limit 10;
然而,令我困惑的是,如果我在embedding
列上放置 GIST 索引,查询性能实际上会更差。添加索引后,查询计划会按照预期的方式更改,只要它使用索引即可。但是……它变慢了!
这似乎与以下说明的文档背道而驰cube
:
此外,立方体 GiST 索引可用于在 ORDER BY 子句中使用度量运算符 <->、<#> 和 <=> 查找最近的邻居
他们甚至提供了一个示例查询,这与我的非常相似。
SELECT c FROM test ORDER BY c <-> cube(array[0.5,0.5,0.5]) LIMIT 1
这是删除索引之前的查询计划和时间信息。
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.41..6.30 rows=10 width=29)
-> Index Scan using sample_embedding_idx on sample (cost=0.41..589360.33 rows=999996 width=29)
Order By: (embedding <-> '(0.18936706, -0.12455666, -0.31581765, 0.0192692, -0.07364611, 0.07851536, 0.0290586, -0.02582532, -0.03378124, -0.10564457, -0.03903799, 0.08668878, -0.15357816, -0.17793414, -0.01826405, 0.01969068, 0.11386908, 0.1555583, 0.09368557, 0.13697313, -0.05610929, -0.06536788, -0.12212707, 0.26356605, -0.06004387, -0.01966437, -0.1250324, -0.16645767, -0.13525756, 0.22482251, -0.1709727, 0.28966117, -0.07927769, -0.02498624, -0.10018375, -0.10923951, 0.04770213, 0.11573371, 0.04619929, 0.05216618, 0.19176421, 0.12948817, 0.08719034, -0.16109011, -0.02411379, -0.05638905, -0.37334979, 0.31225419, 0.0744801, 0.27044332)'::cube)
(3 rows)
title | distance
----------------------+--------------------
david petrarca | 0.5866321762629475
david adamski | 0.5866321762629475
richard ansdell | 0.6239883862603475
linda darke | 0.6392124797481789
ilias tsiliggiris | 0.6996660649119893
watson, jim | 0.7059481479504834
sk radni%c4%8dki | 0.71718948226995
burnham, pa | 0.7384858030758069
arthur (europa-park) | 0.7468462897336924
ivan kecojevic | 0.7488206082281348
(10 rows)
Time: 1226.457 ms (00:01.226)
而且,这是删除索引后的查询计划和时间信息。
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=74036.32..74037.48 rows=10 width=29)
-> Gather Merge (cost=74036.32..171264.94 rows=833330 width=29)
Workers Planned: 2
-> Sort (cost=73036.29..74077.96 rows=416665 width=29)
Sort Key: ((embedding <-> '(0.18936706, -0.12455666, -0.31581765, 0.0192692, -0.07364611, 0.07851536, 0.0290586, -0.02582532, -0.03378124, -0.10564457, -0.03903799, 0.08668878, -0.15357816, -0.17793414, -0.01826405, 0.01969068, 0.11386908, 0.1555583, 0.09368557, 0.13697313, -0.05610929, -0.06536788, -0.12212707, 0.26356605, -0.06004387, -0.01966437, -0.1250324, -0.16645767, -0.13525756, 0.22482251, -0.1709727, 0.28966117, -0.07927769, -0.02498624, -0.10018375, -0.10923951, 0.04770213, 0.11573371, 0.04619929, 0.05216618, 0.19176421, 0.12948817, 0.08719034, -0.16109011, -0.02411379, -0.05638905, -0.37334979, 0.31225419, 0.0744801, 0.27044332)'::cube))
-> Parallel Seq Scan on sample (cost=0.00..64032.31 rows=416665 width=29)
(6 rows)
title | distance
----------------------+--------------------
david petrarca | 0.5866321762629475
david adamski | 0.5866321762629475
richard ansdell | 0.6239883862603475
linda darke | 0.6392124797481789
ilias tsiliggiris | 0.6996660649119893
watson, jim | 0.7059481479504834
sk radni%c4%8dki | 0.71718948226995
burnham, pa | 0.7384858030758069
arthur (europa-park) | 0.7468462897336924
ivan kecojevic | 0.7488206082281348
(10 rows)
Time: 381.419 ms
注意:
- 带索引:1226.457 毫秒
- 无索引:381.419 毫秒
这种非常令人费解的行为!所有这些都记录在 GitHub存储库中,以便其他人可以尝试。我将添加有关如何生成嵌入向量的文档,但这不是必需的,因为在快速入门中我展示了可以从我的 Google Drive 文件夹中下载预先计算的嵌入向量。
附录
在下面的评论中要求提供explain (analyze, buffers)
. 这里就是哪里
- 我重新创建(覆盖)索引
- 我运行查询
explain (analyze, buffers)
- 我删除索引
explain (analyze, buffers)
我再次运行查询
pgbench=# create index on sample using gist (embedding) include (title);
CREATE INDEX
Time: 51966.315 ms (00:51.966)
pgbench=#
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.41..4.15 rows=10 width=29) (actual time=3215.956..3216.667 rows=10 loops=1)
Buffers: shared hit=1439 read=87004 written=7789
-> Index Only Scan using sample_embedding_title_idx on sample (cost=0.41..373768.39 rows=999999 width=29) (actual time=3215.932..3216.441 rows=10 loops=1)
Order By: (embedding <-> '(0.18936706, -0.12455666, -0.31581765, 0.0192692, -0.07364611, 0.07851536, 0.0290586, -0.02582532, -0.03378124, -0.10564457, -0.03903799, 0.08668878, -0.15357816, -0.17793414, -0.01826405, 0.01969068, 0.11386908, 0.1555583, 0.09368557, 0.13697313, -0.05610929, -0.06536788, -0.12212707, 0.26356605, -0.06004387, -0.01966437, -0.1250324, -0.16645767, -0.13525756, 0.22482251, -0.1709727, 0.28966117, -0.07927769, -0.02498624, -0.10018375, -0.10923951, 0.04770213, 0.11573371, 0.04619929, 0.05216618, 0.19176421, 0.12948817, 0.08719034, -0.16109011, -0.02411379, -0.05638905, -0.37334979, 0.31225419, 0.0744801, 0.27044332)'::cube)
Heap Fetches: 0
Buffers: shared hit=1439 read=87004 written=7789
Planning:
Buffers: shared hit=14 read=6 dirtied=2
Planning Time: 0.432 ms
Execution Time: 3316.266 ms
(10 rows)
Time: 3318.333 ms (00:03.318)
pgbench=# drop index sample_embedding_title_idx;
DROP INDEX
Time: 182.324 ms
pgbench=#
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=74036.35..74037.52 rows=10 width=29) (actual time=6052.845..6057.210 rows=10 loops=1)
Buffers: shared hit=70 read=58830
-> Gather Merge (cost=74036.35..171265.21 rows=833332 width=29) (actual time=6052.825..6057.021 rows=10 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=70 read=58830
-> Sort (cost=73036.33..74077.99 rows=416666 width=29) (actual time=6002.928..6003.019 rows=8 loops=3)
Sort Key: ((embedding <-> '(0.18936706, -0.12455666, -0.31581765, 0.0192692, -0.07364611, 0.07851536, 0.0290586, -0.02582532, -0.03378124, -0.10564457, -0.03903799, 0.08668878, -0.15357816, -0.17793414, -0.01826405, 0.01969068, 0.11386908, 0.1555583, 0.09368557, 0.13697313, -0.05610929, -0.06536788, -0.12212707, 0.26356605, -0.06004387, -0.01966437, -0.1250324, -0.16645767, -0.13525756, 0.22482251, -0.1709727, 0.28966117, -0.07927769, -0.02498624, -0.10018375, -0.10923951, 0.04770213, 0.11573371, 0.04619929, 0.05216618, 0.19176421, 0.12948817, 0.08719034, -0.16109011, -0.02411379, -0.05638905, -0.37334979, 0.31225419, 0.0744801, 0.27044332)'::cube))
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=70 read=58830
Worker 0: Sort Method: top-N heapsort Memory: 26kB
Worker 1: Sort Method: top-N heapsort Memory: 26kB
-> Parallel Seq Scan on sample (cost=0.00..64032.33 rows=416666 width=29) (actual time=0.024..3090.103 rows=333333 loops=3)
Buffers: shared read=58824
Planning:
Buffers: shared hit=3 read=3 dirtied=1
Planning Time: 0.129 ms
Execution Time: 6057.388 ms
(18 rows)
Time: 6053.284 ms (00:06.053)
解决方案
推荐阅读
- python - 一个请求下载多个文件名?Python
- java - 有没有一种“正确”的方法来避免“必须是最终的”错误?
- python - 在 Python C 扩展中使用来自 PyObjects 的数据而不持有 GIL
- c++ - Constraing template parameter by another template with any specialization type
- python - 从几个大型 Pandas 数据帧中有效地提取一些值
- java - Open telephone number link
- mongodb - 如何在 $project 中将 $toString 用于 MongoDB 上的子文档字段?
- python - 斐波那契数列在 python 中不起作用,解释器崩溃
- java - 用于渲染 2D 和 3D 交互式图形的 Java API(Javaview 替代品)
- c# - 将属性绑定到表达式