postgresql - PostgreSQL 中的 Edge NGram 搜索
问题描述
我需要为大量公司(超过 80,000,000 家)进行搜索时自动完成。公司名称应包含以这样的搜索查询开头的单词
+-------+----------------------------------+
| term | results |
+-------+----------------------------------+
| gen | general motors; general electric |
| geno | genoptix; genomic health |
| genom | genoma group; genomic health |
+-------+----------------------------------+
pg_trgm模块和GIN 索引实现了类似的行为,但不能解决我的问题。
例如,ElasticSearch 具有完全符合我要求的Edge NGram Tokenizer功能。
从文档:
The edge_ngram tokenizer first breaks the text down into words
whenever it encounters one of a list of specified characters,
then it emits N-grams of each word
where the start of the N-gram is anchored to the beginning of the word.
Edge N-Grams are useful for search-as-you-type queries.
PostgreSQL中是否有类似的解决方案?
解决方案
我创建了一个自定义标记器
CREATE OR REPLACE FUNCTION edge_gram_tsvector(text text) RETURNS tsvector AS
$BODY$
BEGIN
RETURN (select array_to_tsvector((select array_agg(distinct substring(lexeme for len)) from unnest(to_tsvector(text)), generate_series(1,length(lexeme)) len)));
END;
$BODY$
IMMUTABLE
language plpgsql;
这个函数像这样创建所有边缘 ngram
postgres=# select edge_gram_tsvector('general electric');
edge_gram_tsvector
-----------------------------------------------------------------------------------------
'e' 'el' 'ele' 'elec' 'elect' 'electr' 'g' 'ge' 'gen' 'gene' 'gener' 'genera' 'general'
(1 row)
然后我为tsquery创建一个GIN
索引
create index on company using gin(edge_gram_tsvector(name));
搜索查询将如下所示
b2bdb_master=# select name from company where edge_gram_tsvector(name) @@ 'electric'::tsquery limit 3;
name
--------------------------------------------
General electric
Electriciantalk
Galesburg Electric Industrial Supply
(3 rows)
解决方案的性能相当高
explain analyse select * from company where edge_gram_tsvector(name) @@ 'electric'::tsquery;
Bitmap Heap Scan on company (cost=175.13..27450.31 rows=20752 width=2247) (actual time=0.224..1.019 rows=343 loops=1)
Recheck Cond: (edge_gram_tsvector((name)::text) @@ '''electric'''::tsquery)
Heap Blocks: exact=342
-> Bitmap Index Scan on company_edge_gram_tsvector_idx (cost=0.00..169.94 rows=20752 width=0) (actual time=0.138..0.138 rows=343 loops=1)
Index Cond: (edge_gram_tsvector((name)::text) @@ '''electric'''::tsquery)
Planning Time: 0.216 ms
Execution Time: 1.100 ms
推荐阅读
- python - Discord.py 记录特定频道中编辑和删除的消息
- python - 从烧瓶寄存器表格中插入sql表
- eslint - 如何在 ESLint-prettier 中的左大括号后允许一条白线?
- django - Django - 无法在模板 html 中显示类别和子类别
- python-3.x - 将python文件导入另一个目录
- vue.js - 带有 nginx 的子目录上的 Vue 前端应用程序
- html - CSS渐变动画度数变化
- javascript - 将 state 值传递给 props,然后将 props 从 reactjs 中的子组件传递给父组件
- r - 没有适用于“c('double','numeric')”类对象的“lead”方法
- ruby-on-rails - 复杂的 Rails 模型/关联