首页 > 解决方案 > 在 PostgreSQL 中搜索单词相似度?

问题描述

可以说在PostgreSQL数据库中我有一个名为questions. 正如您在该表中看到的那样,我有与人类相似但与数据库不相似的记录。是否可以获取类似于问题列表 90% 的所有记录?

| QUESTION_ID | QUESTION_TEXT                                    |
|-------------|--------------------------------------------------|
| 1           | What is your favorite movie, cartoon and series? |
| 2           | What is your favorite movie cartoon and series   |
| 3           | what is your favorite Movie, Cartoon and Series  |
| 4           | Do you like apple?                               |
| 5           | do you like Apple                                |

现在我使用只返回 2 条记录的下一个代码:

select
    *
from
    questions
where
    question_text in (
        'What is your favorite movie, cartoon and series?',
        'Do you like apple?'
    )

据我所知,PostgreSQL 有pg_trgm帮助按功能搜索相似性的模块word_similarity。如何正确将此功能添加到我的请求中?

标签: sqlpostgresql

解决方案


你会这样做:

CREATE EXTENSION pg_trgm;
CREATE INDEX ON questions USING gin (question_text gin_trgm_ops).

然后你可以像这样有效地搜索:

SELECT question_id
FROM questions
WHERE question_text % 'What is your favorite movie, cartoon and series?';

%是“相似算子”,可以通过参数 来设置认为事物相似时的阈值pg_trgm.similarity_threshold

有关更多信息,请参阅文档


推荐阅读