首页 > 解决方案 > 基于前缀和自定义分词器的 Elasticsearch 自动建议

问题描述

我目前正在使用 ngram 开发自动建议功能。

我有以下过滤器,分析器:

"nGram_filter": {
          "type": "nGram",
          "min_gram": 3,
          "max_gram": 10,
          "token_chars": [
            "letter",
            "digit",
            "punctuation",
            "symbol"
          ]
        }
"nGram_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "asciifolding",
            "nGram_filter"
          ]
        }

现在,当我对样本数据进行标记test_table_for analyzers并搜索字符串testtableanalyzers时,我可以获得上述记录。现在我知道令牌是使用我指定的过滤器创建的,因此它正在工作。

但我需要为此添加另一个功能 - 我也需要启用前缀过滤器。例如:当我搜索test_table (10 chars) 时,我能够得到结果,因为 max n-gram 为 10,但是当我尝试test_table_for时,它返回零结果,因为 record 没有这个标记test_table_for analyzers

如何为现有的 n-gram 分析器添加基于前缀的过滤器?就像我应该能够在搜索时获得最多匹配 10 个字符的结果(目前有效),而且我应该能够建议搜索字符串何时与从开始的记录匹配。

标签: elasticsearchlucenen-gram

解决方案


使用单个分析器是不可能的,您必须创建另一个字段,您可以在其中创建将用于搜索的edge_ngram 标记prefix,添加索引映射,显示其中还包括您当前的分析器。

索引映射

{
    "settings": {
        "analysis": {
            "filter": {
                "autocomplete_filter": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 30
                },
                "nGram_filter": {
                    "type": "nGram",
                    "min_gram": 3,
                    "max_gram": 10,
                    "token_chars": [
                        "letter",
                        "digit",
                        "punctuation",
                        "symbol"
                    ]
                }
            },
            "analyzer": {
                "prefixanalyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter"
                    ]
                },
                "ngramanalyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "nGram_filter"
                    ]
                }
            }
        },
        "index.max_ngram_diff" : 30
    },
    "mappings": {
        "properties": {
            "title_prefix": {
                "type": "text",
                "analyzer": "prefixanalyzer",
                "search_analyzer": "standard"
            },
            "title" :{
                "type": "text",
                "analyzer": "ngramanalyzer",
                "search_analyzer": "standard"
            }
        }
    }
}

现在您可以使用使用analyzeAPI 来确认前缀令牌:

{
    "analyzer": "prefixanalyzer",
    "text" : "test_table_for analyzers"
}

并且你的tokentest_table_for也存在,如下图

{"tokens":[{"token":"t","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"te","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"tes","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_t","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_ta","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_tab","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_tabl","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_table","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_table_","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_table_f","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_table_fo","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_table_for","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"a","start_offset":15,"end_offset":24,"type":"<ALPHANUM>","position":1},{"token":"an","start_offset":15,"end_offset":24,"type":"<ALPHANUM>","position":1},{"token":"ana","start_offset":15,"end_offset":24,"type":"<ALPHANUM>","position":1},{"token":"anal","start_offset":15,"end_offset":24,"type":"<ALPHANUM>","position":1},{"token":"analy","start_offset":15,"end_offset":24,"type":"<ALPHANUM>","position":1},{"token":"analyz","start_offset":15,"end_offset":24,"type":"<ALPHANUM>","position":1},{"token":"analyze","start_offset":15,"end_offset":24,"type":"<ALPHANUM>","position":1},{"token":"analyzer","start_offset":15,"end_offset":24,"type":"<ALPHANUM>","position":1},{"token":"analyzers","start_offset":15,"end_offset":24,"type":"<ALPHANUM>","position":1}]}

现在,您可以使用多重匹配查询,它将为您提供所需的搜索结果,如下所示:

搜索查询

{
    "query": {
        "multi_match": {
            "query": "test_table_for",
            "fields": [
                "title",
                "title_prefix"
            ]
        }
    }
}

搜索结果

 "hits": [
            {
                "_index": "so_63981157",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.45920232,
                "_source": {
                    "title_prefix": "test_table_for analyzers",
                    "title": "test_table_for analyzers"
                }
            }
        ]

推荐阅读