首页 > 解决方案 > 有没有办法在弹性搜索中禁用对文本中数字的模糊搜索?

问题描述

我的字符串很少,例如:

1. 'any text marium malik 127'
2. 'other text marium malik 1.7 other text'
3. 'marium malik 1 7' etc. 
4. 'any other text only'

映射:

'terms' => ['type' => 'text', 'analyzer' => 'new_analyzer']

 'new_analyzer' =>
                     [
                       'tokenizer' => 'standard',
              'filter' => [
                'word_delimiter', 'lowercase', 
               'shingles_2_3',  'remove_space',
                            ]
                        ],

如果我启用模糊性并将其设置为自动,并搜索“marium malik 127”,由于模糊性,我也会得到第二个和第三个字符串作为我的搜索结果,尽管我不想要它。有什么方法可以禁用数字的模糊性?

完整映射:

 'body' => [
            'settings' =>
            [

                'analysis' =>
                [                    
                    'analyzer' =>
                    [
                        "extract_number_analyzer" => [
                            "tokenizer" => "standard",
                            "filter" => ["extract_numbers", "decimal_digit"]
                        ],

 'new_analyzer' =>
                        [
                            'tokenizer' => 'standard',
                            'filter' => [
                                'word_delimiter', 'lowercase', 'word_combination', 'length2', 'remove_space',
                            ]
                        ]],
 'filter' =>
                    [
                        'word_combination' => [
                            'type' => 'shingle',
                            'min_shingle_size' => 2,
                            'max_shingle_size' => 3,
                            'output_unigrams' => true
                        ],
                        "extract_numbers" => [
                            "type" => "keep_types",
                            "types" => ["<NUM>"]
                        ],
                        'remove_space' =>
                        [
                            'type' => 'pattern_replace',
                            'pattern' => ' ',
                            'replacement' => ''
                        ],
                        'length2' =>
                        [
                            'type' => 'length',
                            'min' => '3'
                        ]
                    ]
]

  'mappings' =>
            [
                '_doc' =>
                [
 'terms' => ['type' => 'text', 'analyzer' => "new_analyzer", " 
 fields" => ["extracted_number" => ["type" => "text",
                                     "analyzer" => "extract_number_analyzer"
                                ]]]
]

标签: elasticsearch

解决方案


您可以使用保留类型标记仅将数字标记保留在子字段中

分析仪示例:

PUT /keep_types_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "extract_number_analyzer" : {
                    "tokenizer" : "standard",
                    "filter" : ["extract_numbers", "decimal_digit"]
                }
            },
            "filter" : {
                "extract_numbers" : {
                    "type" : "keep_types",
                    "types" : [ "<NUM>" ]
                }
            }
        }
    }
}

然后在映射中

...
{
  terms: {
    type: "text",
    analyzer: "new_analyzer",
    fields: {
      extracted_number: {
        type: "text",
        analyzer: "extract_number_analyzer"
      }
    }
  }
}
...

然后在查询时,您可以在查询中添加一个子句以匹配数字子字段而没有模糊性,那么只有当数字完全匹配并且文本内容与模糊性匹配时,它才会匹配文档。

查询示例:

{
  query: {
    bool: {
      must: [
        {
          match: {
            "terms": {
              "query": "marium malik 127",
              "fuziness": "auto"
            }
          }
        },
        {
           match: {
            "terms.extracted_number": { // or whatever you subfield name is
              "query": "marium malik 127",
              "zero_terms_query": "all" // to match if no extracted number
            }
          }
        }
      ]
    }
  }
}

推荐阅读