首页 > 解决方案 > 忽略词频但使用位置

问题描述

我有一个带有文本字段的索引,我想在评分中忽略术语频率,但保留位置以具有匹配短语搜索能力。

我的索引定义如下:

curl --location --request PUT 'localhost:9200/my-index-001' \
--header 'Content-Type: application/json' \
--data-raw '{
    "mappings": {
        "autocomplete": {
            "properties": {
                "title": {
                    "type": "text",
                    "analyzer": "row_autocomplete"
                },
                "name": {
                    "type": "text",
                    "analyzer": "row_autocomplete"
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "analyzer": {
                "row_autocomplete": {
                    "tokenizer": "icu_tokenizer",
                    "filter": ["icu_folding", "autocomplete_filter", "lowercase"]
                }
            },
            "filter": {
                "autocomplete_filter": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 20
                }
            }
        }
    }
}'

索引数据

[
    {
        "title": "university",
        "name": "london and EC london English"
    },
    {
        "title": "city",
        "name": "london"
    }
]

当我执行这样的匹配查询时,我希望城市获得高分:

POST _search

{
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "name": {
                            "query": "london"
                        }
                    }
                },
                {
                    "match_phrase": {
                        "name": {
                            "query": "london",
                        }
                    }
                }
            ]
        }
    }
}

由于词频,他们得到了不同的分数(大学实际上大于城市),我想要的只是计算词频一次,并且根据fieldLength,城市的fieldLength小于大学的fieldLength,所以如果我可以忽略重复termFreq,城市的分数将大于大学参考elasticsearch的规则:

GET _explain

# city's _explain
{
    "value": 2.0785222,
    "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
    "details": [
        {
            "value": 6.0,
            "description": "termFreq=6.0",
            "details": []
        },
        {
            "value": 2.0,
            "description": "fieldLength",
            "details": []
        },
        ...
    ]
}

# university's explain
{
    "value": 2.1087635,
    "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
    "details": [
        {
            "value": 24.0,
            "description": "termFreq=24.0",
            "details": []
        },
        {
            "value": 29.0,
            "description": "fieldLength",
            "details": []
        },
        ...
    ]
}

我尝试了一些方法,例如在索引映射中,我可以设置 index_options=docs 以忽略术语频率,但这会禁用术语位置,并且我不能再使用匹配短语查询。

有人有什么主意吗?

提前致谢。

标签: elasticsearch

解决方案


我使用默认索引映射为您提供的两个示例文档编制了索引,因此titlename字段都是文本字段。london并使用了相同的查询,它返回给我的只是包含如下所示的文档的高分:

"hits": [
            {
                "_index": "matchrel",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.51518387,
                "_source": {
                    "title": "city",
                    "name": "london"
                }
            },
            {
                "_index": "matchrel",
                "_type": "_doc",
                "_id": "2",
                "_score": 0.41750965,
                "_source": {
                    "title": "university",
                    "name": "london university and EC London English"
                }
            }
        ]

此外,由于您没有详细解释您的用例,并且信息有限,似乎可以通过以下查询轻松实现,并且还为london文档返回更多分数:

{“查询”:{“match_phrase”:{“名称”:“伦敦”}}}

及其搜索结果

 "hits": [
            {
                "_index": "matchrel",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.25759193, // note score
                "_source": {
                    "title": "city",
                    "name": "london"
                }
            },
            {
                "_index": "matchrel",
                "_type": "_doc",
                "_id": "2",
                "_score": 0.20875482,
                "_source": {
                    "title": "university",
                    "name": "london university and EC London English"
                }
            }
        ]

推荐阅读