elasticsearch - 忽略词频但使用位置
问题描述
我有一个带有文本字段的索引,我想在评分中忽略术语频率,但保留位置以具有匹配短语搜索能力。
我的索引定义如下:
curl --location --request PUT 'localhost:9200/my-index-001' \
--header 'Content-Type: application/json' \
--data-raw '{
"mappings": {
"autocomplete": {
"properties": {
"title": {
"type": "text",
"analyzer": "row_autocomplete"
},
"name": {
"type": "text",
"analyzer": "row_autocomplete"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"row_autocomplete": {
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "autocomplete_filter", "lowercase"]
}
},
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
}
}
}
}'
索引数据:
[
{
"title": "university",
"name": "london and EC london English"
},
{
"title": "city",
"name": "london"
}
]
当我执行这样的匹配查询时,我希望城市获得高分:
POST _search
{
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "london"
}
}
},
{
"match_phrase": {
"name": {
"query": "london",
}
}
}
]
}
}
}
由于词频,他们得到了不同的分数(大学实际上大于城市),我想要的只是计算词频一次,并且根据fieldLength
,城市的fieldLength
小于大学的fieldLength
,所以如果我可以忽略重复termFreq
,城市的分数将大于大学参考elasticsearch的规则:
GET _explain
# city's _explain
{
"value": 2.0785222,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 6.0,
"description": "termFreq=6.0",
"details": []
},
{
"value": 2.0,
"description": "fieldLength",
"details": []
},
...
]
}
# university's explain
{
"value": 2.1087635,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 24.0,
"description": "termFreq=24.0",
"details": []
},
{
"value": 29.0,
"description": "fieldLength",
"details": []
},
...
]
}
我尝试了一些方法,例如在索引映射中,我可以设置 index_options=docs 以忽略术语频率,但这会禁用术语位置,并且我不能再使用匹配短语查询。
有人有什么主意吗?
提前致谢。
解决方案
我使用默认索引映射为您提供的两个示例文档编制了索引,因此title
和name
字段都是文本字段。london
并使用了相同的查询,它返回给我的只是包含如下所示的文档的高分:
"hits": [
{
"_index": "matchrel",
"_type": "_doc",
"_id": "1",
"_score": 0.51518387,
"_source": {
"title": "city",
"name": "london"
}
},
{
"_index": "matchrel",
"_type": "_doc",
"_id": "2",
"_score": 0.41750965,
"_source": {
"title": "university",
"name": "london university and EC London English"
}
}
]
此外,由于您没有详细解释您的用例,并且信息有限,似乎可以通过以下查询轻松实现,并且还为london
文档返回更多分数:
{“查询”:{“match_phrase”:{“名称”:“伦敦”}}}
及其搜索结果
"hits": [
{
"_index": "matchrel",
"_type": "_doc",
"_id": "1",
"_score": 0.25759193, // note score
"_source": {
"title": "city",
"name": "london"
}
},
{
"_index": "matchrel",
"_type": "_doc",
"_id": "2",
"_score": 0.20875482,
"_source": {
"title": "university",
"name": "london university and EC London English"
}
}
]
推荐阅读
- docusignapi - 用于 Power Automate 和 Power Apps 的 Docusign 自定义连接器
- wpf - WPF 中的“内联”或“内联元素/内联级流内容”是什么?
- command - 我将如何更改节点扩展实时服务器的默认浏览器?
- python - Django [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]无效的对象名称'MYSCHEMA.MyUnmanagedModel'
- reactjs - 如何从成功的 React 查询中获取 HTTP 响应代码?
- apache-spark - 如果 F.Col 小于 x,则替换为 'string'
- sql - PostgreSQL 同比增长
- ios - AudioQueue 中的实时音频处理
- sql - 计算谷歌大查询中的每周留存率
- javascript - 带电子的多页应用程序并保留数据