elasticsearch - How to use standard tokenizer with preserve_original?
问题描述
I created 2 custom analyzers as shown below but both doesnt work as I wanted.
here is what i want in my inverted index
for example; for the word reb-tn2000xxxl
i need to have
reb, tn2000xxl, reb-tn2000xxxl in my inverted index.
{
"analysis":{
"filter":{
"my_word_delimiter":{
"split_on_numerics":"true",
"generate_word_parts":"true",
"preserve_original":"true",
"generate_number_parts":"true",
"catenate_all":"true",
"split_on_case_change":"true",
"type":"word_delimiter"
}
},
"analyzer":{
"my_analyzer":{
"filter":[
"standard",
"lowercase",
"my_word_delimiter"
],
"type":"custom",
"tokenizer":"whitespace"
},
"standard_caseinsensitive":{
"filter":[
"standard",
"lowercase"
],
"type":"custom",
"tokenizer":"keyword"
},
"my_delimiter":{
"filter":[
"lowercase",
"my_word_delimiter"
],
"type":"custom",
"tokenizer":"standard"
}
}
}
}
if I use my_analyzer
which implements whitespace
tokenizer, results looks like below if i check with curl
curl -XGET "index/_analyze?analyzer=my_analyzer&pretty=true" -d "reb-tn2000xxxl"
{
"tokens" : [ {
"token" : "reb-tn2000xxxl",
"start_offset" : 0,
"end_offset" : 14,
"type" : "word",
"position" : 0
}, {
"token" : "reb",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
}, {
"token" : "rebtn2000xxxl",
"start_offset" : 0,
"end_offset" : 14,
"type" : "word",
"position" : 0
}, {
"token" : "tn",
"start_offset" : 4,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "2000",
"start_offset" : 6,
"end_offset" : 10,
"type" : "word",
"position" : 2
}, {
"token" : "xxxl",
"start_offset" : 10,
"end_offset" : 14,
"type" : "word",
"position" : 3
} ]
}
so here I am missing tn2000xxxl
split which can be obtained if I use standard
tokenizer instead of whitespace
but problem is once I use standard like my_delimiter
custom analyzer is using. I dont have original value in the inverted index. It seems that standard
tokinezer and preserve_original
filter together doesnt work. I read somewhere that because standard tokenizer already splits on original before filter is applied, that's why original is no longer is the same. but How can I achieve this task to prevent original while splitting like standard tokenizer?
curl -XGET "index/_analyze?analyzer=my_delimiter&pretty=true" -d "reb-tn2000xxxl"
{
"tokens":[
{
"token":"reb",
"start_offset":0,
"end_offset":3,
"type":"<ALPHANUM>",
"position":0
},
{
"token":"tn2000xxxl",
"start_offset":4,
"end_offset":14,
"type":"<ALPHANUM>",
"position":1
},
{
"token":"tn",
"start_offset":4,
"end_offset":6,
"type":"<ALPHANUM>",
"position":1
},
{
"token":"tn2000xxxl",
"start_offset":4,
"end_offset":14,
"type":"<ALPHANUM>",
"position":1
},
{
"token":"2000",
"start_offset":6,
"end_offset":10,
"type":"<ALPHANUM>",
"position":2
},
{
"token":"xxxl",
"start_offset":10,
"end_offset":14,
"type":"<ALPHANUM>",
"position":3
}
]
}
解决方案
In Elasticsearch, you can have multi-fields on your mapping. The behavior that you are describing is actually pretty common. You can have your main text
field analyzed with the standard
analyzer and a keyword
field as well. Here's an example mapping using multi-fields from the documentation. https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html
PUT my_index { "mappings": { "_doc": { "properties": { "city": { "type": "text", "fields": { "raw": { "type": "keyword" } } } } } } }
In this example, the "city"
field will be analyzed with the standard
analyzer and "city.raw"
will be the non-analyzed keyword
. In other words, "city.raw"
is the original string.
推荐阅读
- python - 如何根据位于一个目录中的不同文件绘制不同的子图?
- sql - 如何获取与 PostgreSQL 中最常见值关联的 ID?
- .htaccess - Prestashop 激活时不显示友好 URL 图像
- anaconda - 在 conda env 中安装 Jupyter -“AttributeError:模块 'colorama' 没有属性 'init'”
- c++ - 如何有效地计算指数函数的数字范围内的数字
- hash - 如何在 Eleventy.js 中添加哈希路由
- javascript - 将 JSON 数据添加到表格每个单元格的下拉菜单中 (Tabulator.js)
- python - pandas 通过数据框中的条目替换 id 号
- angular - 仅计算 Angular Material 选择列表的选定值
- javascript - 使用选择器处理动态状态