需求
-
对于英语单词,也就是<ALPHANUM>类型,分析出前缀数组,如果是中文字符,也即是象形文字类型,都切成单字。比如“EVA新世纪福音战士”会被切成[E, EV, EVA, 新,世,纪,福,音,战,士]; “EVA EGG京”则会被切成[E, EV, EVA, E, EG, EGG, 京]。
-
过滤掉英文停用词。
软件版本
ES 7.2.1
Kibana 7.2.1
实现
https://github.com/ralgond/elasticsearch-edgengram2-token-filter
安装插件
git clone git@github.com:ralgond/elasticsearch-edgengram2-token-filter.git
cd elasticsearch-edgengram2-token-filter
mvn clean package
拷贝target/releases/elasticsearch-edgengram2-token-filter-7.2.1.zip到ES的plugins目录,并解压到文件夹elasticsearch-edgengram2-token-filter-7.2.1,然后删除掉zip文件。
重启ES
在启动日志里确认插件是否已经加载:
如何使用
首先定义一个索引
DELETE idx-15
PUT idx-15
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_stop",
"edge_ngram2"
]
}
},
"filter": {
"my_stop": {
"type": "stop",
"stopwords": "_english_"
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}
然后执行查询
GET idx-15/_search
{
"query": {
"match": {
"title": {
"query": "EV"
}
}
},
"highlight": {
"fields": {
"title": {}
}
}
}
得到搜索结果
{
"took" : 156,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.3754495,
"hits" : [
{
"_index" : "idx-15",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.3754495,
"_source" : {
"title" : "the EVA EGG京"
},
"highlight" : {
"title" : [
"the <em>EV</em>A EGG京"
]
}
}
]
}
}