首页 > 技术文章 > Elasticsearch(一)编写Token Filter插件

ralgo 2021-01-18 21:08 原文

需求

  1. 对于英语单词,也就是<ALPHANUM>类型,分析出前缀数组,如果是中文字符,也即是象形文字类型,都切成单字。比如“EVA新世纪福音战士”会被切成[E, EV, EVA, 新,世,纪,福,音,战,士]; “EVA EGG京”则会被切成[E, EV, EVA, E, EG, EGG, 京]。

  2. 过滤掉英文停用词。

软件版本

ES 7.2.1
Kibana 7.2.1

实现

https://github.com/ralgond/elasticsearch-edgengram2-token-filter

安装插件

git clone git@github.com:ralgond/elasticsearch-edgengram2-token-filter.git

cd elasticsearch-edgengram2-token-filter

mvn clean package

拷贝target/releases/elasticsearch-edgengram2-token-filter-7.2.1.zip到ES的plugins目录,并解压到文件夹elasticsearch-edgengram2-token-filter-7.2.1,然后删除掉zip文件。

重启ES

在启动日志里确认插件是否已经加载:

如何使用

首先定义一个索引

DELETE idx-15
PUT idx-15
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_stop",
            "edge_ngram2"
          ]
        }
      },
      "filter": {
        "my_stop": {
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "my_analyzer",
        "search_analyzer": "standard"
      }
    }
  }
}

然后执行查询

GET idx-15/_search
{
  "query": {
    "match": {
      "title": {
        "query":  "EV"
      }
    }
  },
  "highlight": {
    "fields": {
      "title": {}
    }
  }
}

得到搜索结果

{
  "took" : 156,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.3754495,
    "hits" : [
      {
        "_index" : "idx-15",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.3754495,
        "_source" : {
          "title" : "the EVA EGG京"
        },
        "highlight" : {
          "title" : [
            "the <em>EV</em>A EGG京"
          ]
        }
      }
    ]
  }
}

推荐阅读