首页 > 解决方案 > 弹性搜索重复数据删除

问题描述

使用 Elastic Search 6.5.x。我通过外部网络爬虫将文档索引到 ES 索引中。下面是我的索引。我想根据过滤器获取记录。如果我使用以下查询,假设如果我使用https的术语查询,则 http 结果与 http 显示的结果不同。记录 1 和 2 看起来很相似,但区别在于 URL 中的 https 和 http。如何比较协议后 URL 字段的记录。如果它具有相同的信息,我如何显示其中一条记录以及剩余的唯一记录。

指数:

"title": "About elastic search"
"content": "Elasticsearch is an open source distributed, RESTful search and analytics engine capable of solving a growing number of use cases."
"URL: "https://www.elastic.co/webinars/getting-started-elasticsearch"`

"title":"About elastic search"
"content":"Elasticsearch is an open source distributed, RESTful search and analytics engine capable of solving a growing number of use cases."
"URL":"http://www.elastic.co/webinars/getting-started-elasticsearch"

"title":"About Similarity"
"content":"A similarity (scoring / ranking model) defines how matching documents are scored. Similarity is per field, meaning that via the mapping one can define a different similarity per field."
"URL":"https://www.elastic.co/guide/en/elasticsearch/reference/6.2/index-modules-similarity.html"

"title":"SQL Access"
"content":"This functionality is experimental and may be changed or removed completely in a future release. Elastic will take a best effort approach to fix any issues, but experimental features are not subject to the support SLA of official GA features."
"URL":"http://www.elastic.co/guide/en/elasticsearch/reference/current/xpack-sql.html"

询问:

GET test-index/_search
{
   "query":{
      "bool":{
         "must":{
            "query_string":{
               "query":"test"
            }
         },
         "filter": {  
             "bool" : {
                 "must" : 
                    {"term" : { "url" : "https" } }
               }}
      }
   }
}

标签: elasticsearch

解决方案


推荐阅读