elasticsearch - Elasticsearch:查询嵌套对象
问题描述
亲爱的 elasticsearch 专家,
我在查询嵌套对象时遇到问题。让我们使用以下简化映射:
{
"mappings" : {
"_doc" : {
"properties" : {
"companies" : {
"type": "nested",
"properties" : {
"company_id": { "type": "long" },
"name": { "type": "text" }
}
},
"title": { "type": "text" }
}
}
}
}
并将一些文档放入索引中:
PUT my_index/_doc/1
{
"title" : "CPU release",
"companies" : [
{ "company_id" : 1, "name" : "AMD" },
{ "company_id" : 2, "name" : "Intel" }
]
}
PUT my_index/_doc/2
{
"title" : "GPU release 2018-01-10",
"companies" : [
{ "company_id" : 1, "name" : "AMD" },
{ "company_id" : 3, "name" : "Nvidia" }
]
}
PUT my_index/_doc/3
{
"title" : "GPU release 2018-03-01",
"companies" : [
{ "company_id" : 3, "name" : "Nvidia" }
]
}
PUT my_index/_doc/4
{
"title" : "Chipset release",
"companies" : [
{ "company_id" : 2, "name" : "Intel" }
]
}
现在我想执行这样的查询:
{
"query": {
"bool": {
"must": [
{ "match": { "title": "GPU" } },
{ "nested": {
"path": "companies",
"query": {
"bool": {
"must": [
{ "match": { "companies.name": "AMD" } }
]
}
},
"inner_hits" : {}
}
}
]
}
}
}
结果,我想获得具有匹配文件数量的匹配公司。所以上面的查询应该给我:
[
{ "company_id" : 1, "name" : "AMD", "matched_documents:": 1 }
]
以下查询:
{
"query": {
"bool": {
"must": [
{ "match": { "title": "GPU" } }
{ "nested": {
"path": "companies",
"query": { "match_all": {} },
"inner_hits" : {}
}
}
]
}
}
}
应该给我所有分配给文档的公司,该文档的标题包含“GPU”以及匹配文档的数量:
[
{ "company_id" : 1, "name" : "AMD", "matched_documents:": 1 },
{ "company_id" : 3, "name" : "Nvidia", "matched_documents:": 2 }
]
有没有可能以良好的性能达到这个结果?我明确对匹配文档不感兴趣,只对匹配文档的数量和嵌套对象感兴趣。
谢谢你的帮助。
解决方案
就 Elasticsearch 而言,您需要做的是:
- 根据所需条件过滤“父”文档(例如在
GPU
中,或在列表中title
提及);Nvidia
companies
- 按特定标准对“嵌套”文档进行分组,一个桶(例如
company_id
); - 计算每个存储桶有多少“嵌套”文档。
nested
数组中的每个对象都被索引为单独的隐藏文档,这使生活有点复杂。让我们看看如何聚合它们。
那么如何对nested
文档进行聚合和统计呢?
您可以通过结合使用nested、terms和top_hits聚合来实现这一点:
POST my_index/doc/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "GPU"
}
},
{
"nested": {
"path": "companies",
"query": {
"match_all": {}
}
}
}
]
}
},
"aggs": {
"Extract nested": {
"nested": {
"path": "companies"
},
"aggs": {
"By company id": {
"terms": {
"field": "companies.company_id"
},
"aggs": {
"Examples of such company_id": {
"top_hits": {
"size": 1
}
}
}
}
}
}
}
}
这将给出以下输出:
{
...
"hits": { ... },
"aggregations": {
"Extract nested": {
"doc_count": 4, <== How many "nested" documents there were?
"By company id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3, <== this bucket's key: "company_id": 3
"doc_count": 2, <== how many "nested" documents there were with such company_id?
"Examples of such company_id": {
"hits": {
"total": 2,
"max_score": 1.5897496,
"hits": [ <== an example, "top hit" for such company_id
{
"_nested": {
"field": "companies",
"offset": 1
},
"_score": 1.5897496,
"_source": {
"company_id": 3,
"name": "Nvidia"
}
}
]
}
}
},
{
"key": 1,
"doc_count": 1,
"Examples of such company_id": {
"hits": {
"total": 1,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 0
},
"_score": 1.5897496,
"_source": {
"company_id": 1,
"name": "AMD"
}
}
]
}
}
}
]
}
}
}
}
请注意,因为Nvidia
我们有"doc_count": 2
.
但是如果我们想计算拥有Nvidia
vs的“父”对象的数量Intel
呢?
如果我们想根据nested
桶来统计父对象怎么办?
它可以通过reverse_nested
聚合来实现。
我们需要稍微改变一下我们的查询:
POST my_index/doc/_search
{
"query": { ... },
"aggs": {
"Extract nested": {
"nested": {
"path": "companies"
},
"aggs": {
"By company id": {
"terms": {
"field": "companies.company_id"
},
"aggs": {
"Examples of such company_id": {
"top_hits": {
"size": 1
}
},
"original doc count": { <== we ask ES to count how many there are parent docs
"reverse_nested": {}
}
}
}
}
}
}
}
结果将如下所示:
{
...
"hits": { ... },
"aggregations": {
"Extract nested": {
"doc_count": 3,
"By company id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3,
"doc_count": 2,
"original doc count": {
"doc_count": 2 <== how many "parent" documents have such company_id
},
"Examples of such company_id": {
"hits": {
"total": 2,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 1
},
"_score": 1.5897496,
"_source": {
"company_id": 3,
"name": "Nvidia"
}
}
]
}
}
},
{
"key": 1,
"doc_count": 1,
"original doc count": {
"doc_count": 1
},
"Examples of such company_id": {
"hits": {
"total": 1,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 0
},
"_score": 1.5897496,
"_source": {
"company_id": 1,
"name": "AMD"
}
}
]
}
}
}
]
}
}
}
}
我怎样才能发现差异?
为了使差异明显,让我们稍微更改一下数据并Nvidia
在文档列表中添加另一个项目:
PUT my_index/doc/2
{
"title" : "GPU release 2018-01-10",
"companies" : [
{ "company_id" : 1, "name" : "AMD" },
{ "company_id" : 3, "name" : "Nvidia" },
{ "company_id" : 3, "name" : "Nvidia" }
]
}
最后一个查询(带有 的查询reverse_nested
)将为我们提供以下信息:
"By company id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3,
"doc_count": 3, <== 3 "nested" documents with Nvidia
"original doc count": {
"doc_count": 2 <== but only 2 "parent" documents
},
"Examples of such company_id": {
"hits": {
"total": 3,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 2
},
"_score": 1.5897496,
"_source": {
"company_id": 3,
"name": "Nvidia"
}
}
]
}
}
},
如您所见,这是一个难以掌握的细微差别,但它完全改变了语义。
性能怎么样?
虽然在大多数情况下,nested
查询和聚合的性能应该足够了,但它当然会带来一定的成本。因此建议在调整搜索速度nested
时避免使用或父子类型。
在 Elasticsearch 中,最好的性能通常是通过反规范化实现的,尽管没有单一的配方,您应该根据自己的需要选择数据模型。
希望这可以为您澄清nested
一点!
推荐阅读
- common-lisp - Common Lisp: uiop:run-program 输出但 uiop:launch-program 没有
- json - 服务器应该将传入数据转换为 JSON 还是客户端应该将数据作为 JSON 发送?
- oracle - 简单查询需要更长的时间
- git - 在 CIFS 共享文件夹中的 Jenkins 中执行 git checkout 时如何解决“文件被隐藏”
- flutter - 无法使用 [] 访问地图对象。运算符 [] 未定义
- html - 我怎样才能将 2 个 div 从左右分开?
- php - 从外部页面将表转储到 json
- php - ErrorException 未定义的偏移量:0
- angular - 我无法从角度调用 php 中的文件,显示 404 Not found
- c# - await slowest of two async methods