elasticsearch - Bucket_selector 聚合和大小。优化
问题描述
我对 bucket_selector 聚合有疑问。(环境测试:ES6.8 和 ES7 basic on centos7)
在我的用例中,如果所选属性存在欺骗,我需要删除文档。索引不大,大约 200 万条记录。查找这些记录的查询如下所示:
GET index_id1/_search
{
"size": 0,
"aggs": {
"byNested": {
"nested": {
"path": "nestedObjects"
},
"aggs": {
"sameIds": {
"terms": {
"script": {
"lang": "painless",
"source": "return doc['nestedObjects.id'].value"
},
"size": 1000
},
"aggs": {
"byId": {
"reverse_nested": {}
},
"byId_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"totalCount": "byId._count"
},
"script": {
"source": "params.totalCount > 1"
}
}
}
}
}
}
}
}
}
我把桶拿回来了。但是要放松查询和负载。我按大小做: 1000。因此,发出下一个查询以获取更多欺骗,直到归零。然而,问题是 - 受骗的数量太少。我通过设置size: 2000000检查了查询的结果:
GET index_id1/_search
{
"size": 0,
"aggs": {
"byNested": {
"nested": {
"path": "nestedObjects"
},
"aggs": {
"sameIds": {
"terms": {
"script": {
"lang": "painless",
"source": "return doc['nestedObjects.id'].value"
},
"size": 2000000 <-- too big
},
"aggs": {
"byId": {
"reverse_nested": {}
},
"byId_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"totalCount": "byId._count"
},
"script": {
"source": "params.totalCount > 1"
}
}
}
}
}
}
}
}
}
据我了解,第一步是:它实际上创建了查询中所述的存储桶,然后 bucket_selector 仅过滤我需要的内容。这就是为什么我看到这种行为。为了获得所有存储桶,我必须将"search.max_buckets" 调整为 2000000。
转换为使用复合聚合的查询:
GET index_id1/_search
{
"aggs": {
"byNested": {
"nested": {
"path": "nestedObjects"
},
"aggs": {
"compositeAgg": {
"composite": {
"after": {
"termsAgg": "03f10a7d-0162-4409-8647-c643274d6727"
},
"size": 1000,
"sources": [
{
"termsAgg": {
"terms": {
"script": {
"lang": "painless",
"source": "return doc['nestedObjects.id'].value"
}
}
}
}
]
},
"aggs": {
"byId": {
"reverse_nested": {}
},
"byId_bucket_filter": {
"bucket_selector": {
"script": {
"source": "params.totalCount > 1"
},
"buckets_path": {
"totalCount": "byId._count"
}
}
}
}
}
}
}
},
"size": 0
}
据我了解,它的作用相同,只是我需要进行 2000 次调用(大小:每个 1000 次)来检查整个索引。复合 agg 是缓存结果还是为什么这样更好?在这种情况下,也许有更好的方法?