首页 > 解决方案 > Bucket_selector 聚合和大小。优化

问题描述

我对 bucket_selector 聚合有疑问。(环境测试:ES6.8 和 ES7 basic on centos7)

在我的用例中,如果所选属性存在欺骗,我需要删除文档。索引不大,大约 200 万条记录。查找这些记录的查询如下所示:

GET index_id1/_search
{
  "size": 0,
  "aggs": {
    "byNested": {
      "nested": {
        "path": "nestedObjects"
      },
      "aggs": {
        "sameIds": {
          "terms": {
            "script": {
              "lang": "painless",
              "source": "return doc['nestedObjects.id'].value"
            },
            "size": 1000
          },
          "aggs": {
            "byId": {
              "reverse_nested": {}
            },
            "byId_bucket_filter": {
              "bucket_selector": {
                "buckets_path": {
                  "totalCount": "byId._count"
                },
                "script": {
                  "source": "params.totalCount > 1"
                }
              }
            }
          }
        }
      }
    }
  }
}

我把桶拿回来了。但是要放松查询和负载。我按大小做: 1000。因此,发出下一个查询以获取更多欺骗,直到归零。然而,问题是 - 受骗的数量太少。我通过设置size: 2000000检查了查询的结果:

GET index_id1/_search
{
  "size": 0,
  "aggs": {
    "byNested": {
      "nested": {
        "path": "nestedObjects"
      },
      "aggs": {
        "sameIds": {
          "terms": {
            "script": {
              "lang": "painless",
              "source": "return doc['nestedObjects.id'].value"
            },
            "size": 2000000  <-- too big
          },
          "aggs": {
            "byId": {
              "reverse_nested": {}
            },
            "byId_bucket_filter": {
              "bucket_selector": {
                "buckets_path": {
                  "totalCount": "byId._count"
                },
                "script": {
                  "source": "params.totalCount > 1"
                }
              }
            }
          }
        }
      }
    }
  }
}

据我了解,第一步是:它实际上创建了查询中所述的存储桶,然后 bucket_selector 仅过滤我需要的内容。这就是为什么我看到这种行为。为了获得所有存储桶,我必须将"search.max_buckets" 调整为 2000000

转换为使用复合聚合的查询:

GET index_id1/_search
{
  "aggs": {
    "byNested": {
      "nested": {
        "path": "nestedObjects"
      },
      "aggs": {
        "compositeAgg": {
          "composite": {
            "after": {
              "termsAgg": "03f10a7d-0162-4409-8647-c643274d6727"
            },
            "size": 1000,
            "sources": [
              {
                "termsAgg": {
                  "terms": {
                    "script": {
                      "lang": "painless",
                      "source": "return doc['nestedObjects.id'].value"
                    }
                  }
                }
              }
            ]
          },
          "aggs": {
            "byId": {
              "reverse_nested": {}
            },
            "byId_bucket_filter": {
              "bucket_selector": {
                "script": {
                  "source": "params.totalCount > 1"
                },
                "buckets_path": {
                  "totalCount": "byId._count"
                }
              }
            }
          }
        }
      }
    }
  },
  "size": 0
}

据我了解,它的作用相同,只是我需要进行 2000 次调用(大小:每个 1000 次)来检查整个索引。复合 agg 是缓存结果还是为什么这样更好?在这种情况下,也许有更好的方法?

标签: elasticsearch

解决方案


推荐阅读