首页 > 解决方案 > PromQL 查询中的动态阈值(组过滤器中使用的两个标签

问题描述

以下 promql 查询带有一个组过滤器(实例)并按预期工作以生成动态过滤器。

    - record: threshold_NodeHighCpuLoad_warning
      expr: 10
      labels:
        instance: host.example.net:9100

    - record: threshold_NodeHighCpuLoad_critical
      expr: 85
      labels:
        instance: host.example.net:9100

    - record: query_NodeHighCpuLoad
      expr: 100 - (avg by(app,job,instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
    - alert: NodeHighCpuLoadCritical
      expr:  query_NodeHighCpuLoad > on (instance) group_left() ( threshold_NodeHighCpuLoad_critical or on (instance) query_NodeHighCpuLoad * 0 + 90) or absent (query_NodeHighCpuLoad)*-1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Host high CPU load (instance {{ $labels.instance }})
        description: CPU load\n  VALUE = {{ $value }}

    - alert: NodeHighCpuLoadWarning
      expr:  query_NodeHighCpuLoad > on (instance) group_left() ( threshold_NodeHighCpuLoad_warning or on (instance) query_NodeHighCpuLoad * 0 + 80) or absent (query_NodeHighCpuLoad)*-1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Host high CPU load (instance {{ $labels.instance }})
        description: CPU load\n  VALUE = {{ $value }}

以下 promql 查询尝试使用两个组过滤器(容器、pod)但不起作用。我怀疑这是为了匹配标签。

    - record: threshold_ContainerHighCpuLoad_warning
      expr: 0
      labels:
        container: gitlab

    - record: threshold_ContainerHighCpuLoad_critical
      expr: 1
      labels:
        container: gitlab

    - record: threshold_ContainerHighCpuLoad_warning
      expr: 1
      labels:
        container: prometheus

    - record: threshold_ContainerHighCpuLoad_critical
      expr: 2
      labels:
        container: prometheus

    - record: query_ContainerHighCpuLoad
      expr: (sum by(pod, namespace, job, instance, image, name, container) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m])))

    - alert: ContainerHighCpuLoadWarning
      expr:  query_ContainerHighCpuLoad > on (container,pod) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container,pod) query_ContainerHighCpuLoad * 0 + .5) or absent(query_ContainerHighCpuLoad)*-1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Host high CPU load ({{$labels.container}} {{ $labels.namespace }}/{{ $labels.pod }})
        description: CPU load\n  VALUE = {{ $value }}

    - alert: ContainerHighCpuLoadCritical
      expr:  query_ContainerHighCpuLoad > on (container,pod) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container,pod) query_ContainerHighCpuLoad * 0 + 1) or absent(query_ContainerHighCpuLoad)*-1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Host high CPU load ({{$labels.container}} {{ $labels.namespace }}/{{ $labels.pod }})
        description: CPU load\n  VALUE = {{ $value }}

我尝试添加容器作为包罗万象,如下所示,但这不起作用。

    - record: threshold_ContainerHighCpuLoad_critical
      expr: 1
      labels:
        container: gitlab
        pod: ".*"

我怀疑它被评估为“=”而不是“=~”,因此不匹配。

我发现如果我添加以下内容,我会得到预期的结果。但是,由于 pod 名称是动态的,我需要某种正则表达式匹配。

    - record: threshold_ContainerHighCpuLoad_warning
      expr: 0
      labels:
        container: gitlab
        pod: gitlab-67dd9b7d59-np4js

    - record: threshold_ContainerHighCpuLoad_critical
      expr: 1
      labels:
        container: gitlab
        pod: gitlab-67dd9b7d59-np4js

有人知道如何解决这个问题吗?

谢谢!——肯德尔·切诺维斯

标签: prometheuspromql

解决方案


我想到了。我修改了查询以使用不带(标签列表)的总和。

   - record: query_ContainerHighCpuLoad
      expr: (sum without(id, node, service, pod, name, image, instance) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m])))

当不希望按某些标签值拆分时间序列数据时,例如在 query_ContainerHighCpuLoad 中,有必要将这些标签标记为忽略。

要找出哪些标签正在拆分您的时间序列,首先运行警报 promql(下一节)扩展查询并在无参数列表中包括所有标签,而不是连接所必需的。一次删除一个,以确定哪些是有问题的,需要包含在最终的无列表中。

从查询开始。

(sum without(id, node, service, name, image, pod, instance, cpu, endpoint, job, metrics_path) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) > on (container, namespace) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container) (sum without(id, node, service, pod, name, image, instance, cpu, endpoint, job, metrics_path) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) * 0 + .4)

此练习会产生以下查询。

(sum without(id, node, service, name, image, pod, instance) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) > on (container, namespace) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container) (sum without(id, node, service, pod, name, image, instance) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) * 0 + .4)

现在,您可以更新 query_ContainerHighCpuLoad 的定义并简化表达式。

query_ContainerHighCpuLoad > on (container, namespace) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container) query_ContainerHighCpuLoad * 0 + .4)

由于某些标签(例如实例)被抑制并且与问题解决相关,因此可以使用以下 kubectl 命令恢复它们。

kubectl get po -n monitoring -o jsonpath='{range .items[*]}{"\n"}{"pod: "}{.metadata.name}/{.metadata.namespace}: {range .spec.containers[*]}{.name}{","}{end}{"\n"}' | grep "prometheus,"

在此命令中,命名空间、监控和容器名称 prometheus 可用于提取输出,例如

pod: prometheus-kubeprom-kube-prometheus-s-prometheus-0/monitoring: prometheus,config-reloader,

由于查询是针对容器名称运行的,因此它可能会返回多个实例。


推荐阅读