prometheus - PromQL 查询中的动态阈值(组过滤器中使用的两个标签
问题描述
以下 promql 查询带有一个组过滤器(实例)并按预期工作以生成动态过滤器。
- record: threshold_NodeHighCpuLoad_warning
expr: 10
labels:
instance: host.example.net:9100
- record: threshold_NodeHighCpuLoad_critical
expr: 85
labels:
instance: host.example.net:9100
- record: query_NodeHighCpuLoad
expr: 100 - (avg by(app,job,instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- alert: NodeHighCpuLoadCritical
expr: query_NodeHighCpuLoad > on (instance) group_left() ( threshold_NodeHighCpuLoad_critical or on (instance) query_NodeHighCpuLoad * 0 + 90) or absent (query_NodeHighCpuLoad)*-1
for: 5m
labels:
severity: critical
annotations:
summary: Host high CPU load (instance {{ $labels.instance }})
description: CPU load\n VALUE = {{ $value }}
- alert: NodeHighCpuLoadWarning
expr: query_NodeHighCpuLoad > on (instance) group_left() ( threshold_NodeHighCpuLoad_warning or on (instance) query_NodeHighCpuLoad * 0 + 80) or absent (query_NodeHighCpuLoad)*-1
for: 5m
labels:
severity: critical
annotations:
summary: Host high CPU load (instance {{ $labels.instance }})
description: CPU load\n VALUE = {{ $value }}
以下 promql 查询尝试使用两个组过滤器(容器、pod)但不起作用。我怀疑这是为了匹配标签。
- record: threshold_ContainerHighCpuLoad_warning
expr: 0
labels:
container: gitlab
- record: threshold_ContainerHighCpuLoad_critical
expr: 1
labels:
container: gitlab
- record: threshold_ContainerHighCpuLoad_warning
expr: 1
labels:
container: prometheus
- record: threshold_ContainerHighCpuLoad_critical
expr: 2
labels:
container: prometheus
- record: query_ContainerHighCpuLoad
expr: (sum by(pod, namespace, job, instance, image, name, container) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m])))
- alert: ContainerHighCpuLoadWarning
expr: query_ContainerHighCpuLoad > on (container,pod) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container,pod) query_ContainerHighCpuLoad * 0 + .5) or absent(query_ContainerHighCpuLoad)*-1
for: 5m
labels:
severity: warning
annotations:
summary: Host high CPU load ({{$labels.container}} {{ $labels.namespace }}/{{ $labels.pod }})
description: CPU load\n VALUE = {{ $value }}
- alert: ContainerHighCpuLoadCritical
expr: query_ContainerHighCpuLoad > on (container,pod) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container,pod) query_ContainerHighCpuLoad * 0 + 1) or absent(query_ContainerHighCpuLoad)*-1
for: 5m
labels:
severity: critical
annotations:
summary: Host high CPU load ({{$labels.container}} {{ $labels.namespace }}/{{ $labels.pod }})
description: CPU load\n VALUE = {{ $value }}
我尝试添加容器作为包罗万象,如下所示,但这不起作用。
- record: threshold_ContainerHighCpuLoad_critical
expr: 1
labels:
container: gitlab
pod: ".*"
我怀疑它被评估为“=”而不是“=~”,因此不匹配。
我发现如果我添加以下内容,我会得到预期的结果。但是,由于 pod 名称是动态的,我需要某种正则表达式匹配。
- record: threshold_ContainerHighCpuLoad_warning
expr: 0
labels:
container: gitlab
pod: gitlab-67dd9b7d59-np4js
- record: threshold_ContainerHighCpuLoad_critical
expr: 1
labels:
container: gitlab
pod: gitlab-67dd9b7d59-np4js
有人知道如何解决这个问题吗?
谢谢!——肯德尔·切诺维斯
解决方案
我想到了。我修改了查询以使用不带(标签列表)的总和。
- record: query_ContainerHighCpuLoad
expr: (sum without(id, node, service, pod, name, image, instance) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m])))
当不希望按某些标签值拆分时间序列数据时,例如在 query_ContainerHighCpuLoad 中,有必要将这些标签标记为忽略。
要找出哪些标签正在拆分您的时间序列,首先运行警报 promql(下一节)扩展查询并在无参数列表中包括所有标签,而不是连接所必需的。一次删除一个,以确定哪些是有问题的,需要包含在最终的无列表中。
从查询开始。
(sum without(id, node, service, name, image, pod, instance, cpu, endpoint, job, metrics_path) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) > on (container, namespace) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container) (sum without(id, node, service, pod, name, image, instance, cpu, endpoint, job, metrics_path) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) * 0 + .4)
此练习会产生以下查询。
(sum without(id, node, service, name, image, pod, instance) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) > on (container, namespace) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container) (sum without(id, node, service, pod, name, image, instance) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) * 0 + .4)
现在,您可以更新 query_ContainerHighCpuLoad 的定义并简化表达式。
query_ContainerHighCpuLoad > on (container, namespace) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container) query_ContainerHighCpuLoad * 0 + .4)
由于某些标签(例如实例)被抑制并且与问题解决相关,因此可以使用以下 kubectl 命令恢复它们。
kubectl get po -n monitoring -o jsonpath='{range .items[*]}{"\n"}{"pod: "}{.metadata.name}/{.metadata.namespace}: {range .spec.containers[*]}{.name}{","}{end}{"\n"}' | grep "prometheus,"
在此命令中,命名空间、监控和容器名称 prometheus 可用于提取输出,例如
pod: prometheus-kubeprom-kube-prometheus-s-prometheus-0/monitoring: prometheus,config-reloader,
由于查询是针对容器名称运行的,因此它可能会返回多个实例。
推荐阅读
- web-services - 如何做文件夹以通过 IIS 要求 SSL 证书
- spring-boot - 在 Spring Boot 应用程序上添加 jpa 依赖项时,Okta Spring Boot 不起作用
- php - PHP/SQL - 用不同的语言搜索数据库
- scala - 一级函数的部分应用函数
- ios - iOS“专门”的 TableView 崩溃
- quill - Quill 富文本编辑器 - 无法解释的工具栏行为
- azure - 如何通过 Azure REST API 访问 Cosmos DB 数据库或收集指标?
- php - 有没有更好的方法在 PHP / Laravel 中编写这句话?
- c# - 列表上的Linq排序及其同一类的嵌套集合
- marklogic - 访问 narthex 时出现 404 Not Found 错误