prometheus - 为每个服务定义具有不同警报阈值的共享 Prometheus 警报
问题描述
我用如下表达式定义了一些警报:
sum(rate(some_error_metric[1m])) BY (namespace,application) > 10
sum(rate(some_other_error_metric[1m])) BY (namespace,application) > 10
...
当前,当我们的任何应用程序以每分钟 10 以上的速率发出这些指标时,上述警报就会触发。
我希望能够为每个应用程序指定不同的阈值,而不是硬编码阈值 10。
例如application_1
,应该以每分钟 10 次的速度application_2
发出警报,应该以每分钟 20 次的速度发出警报,等等。
如果不为每个应用程序复制警报,这是否可能?
这个stackoverflow问题:Promethues alerting rules中的动态标签值表明使用记录规则可以实现我想要的,但是遵循这个问题的唯一答案中建议的模式会导致记录Prometheus似乎没有的规则能够解析:
- record: application_1_warning_threshold
expr: warning_threshold{application="application_1"} 10
- record: application_2_warning_threshold
expr: warning_threshold{application="application_2"} 20
...
解决方案
这是我对TasksMissing
具有不同每个作业阈值的警报的配置:
groups:
- name: availability.rules
rules:
# Expected number of tasks per job and environment.
- record: job_env:up:count
expr: count(up) without (instance)
# Actually up and running tasks per job and environment.
- record: job_env:up:sum
expr: sum(up) without (instance)
# Ratio of up and running to expected tasks per job and environment.
- record: job_env:up:ratio
expr: job_env:up:sum / job_env:up:count
# Global warning and critical availability ratio thresholds.
- record: job:up:ratio_warning_threshold
expr: 0.7
- record: job:up:ratio_critical_threshold
expr: 0.5
# Job-specific warning and critical availability ratio thresholds.
# Always alert if one Prometheus instance is down.
- record: job:up:ratio_critical_threshold
labels:
job: prometheus
expr: 0.99
# Never alert for some-batch-job instances down:
- record: job:up:ratio_warning_threshold
labels:
job: some-batch-job
expr: 0
- record: job:up:ratio_critical_threshold
labels:
job: some-batch-job
expr: 0
# TasksMissing is fired when a certain percentage of tasks belonging to a job are down. Namely:
#
# job_env:up:ratio < job:up:ratio_(warning|critical)_threshold
#
# with a job-specific warning/critical threshold when defined, or the global default otherwise.
- alert: TasksMissing
expr: |
# Default warning threshold is < 70%
job_env:up:ratio
< on(job) group_left()
(
job:up:ratio_warning_threshold
or on(job)
count by(job) (job_env:up:ratio) * 0
+ on() group_left()
job:up:ratio_warning_threshold{job=""}
)
for: 2m
labels:
severity: warning
annotations:
summary: Tasks missing for {{ $labels.job }} in {{ $labels.env }}
description:
'...'
- alert: TasksMissing
expr: |
# Default critical threshold is < 50%
job_env:up:ratio
< on(job) group_left()
(
job:up:ratio_critical_threshold
or on(job)
count by(job) (job_env:up:ratio) * 0
+ on() group_left()
job:up:ratio_critical_threshold{job=""}
)
for: 2m
labels:
severity: critical
annotations:
summary: Tasks missing for {{ $labels.job }} in {{ $labels.env }}
description:
'...'
推荐阅读
- ruby-on-rails - .旋转!在 before_create 方法中抛出 NoMethodError,但 .rotate 没有
- c# - 异步更新属性 c# MVVM
- amazon-web-services - 从 AWS Opsworks Stacks Chef 11 迁移到 Chef 12 - 堆栈属性问题
- python - 使python键入下标泛型(实例)可检查
- html - How to center Text Under iFrame when text is wrapped around it
- r - 汇总 2 个样本 t 检验中的分组因素
- javascript - 无法弄清楚为什么 React 代码不起作用
- javascript - 在我让 Puppeteer 截取页面截图之前,有没有办法让 Puppeteer 执行页面操作(比如扩展维基百科条目)?
- aws-lambda - 部署到 AWS Lambda 时,Python 库文件不在文件夹内
- azure-logic-apps - 如何正确地将 JSON 数据提取/转换为对象?