首页 > 解决方案 > 为每个服务定义具有不同警报阈值的共享 Prometheus 警报

问题描述

我用如下表达式定义了一些警报:

sum(rate(some_error_metric[1m])) BY (namespace,application) > 10
sum(rate(some_other_error_metric[1m])) BY (namespace,application) > 10
...

当前,当我们的任何应用程序以每分钟 10 以上的速率发出这些指标时,上述警报就会触发。

我希望能够为每个应用程序指定不同的阈值,而不是硬编码阈值 10。

例如application_1,应该以每分钟 10 次的速度application_2发出警报,应该以每分钟 20 次的速度发出警报,等等。

如果不为每个应用程序复制警报,这是否可能?

这个stackoverflow问题:Promethues alerting rules中的动态标签值表明使用记录规则可以实现我想要的,但是遵循这个问题的唯一答案中建议的模式会导致记录Prometheus似乎没有的规则能够解析:

  - record: application_1_warning_threshold
    expr: warning_threshold{application="application_1"} 10
  - record: application_2_warning_threshold
    expr: warning_threshold{application="application_2"} 20
  ...

标签: prometheusprometheus-alertmanager

解决方案


这是我对TasksMissing具有不同每个作业阈值的警报的配置:

groups:
- name: availability.rules
  rules:

  # Expected number of tasks per job and environment.
  - record: job_env:up:count
    expr: count(up) without (instance)

  # Actually up and running tasks per job and environment.
  - record: job_env:up:sum
    expr: sum(up) without (instance)

  # Ratio of up and running to expected tasks per job and environment.
  - record: job_env:up:ratio
    expr: job_env:up:sum / job_env:up:count

  # Global warning and critical availability ratio thresholds.
  - record: job:up:ratio_warning_threshold
    expr: 0.7
  - record: job:up:ratio_critical_threshold
    expr: 0.5


  # Job-specific warning and critical availability ratio thresholds.

  # Always alert if one Prometheus instance is down.
  - record: job:up:ratio_critical_threshold
    labels:
      job: prometheus
    expr: 0.99

  # Never alert for some-batch-job instances down:
  - record: job:up:ratio_warning_threshold
    labels:
      job: some-batch-job
    expr: 0
  - record: job:up:ratio_critical_threshold
    labels:
      job: some-batch-job
    expr: 0


  # TasksMissing is fired when a certain percentage of tasks belonging to a job are down. Namely:
  #
  #     job_env:up:ratio < job:up:ratio_(warning|critical)_threshold
  #
  # with a job-specific warning/critical threshold when defined, or the global default otherwise.

  - alert: TasksMissing
    expr: |
      # Default warning threshold is < 70%
        job_env:up:ratio
      < on(job) group_left()
        (
            job:up:ratio_warning_threshold
          or on(job)
              count by(job) (job_env:up:ratio) * 0
            + on() group_left()
              job:up:ratio_warning_threshold{job=""}
        )
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Tasks missing for {{ $labels.job }} in {{ $labels.env }}
      description:
       '...'

  - alert: TasksMissing
    expr: |
      # Default critical threshold is < 50%
        job_env:up:ratio
      < on(job) group_left()
        (
            job:up:ratio_critical_threshold
          or on(job)
              count by(job) (job_env:up:ratio) * 0
            + on() group_left()
              job:up:ratio_critical_threshold{job=""}
        )
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Tasks missing for {{ $labels.job }} in {{ $labels.env }}
      description:
       '...'


推荐阅读