Skip to main content
Technology & EngineeringObservability Patterns201 lines

Alerting Strategies

On-call alerting strategies for actionable, low-noise alert systems that reduce fatigue and improve response times

Quick Summary28 lines
You are an expert in on-call alerting design for building observable systems.

## Key Points

- **Alert rule**: A condition (typically a PromQL expression) evaluated periodically. When it becomes true for a sustained duration, it fires.
- **Severity levels**: Critical (pages on-call immediately), Warning (reviewed during business hours), Info (logged for awareness).
- **Alertmanager**: Prometheus's alert routing, grouping, inhibition, and silencing engine.
- **Routing tree**: A hierarchy of matchers that directs alerts to the correct notification channel (PagerDuty, Slack, email).
- **Grouping**: Combining related alerts into a single notification to reduce noise during cascading failures.
- **Inhibition**: Suppressing lower-priority alerts when a higher-priority root-cause alert is already firing.
- **Silencing**: Temporarily muting alerts during planned maintenance.
- **Escalation policy**: Rules that escalate an unacknowledged alert to backup responders or management after a timeout.
- name: service-health
- name: pagerduty-oncall
- name: slack-critical
- name: slack-warnings

## Quick Example

```
Level 1: Primary on-call engineer — notified immediately
  Timeout: 5 minutes
Level 2: Secondary on-call engineer — notified after 5 min with no ack
  Timeout: 10 minutes
Level 3: Engineering manager — notified after 15 min total with no ack
```
skilldb get observability-patterns-skills/Alerting StrategiesFull skill: 201 lines
Paste into your CLAUDE.md or agent config

Alerting Strategies — Observability

You are an expert in on-call alerting design for building observable systems.

Overview

Alerting is the bridge between monitoring and human action. A well-designed alerting system wakes on-call engineers only for conditions that genuinely require human intervention, routes alerts to the right responder, and provides enough context to start diagnosis immediately. Poorly designed alerts cause fatigue, erode trust in monitoring, and ultimately lead to real incidents being ignored.

Core Concepts

  • Alert rule: A condition (typically a PromQL expression) evaluated periodically. When it becomes true for a sustained duration, it fires.
  • Severity levels: Critical (pages on-call immediately), Warning (reviewed during business hours), Info (logged for awareness).
  • Alertmanager: Prometheus's alert routing, grouping, inhibition, and silencing engine.
  • Routing tree: A hierarchy of matchers that directs alerts to the correct notification channel (PagerDuty, Slack, email).
  • Grouping: Combining related alerts into a single notification to reduce noise during cascading failures.
  • Inhibition: Suppressing lower-priority alerts when a higher-priority root-cause alert is already firing.
  • Silencing: Temporarily muting alerts during planned maintenance.
  • Escalation policy: Rules that escalate an unacknowledged alert to backup responders or management after a timeout.

Implementation Patterns

Prometheus alerting rules

# alerting_rules.yml
groups:
  - name: service-health
    rules:
      # High error rate — pages on-call
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          / sum(rate(http_requests_total[5m])) by (service) > 0.05
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "{{ $labels.service }} error rate above 5%"
          description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes."
          runbook_url: "https://wiki.internal/runbooks/high-error-rate"
          dashboard_url: "https://grafana.internal/d/svc-overview?var-service={{ $labels.service }}"

      # Elevated latency — warns during business hours
      - alert: HighP99Latency
        expr: |
          histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 2
        for: 10m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "{{ $labels.service }} p99 latency above 2s"
          runbook_url: "https://wiki.internal/runbooks/high-latency"

      # Approaching disk capacity
      - alert: DiskSpaceRunningLow
        expr: |
          predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
        for: 30m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "{{ $labels.instance }} disk predicted full within 24h"

Alertmanager configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  pagerduty_url: "https://events.pagerduty.com/v2/enqueue"

route:
  receiver: default-slack
  group_by: [alertname, service]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: pagerduty-oncall
      continue: true
    - match:
        severity: critical
      receiver: slack-critical
    - match:
        severity: warning
      receiver: slack-warnings
      group_wait: 1m
      repeat_interval: 12h

receivers:
  - name: pagerduty-oncall
    pagerduty_configs:
      - routing_key: "<pagerduty-integration-key>"
        severity: '{{ if eq .CommonLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
        description: "{{ .CommonAnnotations.summary }}"
        details:
          service: "{{ .CommonLabels.service }}"
          runbook: "{{ .CommonAnnotations.runbook_url }}"

  - name: slack-critical
    slack_configs:
      - api_url: "https://hooks.slack.com/services/T.../B.../..."
        channel: "#incidents"
        title: "[{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}"
        text: "{{ .CommonAnnotations.description }}\nRunbook: {{ .CommonAnnotations.runbook_url }}"

  - name: slack-warnings
    slack_configs:
      - api_url: "https://hooks.slack.com/services/T.../B.../..."
        channel: "#platform-warnings"
        title: "{{ .CommonAnnotations.summary }}"

  - name: default-slack
    slack_configs:
      - channel: "#monitoring"

inhibit_rules:
  # If the entire cluster is down, suppress individual service alerts
  - source_matchers:
      - alertname = ClusterDown
    target_matchers:
      - severity = warning
    equal: [cluster]

PagerDuty escalation policy (conceptual)

Level 1: Primary on-call engineer — notified immediately
  Timeout: 5 minutes
Level 2: Secondary on-call engineer — notified after 5 min with no ack
  Timeout: 10 minutes
Level 3: Engineering manager — notified after 15 min total with no ack

Alert annotation template for context-rich notifications

annotations:
  summary: "{{ $labels.service }}: {{ $labels.alertname }}"
  description: |
    **What**: {{ $labels.alertname }} is firing for {{ $labels.service }}
    **Impact**: Customer-facing requests may be failing or slow
    **Current value**: {{ $value | humanizePercentage }}
    **Threshold**: 5%
    **Duration**: Condition has been true for at least {{ .ForDuration }}
  runbook_url: "https://wiki.internal/runbooks/{{ $labels.alertname | toLower }}"
  dashboard_url: "https://grafana.internal/d/svc-overview?var-service={{ $labels.service }}"
  logs_url: "https://grafana.internal/explore?query={{ $labels.service }}"

Core Philosophy

An alert is a request for human attention, and human attention is the most expensive resource in your system. Every alert that fires and does not require action is not just noise — it actively degrades your ability to respond to real incidents. Alert fatigue is not a morale problem; it is a reliability problem. When engineers learn to ignore alerts because most of them are false positives, the real alerts get ignored too. The bar for a paging alert should be ruthlessly high: if this fires at 3 AM, will the on-call engineer need to take immediate action to prevent user impact?

Alerting should be symptom-based, not cause-based. Users do not care that a pod restarted, that CPU hit 90%, or that a background job queue is growing. They care that their requests are failing, slow, or not being processed. Symptom-based alerts (high error rate, elevated latency, SLO budget burning) page for conditions that users actually feel. Cause-based observations (pod restart, high CPU, queue depth) belong on dashboards and in warning-level notifications reviewed during business hours. This distinction is what separates an alert system that supports incident response from one that undermines it.

Context is the difference between an alert that accelerates incident response and one that delays it. An alert notification that says "HighErrorRate firing" tells the on-call engineer almost nothing. An alert that says "order-service error rate is 12% (threshold 5%), started 6 minutes ago" with links to the relevant dashboard, log query, and runbook gives the engineer everything they need to start diagnosis in seconds rather than minutes. Every minute spent figuring out what an alert means is a minute of user impact that could have been avoided.

Anti-Patterns

  • Alerting on every metric threshold. Creating alerts for CPU > 80%, memory > 70%, disk > 60% on every host produces a constant stream of non-actionable noise. These are dashboard metrics, not alert conditions. Alert on symptoms (error rate, latency) and use resource metrics as diagnostic context during incidents.

  • No for duration on alerts. Firing an alert the instant a threshold is crossed means transient spikes — a single slow garbage collection, a momentary network blip — generate pages. Always use a sustained duration (for: 5m) that matches the acceptable detection delay for the alert's severity.

  • Copy-pasted thresholds across services. Using the same 5% error rate threshold for every service ignores that services have fundamentally different baselines. A payment service at 0.1% error rate is in crisis; a recommendation service at 5% may be normal. Tune thresholds per service based on historical behavior and user impact.

  • Alerts without runbooks. An alert that fires without a linked runbook forces the on-call engineer to improvise diagnosis and remediation under pressure. Every critical alert must have a runbook_url annotation linking to step-by-step instructions that a non-expert can follow.

  • Never reviewing alert quality. Alerts accumulate over time as engineers add them reactively during incidents but rarely remove or tune them. Without quarterly reviews of alert firing frequency, false-positive rate, and time-to-acknowledge, the alert system degrades until it is actively harmful.

Best Practices

  • Every alert must be actionable. If no human action is required, it should be a dashboard metric, not an alert. Ask: "If I get paged for this at 3 AM, what will I do?"
  • Every critical alert needs a runbook. Include the runbook_url annotation linking to step-by-step diagnosis and remediation instructions.
  • Alert on symptoms, not causes. Page on "error rate > 5%" (symptom users feel), not "pod restarted" (cause that may self-heal). Cause-based alerts should be warnings at most.
  • Use for duration to avoid flapping. A 5-minute for clause ensures transient spikes do not page. Tune this per alert based on acceptable detection delay.
  • Group related alerts. Use Alertmanager's group_by to batch alerts from the same service into one notification during cascading failures.
  • Review alert quality quarterly. Track false-positive rate, time-to-acknowledge, and total page count. Retire or tune alerts that are regularly ignored.
  • Keep critical alert count low. If on-call engineers receive more than a handful of critical pages per week, the system is too noisy. Fix root causes or demote to warning.

Common Pitfalls

  • Alert fatigue. Too many non-actionable alerts train engineers to ignore all alerts, including real ones. Ruthlessly prune.
  • Missing context in notifications. An alert that says "HighErrorRate firing" without the service name, current value, or runbook link wastes precious incident response time.
  • Alerting on raw metric values instead of rates. Counter values only go up; alert on rate() or percentage changes.
  • No for clause. Without a sustained duration, single scrape blips cause pages.
  • Copy-pasting thresholds. Each service has different baseline characteristics. A 1% error rate is catastrophic for a payment service but normal for a best-effort recommendation service. Tune thresholds per service.
  • Not testing alerts. Use promtool test rules to unit-test alerting rules with synthetic input. Deploy rule changes through CI just like application code.

Install this skill directly: skilldb add observability-patterns-skills

Get CLI access →