Alerting Strategies
On-call alerting strategies for actionable, low-noise alert systems that reduce fatigue and improve response times
You are an expert in on-call alerting design for building observable systems. ## Key Points - **Alert rule**: A condition (typically a PromQL expression) evaluated periodically. When it becomes true for a sustained duration, it fires. - **Severity levels**: Critical (pages on-call immediately), Warning (reviewed during business hours), Info (logged for awareness). - **Alertmanager**: Prometheus's alert routing, grouping, inhibition, and silencing engine. - **Routing tree**: A hierarchy of matchers that directs alerts to the correct notification channel (PagerDuty, Slack, email). - **Grouping**: Combining related alerts into a single notification to reduce noise during cascading failures. - **Inhibition**: Suppressing lower-priority alerts when a higher-priority root-cause alert is already firing. - **Silencing**: Temporarily muting alerts during planned maintenance. - **Escalation policy**: Rules that escalate an unacknowledged alert to backup responders or management after a timeout. - name: service-health - name: pagerduty-oncall - name: slack-critical - name: slack-warnings ## Quick Example ``` Level 1: Primary on-call engineer — notified immediately Timeout: 5 minutes Level 2: Secondary on-call engineer — notified after 5 min with no ack Timeout: 10 minutes Level 3: Engineering manager — notified after 15 min total with no ack ```
skilldb get observability-patterns-skills/Alerting StrategiesFull skill: 201 linesAlerting Strategies — Observability
You are an expert in on-call alerting design for building observable systems.
Overview
Alerting is the bridge between monitoring and human action. A well-designed alerting system wakes on-call engineers only for conditions that genuinely require human intervention, routes alerts to the right responder, and provides enough context to start diagnosis immediately. Poorly designed alerts cause fatigue, erode trust in monitoring, and ultimately lead to real incidents being ignored.
Core Concepts
- Alert rule: A condition (typically a PromQL expression) evaluated periodically. When it becomes true for a sustained duration, it fires.
- Severity levels: Critical (pages on-call immediately), Warning (reviewed during business hours), Info (logged for awareness).
- Alertmanager: Prometheus's alert routing, grouping, inhibition, and silencing engine.
- Routing tree: A hierarchy of matchers that directs alerts to the correct notification channel (PagerDuty, Slack, email).
- Grouping: Combining related alerts into a single notification to reduce noise during cascading failures.
- Inhibition: Suppressing lower-priority alerts when a higher-priority root-cause alert is already firing.
- Silencing: Temporarily muting alerts during planned maintenance.
- Escalation policy: Rules that escalate an unacknowledged alert to backup responders or management after a timeout.
Implementation Patterns
Prometheus alerting rules
# alerting_rules.yml
groups:
- name: service-health
rules:
# High error rate — pages on-call
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) > 0.05
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "{{ $labels.service }} error rate above 5%"
description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes."
runbook_url: "https://wiki.internal/runbooks/high-error-rate"
dashboard_url: "https://grafana.internal/d/svc-overview?var-service={{ $labels.service }}"
# Elevated latency — warns during business hours
- alert: HighP99Latency
expr: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 2
for: 10m
labels:
severity: warning
team: platform
annotations:
summary: "{{ $labels.service }} p99 latency above 2s"
runbook_url: "https://wiki.internal/runbooks/high-latency"
# Approaching disk capacity
- alert: DiskSpaceRunningLow
expr: |
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
for: 30m
labels:
severity: warning
team: infrastructure
annotations:
summary: "{{ $labels.instance }} disk predicted full within 24h"
Alertmanager configuration
# alertmanager.yml
global:
resolve_timeout: 5m
pagerduty_url: "https://events.pagerduty.com/v2/enqueue"
route:
receiver: default-slack
group_by: [alertname, service]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: pagerduty-oncall
continue: true
- match:
severity: critical
receiver: slack-critical
- match:
severity: warning
receiver: slack-warnings
group_wait: 1m
repeat_interval: 12h
receivers:
- name: pagerduty-oncall
pagerduty_configs:
- routing_key: "<pagerduty-integration-key>"
severity: '{{ if eq .CommonLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
description: "{{ .CommonAnnotations.summary }}"
details:
service: "{{ .CommonLabels.service }}"
runbook: "{{ .CommonAnnotations.runbook_url }}"
- name: slack-critical
slack_configs:
- api_url: "https://hooks.slack.com/services/T.../B.../..."
channel: "#incidents"
title: "[{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}"
text: "{{ .CommonAnnotations.description }}\nRunbook: {{ .CommonAnnotations.runbook_url }}"
- name: slack-warnings
slack_configs:
- api_url: "https://hooks.slack.com/services/T.../B.../..."
channel: "#platform-warnings"
title: "{{ .CommonAnnotations.summary }}"
- name: default-slack
slack_configs:
- channel: "#monitoring"
inhibit_rules:
# If the entire cluster is down, suppress individual service alerts
- source_matchers:
- alertname = ClusterDown
target_matchers:
- severity = warning
equal: [cluster]
PagerDuty escalation policy (conceptual)
Level 1: Primary on-call engineer — notified immediately
Timeout: 5 minutes
Level 2: Secondary on-call engineer — notified after 5 min with no ack
Timeout: 10 minutes
Level 3: Engineering manager — notified after 15 min total with no ack
Alert annotation template for context-rich notifications
annotations:
summary: "{{ $labels.service }}: {{ $labels.alertname }}"
description: |
**What**: {{ $labels.alertname }} is firing for {{ $labels.service }}
**Impact**: Customer-facing requests may be failing or slow
**Current value**: {{ $value | humanizePercentage }}
**Threshold**: 5%
**Duration**: Condition has been true for at least {{ .ForDuration }}
runbook_url: "https://wiki.internal/runbooks/{{ $labels.alertname | toLower }}"
dashboard_url: "https://grafana.internal/d/svc-overview?var-service={{ $labels.service }}"
logs_url: "https://grafana.internal/explore?query={{ $labels.service }}"
Core Philosophy
An alert is a request for human attention, and human attention is the most expensive resource in your system. Every alert that fires and does not require action is not just noise — it actively degrades your ability to respond to real incidents. Alert fatigue is not a morale problem; it is a reliability problem. When engineers learn to ignore alerts because most of them are false positives, the real alerts get ignored too. The bar for a paging alert should be ruthlessly high: if this fires at 3 AM, will the on-call engineer need to take immediate action to prevent user impact?
Alerting should be symptom-based, not cause-based. Users do not care that a pod restarted, that CPU hit 90%, or that a background job queue is growing. They care that their requests are failing, slow, or not being processed. Symptom-based alerts (high error rate, elevated latency, SLO budget burning) page for conditions that users actually feel. Cause-based observations (pod restart, high CPU, queue depth) belong on dashboards and in warning-level notifications reviewed during business hours. This distinction is what separates an alert system that supports incident response from one that undermines it.
Context is the difference between an alert that accelerates incident response and one that delays it. An alert notification that says "HighErrorRate firing" tells the on-call engineer almost nothing. An alert that says "order-service error rate is 12% (threshold 5%), started 6 minutes ago" with links to the relevant dashboard, log query, and runbook gives the engineer everything they need to start diagnosis in seconds rather than minutes. Every minute spent figuring out what an alert means is a minute of user impact that could have been avoided.
Anti-Patterns
-
Alerting on every metric threshold. Creating alerts for CPU > 80%, memory > 70%, disk > 60% on every host produces a constant stream of non-actionable noise. These are dashboard metrics, not alert conditions. Alert on symptoms (error rate, latency) and use resource metrics as diagnostic context during incidents.
-
No
forduration on alerts. Firing an alert the instant a threshold is crossed means transient spikes — a single slow garbage collection, a momentary network blip — generate pages. Always use a sustained duration (for: 5m) that matches the acceptable detection delay for the alert's severity. -
Copy-pasted thresholds across services. Using the same 5% error rate threshold for every service ignores that services have fundamentally different baselines. A payment service at 0.1% error rate is in crisis; a recommendation service at 5% may be normal. Tune thresholds per service based on historical behavior and user impact.
-
Alerts without runbooks. An alert that fires without a linked runbook forces the on-call engineer to improvise diagnosis and remediation under pressure. Every critical alert must have a
runbook_urlannotation linking to step-by-step instructions that a non-expert can follow. -
Never reviewing alert quality. Alerts accumulate over time as engineers add them reactively during incidents but rarely remove or tune them. Without quarterly reviews of alert firing frequency, false-positive rate, and time-to-acknowledge, the alert system degrades until it is actively harmful.
Best Practices
- Every alert must be actionable. If no human action is required, it should be a dashboard metric, not an alert. Ask: "If I get paged for this at 3 AM, what will I do?"
- Every critical alert needs a runbook. Include the
runbook_urlannotation linking to step-by-step diagnosis and remediation instructions. - Alert on symptoms, not causes. Page on "error rate > 5%" (symptom users feel), not "pod restarted" (cause that may self-heal). Cause-based alerts should be warnings at most.
- Use
forduration to avoid flapping. A 5-minuteforclause ensures transient spikes do not page. Tune this per alert based on acceptable detection delay. - Group related alerts. Use Alertmanager's
group_byto batch alerts from the same service into one notification during cascading failures. - Review alert quality quarterly. Track false-positive rate, time-to-acknowledge, and total page count. Retire or tune alerts that are regularly ignored.
- Keep critical alert count low. If on-call engineers receive more than a handful of critical pages per week, the system is too noisy. Fix root causes or demote to warning.
Common Pitfalls
- Alert fatigue. Too many non-actionable alerts train engineers to ignore all alerts, including real ones. Ruthlessly prune.
- Missing context in notifications. An alert that says "HighErrorRate firing" without the service name, current value, or runbook link wastes precious incident response time.
- Alerting on raw metric values instead of rates. Counter values only go up; alert on
rate()or percentage changes. - No
forclause. Without a sustained duration, single scrape blips cause pages. - Copy-pasting thresholds. Each service has different baseline characteristics. A 1% error rate is catastrophic for a payment service but normal for a best-effort recommendation service. Tune thresholds per service.
- Not testing alerts. Use
promtool test rulesto unit-test alerting rules with synthetic input. Deploy rule changes through CI just like application code.
Install this skill directly: skilldb add observability-patterns-skills
Related Skills
Distributed Tracing
OpenTelemetry distributed tracing patterns for end-to-end request visibility across microservices
Health Checks
Health check endpoint patterns for liveness, readiness, and startup probes in distributed services
Incident Response
Incident response and postmortem patterns for structured handling, communication, and learning from production incidents
Log Aggregation
Centralized log aggregation patterns for collecting, indexing, and querying logs across distributed systems
Metrics Collection
Prometheus and Grafana metrics collection patterns for monitoring application and infrastructure health
Sli Slo
SLI, SLO, and error budget patterns for defining and managing service reliability targets