Incident Response
Incident response and postmortem patterns for structured handling, communication, and learning from production incidents
You are an expert in incident response and postmortems for building observable systems.
## Key Points
- **Incident**: An unplanned event that causes or risks causing service degradation or outage that impacts users.
- **Severity levels**: Typically SEV1 (critical, widespread user impact), SEV2 (significant, partial impact), SEV3 (minor, limited impact), SEV4 (informational).
- **Incident Commander (IC)**: The person who coordinates the response — delegates tasks, tracks progress, manages communications. Not necessarily the most senior engineer.
- **Communication Lead**: Manages external and internal status updates (status page, stakeholder messages, customer support).
- **Subject Matter Expert (SME)**: Engineers with deep knowledge of the affected system who perform diagnosis and remediation.
- **Mitigation vs. Root Cause Fix**: Mitigation restores service (rollback, feature flag, scaling). Root cause fix addresses the underlying defect. Always mitigate first.
- **Blameless postmortem**: A review focused on systemic failures (process, tooling, architecture) rather than individual fault. The goal is learning, not punishment.
- What alerts fire? List alert names and links.
- What dashboards to check first? Include URLs.
1. Check the service dashboard: [link]
2. Check error logs: `{service="<name>", level="error"}`
3. Check recent deployments: `kubectl rollout history deployment/<name>`skilldb get observability-patterns-skills/Incident ResponseFull skill: 243 linesIncident Response and Postmortems — Observability
You are an expert in incident response and postmortems for building observable systems.
Overview
Incident response is the process of detecting, triaging, mitigating, and resolving production incidents in a structured way. Postmortems (also called retrospectives or incident reviews) are the practice of analyzing what happened after the incident is resolved, extracting lessons, and driving systemic improvements. Together, they form a continuous improvement loop: incidents reveal weaknesses, postmortems identify fixes, and those fixes reduce future incidents.
Core Concepts
- Incident: An unplanned event that causes or risks causing service degradation or outage that impacts users.
- Severity levels: Typically SEV1 (critical, widespread user impact), SEV2 (significant, partial impact), SEV3 (minor, limited impact), SEV4 (informational).
- Incident Commander (IC): The person who coordinates the response — delegates tasks, tracks progress, manages communications. Not necessarily the most senior engineer.
- Communication Lead: Manages external and internal status updates (status page, stakeholder messages, customer support).
- Subject Matter Expert (SME): Engineers with deep knowledge of the affected system who perform diagnosis and remediation.
- Mitigation vs. Root Cause Fix: Mitigation restores service (rollback, feature flag, scaling). Root cause fix addresses the underlying defect. Always mitigate first.
- Blameless postmortem: A review focused on systemic failures (process, tooling, architecture) rather than individual fault. The goal is learning, not punishment.
Implementation Patterns
Incident severity matrix
# incident-severity.yaml
severities:
SEV1:
description: "Complete service outage or data loss affecting all users"
response_time: "5 minutes"
communication: "Status page updated every 15 min, exec stakeholders notified"
who_is_paged: "Primary on-call, secondary on-call, engineering manager"
examples:
- "Payment processing is completely down"
- "All API requests returning 500"
- "Data corruption detected in production database"
SEV2:
description: "Major feature degraded, significant subset of users impacted"
response_time: "15 minutes"
communication: "Status page updated every 30 min"
who_is_paged: "Primary on-call"
examples:
- "Search is returning stale results"
- "Elevated error rate (>5%) on checkout flow"
- "Mobile app cannot load user profiles"
SEV3:
description: "Minor feature degraded, small subset of users impacted"
response_time: "Business hours"
communication: "Internal status update"
who_is_paged: "None (ticket created)"
examples:
- "Email notifications delayed by 10 minutes"
- "Admin dashboard slow to load"
SEV4:
description: "Cosmetic issue or internal tooling degradation"
response_time: "Next sprint"
communication: "None"
who_is_paged: "None (backlog item)"
Incident response runbook template
## Incident Response Runbook: [Service/Component Name]
### Detection
- What alerts fire? List alert names and links.
- What dashboards to check first? Include URLs.
### Triage (first 5 minutes)
1. Check the service dashboard: [link]
2. Check error logs: `{service="<name>", level="error"}`
3. Check recent deployments: `kubectl rollout history deployment/<name>`
4. Check dependency health: [dependency dashboard link]
### Common scenarios and mitigations
| Symptom | Likely cause | Mitigation |
|---------|-------------|------------|
| 5xx spike after deploy | Bad code release | `kubectl rollout undo deployment/<name>` |
| Latency spike, no deploy | Database saturation | Check slow query log, kill long-running queries |
| Connection timeouts | Dependency outage | Enable circuit breaker / fallback |
| OOM kills | Memory leak | Restart pods, increase memory limit temporarily |
### Escalation
- If not mitigated in 30 minutes, page secondary on-call
- If data integrity is at risk, page database team
- If customer-facing SLA is breached, notify VP Engineering
### Communication templates
**Status page (investigating)**:
"We are investigating elevated error rates on [service]. Some users may experience [symptom]. We will provide updates every [15/30] minutes."
**Status page (mitigated)**:
"The issue with [service] has been mitigated. [Brief description of fix]. We are monitoring for recurrence."
**Status page (resolved)**:
"The incident affecting [service] has been fully resolved. A postmortem will be published within 5 business days."
Incident timeline tracking
## Incident Timeline: 2026-03-15 Payment Processing Outage
| Time (UTC) | Event |
|-------------|--------------------------------------------------------------|
| 14:02 | Deploy v2.14.0 of payment-service completed |
| 14:05 | HighErrorRate alert fires for payment-service (SEV1) |
| 14:06 | On-call engineer Alice acknowledges page |
| 14:08 | IC role assumed by Alice, #incident-20260315 channel created |
| 14:10 | Error logs show NPE in new payment validation code path |
| 14:12 | Decision: rollback to v2.13.9 |
| 14:14 | Rollback initiated: `kubectl rollout undo` |
| 14:17 | New pods healthy, error rate dropping |
| 14:22 | Error rate back to baseline, mitigation confirmed |
| 14:25 | Status page updated: "Mitigated" |
| 14:45 | Root cause confirmed: null check missing for new field |
| 15:00 | Incident resolved, status page updated |
| TTD | 3 minutes (deploy to detection) |
| TTM | 15 minutes (detection to mitigation) |
| Total | 20 minutes user impact |
Postmortem template
## Postmortem: Payment Processing Outage — 2026-03-15
### Metadata
- **Date**: 2026-03-15
- **Duration**: 20 minutes (14:05 - 14:25 UTC)
- **Severity**: SEV1
- **Incident Commander**: Alice
- **Author**: Alice, reviewed by Bob
- **Status**: Action items in progress
### Summary
A deploy of payment-service v2.14.0 introduced a null pointer exception
in the payment validation code path, causing 100% of payment requests to
fail for 20 minutes. The issue was mitigated by rolling back to v2.13.9.
### Impact
- 100% of payment attempts failed for 20 minutes
- Approximately 1,200 failed transactions
- Estimated revenue impact: $45,000 in delayed orders (most retried successfully)
- No data loss or corruption
### Root Cause
PR #1847 added a new `billing_region` field to the payment request schema.
The validation function accessed `request.billing_region.code` without a
null check. The field was optional but the code assumed it was always present.
The existing unit tests only covered the case where the field was populated.
### Timeline
[See incident timeline above]
### What went well
- Alert fired within 3 minutes of the deploy completing
- On-call responded and acknowledged within 1 minute
- Rollback was executed quickly (under 5 minutes from decision to effect)
- Clear deployment history made identifying the suspect change easy
### What went poorly
- No integration test covering the null case for the new field
- The deploy happened on a Friday afternoon (higher risk, lower staffing)
- Status page was not updated until 20 minutes after the alert fired
### Action Items
| Action | Owner | Priority | Ticket | Status |
|--------|-------|----------|--------|--------|
| Add null-safety tests for all optional payment fields | Bob | P1 | ENG-4521 | In Progress |
| Add integration test for payment validation edge cases | Carol | P1 | ENG-4522 | TODO |
| Implement deploy freeze policy for Fridays after 3 PM | Alice | P2 | ENG-4523 | TODO |
| Automate status page update on SEV1 alert | Dave | P2 | ENG-4524 | TODO |
| Add canary deployment stage for payment-service | Bob | P2 | ENG-4525 | TODO |
### Lessons Learned
1. Optional fields in API schemas need defensive coding and null-safety tests.
2. High-risk services (payment) need canary deployments, not immediate full rollouts.
3. Status page updates should be semi-automated for SEV1 incidents.
Incident metrics to track
# Mean Time to Detect (MTTD)
# Track as a custom metric emitted by your incident management tool
# Mean Time to Mitigate (MTTM)
# Time from detection to mitigation
# Incident frequency by severity
count(incidents) by (severity, month)
# Postmortem action item completion rate
completed_action_items / total_action_items * 100
Core Philosophy
The purpose of incident response is to restore service, not to find the root cause. This distinction matters because the instinct during an incident is to diagnose — to understand why something broke before taking action. But every minute spent diagnosing is a minute of user impact. The fastest path to recovery is almost always a known-safe mitigation: rollback, feature flag toggle, scaling up, or failing over. Root cause analysis is essential, but it belongs in the postmortem, not in the incident. Train your team to reach for the rollback before reaching for the debugger.
Blameless postmortems are not about being nice — they are about being effective. When engineers fear punishment for causing incidents, they hide information, delay declaring incidents, and avoid touching risky systems that need improvement. A blame culture produces cover-ups, not reliability. The blameless approach asks "what systemic conditions allowed a single human action to cause this outcome?" because the answer always reveals a fixable system problem: missing guardrails, inadequate testing, unclear documentation, or insufficient observability. Humans will always make errors; systems must be resilient to them.
Incident response is a skill that atrophies without practice. Organizations that only exercise their incident process during real incidents discover broken runbooks, unfamiliar tools, and unclear role assignments at the worst possible time. Game days, tabletop exercises, and chaos engineering create the muscle memory that makes real incidents feel routine rather than panicked. The goal is not to prevent all incidents — that is impossible — but to make the response so practiced and systematic that incidents are resolved quickly and calmly, with lessons captured and acted upon.
Anti-Patterns
-
Diagnosing before mitigating. Spending twenty minutes finding the root cause while users are impacted, when a two-minute rollback would have restored service immediately. Always mitigate first (rollback, toggle, scale) and investigate later in the postmortem.
-
Diffuse responsibility. "Someone should update the status page" means nobody will. Without explicitly assigned roles (Incident Commander, Communication Lead, Subject Matter Expert) at the start of every incident, critical tasks fall through the cracks and coordination breaks down.
-
Postmortems without follow-through. Writing a thorough postmortem with five action items and then never completing them is worse than not writing the postmortem at all — it creates a false sense that the problem has been addressed. Track action items with owners, deadlines, and ticket numbers, and review completion weekly.
-
Hero culture. Relying on one expert to diagnose and fix every incident creates a single point of failure and prevents the team from building incident response skills. Rotate the IC role, write runbooks that non-experts can follow, and cross-train team members on critical systems.
-
Skipping postmortems for "small" incidents. Dismissing a five-minute outage as too minor for a postmortem misses the systemic insight it might reveal. Small incidents often share root causes with the large incidents that follow. Write at least a brief postmortem for every SEV1 and SEV2.
Best Practices
- Mitigate first, diagnose later. The priority during an incident is restoring service. Roll back, toggle feature flags, scale up, or failover. Root cause analysis happens in the postmortem.
- Declare incidents early. It is better to declare an incident that turns out to be minor than to delay coordination on a real outage. Lower the threshold for declaring.
- Use a dedicated incident channel. Create a Slack/Teams channel per incident for real-time coordination. Keep the channel focused — social chatter goes elsewhere.
- Assign roles explicitly. "Someone should update the status page" means nobody will. Assign IC, Communication Lead, and SME roles by name at the start.
- Write blameless postmortems. Focus on "what systemic conditions allowed this to happen" not "who made the mistake." People make errors; systems should be resilient to them.
- Track action items to completion. A postmortem without follow-through is wasted effort. Review open action items in weekly team meetings.
- Measure incident response metrics. Track MTTD, MTTM, incident frequency, and action item completion rate over time. These are leading indicators of operational maturity.
Common Pitfalls
- Skipping the postmortem. "It was just a quick fix, no need for a postmortem." Small incidents often reveal systemic issues. Write at least a brief postmortem for every SEV1 and SEV2.
- Blame culture. If engineers fear punishment for causing incidents, they will hide information, delay declaring incidents, and avoid risky but valuable improvements. Blame erodes safety.
- Action items without owners or deadlines. "We should add more tests" is not an action item. "Bob will add null-safety tests for payment validation by March 22 (ENG-4521)" is.
- Hero culture. Relying on one expert to fix every incident is a single point of failure. Document runbooks, rotate the IC role, and cross-train team members.
- Not practicing. Run tabletop exercises or game days where the team simulates an incident. This builds muscle memory for roles, tools, and communication patterns before a real incident happens.
- Ignoring repeat incidents. If the same root cause appears in multiple postmortems, the action items from previous postmortems were either inadequate or never completed. Escalate systemic recurring issues.
Install this skill directly: skilldb add observability-patterns-skills
Related Skills
Alerting Strategies
On-call alerting strategies for actionable, low-noise alert systems that reduce fatigue and improve response times
Distributed Tracing
OpenTelemetry distributed tracing patterns for end-to-end request visibility across microservices
Health Checks
Health check endpoint patterns for liveness, readiness, and startup probes in distributed services
Log Aggregation
Centralized log aggregation patterns for collecting, indexing, and querying logs across distributed systems
Metrics Collection
Prometheus and Grafana metrics collection patterns for monitoring application and infrastructure health
Sli Slo
SLI, SLO, and error budget patterns for defining and managing service reliability targets