Skip to main content
Technology & EngineeringIncident Postmortem122 lines

Incident Response Runbooks

Write runbooks the on-call engineer at 03:00 AM can actually follow.

Quick Summary18 lines
The runbook is the artifact the on-call engineer reads at 03:00 AM when an alert fires for a system they don't fully remember. Write it for that moment. Not for the engineer who designed the system; for the engineer who has been awake for fourteen minutes, has cortisol in their bloodstream, and needs to know what to do next.

## Key Points

- **Trigger** — what alert or symptom this runbook responds to. Match the alert name exactly so the engineer can grep.
- **Severity** — your team's scale; the engineer knows whether to escalate immediately.
- **Customer impact** — what happens to users when this fires. Helps the engineer prioritize.
- **Owner team** — who maintains this runbook; who to ping if it's wrong.
- **Last verified** — date when someone last walked through the runbook and confirmed it works. Runbooks rot; the date tells the engineer how much to trust it.
1. Check the payment-svc dashboard at https://...
- Expected: 5xx rate < 1%
- If 5xx rate is elevated, continue to step 2.
- If 5xx rate is normal, the alert was a false positive; ack and snooze.
2. Check recent deploys with `kubectl rollout history deploy/payment-svc`.
- Expected: most recent deploy is the current production version.
- If a deploy happened in the last 30 minutes, continue to step 3.
skilldb get incident-postmortem-skills/Incident Response RunbooksFull skill: 122 lines
Paste into your CLAUDE.md or agent config

The runbook is the artifact the on-call engineer reads at 03:00 AM when an alert fires for a system they don't fully remember. Write it for that moment. Not for the engineer who designed the system; for the engineer who has been awake for fourteen minutes, has cortisol in their bloodstream, and needs to know what to do next.

The Two Kinds of Runbook

Distinguish procedural runbooks from diagnostic runbooks; they are different artifacts and they should not be combined.

A procedural runbook is for known issues with known fixes. The alert fires; the runbook describes the steps; the engineer executes them; the issue is resolved. Procedural runbooks are short, ordered, and unambiguous. "If you see X, do Y." If a procedural runbook starts asking the engineer to investigate, it has become a diagnostic runbook in disguise and should be split.

A diagnostic runbook is for an investigation. The alert fires; the engineer doesn't know what's happening; the runbook helps them figure out. Diagnostic runbooks are decision trees. "Check this dashboard. If you see condition A, go to section 3. If you see condition B, go to section 4." Diagnostic runbooks are longer and branching.

Most teams write hybrid runbooks. Hybrid runbooks fail because the engineer can't tell whether they should be following a procedure or investigating. Decide which kind of runbook this is, write the appropriate kind, and link to other runbooks for the cases the current one doesn't handle.

The Header

Open with the metadata the on-call engineer needs in five seconds:

  • Trigger — what alert or symptom this runbook responds to. Match the alert name exactly so the engineer can grep.
  • Severity — your team's scale; the engineer knows whether to escalate immediately.
  • Customer impact — what happens to users when this fires. Helps the engineer prioritize.
  • Owner team — who maintains this runbook; who to ping if it's wrong.
  • Last verified — date when someone last walked through the runbook and confirmed it works. Runbooks rot; the date tells the engineer how much to trust it.

The header is the equivalent of a function signature. Five seconds to know whether you're in the right place.

The Procedural Body

For procedural runbooks, the body is a numbered list of steps. Each step is a single action with a single observable result.

1. Check the payment-svc dashboard at https://...
   - Expected: 5xx rate < 1%
   - If 5xx rate is elevated, continue to step 2.
   - If 5xx rate is normal, the alert was a false positive; ack and snooze.

2. Check recent deploys with `kubectl rollout history deploy/payment-svc`.
   - Expected: most recent deploy is the current production version.
   - If a deploy happened in the last 30 minutes, continue to step 3.
   - If no recent deploy, escalate to step 6.

3. Roll back to previous revision: `kubectl rollout undo deploy/payment-svc`.
4. Wait 90 seconds. Check the dashboard again.
5. If the dashboard recovers, file an incident with the rolled-back commit
   SHA and notify the deploying engineer in #payment-team.
6. Page the payment-team on-call engineer; this runbook does not cover the
   non-deploy scenarios.

The numbered list is intentional. Reading at 03:00 AM, the engineer wants linear steps with explicit checkpoints. Bullet lists are wrong; they invite skimming. Ordered steps with branching conditions are correct.

Each step has: a specific command or action, the expected result, and what to do if the result is different. Without the "what if different" branch, the engineer hits unexpected output and stops.

The Diagnostic Body

For diagnostic runbooks, the body is a decision tree. Lay it out as nested sections, each with a decision question and links forward.

## Step 1: Is the database the bottleneck?

Run: `select * from pg_stat_activity where state = 'active'`

- **More than 100 active queries?** → Step 2A: connection pool exhaustion
- **Long-running queries (>30s)?** → Step 2B: query degradation
- **Normal activity (<50 queries, none long)?** → Step 3: not the database

## Step 2A: Connection pool exhaustion
...

Each branch leads to a leaf — a procedural section the engineer can execute. Diagnostic runbooks are not "investigate the system from scratch." They are pre-computed paths through the decision space, written by the engineers who know the system best, executed by whoever is on-call.

The Escalation Criteria

Every runbook has explicit escalation criteria. Three lines, near the top:

  • Page the team owner if: [specific conditions]
  • Page the on-call manager if: [worse conditions]
  • Declare an incident if: [worst conditions]

The on-call engineer does not have to guess when to escalate. The runbook tells them. This is one of the highest-value sections of any runbook; without it, engineers hesitate to escalate (afraid of overreacting) and the incident grows.

The Examples

Include worked examples at the bottom of the runbook. A real incident from the past, with the actual dashboard screenshots, log lines, and command outputs. Annotated.

The example anchors the runbook in reality. The engineer reading at 03:00 AM compares what they're seeing to the example; if it matches, they trust the runbook. If it differs, they know to be cautious.

Verification Cycles

Verify runbooks quarterly. Schedule a runbook verification day; rotate the on-call team through the runbooks they own; have them walk through each one, confirm the commands still work, the dashboards still exist, the escalation contacts are still accurate. Update the "last verified" date.

Untouched runbooks rot. Commands change. URLs move. Team contacts leave the company. The engineer at 03:00 AM following a rotted runbook is worse off than one with no runbook — they trust it for longer than they should.

The Game Day

Once a year, run a game day. Pick a runbook. Trigger the underlying failure (in staging, ideally) and have a non-author engineer follow the runbook. Watch where they hesitate, what they get wrong, what they ask. Update the runbook based on what you learn.

Game days catch the assumed knowledge that the author baked in. The author who wrote the runbook knows the system; the runbook needs to work for the engineer who doesn't.

Anti-Patterns

Procedural and diagnostic mixed. The reader can't tell if they should follow steps or investigate. Split into two runbooks; link them.

Bullet lists for procedures. Bullets invite skimming. Numbered ordered steps are correct.

No "what if different" branches. Each step needs an else-branch. Without it, the engineer hits unexpected output and stops.

Implicit escalation. The runbook does not say when to escalate. Engineers hesitate. Make it explicit.

Stale runbooks. No verification date or one older than a year. The engineer should know whether to trust the runbook; the date tells them.

Author-perspective. Written for someone who already knows the system. The runbook is for the engineer who doesn't. Write for them.

Install this skill directly: skilldb add incident-postmortem-skills

Get CLI access →