Writing Blameless Postmortems
Write postmortems that turn outages into learning, not blame. Covers the
You are writing a postmortem. The incident happened. Customers were affected. Engineers were paged. The fix is in. Now the team needs to learn from it — and the postmortem is the artifact that does that learning. Write it well or the same incident will happen again, in a different surface, in six months.
## Key Points
- **Incident ID and title** — short, descriptive ("INC-0421: Payment provider degraded from 14:02–15:48 UTC")
- **Severity** — your team's severity scale (SEV-1, SEV-2, etc.) with the criteria for that level
- **Detected at** — first signal, alert, or report
- **Resolved at** — the moment the system was confirmed stable
- **Duration** — derived from the above
- **Customer impact** — quantified: requests failed, users affected, revenue lost, MAU touched
- **Author and reviewers** — the person writing the postmortem, the people reviewing it
- **Status** — draft / under review / final
- **Code change** — A code change that introduced the latent fault. Describe what the change was meant to do, what it actually did, and why the difference was not caught.
- **Test gap** — The tests that should have caught the fault but didn't. Describe what the tests covered and what the gap was.
- **Deploy practice** — How the change reached production. Describe the rollout strategy and what about it amplified the fault's blast radius.
- **Monitoring gap** — How long the issue went undetected. Describe what would have detected it sooner.skilldb get incident-postmortem-skills/Writing Blameless PostmortemsFull skill: 124 linesYou are writing a postmortem. The incident happened. Customers were affected. Engineers were paged. The fix is in. Now the team needs to learn from it — and the postmortem is the artifact that does that learning. Write it well or the same incident will happen again, in a different surface, in six months.
The Blameless Framing
Open with a statement that the postmortem is blameless. The reader needs to know this — both the engineers reading and the stakeholders skimming. The blameless framing is not a courtesy; it is a methodological choice. People who fear blame describe what happened defensively, hiding the small choices that produced the failure. People who trust the framing describe what happened accurately, and the accuracy is what makes the learning possible.
The framing rules out the question "whose fault was it." The framing replaces that question with "what about our system, our process, our tooling, our context allowed this to happen." Individual humans operating reasonable judgments under the conditions they had are not the failure mode the postmortem investigates. The postmortem investigates the conditions.
Phrase it explicitly: "This document is blameless. Names appear only to make the timeline reconstructible. The question is not who, but what allowed this." The single line at the top sets the tone for the whole document.
The Header Block
Stamp the postmortem with metadata the reader needs at a glance:
- Incident ID and title — short, descriptive ("INC-0421: Payment provider degraded from 14:02–15:48 UTC")
- Severity — your team's severity scale (SEV-1, SEV-2, etc.) with the criteria for that level
- Detected at — first signal, alert, or report
- Resolved at — the moment the system was confirmed stable
- Duration — derived from the above
- Customer impact — quantified: requests failed, users affected, revenue lost, MAU touched
- Author and reviewers — the person writing the postmortem, the people reviewing it
- Status — draft / under review / final
The header is not narrative. It is a set of facts the reader should be able to assess in twenty seconds without reading the body.
The Summary
Write a 2–4 sentence summary that captures: what failed, who was affected, why it happened, and how it was fixed. The reader who only reads the summary should leave with a correct mental model of the incident.
Resist the urge to be technical in the summary. The summary is for executives, for cross-team readers, for the future on-call engineer scanning postmortems for prior art. Save the technical depth for the body.
Bad summary: "The payment service experienced elevated error rates due to connection pool exhaustion."
Good summary: "Between 14:02 and 15:48 UTC, our payment provider was degraded for 60% of customers because a code change in the connection-pool library leaked sockets under load. The fix was a rollback. Underlying cause: the new library version had a regression that our integration tests did not catch."
The Timeline
Reconstruct the timeline minute by minute. Use UTC. Use 24-hour. Use the same timezone for every entry — pick one. Each entry has a timestamp, a system, and an event:
14:02 payment-svc First customer reports timeout via support
14:04 pagerduty SEV-2 paged for engineer-on-call
14:11 alice@ Engineer joins #incident-payment
14:18 alice@ Identifies elevated 5xx in payment-svc dashboard
14:23 bob@ Joined; checks recent deploys
14:31 bob@ Notes payment-svc deployed v3.4.1 at 13:55
14:38 alice@ Begins rollback to v3.4.0
14:47 payment-svc Rolled back; error rate begins recovering
15:48 payment-svc Confirmed stable at 0.02% baseline
The timeline is detective work. Reconstruct from chat logs, monitoring system event histories, deploy logs, and the on-call engineer's notes. Do not paraphrase. Quote dashboards and logs verbatim where they show what people knew at the time.
The timeline often reveals the postmortem's most important learning: how long it took to detect, how long to identify cause, how long to mitigate. These three intervals — time to detect, time to identify, time to mitigate — are the operational metrics the postmortem produces.
The Contributing Factors
Most production incidents have multiple contributing factors. The single-cause postmortem is rare and usually wrong. Identify three to seven factors. For each, describe what the factor was and how it contributed:
- Code change — A code change that introduced the latent fault. Describe what the change was meant to do, what it actually did, and why the difference was not caught.
- Test gap — The tests that should have caught the fault but didn't. Describe what the tests covered and what the gap was.
- Deploy practice — How the change reached production. Describe the rollout strategy and what about it amplified the fault's blast radius.
- Monitoring gap — How long the issue went undetected. Describe what would have detected it sooner.
- Runbook gap — What the on-call engineer didn't know and had to figure out in the moment. Describe what the runbook should have contained.
- External dependency — A behavior of an upstream service or library that changed without notice.
- Operational pressure — Team conditions (one engineer on-call, end of quarter, holiday weekend) that shaped the response.
Resist the temptation to list "the root cause." Postmortems that name a single root cause typically miss the system-level factors that allowed that cause to produce a customer-facing incident. Most causes are necessary; few are sufficient.
The Action Items
Each contributing factor should produce one or more action items. Action items have:
- A specific change to make
- An owner (team or individual)
- A target date
- A tracking ticket
Distinguish three types:
- Fix the immediate cause. The patch that prevents this exact bug from recurring.
- Reduce blast radius next time. The deploy practice, feature flag, canary stage, or circuit breaker that would have contained the fault.
- Reduce time to detect/identify/mitigate. The dashboard, alert, runbook entry, or training that would speed up response.
A postmortem that produces only "fix the immediate cause" action items has not done the systemic learning. The patch eliminates this exact bug; the systemic items eliminate the class of bug.
Cap action items at six to ten. More than that signals the team is dumping every adjacent improvement opportunity into the postmortem; the action items will not be done. Pick the highest-leverage items and let the rest live in the team's normal backlog.
The What Went Well
Include a section on what went well. Detection that fired correctly, runbooks that worked, communication that flowed, decisions that were correct. Postmortems that read as catalogs of failure miss that the response also contained successful judgments — and those successful judgments are part of the team's accumulated competence, worth naming explicitly so that the next on-call engineer knows what to repeat.
The "what went well" section is also a defense against the postmortem feeling punitive. The same engineers who made mistakes also made good calls. The postmortem honors both.
The Distribution
Distribute the postmortem widely. The team that ran the incident reads it carefully. Adjacent teams read it for transferable lessons. Leadership reads it for severity and trends. The postmortem belongs in a searchable archive (the company's wiki, an incident management system) so that future engineers researching similar incidents can find it.
Time-box the review. The postmortem is more useful one week after the incident than three months later. Set a deadline at the time the incident is resolved and hold the team to it. Drafts older than a month are typically dead.
Anti-Patterns
Naming individuals as causes. "Engineer X deployed a bad change" is the failure mode of un-blameless postmortems. Replace with "the deploy practice did not require a canary stage for this class of change."
Single-cause root cause. Real incidents have multiple necessary contributors. Identify them; do not collapse them into a falsely simple narrative.
Action item dump. Twenty action items signal the team is using the postmortem to file every adjacent improvement. Most will not be done. Cap at six to ten.
Postmortem as performance review. The document is not used to evaluate the engineer's competence. It is used to learn about the system. If the postmortem reveals an engineer needs training, that conversation happens elsewhere.
Late postmortem. A postmortem written a month after the incident has lost the clarity of recent memory. Write within one week.
No followup. Action items checked into the tracker but never completed. Schedule a 30-day review where the postmortem author confirms the action items are done, and reassign or remove the ones that are not.
Install this skill directly: skilldb add incident-postmortem-skills
Related Skills
Customer Communication During Incidents
Communicate with customers during an active incident — status page,
Incident Commander Role
Serve as the incident commander during an active production incident.
Incident Response Runbooks
Write runbooks the on-call engineer at 03:00 AM can actually follow.
Incident Severity Classification
Define a severity scale that triggers the right response without
Adversarial Code Review
Adversarial implementation review methodology that validates code completeness against requirements with fresh objectivity. Uses a coach-player dialectical loop to catch real gaps in security, logic, and data flow.
API Design Testing
Design, document, and test APIs following RESTful principles, consistent