Technology & EngineeringIncident Postmortem124 lines

Writing Blameless Postmortems

Write postmortems that turn outages into learning, not blame. Covers the

Quick Summary18 lines

You are writing a postmortem. The incident happened. Customers were affected. Engineers were paged. The fix is in. Now the team needs to learn from it — and the postmortem is the artifact that does that learning. Write it well or the same incident will happen again, in a different surface, in six months.

## Key Points

- **Incident ID and title** — short, descriptive ("INC-0421: Payment provider degraded from 14:02–15:48 UTC")
- **Severity** — your team's severity scale (SEV-1, SEV-2, etc.) with the criteria for that level
- **Detected at** — first signal, alert, or report
- **Resolved at** — the moment the system was confirmed stable
- **Duration** — derived from the above
- **Customer impact** — quantified: requests failed, users affected, revenue lost, MAU touched
- **Author and reviewers** — the person writing the postmortem, the people reviewing it
- **Status** — draft / under review / final
- **Code change** — A code change that introduced the latent fault. Describe what the change was meant to do, what it actually did, and why the difference was not caught.
- **Test gap** — The tests that should have caught the fault but didn't. Describe what the tests covered and what the gap was.
- **Deploy practice** — How the change reached production. Describe the rollout strategy and what about it amplified the fault's blast radius.
- **Monitoring gap** — How long the issue went undetected. Describe what would have detected it sooner.

skilldb get incident-postmortem-skills/Writing Blameless PostmortemsFull skill: 124 lines

Paste into your CLAUDE.md or agent config

You are writing a postmortem. The incident happened. Customers were affected. Engineers were paged. The fix is in. Now the team needs to learn from it — and the postmortem is the artifact that does that learning. Write it well or the same incident will happen again, in a different surface, in six months.

The Blameless Framing

Open with a statement that the postmortem is blameless. The reader needs to know this — both the engineers reading and the stakeholders skimming. The blameless framing is not a courtesy; it is a methodological choice. People who fear blame describe what happened defensively, hiding the small choices that produced the failure. People who trust the framing describe what happened accurately, and the accuracy is what makes the learning possible.

The framing rules out the question "whose fault was it." The framing replaces that question with "what about our system, our process, our tooling, our context allowed this to happen." Individual humans operating reasonable judgments under the conditions they had are not the failure mode the postmortem investigates. The postmortem investigates the conditions.

Phrase it explicitly: "This document is blameless. Names appear only to make the timeline reconstructible. The question is not who, but what allowed this." The single line at the top sets the tone for the whole document.

The Header Block

Stamp the postmortem with metadata the reader needs at a glance:

Incident ID and title — short, descriptive ("INC-0421: Payment provider degraded from 14:02–15:48 UTC")
Severity — your team's severity scale (SEV-1, SEV-2, etc.) with the criteria for that level
Detected at — first signal, alert, or report
Resolved at — the moment the system was confirmed stable
Duration — derived from the above
Customer impact — quantified: requests failed, users affected, revenue lost, MAU touched
Author and reviewers — the person writing the postmortem, the people reviewing it
Status — draft / under review / final

The header is not narrative. It is a set of facts the reader should be able to assess in twenty seconds without reading the body.

The Summary

Write a 2–4 sentence summary that captures: what failed, who was affected, why it happened, and how it was fixed. The reader who only reads the summary should leave with a correct mental model of the incident.

Resist the urge to be technical in the summary. The summary is for executives, for cross-team readers, for the future on-call engineer scanning postmortems for prior art. Save the technical depth for the body.

Bad summary: "The payment service experienced elevated error rates due to connection pool exhaustion."

Good summary: "Between 14:02 and 15:48 UTC, our payment provider was degraded for 60% of customers because a code change in the connection-pool library leaked sockets under load. The fix was a rollback. Underlying cause: the new library version had a regression that our integration tests did not catch."

The Timeline

Reconstruct the timeline minute by minute. Use UTC. Use 24-hour. Use the same timezone for every entry — pick one. Each entry has a timestamp, a system, and an event:

14:02  payment-svc      First customer reports timeout via support
14:04  pagerduty        SEV-2 paged for engineer-on-call
14:11  alice@           Engineer joins #incident-payment
14:18  alice@           Identifies elevated 5xx in payment-svc dashboard
14:23  bob@             Joined; checks recent deploys
14:31  bob@             Notes payment-svc deployed v3.4.1 at 13:55
14:38  alice@           Begins rollback to v3.4.0
14:47  payment-svc      Rolled back; error rate begins recovering
15:48  payment-svc      Confirmed stable at 0.02% baseline

The timeline is detective work. Reconstruct from chat logs, monitoring system event histories, deploy logs, and the on-call engineer's notes. Do not paraphrase. Quote dashboards and logs verbatim where they show what people knew at the time.

The timeline often reveals the postmortem's most important learning: how long it took to detect, how long to identify cause, how long to mitigate. These three intervals — time to detect, time to identify, time to mitigate — are the operational metrics the postmortem produces.

The Contributing Factors

Most production incidents have multiple contributing factors. The single-cause postmortem is rare and usually wrong. Identify three to seven factors. For each, describe what the factor was and how it contributed:

Code change — A code change that introduced the latent fault. Describe what the change was meant to do, what it actually did, and why the difference was not caught.
Test gap — The tests that should have caught the fault but didn't. Describe what the tests covered and what the gap was.
Deploy practice — How the change reached production. Describe the rollout strategy and what about it amplified the fault's blast radius.
Monitoring gap — How long the issue went undetected. Describe what would have detected it sooner.
Runbook gap — What the on-call engineer didn't know and had to figure out in the moment. Describe what the runbook should have contained.
External dependency — A behavior of an upstream service or library that changed without notice.
Operational pressure — Team conditions (one engineer on-call, end of quarter, holiday weekend) that shaped the response.

Resist the temptation to list "the root cause." Postmortems that name a single root cause typically miss the system-level factors that allowed that cause to produce a customer-facing incident. Most causes are necessary; few are sufficient.

The Action Items

Each contributing factor should produce one or more action items. Action items have:

A specific change to make
An owner (team or individual)
A target date
A tracking ticket

Distinguish three types:

Fix the immediate cause. The patch that prevents this exact bug from recurring.
Reduce blast radius next time. The deploy practice, feature flag, canary stage, or circuit breaker that would have contained the fault.
Reduce time to detect/identify/mitigate. The dashboard, alert, runbook entry, or training that would speed up response.

A postmortem that produces only "fix the immediate cause" action items has not done the systemic learning. The patch eliminates this exact bug; the systemic items eliminate the class of bug.

Cap action items at six to ten. More than that signals the team is dumping every adjacent improvement opportunity into the postmortem; the action items will not be done. Pick the highest-leverage items and let the rest live in the team's normal backlog.

The What Went Well

Include a section on what went well. Detection that fired correctly, runbooks that worked, communication that flowed, decisions that were correct. Postmortems that read as catalogs of failure miss that the response also contained successful judgments — and those successful judgments are part of the team's accumulated competence, worth naming explicitly so that the next on-call engineer knows what to repeat.

The "what went well" section is also a defense against the postmortem feeling punitive. The same engineers who made mistakes also made good calls. The postmortem honors both.

The Distribution

Distribute the postmortem widely. The team that ran the incident reads it carefully. Adjacent teams read it for transferable lessons. Leadership reads it for severity and trends. The postmortem belongs in a searchable archive (the company's wiki, an incident management system) so that future engineers researching similar incidents can find it.

Time-box the review. The postmortem is more useful one week after the incident than three months later. Set a deadline at the time the incident is resolved and hold the team to it. Drafts older than a month are typically dead.

Anti-Patterns

Naming individuals as causes. "Engineer X deployed a bad change" is the failure mode of un-blameless postmortems. Replace with "the deploy practice did not require a canary stage for this class of change."

Single-cause root cause. Real incidents have multiple necessary contributors. Identify them; do not collapse them into a falsely simple narrative.

Action item dump. Twenty action items signal the team is using the postmortem to file every adjacent improvement. Most will not be done. Cap at six to ten.

Postmortem as performance review. The document is not used to evaluate the engineer's competence. It is used to learn about the system. If the postmortem reveals an engineer needs training, that conversation happens elsewhere.

Late postmortem. A postmortem written a month after the incident has lost the clarity of recent memory. Write within one week.

No followup. Action items checked into the tracker but never completed. Schedule a 30-day review where the postmortem author confirms the action items are done, and reassign or remove the ones that are not.

Install this skill directly: skilldb add incident-postmortem-skills

Get CLI access →

Writing Blameless Postmortems

The Blameless Framing

The Header Block

The Summary

The Timeline

The Contributing Factors

The Action Items

The What Went Well

The Distribution

Anti-Patterns

Related Skills

Customer Communication During Incidents

Incident Commander Role

Incident Response Runbooks

Incident Severity Classification

Adversarial Code Review

API Design Testing