Incident Severity Classification
Define a severity scale that triggers the right response without
Severity is the most consequential decision early in an incident. SEV-1 means the entire engineering team is engaged; SEV-3 means the on-call handles it. Get severity wrong and you either burn the team on minor issues or you let major incidents fester unstaffed. ## Key Points - Engineering on-call is paged immediately. - Engineering manager and senior leadership are notified within 15 minutes. - A dedicated incident commander is named. - A communication channel is created and stays open. - Status page is updated within 15 minutes. - All-hands available; calendars cleared as needed. - Resolution time target: under 1 hour. - Postmortem mandatory; written within 1 week. - Engineering on-call is paged. - Team channel notified. - Engineering manager notified within 30 minutes. - Status page updated within 30 minutes.
skilldb get incident-postmortem-skills/Incident Severity ClassificationFull skill: 129 linesSeverity is the most consequential decision early in an incident. SEV-1 means the entire engineering team is engaged; SEV-3 means the on-call handles it. Get severity wrong and you either burn the team on minor issues or you let major incidents fester unstaffed.
The severity scale is the artifact that prevents both failure modes. Define it once, document it explicitly, and apply it consistently. Then engineers don't have to invent the criteria during an incident.
The Five-Level Scale
Most teams use four or five levels. Here is a five-level scale that maps to most production environments. Adjust the thresholds to your domain — but keep the structure.
SEV-1: Critical
The system is down or critically degraded for the majority of users. Revenue is being lost. The company's brand is at risk. Examples: payment processing fully offline, primary application returning 5xx for >50% of requests, customer data exposure incident, security breach in progress.
Response:
- Engineering on-call is paged immediately.
- Engineering manager and senior leadership are notified within 15 minutes.
- A dedicated incident commander is named.
- A communication channel is created and stays open.
- Status page is updated within 15 minutes.
- All-hands available; calendars cleared as needed.
- Resolution time target: under 1 hour.
- Postmortem mandatory; written within 1 week.
SEV-2: Major
A major feature or significant subset of users is affected, but the system is still partially functional. Examples: search broken for users in one geography, checkout flow degraded but cart still works, scheduled job failing for 24 hours.
Response:
- Engineering on-call is paged.
- Team channel notified.
- Engineering manager notified within 30 minutes.
- Status page updated within 30 minutes.
- Resolution time target: under 4 hours.
- Postmortem mandatory; written within 1 week.
SEV-3: Minor
A non-critical feature is affected, or a problem affects a small subset of users in a non-blocking way. Examples: a secondary feature returning errors, performance degradation under specific conditions, a backend job running slow.
Response:
- On-call engineer notified during business hours.
- Logged in incident tracking but no page outside business hours unless escalated.
- Resolution time target: under 1 business day.
- Postmortem optional; team's discretion.
SEV-4: Cosmetic
A bug or issue with no functional impact. Examples: a UI element misaligned, a tooltip not appearing, a typo on an error page.
Response:
- Filed as a normal bug ticket.
- No paging, no incident channel.
- Resolution time target: standard backlog priority.
SEV-5: Internal-only
A failure in internal tooling that does not affect customers. Examples: internal dashboard broken, build pipeline slow but not blocked, internal staging environment degraded.
Response:
- Filed as ticket; team handles in normal flow.
- No external communication.
Customer Impact, Not Internal Severity
Define severity by customer impact, not internal complexity. A bug that requires a complete rewrite of an internal service is not SEV-1 unless it is also breaking customer functionality. A simple config change that takes down the primary application is SEV-1 even if the fix is one line.
This distinction prevents two common mistakes:
- Engineer overreaction: a complex internal problem feels critical to the engineer working on it, but is invisible to customers. The engineer wants to escalate; severity criteria push back.
- Engineer underreaction: a simple-looking issue ("it's just a config typo") may be customer-facing critical. Severity criteria force the question: how many customers are affected and how badly?
Escalation: Going Up the Scale
Incidents can escalate. A SEV-3 that goes unresolved for hours and starts affecting more users may become SEV-2. The runbook for severity should describe the escalation triggers:
- Customer-impact metrics cross a threshold (error rate goes from 1% to 10%).
- Estimated time to resolution exceeds the SEV's window.
- Adjacent systems start to fail (cascading effects).
- External attention picks up (social media, press, support volume).
When an incident escalates, the response level escalates. SEV-3 to SEV-2 means more people get notified, the manager is paged, the status page is updated. The escalation should be explicit; the incident commander declares "we are now SEV-2" and the team responds accordingly.
De-escalation: Going Down the Scale
Severity also de-escalates. A SEV-1 that has been mitigated (root cause not fixed but customer impact contained) may step down to SEV-2 or SEV-3 while engineers do the long-tail work. De-escalation lets the team release pressure — leadership stops being paged every 30 minutes, the incident channel quiets down, the engineers can focus on root cause without the meta-overhead of high-severity coordination.
De-escalate explicitly. The incident commander announces "we are stepping down from SEV-1 to SEV-2 because the customer impact has been mitigated; we continue to investigate root cause." Without explicit de-escalation, the incident drags on with full SEV-1 overhead long after the fire is out.
The Customer-Impact Threshold Question
Most teams define SEV-1 as "majority of users affected" or similar. Pick a specific number. "More than 25% of requests failing" is better than "majority." Engineers can check the metric and make the call without arguing.
Keep the threshold tight. If SEV-1 is "more than 50% of requests failing" you may have rare invocations. If it is "more than 5% of requests failing" you will have frequent SEV-1s and the team will burn out. Calibrate to your traffic and risk tolerance.
The "What If We're Not Sure" Default
When the severity is ambiguous, default high. SEV-2 if you're unsure between SEV-2 and SEV-3. The cost of overreacting (engineers join an incident channel that turns out to be minor) is much lower than the cost of underreacting (an actual SEV-1 sits at SEV-3 for an hour while the on-call works alone).
You can always step down. You cannot recover the time lost while a critical incident was misclassified.
The Severity Audit
Once a quarter, audit the past quarter's incidents. Did the severities applied match the actual customer impact in retrospect? Were any incidents misclassified? Were any classes systematically over- or under-classified?
The audit catches drift. Teams sometimes lower their de facto severity bar over time (everything becomes SEV-3) or raise it (everything becomes SEV-1). The audit recalibrates.
Anti-Patterns
Severity by engineer effort. "This is hard to fix, so it's SEV-1." Severity is customer impact, not engineer difficulty.
No customer-impact threshold. "Majority of users" without a specific percentage. Engineers argue during the incident.
Implicit escalation. Severity creeps up without anyone announcing it. Some team members are responding to SEV-3, others to SEV-1. Coordination breaks.
No de-escalation. Incident stays SEV-1 for 12 hours after customer impact is gone. Engineers burn out. Declare de-escalation explicitly.
Severity scale not in writing. Each engineer has their own mental model. Document the scale; train the team on it.
Severity drift. Over time, every incident becomes SEV-2. Audit quarterly; recalibrate.
Install this skill directly: skilldb add incident-postmortem-skills
Related Skills
Customer Communication During Incidents
Communicate with customers during an active incident — status page,
Incident Commander Role
Serve as the incident commander during an active production incident.
Incident Response Runbooks
Write runbooks the on-call engineer at 03:00 AM can actually follow.
Writing Blameless Postmortems
Write postmortems that turn outages into learning, not blame. Covers the
Adversarial Code Review
Adversarial implementation review methodology that validates code completeness against requirements with fresh objectivity. Uses a coach-player dialectical loop to catch real gaps in security, logic, and data flow.
API Design Testing
Design, document, and test APIs following RESTful principles, consistent