Technology & EngineeringDevops Cloud71 lines

Incident Management

Coordinate effective incident response from detection through resolution and

Quick Summary21 lines

Incident management is the structured process of detecting, responding to, and
recovering from service disruptions while minimizing customer impact. The goal is
not to prevent all incidents — that is impossible — but to detect them quickly,
respond effectively, communicate transparently, and learn from every occurrence.

## Key Points

- **Severity Classification**: Define clear severity levels (SEV1-SEV4) based on
- **Incident Commander Role**: Designate a single person to coordinate response,
- **Communication Cadence**: Establish regular status updates to stakeholders at
- **Blameless Postmortems**: After resolution, conduct structured reviews focused
- **Runbooks**: Maintain step-by-step guides for known failure scenarios that
- **War Room Protocol**: Establish dedicated communication channels and video
- Define what constitutes an incident before one occurs. Ambiguity during an
- Page the right people immediately. Under-escalation causes more damage than
- Separate mitigation from root cause analysis. Restore service first, investigate
- Keep a timeline of actions taken during the incident for the postmortem.
- Communicate externally through status pages even when the full picture is unclear.
- Track mean time to detect (MTTD) and mean time to resolve (MTTR) as key metrics.

skilldb get devops-cloud-skills/Incident ManagementFull skill: 71 lines

Paste into your CLAUDE.md or agent config

Incident Management

Core Philosophy

Incident management is the structured process of detecting, responding to, and recovering from service disruptions while minimizing customer impact. The goal is not to prevent all incidents — that is impossible — but to detect them quickly, respond effectively, communicate transparently, and learn from every occurrence. A blameless culture that treats incidents as learning opportunities rather than failures to punish is essential for long-term reliability improvement.

Key Techniques

Severity Classification: Define clear severity levels (SEV1-SEV4) based on customer impact, not technical symptoms. SEV1 means significant customer-facing impact; SEV4 means minor issues with workarounds available.
Incident Commander Role: Designate a single person to coordinate response, make decisions, and manage communication. The IC does not debug — they orchestrate.
Communication Cadence: Establish regular status updates to stakeholders at intervals appropriate to severity. SEV1 gets updates every 15 minutes.
Blameless Postmortems: After resolution, conduct structured reviews focused on systemic causes and preventive actions rather than individual blame.
Runbooks: Maintain step-by-step guides for known failure scenarios that enable any on-call engineer to begin diagnosis and mitigation immediately.
War Room Protocol: Establish dedicated communication channels and video bridges for major incidents, with clear roles for participants.

Best Practices

Define what constitutes an incident before one occurs. Ambiguity during an active incident wastes critical time on classification debates.
Page the right people immediately. Under-escalation causes more damage than over-escalation.
Separate mitigation from root cause analysis. Restore service first, investigate causes afterward.
Keep a timeline of actions taken during the incident for the postmortem.
Communicate externally through status pages even when the full picture is unclear. Silence is worse than partial information.
Track mean time to detect (MTTD) and mean time to resolve (MTTR) as key metrics.
Conduct regular incident response drills to test processes before real incidents.

Common Patterns

Detect → Triage → Mitigate → Resolve → Review: The standard incident lifecycle that ensures no phase is skipped.
On-Call Rotation: Distribute incident response duty across the team with clear handoff procedures and compensation.
Automated Remediation: For well-understood failure modes, implement automated responses (restart service, scale up, failover) that resolve incidents before a human is paged.
Incident Review Board: Weekly review of recent incidents to identify patterns, track action item completion, and prioritize reliability investments.

Anti-Patterns

Blaming individuals for incidents. This drives hiding and underreporting rather than learning and improvement.
Not declaring incidents early enough. When in doubt, declare and downgrade later rather than escalating too late.
Postmortems that produce action items nobody follows up on. Track completion rates and hold teams accountable for preventive measures.
Hero culture where the same senior engineers are always paged for every incident. This creates single points of failure and burnout.
Ignoring near-misses. An incident that was caught before customer impact is still a valuable learning opportunity.
Not practicing incident response. A process that has never been tested will fail when it matters most.

Install this skill directly: skilldb add devops-cloud-skills

Get CLI access →

Related Skills

CI CD Pipelines

Design and maintain continuous integration and continuous delivery pipelines

Devops Cloud•144L

Cloud Architecture

Design scalable, resilient, and cost-effective systems on cloud platforms like

Devops Cloud•73L

Configuration Management

Manage system configurations consistently across environments using automation

Devops Cloud•71L

Container Orchestration

Manage containerized applications at scale using orchestration platforms like

Devops Cloud•74L

Cost Optimization

Reduce and optimize cloud infrastructure spending without sacrificing performance

Devops Cloud•72L

Infrastructure As Code

Provision and manage cloud infrastructure through code rather than manual

Devops Cloud•74L