Skip to content
📦 Technology & EngineeringDevops Cloud71 lines

Incident Management

Coordinate effective incident response from detection through resolution and

Paste into your CLAUDE.md or agent config

Incident Management

Core Philosophy

Incident management is the structured process of detecting, responding to, and recovering from service disruptions while minimizing customer impact. The goal is not to prevent all incidents — that is impossible — but to detect them quickly, respond effectively, communicate transparently, and learn from every occurrence. A blameless culture that treats incidents as learning opportunities rather than failures to punish is essential for long-term reliability improvement.

Key Techniques

  • Severity Classification: Define clear severity levels (SEV1-SEV4) based on customer impact, not technical symptoms. SEV1 means significant customer-facing impact; SEV4 means minor issues with workarounds available.
  • Incident Commander Role: Designate a single person to coordinate response, make decisions, and manage communication. The IC does not debug — they orchestrate.
  • Communication Cadence: Establish regular status updates to stakeholders at intervals appropriate to severity. SEV1 gets updates every 15 minutes.
  • Blameless Postmortems: After resolution, conduct structured reviews focused on systemic causes and preventive actions rather than individual blame.
  • Runbooks: Maintain step-by-step guides for known failure scenarios that enable any on-call engineer to begin diagnosis and mitigation immediately.
  • War Room Protocol: Establish dedicated communication channels and video bridges for major incidents, with clear roles for participants.

Best Practices

  • Define what constitutes an incident before one occurs. Ambiguity during an active incident wastes critical time on classification debates.
  • Page the right people immediately. Under-escalation causes more damage than over-escalation.
  • Separate mitigation from root cause analysis. Restore service first, investigate causes afterward.
  • Keep a timeline of actions taken during the incident for the postmortem.
  • Communicate externally through status pages even when the full picture is unclear. Silence is worse than partial information.
  • Track mean time to detect (MTTD) and mean time to resolve (MTTR) as key metrics.
  • Conduct regular incident response drills to test processes before real incidents.

Common Patterns

  • Detect → Triage → Mitigate → Resolve → Review: The standard incident lifecycle that ensures no phase is skipped.
  • On-Call Rotation: Distribute incident response duty across the team with clear handoff procedures and compensation.
  • Automated Remediation: For well-understood failure modes, implement automated responses (restart service, scale up, failover) that resolve incidents before a human is paged.
  • Incident Review Board: Weekly review of recent incidents to identify patterns, track action item completion, and prioritize reliability investments.

Anti-Patterns

  • Blaming individuals for incidents. This drives hiding and underreporting rather than learning and improvement.
  • Not declaring incidents early enough. When in doubt, declare and downgrade later rather than escalating too late.
  • Postmortems that produce action items nobody follows up on. Track completion rates and hold teams accountable for preventive measures.
  • Hero culture where the same senior engineers are always paged for every incident. This creates single points of failure and burnout.
  • Ignoring near-misses. An incident that was caught before customer impact is still a valuable learning opportunity.
  • Not practicing incident response. A process that has never been tested will fail when it matters most.