Incident Commander Role
Serve as the incident commander during an active production incident.
The incident commander is the role that holds the incident together. Engineers investigate root cause; the IC coordinates the investigation, decides what to communicate to whom, and keeps the team's energy directed. Without an IC, every engineer in the channel is simultaneously coordinating, investigating, and communicating, and all three suffer. ## Key Points - The incident is SEV-1 or SEV-2 (always) - The incident has more than three engineers active - The incident has crossed the 30-minute mark with no clear path to resolution - Stakeholders outside engineering are starting to ask questions - What we know - What we're investigating - What we've tried that didn't work - What we're going to try next - Customer impact (current) - Current state of the investigation - What's been ruled out - What's being tried now
skilldb get incident-postmortem-skills/Incident Commander RoleFull skill: 122 linesThe incident commander is the role that holds the incident together. Engineers investigate root cause; the IC coordinates the investigation, decides what to communicate to whom, and keeps the team's energy directed. Without an IC, every engineer in the channel is simultaneously coordinating, investigating, and communicating, and all three suffer.
The IC role is a discipline. The person playing it is not necessarily the most technical engineer in the room. The IC is the engineer who can hold the situation in their head, decide who is doing what, and keep the response coherent.
When to Name an IC
Name an incident commander when:
- The incident is SEV-1 or SEV-2 (always)
- The incident has more than three engineers active
- The incident has crossed the 30-minute mark with no clear path to resolution
- Stakeholders outside engineering are starting to ask questions
For SEV-3 and below, the on-call engineer typically plays both IC and investigator. For SEV-1 and SEV-2, separating the roles is what allows the response to scale.
The IC's Job
The IC has three responsibilities. Hold these clearly; the IC who tries to also investigate root cause will drop one of them.
1. Coordinate the Response
The IC tracks who is doing what. If the team is splitting into investigation streams (one engineer on the database, one on the application, one on the deploy pipeline), the IC writes those assignments down — in the incident channel, in a pinned message, in a shared doc — and keeps the assignments current.
The IC also makes the calls about what to try next. "We've ruled out a deploy. Let's check the database. Alice, take the connection pool. Bob, take query times. Report back in 10 minutes." The IC is making decisions; the engineers are executing.
2. Track Status
The IC keeps a running summary of the incident's state. Every 15 minutes, the IC posts a status update to the incident channel:
- What we know
- What we're investigating
- What we've tried that didn't work
- What we're going to try next
- Customer impact (current)
This status update has multiple audiences: the engineers who joined the incident late, the leaders who are watching, the support team waiting to update customers. The status update is a courtesy to all of them.
3. Communicate with Stakeholders
The IC is the single voice from the incident to the rest of the company. Engineering leadership wants updates; the IC provides them. Support wants to know what to tell customers; the IC tells them. Marketing wants to know if the status page should be updated; the IC decides.
This single-voice principle prevents the engineers in the channel from being interrupted by leadership questions every five minutes. The IC handles the questions; the engineers stay focused on root cause.
The communication is calibrated. To engineering leadership: "we're investigating, no ETA, will update in 30 minutes." To support: "the symptoms users are seeing are [X]; please tell them we're aware and working on it." To the status page: "Investigating: payment provider degraded for some users." Each audience gets the communication appropriate to them.
What the IC Doesn't Do
The IC does not investigate root cause. The temptation is strong, especially if the IC is technical and has ideas. Resist. The moment the IC starts hypothesizing about the database, they have stopped being the IC.
If the IC has technical ideas, they tell the engineers to consider them. They do not pursue them personally.
The IC also does not type into the production system. They do not run commands. They do not deploy. The IC's job is to coordinate; the moment they take action they are an investigator and someone else needs to be the IC.
The Handoff
Long incidents need IC handoffs. After two to three hours, the IC's judgment degrades. They have been holding too much state for too long. Hand off to a fresh IC.
The handoff is a structured artifact:
- Current state of the investigation
- What's been ruled out
- What's being tried now
- Outstanding decisions the new IC will need to make
- Communication threads that are open with stakeholders
Take 10 minutes for the handoff. The new IC reads the running summary, asks questions, and confirms they have the picture before the old IC drops off. Bad handoffs lose state; the new IC re-investigates things the old team already ruled out, and progress regresses.
The Decision Authority
The IC is the decision-maker during the incident. Engineers debating whether to roll back, restart a service, page another team, or update the status page bring the decision to the IC. The IC decides.
This authority is granted explicitly when the IC role is named. "Carol is IC; her decisions stand." Without explicit authority, decisions get debated by committee in the channel and the response slows.
The IC's decisions can be appealed to the on-call manager or to the engineering leadership only after the incident, in the postmortem. During the incident, the IC's call is final.
The Calm Voice
The IC's voice is calm in writing and in audio. Not artificially calm; not so calm it suggests the IC doesn't grasp the severity. But not panicked. The team mirrors the IC's tone; if the IC is anxious, the team becomes anxious; if the IC is collected, the team is collected.
The calm voice is harder during a SEV-1 at 03:00 AM than at 14:00. Practice it. The IC who has been in this seat before is more valuable than the IC who knows the system better but has never run an incident.
The IC's Notes
The IC keeps detailed notes during the incident. Timestamps, decisions, what was tried, what people said. The notes feed directly into the postmortem timeline. Without good IC notes, the postmortem author has to reconstruct from chat logs after the fact, and details are lost.
The notes are also a hedge against the IC's own forgetting. By the time the IC hands off or the incident is over, they will not remember which thing was tried at which time. The notes are the memory.
After the Incident
After resolution, the IC is responsible for:
- Confirming the incident is fully resolved (not just mitigated)
- Closing the incident channel with a final summary
- Initiating the postmortem (or naming an author)
- Thanking the engineers who responded
The thank-you is not a courtesy; it is part of building the team's incident response culture. Engineers who feel acknowledged for stepping up to incidents continue to step up. Engineers who don't, don't.
Anti-Patterns
IC investigates root cause. The IC starts typing commands. They have stopped being the IC. Find another IC or hand off.
No status updates. Engineering leadership is asking what's happening every 5 minutes because they don't have the running summary. Post status updates; the questions stop.
Multiple voices to stakeholders. Engineers in the channel are individually replying to leadership questions. The signal is fragmented. Route through the IC.
No handoff at the 3-hour mark. The IC's judgment is degrading. They are still on the call. Hand off; pick up the pace again.
Decisions by committee. The team debates whether to roll back. The IC's job is to decide. They decide; the team executes.
Panicked IC. The team mirrors the IC's tone. If the IC is anxious, the response is anxious. Train for calm in writing.
Install this skill directly: skilldb add incident-postmortem-skills
Related Skills
Customer Communication During Incidents
Communicate with customers during an active incident — status page,
Incident Response Runbooks
Write runbooks the on-call engineer at 03:00 AM can actually follow.
Incident Severity Classification
Define a severity scale that triggers the right response without
Writing Blameless Postmortems
Write postmortems that turn outages into learning, not blame. Covers the
Adversarial Code Review
Adversarial implementation review methodology that validates code completeness against requirements with fresh objectivity. Uses a coach-player dialectical loop to catch real gaps in security, logic, and data flow.
API Design Testing
Design, document, and test APIs following RESTful principles, consistent