Customer Communication During Incidents
Communicate with customers during an active incident — status page,
Customers experience incidents through what you communicate, not through the technical reality. Two outages with identical impact and identical resolution time can feel completely different to customers depending on whether they were updated. Communication is part of the incident response, not separate from it. ## Key Points - **Active users right now**, who are experiencing the failure: in-app banners, error messages, support contact options. - **Subscribed status-page watchers**, who care about your reliability: status page updates. - **The broader public** (and press, social media, customers who don't know there's an issue): social media, email, blog posts after the fact. 1. **Acknowledgment**: yes, there is an issue, and we know about it. 2. **Scope**: who is affected (all users, users in a region, users of a feature). 3. **What we're doing**: investigating, fixing, mitigating. 4. **Next update**: when you'll communicate again. - **SEV-1**: every 30 minutes, even if there's no progress - **SEV-2**: every 60 minutes - **SEV-3**: when status changes, no fixed cadence - **Acknowledging** — early in the incident, you know something is wrong but not what. Tone is calm, factual, brief. - **Updating** — during the investigation, you're sharing what you've learned. Tone is informative, slightly more detailed, still factual. ## Quick Example ``` We're investigating reports of slow page loads affecting some users in North America. Our team is on it; we'll post the next update by 14:30 UTC. ```
skilldb get incident-postmortem-skills/Customer Communication During IncidentsFull skill: 142 linesCustomers experience incidents through what you communicate, not through the technical reality. Two outages with identical impact and identical resolution time can feel completely different to customers depending on whether they were updated. Communication is part of the incident response, not separate from it.
The communication is also separable. The engineers fixing the incident should not be writing the status page. The IC may write it, or someone designated for communications. But communication is its own discipline with its own rules.
The Three Audiences
You are communicating to three audiences, often through three channels:
- Active users right now, who are experiencing the failure: in-app banners, error messages, support contact options.
- Subscribed status-page watchers, who care about your reliability: status page updates.
- The broader public (and press, social media, customers who don't know there's an issue): social media, email, blog posts after the fact.
Each audience has different information needs and different tolerance for technical detail. Treat them as distinct.
The First Communication
The first customer-facing communication should go out within 15 minutes of the incident's confirmed start. Not the alert; the start, when you've confirmed the issue is real.
The first communication is short and contains four things:
- Acknowledgment: yes, there is an issue, and we know about it.
- Scope: who is affected (all users, users in a region, users of a feature).
- What we're doing: investigating, fixing, mitigating.
- Next update: when you'll communicate again.
We're investigating reports of slow page loads affecting some users
in North America. Our team is on it; we'll post the next update by
14:30 UTC.
The first communication does not need to explain the cause; you don't know the cause yet. It does not need to apologize at length. It does not need to be elegant. It needs to acknowledge, scope, commit to next-update.
The "next update" commitment is binding. If you say "next update at 14:30," post an update at 14:30 even if you have nothing new to say. Customers who can predict your communication cadence trust you more than customers who get sporadic updates.
The Cadence
During the incident, communicate at a regular cadence:
- SEV-1: every 30 minutes, even if there's no progress
- SEV-2: every 60 minutes
- SEV-3: when status changes, no fixed cadence
A "no progress" update is itself useful. "We're still investigating; no ETA yet; next update in 30 minutes" tells customers you haven't forgotten about them.
Drop into the cadence and stay in it. Skipping updates because "nothing new to share" is the most common communication failure during incidents.
The Tone
Match the tone to the situation. A 5-minute degradation does not warrant the rhetorical weight of a 4-hour outage. A data exposure incident does not warrant the casualness of a typo bug.
Three tone calibrations:
- Acknowledging — early in the incident, you know something is wrong but not what. Tone is calm, factual, brief.
- Updating — during the investigation, you're sharing what you've learned. Tone is informative, slightly more detailed, still factual.
- Resolving — once fixed, you're explaining what happened and what you're doing about it. Tone is more substantial, includes acknowledgment of impact and what comes next.
Avoid: corporate jargon, passive voice that obscures responsibility, defensiveness, false urgency, or false calm. The tone the customer wants is the tone of an engineer who knows what's happening and is fixing it.
What Not to Say
There are categories of statement to avoid in real-time updates:
- Speculation about cause when you don't know yet. Wait until you're sure.
- ETAs you can't keep. "Should be fixed in 30 minutes" creates a clock. If you miss it, customers feel deceived.
- Technical detail that exposes you. "We've identified a memory leak in the connection pool" is fine; "Engineer X's PR introduced a bug" is not.
- Blame on third parties when you haven't confirmed. "AWS is having issues" without verification creates a different mess if it's not actually AWS.
- Apologies that read as performative. "We sincerely apologize for any inconvenience" is white noise. If you mean it, mean it specifically: "We know this affected your ability to check out; we're sorry."
The Status Page
The status page is the canonical record of the incident's customer-facing communication. Update it at every cadence interval. Update it when status genuinely changes. Use the status page's structured fields: incident state (investigating / identified / monitoring / resolved), affected components, severity if your tool supports it.
The status page also serves the post-incident audience. People researching whether to use your product will find your status page; the way you ran an incident is part of your reliability story. A status page that shows "investigating, identified, monitoring, resolved" with timestamps and clear updates tells the reader you have a mature incident process. A status page with three updates over four hours and a vague resolution tells the opposite.
The In-App Banner
For active users, an in-app banner is more honest than letting the application fail without acknowledgment. If the user is hitting a degraded service, tell them in the application: "We're experiencing issues with [feature]. We're working on it; check status.example.com for updates."
The banner appears at the top of the relevant pages. It is dismissible. It links to the status page. When the issue resolves, the banner goes away (don't leave stale banners up for hours after fix).
The in-app banner is technically tricky to deploy if the issue affects deployment. Have a pre-built mechanism — a feature flag, a config edit, a CDN-hosted banner — that you can flip without deploying.
The Resolution Communication
When the incident is resolved, post a resolution message. This is the most important customer communication of the incident; it is what customers will remember.
The resolution message includes:
- Confirmation that the issue is resolved.
- A brief explanation of what happened, in non-technical language.
- An apology, specific to the impact.
- What you're doing to prevent recurrence.
- A pointer to a future detailed write-up if there will be one (for major incidents, customers sometimes appreciate a public postmortem).
Resolved: At 14:02 UTC, our payment provider became degraded for
about 60% of customers. Customers experienced timeouts and failed
payments for approximately 1 hour 46 minutes. The issue was caused
by a regression in a recent code update; we rolled back at 14:38
and the service has been stable since 15:48.
We know this affected your ability to complete purchases, and we're
sorry. We've identified the gap in our pre-deploy testing that
allowed this regression and are adding test coverage this week.
A more detailed postmortem will be published next week.
The Public Postmortem
For major incidents, consider a public postmortem. Many companies do not publish them; some do. Public postmortems are a strong signal of operational maturity, but they require care: they expose technical details that may be sensitive, and they require the engineering and communications teams to align on what's said.
A public postmortem is not a replacement for the internal blameless postmortem. It is a derivative — the internal postmortem with names removed, internal-system-detail abstracted, and the narrative tightened for an external audience.
If you publish public postmortems, do so consistently. Selectively publishing them — only for incidents that look good in the telling — destroys the credibility of the practice.
Anti-Patterns
Late first communication. First update goes out 90 minutes into a SEV-1. Customers were confused for 90 minutes; the trust hit is large.
Cadence drift. "Next update in 30 minutes" turns into 90 minutes with no apology. Either commit to a cadence and hold it, or don't promise one.
Speculation as fact. "AWS is having issues" before you've confirmed. If wrong, you've damaged a vendor relationship publicly and confused your customers.
False ETA. "Fixed in 30 minutes" said hopefully. Now you have a clock. Better to say "no ETA yet, next update in 30 minutes."
Performative apology. "We sincerely apologize for any inconvenience" — empty. Make it specific.
Stale banner. In-app banner up for two days after the incident resolved. Customers think you have ongoing issues.
Inconsistent public postmortems. Publishing only the well-handled incidents and burying the messy ones. Customers notice.
Install this skill directly: skilldb add incident-postmortem-skills
Related Skills
Incident Commander Role
Serve as the incident commander during an active production incident.
Incident Response Runbooks
Write runbooks the on-call engineer at 03:00 AM can actually follow.
Incident Severity Classification
Define a severity scale that triggers the right response without
Writing Blameless Postmortems
Write postmortems that turn outages into learning, not blame. Covers the
Adversarial Code Review
Adversarial implementation review methodology that validates code completeness against requirements with fresh objectivity. Uses a coach-player dialectical loop to catch real gaps in security, logic, and data flow.
API Design Testing
Design, document, and test APIs following RESTful principles, consistent