Incident Communication Playbook
Communicate effectively during production incidents. Covers status page management, stakeholder updates, customer communication templates, internal escalation, timeline documentation, and the patterns that maintain trust when things go wrong.
Technical resolution is half the work during an incident. The other half is communication: keeping customers informed, coordinating responders, and updating stakeholders. Poor communication during an incident amplifies the impact — customers who know what is happening are understanding, but customers left in the dark become furious.
Communication Timeline
T+0: Incident detected (alert fires)
Internal: Post in #incidents channel
Template: "🔴 INVESTIGATING: [Service] — [Symptom]
Impact: [Who is affected]
Severity: [SEV-1/2/3]
IC: @[Incident Commander]
Status page: Updating now"
T+5: Initial status page update
External: "We are investigating reports of [issue].
We will provide an update within 30 minutes."
Tone: Acknowledge, no speculation
T+15: First substantive update
External: "We have identified the issue affecting [service].
Our engineering team is working on a fix.
Estimated resolution: [time estimate or 'investigating']"
Internal: "Root cause identified: [brief]. Working on [fix]."
T+30: Progress update (every 30 min until resolved)
External: "Our team continues to work on resolving [issue].
Current status: [mitigation in progress/workaround available]
Next update in 30 minutes."
T+resolution: Resolution
External: "The issue has been resolved. [Service] is operating normally.
We will publish a detailed postmortem within 48 hours."
Internal: Close #incidents thread, schedule postmortem
T+48h: Postmortem published
External: "Incident Report: [Summary, root cause, what we're doing to prevent recurrence]"
Status Page Management
class StatusPageManager:
"""Manage public-facing incident status."""
severities = {
"SEV-1": {
"status_page_impact": "major_outage",
"update_frequency_minutes": 15,
"stakeholder_notification": True,
"customer_notification": True,
},
"SEV-2": {
"status_page_impact": "partial_outage",
"update_frequency_minutes": 30,
"stakeholder_notification": True,
"customer_notification": False, # Status page only
},
"SEV-3": {
"status_page_impact": "degraded_performance",
"update_frequency_minutes": 60,
"stakeholder_notification": False,
"customer_notification": False,
},
}
def create_incident(self, title, severity, affected_components):
config = self.severities[severity]
# Create status page incident
self.statuspage.create_incident(
title=title,
impact=config["status_page_impact"],
components=affected_components,
status="investigating",
body=f"We are investigating reports of {title.lower()}. "
f"We will provide an update within "
f"{config['update_frequency_minutes']} minutes.",
)
# Notify stakeholders
if config["stakeholder_notification"]:
self.notify_stakeholders(title, severity)
# Schedule reminder for next update
self.schedule_update_reminder(config["update_frequency_minutes"])
Customer Communication Templates
Investigating:
"We're aware of an issue affecting [specific feature/service].
Our team is actively investigating. We'll share more details
within [time frame]."
Identified:
"We've identified the cause of the issue affecting [service].
Our team is implementing a fix. We expect to resolve this
within [time estimate]. In the meantime, [workaround if any]."
Resolved:
"The issue affecting [service] has been resolved as of [time].
All systems are operating normally. We apologize for the
disruption and will share a detailed incident report within
48 hours."
Postmortem:
"On [date], [service] experienced [duration] of [impact].
Root cause: [brief, non-technical explanation].
We have implemented [specific fix] to prevent recurrence.
Additional improvements planned: [list]."
RULES:
✓ Be honest — never say "no data was affected" unless verified
✓ Be specific — "payment processing" not "some services"
✓ Give timelines — even "we'll update in 30 minutes" is a timeline
✓ Acknowledge impact — "we know this affects your business"
✗ Never blame vendors — "our infrastructure provider" not "AWS"
✗ Never speculate — only communicate confirmed information
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| No status page updates | Customers flood support | Update status page within 5 minutes |
| Technical jargon in updates | Customers confused, more support tickets | Plain language, impact-focused |
| ”Everything is fine” when it is not | Trust destroyed when truth emerges | Honest, proportionate communication |
| Update only when resolved | Hours of silence = hours of customer panic | Regular updates even if no progress |
| No postmortem published | Customers fear recurrence | Publish postmortem within 48 hours |
Incident communication is a trust exercise. During an outage, your status page becomes the most important page on your website. Treat communication with the same urgency as technical resolution.