Incident Communication Playbook

Technical resolution is half the work during an incident. The other half is communication: keeping customers informed, coordinating responders, and updating stakeholders. Poor communication during an incident amplifies the impact — customers who know what is happening are understanding, but customers left in the dark become furious.

Communication Timeline

T+0: Incident detected (alert fires)
  Internal: Post in #incidents channel
  Template: "🔴 INVESTIGATING: [Service] — [Symptom]
             Impact: [Who is affected]
             Severity: [SEV-1/2/3]
             IC: @[Incident Commander]
             Status page: Updating now"

T+5: Initial status page update
  External: "We are investigating reports of [issue].
            We will provide an update within 30 minutes."
  Tone: Acknowledge, no speculation

T+15: First substantive update
  External: "We have identified the issue affecting [service].
            Our engineering team is working on a fix.
            Estimated resolution: [time estimate or 'investigating']"
  Internal: "Root cause identified: [brief]. Working on [fix]."

T+30: Progress update (every 30 min until resolved)
  External: "Our team continues to work on resolving [issue].
            Current status: [mitigation in progress/workaround available]
            Next update in 30 minutes."

T+resolution: Resolution
  External: "The issue has been resolved. [Service] is operating normally.
            We will publish a detailed postmortem within 48 hours."
  Internal: Close #incidents thread, schedule postmortem

T+48h: Postmortem published
  External: "Incident Report: [Summary, root cause, what we're doing to prevent recurrence]"

Status Page Management

class StatusPageManager:
    """Manage public-facing incident status."""
    
    severities = {
        "SEV-1": {
            "status_page_impact": "major_outage",
            "update_frequency_minutes": 15,
            "stakeholder_notification": True,
            "customer_notification": True,
        },
        "SEV-2": {
            "status_page_impact": "partial_outage",
            "update_frequency_minutes": 30,
            "stakeholder_notification": True,
            "customer_notification": False,  # Status page only
        },
        "SEV-3": {
            "status_page_impact": "degraded_performance",
            "update_frequency_minutes": 60,
            "stakeholder_notification": False,
            "customer_notification": False,
        },
    }
    
    def create_incident(self, title, severity, affected_components):
        config = self.severities[severity]
        
        # Create status page incident
        self.statuspage.create_incident(
            title=title,
            impact=config["status_page_impact"],
            components=affected_components,
            status="investigating",
            body=f"We are investigating reports of {title.lower()}. "
                 f"We will provide an update within "
                 f"{config['update_frequency_minutes']} minutes.",
        )
        
        # Notify stakeholders
        if config["stakeholder_notification"]:
            self.notify_stakeholders(title, severity)
        
        # Schedule reminder for next update
        self.schedule_update_reminder(config["update_frequency_minutes"])

Customer Communication Templates

Investigating:
  "We're aware of an issue affecting [specific feature/service]. 
   Our team is actively investigating. We'll share more details 
   within [time frame]."

Identified:
  "We've identified the cause of the issue affecting [service]. 
   Our team is implementing a fix. We expect to resolve this 
   within [time estimate]. In the meantime, [workaround if any]."

Resolved:
  "The issue affecting [service] has been resolved as of [time]. 
   All systems are operating normally. We apologize for the 
   disruption and will share a detailed incident report within 
   48 hours."

Postmortem:
  "On [date], [service] experienced [duration] of [impact]. 
   Root cause: [brief, non-technical explanation]. 
   We have implemented [specific fix] to prevent recurrence. 
   Additional improvements planned: [list]."

RULES:
  ✓ Be honest — never say "no data was affected" unless verified
  ✓ Be specific — "payment processing" not "some services"
  ✓ Give timelines — even "we'll update in 30 minutes" is a timeline
  ✓ Acknowledge impact — "we know this affects your business"
  ✗ Never blame vendors — "our infrastructure provider" not "AWS"
  ✗ Never speculate — only communicate confirmed information

Anti-Patterns

Anti-Pattern	Consequence	Fix
No status page updates	Customers flood support	Update status page within 5 minutes
Technical jargon in updates	Customers confused, more support tickets	Plain language, impact-focused
”Everything is fine” when it is not	Trust destroyed when truth emerges	Honest, proportionate communication
Update only when resolved	Hours of silence = hours of customer panic	Regular updates even if no progress
No postmortem published	Customers fear recurrence	Publish postmortem within 48 hours

Incident communication is a trust exercise. During an outage, your status page becomes the most important page on your website. Treat communication with the same urgency as technical resolution.

Communication Timeline

Status Page Management

Customer Communication Templates

Anti-Patterns

More in Site Reliability Engineering

Capacity Planning: Scaling Infrastructure Before You Need To

SRE Capacity Forecasting

Capacity Planning