ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Incident Communication Playbook

Communicate effectively during production incidents. Covers status page management, stakeholder updates, customer communication templates, internal escalation, timeline documentation, and the patterns that maintain trust when things go wrong.

Technical resolution is half the work during an incident. The other half is communication: keeping customers informed, coordinating responders, and updating stakeholders. Poor communication during an incident amplifies the impact — customers who know what is happening are understanding, but customers left in the dark become furious.


Communication Timeline

T+0: Incident detected (alert fires)
  Internal: Post in #incidents channel
  Template: "🔴 INVESTIGATING: [Service] — [Symptom]
             Impact: [Who is affected]
             Severity: [SEV-1/2/3]
             IC: @[Incident Commander]
             Status page: Updating now"

T+5: Initial status page update
  External: "We are investigating reports of [issue].
            We will provide an update within 30 minutes."
  Tone: Acknowledge, no speculation

T+15: First substantive update
  External: "We have identified the issue affecting [service].
            Our engineering team is working on a fix.
            Estimated resolution: [time estimate or 'investigating']"
  Internal: "Root cause identified: [brief]. Working on [fix]."

T+30: Progress update (every 30 min until resolved)
  External: "Our team continues to work on resolving [issue].
            Current status: [mitigation in progress/workaround available]
            Next update in 30 minutes."

T+resolution: Resolution
  External: "The issue has been resolved. [Service] is operating normally.
            We will publish a detailed postmortem within 48 hours."
  Internal: Close #incidents thread, schedule postmortem

T+48h: Postmortem published
  External: "Incident Report: [Summary, root cause, what we're doing to prevent recurrence]"

Status Page Management

class StatusPageManager:
    """Manage public-facing incident status."""
    
    severities = {
        "SEV-1": {
            "status_page_impact": "major_outage",
            "update_frequency_minutes": 15,
            "stakeholder_notification": True,
            "customer_notification": True,
        },
        "SEV-2": {
            "status_page_impact": "partial_outage",
            "update_frequency_minutes": 30,
            "stakeholder_notification": True,
            "customer_notification": False,  # Status page only
        },
        "SEV-3": {
            "status_page_impact": "degraded_performance",
            "update_frequency_minutes": 60,
            "stakeholder_notification": False,
            "customer_notification": False,
        },
    }
    
    def create_incident(self, title, severity, affected_components):
        config = self.severities[severity]
        
        # Create status page incident
        self.statuspage.create_incident(
            title=title,
            impact=config["status_page_impact"],
            components=affected_components,
            status="investigating",
            body=f"We are investigating reports of {title.lower()}. "
                 f"We will provide an update within "
                 f"{config['update_frequency_minutes']} minutes.",
        )
        
        # Notify stakeholders
        if config["stakeholder_notification"]:
            self.notify_stakeholders(title, severity)
        
        # Schedule reminder for next update
        self.schedule_update_reminder(config["update_frequency_minutes"])

Customer Communication Templates

Investigating:
  "We're aware of an issue affecting [specific feature/service]. 
   Our team is actively investigating. We'll share more details 
   within [time frame]."

Identified:
  "We've identified the cause of the issue affecting [service]. 
   Our team is implementing a fix. We expect to resolve this 
   within [time estimate]. In the meantime, [workaround if any]."

Resolved:
  "The issue affecting [service] has been resolved as of [time]. 
   All systems are operating normally. We apologize for the 
   disruption and will share a detailed incident report within 
   48 hours."

Postmortem:
  "On [date], [service] experienced [duration] of [impact]. 
   Root cause: [brief, non-technical explanation]. 
   We have implemented [specific fix] to prevent recurrence. 
   Additional improvements planned: [list]."

RULES:
  ✓ Be honest — never say "no data was affected" unless verified
  ✓ Be specific — "payment processing" not "some services"
  ✓ Give timelines — even "we'll update in 30 minutes" is a timeline
  ✓ Acknowledge impact — "we know this affects your business"
  ✗ Never blame vendors — "our infrastructure provider" not "AWS"
  ✗ Never speculate — only communicate confirmed information

Anti-Patterns

Anti-PatternConsequenceFix
No status page updatesCustomers flood supportUpdate status page within 5 minutes
Technical jargon in updatesCustomers confused, more support ticketsPlain language, impact-focused
”Everything is fine” when it is notTrust destroyed when truth emergesHonest, proportionate communication
Update only when resolvedHours of silence = hours of customer panicRegular updates even if no progress
No postmortem publishedCustomers fear recurrencePublish postmortem within 48 hours

Incident communication is a trust exercise. During an outage, your status page becomes the most important page on your website. Treat communication with the same urgency as technical resolution.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →