ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Incident Management That Learns: From Detection to Post-Mortem

Build an incident management process that gets better over time. Covers severity classification, role assignment during incidents, communication templates, post-mortem culture, and the follow-through systems that turn incidents into improvements.

Every production incident reveals two things: a technical failure and an organizational one. The server crashed because of a memory leak — that is the technical failure. Nobody noticed for 45 minutes because the alert was noisy and ignored — that is the organizational failure. Most teams fix the memory leak and move on. The best teams fix the alerting, the response process, and the handoff that let the memory leak reach production in the first place.

This guide covers how to build an incident management process that does not just fight fires but prevents them from recurring.


Severity Classification

Every incident must be classified within the first 5 minutes. This determines the response.

SeverityDefinitionResponseCommunication
SEV-1Complete outage or data loss affecting all usersAll-hands, incident commander assigned, war room openedStatus page updated every 15 min, exec leadership notified
SEV-2Major feature degraded, subset of users affectedOn-call + relevant team leads, incident commander optionalStatus page updated every 30 min, stakeholders notified
SEV-3Minor feature impact, workaround availableOn-call handles independentlyInternal notification only
SEV-4No user impact, internal monitoring triggeredFix in normal sprintNo external communication

When in doubt, escalate up. A SEV-3 that gets upgraded to SEV-2 wastes 30 minutes. A SEV-2 that should have been SEV-1 wastes 2 hours of incident response and potentially damages customer trust. Always err on the side of higher severity.


Incident Roles

Clear roles prevent the chaos of 12 engineers all trying to fix the same thing while nobody communicates with customers.

┌─────────────────────────────────────────────────┐
│  INCIDENT COMMANDER (IC)                         │
│  Owns the incident. Coordinates. Does NOT debug. │
│  - Assigns roles                                 │
│  - Manages timeline                              │
│  - Makes decisions when the team is stuck        │
│  - Calls for escalation                          │
├─────────────────────────────────────────────────┤
│  TECHNICAL LEAD                                   │
│  Owns the technical investigation.                │
│  - Drives root cause analysis                    │
│  - Assigns debugging tasks to engineers           │
│  - Proposes mitigation strategies                 │
│  - Implements fixes                               │
├─────────────────────────────────────────────────┤
│  COMMUNICATIONS LEAD                              │
│  Owns all communication — internal and external.  │
│  - Updates status page                            │
│  - Posts to incident Slack channel                 │
│  - Drafts customer communications                 │
│  - Shields tech team from stakeholder questions    │
├─────────────────────────────────────────────────┤
│  SCRIBE                                           │
│  Documents everything in real-time.               │
│  - Timeline of events and decisions              │
│  - Actions taken and their results                │
│  - This becomes the post-mortem foundation        │
└─────────────────────────────────────────────────┘

Role Assignment Rules

ScenarioICTech LeadComms
Small team (< 5 eng)On-call engineer doubles as IC + tech leadSame personEngineering manager
Medium team (5-20)Engineering manager or senior engineerOn-call + domain expertDesignated comms person
Large org (20+)Trained IC from rotationSubject matter expertDedicated incident comms team

The Incident Timeline

T+0:00  Alert fires or customer reports issue
T+0:05  On-call acknowledges. Initial assessment: what is broken, who is affected?
T+0:10  Severity classified. IC assigned if SEV-1/SEV-2.
T+0:15  War room opened (Slack channel/video call). Roles assigned.
T+0:20  Status page updated: "We are investigating an issue with [service]"
T+0:30  First mitigation attempt (rollback, restart, failover)
T+1:00  If not resolved: escalate. Pull in additional engineers.
        Status page update with ETA or "still investigating"
T+2:00  If still ongoing: executive escalation for SEV-1
        Consider: is this a full outage that needs customer notification?
T+???   Issue resolved. Monitoring confirms stability for 30 min.
T+end   IC declares incident resolved. Status page updated.
T+24hr  Post-mortem document drafted.
T+72hr  Post-mortem review meeting held.
T+7d    Action items assigned and tracked.

Communication Templates

Having pre-written templates eliminates the stress of composing messages during an incident.

Internal Slack (Initial Alert)

🔴 INCIDENT DECLARED — SEV-[1/2]

What: [Brief description of the issue]
Impact: [Who is affected and how]
IC: @[name]
Tech Lead: @[name]
War Room: #incident-[date]-[service]

If you are not assigned a role, please stay out of the war room
unless you have specific information to contribute.

Status Page Update

Investigating: We are currently investigating an issue affecting
[service/feature]. Some users may experience [symptom]. Our team
is working to resolve this as quickly as possible. We will provide
an update within [30/60] minutes.

Update: We have identified the cause of the issue affecting
[service/feature]. We are implementing a fix and expect service
to be restored within [timeframe].

Resolved: The issue affecting [service/feature] has been resolved.
Service is operating normally. We will publish a detailed incident
report within 72 hours.

Post-Mortem Culture

The post-mortem is where incidents become improvements. Done well, it is the most valuable process in your engineering organization. Done poorly — or not done at all — you are guaranteed to repeat the same failures.

The Blameless Principle

❌ "John deployed without testing, which caused the outage."
✅ "The deployment process allowed changes to reach production without
    automated testing, which resulted in a regression."

The difference:
  - First one blames a person. John will never volunteer information again.
  - Second one identifies a system gap. Now you fix the system.

Post-Mortem Document Template

# Incident Post-Mortem: [Title]

**Date:** [Date of incident]
**Duration:** [Start time - End time]
**Severity:** SEV-[1/2/3]
**IC:** [Name]
**Author:** [Name]

## Summary
[2-3 sentences: what happened, who was affected, how long]

## Impact
- Users affected: [number or percentage]
- Revenue impact: [estimated, if applicable]
- Duration: [minutes/hours]

## Timeline
| Time | Event |
|---|---|
| 14:32 | Monitoring alert fired: API error rate > 5% |
| 14:35 | On-call acknowledged, began investigation |
| 14:42 | Identified root cause: database connection pool exhausted |
| 14:45 | Restarted connection pool. Errors cleared. |
| 14:55 | Confirmed stable. Incident resolved. |

## Root Cause
[Detailed technical explanation of what caused the incident]

## What Went Well
- Alert fired within 3 minutes of impact
- On-call responded and acknowledged within 5 minutes
- Root cause identified quickly due to good logging

## What Went Poorly
- Connection pool monitoring was not in place
- Runbook for this scenario did not exist
- Status page was not updated until T+20

## Action Items
| Action | Owner | Due Date | Status |
|---|---|---|---|
| Add connection pool monitoring | @alice | 2024-08-01 | ☐ |
| Write runbook for DB connection issues | @bob | 2024-08-05 | ☐ |
| Automate status page updates on SEV-1 | @carol | 2024-08-15 | ☐ |

The Follow-Through Problem

Most action items from post-mortems die in the backlog. Track them separately and hold the team accountable:

MetricTargetWhat It Tells You
Post-mortem completion rate100% for SEV-1/2Are you doing post-mortems?
Action item completion rate> 80% within 30 daysAre you actually learning?
Repeat incident rateDecreasing trendAre the fixes working?
Mean time to detect (MTTD)Decreasing trendIs monitoring improving?
Mean time to resolve (MTTR)Decreasing trendIs response improving?

Implementation Checklist

  • Define severity levels with clear, unambiguous criteria
  • Assign and train incident commanders (at least 3 people who can serve as IC)
  • Create a dedicated incident Slack channel template
  • Write communication templates for status page updates
  • Establish the post-mortem process: document within 24 hours, review within 72 hours
  • Enforce blameless post-mortems (no individual blame, only system gaps)
  • Track action items from post-mortems in a separate tracker with due dates
  • Measure MTTD and MTTR trends monthly
  • Run incident response drills quarterly (game days)
  • Review repeat incidents: if the same root cause appears twice, the first fix failed
Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →