Incident Management That Learns: From Detection to Post-Mortem

Every production incident reveals two things: a technical failure and an organizational one. The server crashed because of a memory leak — that is the technical failure. Nobody noticed for 45 minutes because the alert was noisy and ignored — that is the organizational failure. Most teams fix the memory leak and move on. The best teams fix the alerting, the response process, and the handoff that let the memory leak reach production in the first place.

This guide covers how to build an incident management process that does not just fight fires but prevents them from recurring.

Severity Classification

Every incident must be classified within the first 5 minutes. This determines the response.

Severity	Definition	Response	Communication
SEV-1	Complete outage or data loss affecting all users	All-hands, incident commander assigned, war room opened	Status page updated every 15 min, exec leadership notified
SEV-2	Major feature degraded, subset of users affected	On-call + relevant team leads, incident commander optional	Status page updated every 30 min, stakeholders notified
SEV-3	Minor feature impact, workaround available	On-call handles independently	Internal notification only
SEV-4	No user impact, internal monitoring triggered	Fix in normal sprint	No external communication

When in doubt, escalate up. A SEV-3 that gets upgraded to SEV-2 wastes 30 minutes. A SEV-2 that should have been SEV-1 wastes 2 hours of incident response and potentially damages customer trust. Always err on the side of higher severity.

Incident Roles

Clear roles prevent the chaos of 12 engineers all trying to fix the same thing while nobody communicates with customers.

┌─────────────────────────────────────────────────┐
│  INCIDENT COMMANDER (IC)                         │
│  Owns the incident. Coordinates. Does NOT debug. │
│  - Assigns roles                                 │
│  - Manages timeline                              │
│  - Makes decisions when the team is stuck        │
│  - Calls for escalation                          │
├─────────────────────────────────────────────────┤
│  TECHNICAL LEAD                                   │
│  Owns the technical investigation.                │
│  - Drives root cause analysis                    │
│  - Assigns debugging tasks to engineers           │
│  - Proposes mitigation strategies                 │
│  - Implements fixes                               │
├─────────────────────────────────────────────────┤
│  COMMUNICATIONS LEAD                              │
│  Owns all communication — internal and external.  │
│  - Updates status page                            │
│  - Posts to incident Slack channel                 │
│  - Drafts customer communications                 │
│  - Shields tech team from stakeholder questions    │
├─────────────────────────────────────────────────┤
│  SCRIBE                                           │
│  Documents everything in real-time.               │
│  - Timeline of events and decisions              │
│  - Actions taken and their results                │
│  - This becomes the post-mortem foundation        │
└─────────────────────────────────────────────────┘

Role Assignment Rules

Scenario	IC	Tech Lead	Comms
Small team (< 5 eng)	On-call engineer doubles as IC + tech lead	Same person	Engineering manager
Medium team (5-20)	Engineering manager or senior engineer	On-call + domain expert	Designated comms person
Large org (20+)	Trained IC from rotation	Subject matter expert	Dedicated incident comms team

The Incident Timeline

T+0:00  Alert fires or customer reports issue
T+0:05  On-call acknowledges. Initial assessment: what is broken, who is affected?
T+0:10  Severity classified. IC assigned if SEV-1/SEV-2.
T+0:15  War room opened (Slack channel/video call). Roles assigned.
T+0:20  Status page updated: "We are investigating an issue with [service]"
T+0:30  First mitigation attempt (rollback, restart, failover)
T+1:00  If not resolved: escalate. Pull in additional engineers.
        Status page update with ETA or "still investigating"
T+2:00  If still ongoing: executive escalation for SEV-1
        Consider: is this a full outage that needs customer notification?
T+???   Issue resolved. Monitoring confirms stability for 30 min.
T+end   IC declares incident resolved. Status page updated.
T+24hr  Post-mortem document drafted.
T+72hr  Post-mortem review meeting held.
T+7d    Action items assigned and tracked.

Communication Templates

Having pre-written templates eliminates the stress of composing messages during an incident.

Internal Slack (Initial Alert)

🔴 INCIDENT DECLARED — SEV-[1/2]

What: [Brief description of the issue]
Impact: [Who is affected and how]
IC: @[name]
Tech Lead: @[name]
War Room: #incident-[date]-[service]

If you are not assigned a role, please stay out of the war room
unless you have specific information to contribute.

Status Page Update

Investigating: We are currently investigating an issue affecting
[service/feature]. Some users may experience [symptom]. Our team
is working to resolve this as quickly as possible. We will provide
an update within [30/60] minutes.

Update: We have identified the cause of the issue affecting
[service/feature]. We are implementing a fix and expect service
to be restored within [timeframe].

Resolved: The issue affecting [service/feature] has been resolved.
Service is operating normally. We will publish a detailed incident
report within 72 hours.

Post-Mortem Culture

The post-mortem is where incidents become improvements. Done well, it is the most valuable process in your engineering organization. Done poorly — or not done at all — you are guaranteed to repeat the same failures.

The Blameless Principle

❌ "John deployed without testing, which caused the outage."
✅ "The deployment process allowed changes to reach production without
    automated testing, which resulted in a regression."

The difference:
  - First one blames a person. John will never volunteer information again.
  - Second one identifies a system gap. Now you fix the system.

Post-Mortem Document Template

# Incident Post-Mortem: [Title]

**Date:** [Date of incident]
**Duration:** [Start time - End time]
**Severity:** SEV-[1/2/3]
**IC:** [Name]
**Author:** [Name]

## Summary
[2-3 sentences: what happened, who was affected, how long]

## Impact
- Users affected: [number or percentage]
- Revenue impact: [estimated, if applicable]
- Duration: [minutes/hours]

## Timeline
| Time | Event |
|---|---|
| 14:32 | Monitoring alert fired: API error rate > 5% |
| 14:35 | On-call acknowledged, began investigation |
| 14:42 | Identified root cause: database connection pool exhausted |
| 14:45 | Restarted connection pool. Errors cleared. |
| 14:55 | Confirmed stable. Incident resolved. |

## Root Cause
[Detailed technical explanation of what caused the incident]

## What Went Well
- Alert fired within 3 minutes of impact
- On-call responded and acknowledged within 5 minutes
- Root cause identified quickly due to good logging

## What Went Poorly
- Connection pool monitoring was not in place
- Runbook for this scenario did not exist
- Status page was not updated until T+20

## Action Items
| Action | Owner | Due Date | Status |
|---|---|---|---|
| Add connection pool monitoring | @alice | 2024-08-01 | ☐ |
| Write runbook for DB connection issues | @bob | 2024-08-05 | ☐ |
| Automate status page updates on SEV-1 | @carol | 2024-08-15 | ☐ |

The Follow-Through Problem

Most action items from post-mortems die in the backlog. Track them separately and hold the team accountable:

Metric	Target	What It Tells You
Post-mortem completion rate	100% for SEV-1/2	Are you doing post-mortems?
Action item completion rate	> 80% within 30 days	Are you actually learning?
Repeat incident rate	Decreasing trend	Are the fixes working?
Mean time to detect (MTTD)	Decreasing trend	Is monitoring improving?
Mean time to resolve (MTTR)	Decreasing trend	Is response improving?