Incident Management
Handle production incidents effectively from detection through resolution to postmortem. Covers incident severity classification, commander role, communication templates, timeline documentation, and building an incident management program that improves with every incident.
Every production system will have incidents. The difference between an organization that handles them well and one that does not is not the absence of failure — it is the presence of a structured, practiced, and continuously improving incident management process.
Severity Classification
| Severity | Impact | Response Time | Example |
|---|---|---|---|
| SEV-1 | Full outage, data loss, security breach | Immediate (< 5 min) | Site down, payment processing failed |
| SEV-2 | Major feature degraded, significant user impact | < 15 min | Search broken, slow response times |
| SEV-3 | Minor feature degraded, workaround exists | < 1 hour | Admin panel slow, non-critical job failures |
| SEV-4 | Cosmetic, no user impact | Next business day | Typo in email, log noise |
Incident Roles
Incident Commander (IC)
Responsibilities:
- Coordinates response (does NOT debug)
- Delegates investigation tasks
- Makes escalation decisions
- Communicates status externally
- Decides when to declare "resolved"
Not responsible for:
- Writing code
- Debugging logs
- Making architectural decisions under pressure
Technical Lead
Responsibilities:
- Leads technical investigation
- Proposes and implements fixes
- Advises IC on blast radius and risk
- Coordinates with engineers working the issue
Communications Lead
Responsibilities:
- Updates status page
- Sends customer notifications
- Updates internal stakeholders
- Posts in incident channel at regular intervals
Incident Lifecycle
Phase 1: Detection (0-5 min)
- Alert fires or customer reports issue
- On-call acknowledges
- Initial assessment: severity + impact
Phase 2: Triage (5-15 min)
- Incident declared (Slack channel created)
- Roles assigned (IC, Tech Lead, Comms)
- Hypothesis formed based on symptoms
Phase 3: Investigation (15-60 min)
- Systematic debugging
- Dashboard and log review
- Root cause identification
- Impact scope defined
Phase 4: Mitigation (varies)
- Implement fix (rollback, config change, hotfix)
- Verify fix resolves user-facing symptoms
- Monitor recovery
Phase 5: Resolution
- Confirm all systems nominal
- Update status page to "resolved"
- Schedule postmortem
Communication Templates
Internal Update (Every 15-30 min)
🔴 SEV-1 Incident: Checkout failures
Status: Investigating
Impact: ~30% of checkout attempts failing with 500 errors
Duration: 23 minutes so far
Team: @alice (IC), @bob (Tech Lead), @carol (Comms)
Current theory: Database connection pool exhaustion
Next action: Restarting order-service pods
ETA to next update: 15 minutes
Customer Status Page
Investigating: Checkout Issues
We are investigating reports of checkout failures. Some customers
may experience errors when completing purchases. Our team is
actively working on a resolution.
Posted: 3:45 PM EDT
Next update: 4:15 PM EDT
Postmortem
Structure
# Incident Postmortem: Checkout Failures (2026-03-04)
## Summary
Checkout failures affecting 30% of purchases for 47 minutes.
## Timeline
15:23 - Alert: order-service error rate > 5%
15:25 - On-call acknowledges, begins investigation
15:30 - IC declares SEV-1, incident channel created
15:35 - Root cause identified: connection pool exhausted
15:38 - Mitigation: order-service pods restarted
15:42 - Error rate returning to normal
16:10 - Resolved, all systems nominal
## Root Cause
Connection leak in new ORM version deployed at 14:00.
Connections were not being returned on timeout errors.
## Impact
- 1,247 failed checkout attempts
- Estimated revenue impact: $45,000
- 47 minutes of degraded service
## Action Items
- [ ] Fix: Upgrade ORM to patched version (P0, @bob, by 03/06)
- [ ] Detect: Add connection pool utilization alert (P1, @carol, by 03/07)
- [ ] Prevent: Add connection leak detection to CI (P2, @dave, by 03/14)
- [ ] Process: Add ORM to critical dependency test matrix (P2, @alice, by 03/14)
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| No defined severity levels | Every incident is a fire drill | Classify, respond appropriately |
| IC also debugging | Nobody coordinating response | Separate coordination from investigation |
| Blame individuals | People hide mistakes | Blameless postmortems |
| No postmortem | Same incident recurs | Postmortem + action items after every SEV-1/2 |
| Action items without owners/dates | Items never completed | Owner + due date + tracking |
Incident management is a skill. Like any skill, it improves with practice. Run tabletop exercises, review postmortems, and treat every incident as an opportunity to make the system and the process better.