Incident Management | The Garnet Wiki

Every production system will have incidents. The difference between an organization that handles them well and one that does not is not the absence of failure — it is the presence of a structured, practiced, and continuously improving incident management process.

Severity Classification

Severity	Impact	Response Time	Example
SEV-1	Full outage, data loss, security breach	Immediate (< 5 min)	Site down, payment processing failed
SEV-2	Major feature degraded, significant user impact	< 15 min	Search broken, slow response times
SEV-3	Minor feature degraded, workaround exists	< 1 hour	Admin panel slow, non-critical job failures
SEV-4	Cosmetic, no user impact	Next business day	Typo in email, log noise

Incident Roles

Incident Commander (IC)

Responsibilities:
  - Coordinates response (does NOT debug)
  - Delegates investigation tasks
  - Makes escalation decisions
  - Communicates status externally
  - Decides when to declare "resolved"
  
Not responsible for:
  - Writing code
  - Debugging logs
  - Making architectural decisions under pressure

Technical Lead

Responsibilities:
  - Leads technical investigation
  - Proposes and implements fixes
  - Advises IC on blast radius and risk
  - Coordinates with engineers working the issue

Communications Lead

Responsibilities:
  - Updates status page
  - Sends customer notifications
  - Updates internal stakeholders
  - Posts in incident channel at regular intervals

Incident Lifecycle

Phase 1: Detection (0-5 min)
  - Alert fires or customer reports issue
  - On-call acknowledges
  - Initial assessment: severity + impact

Phase 2: Triage (5-15 min)
  - Incident declared (Slack channel created)
  - Roles assigned (IC, Tech Lead, Comms)
  - Hypothesis formed based on symptoms

Phase 3: Investigation (15-60 min)
  - Systematic debugging
  - Dashboard and log review
  - Root cause identification
  - Impact scope defined

Phase 4: Mitigation (varies)
  - Implement fix (rollback, config change, hotfix)
  - Verify fix resolves user-facing symptoms
  - Monitor recovery

Phase 5: Resolution
  - Confirm all systems nominal
  - Update status page to "resolved"
  - Schedule postmortem

Communication Templates

Internal Update (Every 15-30 min)

🔴 SEV-1 Incident: Checkout failures

Status: Investigating
Impact: ~30% of checkout attempts failing with 500 errors
Duration: 23 minutes so far
Team: @alice (IC), @bob (Tech Lead), @carol (Comms)

Current theory: Database connection pool exhaustion
Next action: Restarting order-service pods
ETA to next update: 15 minutes

Customer Status Page

Investigating: Checkout Issues
We are investigating reports of checkout failures. Some customers
may experience errors when completing purchases. Our team is 
actively working on a resolution.

Posted: 3:45 PM EDT
Next update: 4:15 PM EDT

Postmortem

Structure

# Incident Postmortem: Checkout Failures (2026-03-04)

## Summary
Checkout failures affecting 30% of purchases for 47 minutes.

## Timeline
15:23 - Alert: order-service error rate > 5%
15:25 - On-call acknowledges, begins investigation  
15:30 - IC declares SEV-1, incident channel created
15:35 - Root cause identified: connection pool exhausted
15:38 - Mitigation: order-service pods restarted
15:42 - Error rate returning to normal
16:10 - Resolved, all systems nominal

## Root Cause
Connection leak in new ORM version deployed at 14:00.
Connections were not being returned on timeout errors.

## Impact
- 1,247 failed checkout attempts
- Estimated revenue impact: $45,000
- 47 minutes of degraded service

## Action Items
- [ ] Fix: Upgrade ORM to patched version (P0, @bob, by 03/06)
- [ ] Detect: Add connection pool utilization alert (P1, @carol, by 03/07)
- [ ] Prevent: Add connection leak detection to CI (P2, @dave, by 03/14)
- [ ] Process: Add ORM to critical dependency test matrix (P2, @alice, by 03/14)

Anti-Patterns

Anti-Pattern	Consequence	Fix
No defined severity levels	Every incident is a fire drill	Classify, respond appropriately
IC also debugging	Nobody coordinating response	Separate coordination from investigation
Blame individuals	People hide mistakes	Blameless postmortems
No postmortem	Same incident recurs	Postmortem + action items after every SEV-1/2
Action items without owners/dates	Items never completed	Owner + due date + tracking

Incident management is a skill. Like any skill, it improves with practice. Run tabletop exercises, review postmortems, and treat every incident as an opportunity to make the system and the process better.