ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Incident Management

Handle production incidents effectively from detection through resolution to postmortem. Covers incident severity classification, commander role, communication templates, timeline documentation, and building an incident management program that improves with every incident.

Every production system will have incidents. The difference between an organization that handles them well and one that does not is not the absence of failure — it is the presence of a structured, practiced, and continuously improving incident management process.


Severity Classification

SeverityImpactResponse TimeExample
SEV-1Full outage, data loss, security breachImmediate (< 5 min)Site down, payment processing failed
SEV-2Major feature degraded, significant user impact< 15 minSearch broken, slow response times
SEV-3Minor feature degraded, workaround exists< 1 hourAdmin panel slow, non-critical job failures
SEV-4Cosmetic, no user impactNext business dayTypo in email, log noise

Incident Roles

Incident Commander (IC)

Responsibilities:
  - Coordinates response (does NOT debug)
  - Delegates investigation tasks
  - Makes escalation decisions
  - Communicates status externally
  - Decides when to declare "resolved"
  
Not responsible for:
  - Writing code
  - Debugging logs
  - Making architectural decisions under pressure

Technical Lead

Responsibilities:
  - Leads technical investigation
  - Proposes and implements fixes
  - Advises IC on blast radius and risk
  - Coordinates with engineers working the issue

Communications Lead

Responsibilities:
  - Updates status page
  - Sends customer notifications
  - Updates internal stakeholders
  - Posts in incident channel at regular intervals

Incident Lifecycle

Phase 1: Detection (0-5 min)
  - Alert fires or customer reports issue
  - On-call acknowledges
  - Initial assessment: severity + impact

Phase 2: Triage (5-15 min)
  - Incident declared (Slack channel created)
  - Roles assigned (IC, Tech Lead, Comms)
  - Hypothesis formed based on symptoms

Phase 3: Investigation (15-60 min)
  - Systematic debugging
  - Dashboard and log review
  - Root cause identification
  - Impact scope defined

Phase 4: Mitigation (varies)
  - Implement fix (rollback, config change, hotfix)
  - Verify fix resolves user-facing symptoms
  - Monitor recovery

Phase 5: Resolution
  - Confirm all systems nominal
  - Update status page to "resolved"
  - Schedule postmortem

Communication Templates

Internal Update (Every 15-30 min)

🔴 SEV-1 Incident: Checkout failures

Status: Investigating
Impact: ~30% of checkout attempts failing with 500 errors
Duration: 23 minutes so far
Team: @alice (IC), @bob (Tech Lead), @carol (Comms)

Current theory: Database connection pool exhaustion
Next action: Restarting order-service pods
ETA to next update: 15 minutes

Customer Status Page

Investigating: Checkout Issues
We are investigating reports of checkout failures. Some customers
may experience errors when completing purchases. Our team is 
actively working on a resolution.

Posted: 3:45 PM EDT
Next update: 4:15 PM EDT

Postmortem

Structure

# Incident Postmortem: Checkout Failures (2026-03-04)

## Summary
Checkout failures affecting 30% of purchases for 47 minutes.

## Timeline
15:23 - Alert: order-service error rate > 5%
15:25 - On-call acknowledges, begins investigation  
15:30 - IC declares SEV-1, incident channel created
15:35 - Root cause identified: connection pool exhausted
15:38 - Mitigation: order-service pods restarted
15:42 - Error rate returning to normal
16:10 - Resolved, all systems nominal

## Root Cause
Connection leak in new ORM version deployed at 14:00.
Connections were not being returned on timeout errors.

## Impact
- 1,247 failed checkout attempts
- Estimated revenue impact: $45,000
- 47 minutes of degraded service

## Action Items
- [ ] Fix: Upgrade ORM to patched version (P0, @bob, by 03/06)
- [ ] Detect: Add connection pool utilization alert (P1, @carol, by 03/07)
- [ ] Prevent: Add connection leak detection to CI (P2, @dave, by 03/14)
- [ ] Process: Add ORM to critical dependency test matrix (P2, @alice, by 03/14)

Anti-Patterns

Anti-PatternConsequenceFix
No defined severity levelsEvery incident is a fire drillClassify, respond appropriately
IC also debuggingNobody coordinating responseSeparate coordination from investigation
Blame individualsPeople hide mistakesBlameless postmortems
No postmortemSame incident recursPostmortem + action items after every SEV-1/2
Action items without owners/datesItems never completedOwner + due date + tracking

Incident management is a skill. Like any skill, it improves with practice. Run tabletop exercises, review postmortems, and treat every incident as an opportunity to make the system and the process better.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →