An incident is an investment you have already made. The post-incident review is where you collect the return. Without a good review, the same incident repeats. With a good review, you fix the systemic issue and become more resilient.
Most post-incident reviews fail. They produce a list of action items that nobody follows up on, blame individuals for systemic problems, and focus on the immediate trigger rather than the conditions that made the failure possible. This guide covers the process that produces genuine organizational learning.
Review Structure
Timeline:
Incident occurs → Mitigated → Resolved
Within 24 hours: Draft incident timeline
Within 3 days: Schedule review meeting
Within 1 week: Conduct review (45-60 min)
Within 2 weeks: Publish write-up, assign actions
Within 30 days: Follow up on action items
| Phase | Duration | Purpose |
|---|
| Timeline review | 15 min | Walk through what happened, when |
| Contributing factors | 20 min | Why did this happen? (not who) |
| What worked well | 10 min | What prevented this from being worse? |
| Action items | 15 min | Systemic changes to prevent recurrence |
Timeline Construction
Build the timeline BEFORE the meeting using logs, metrics, and chat history:
14:23 Deployment of checkout-api v2.4.1 begins
14:25 Deployment complete, canary receiving 5% traffic
14:27 Alert: checkout error rate > 1% (PagerDuty)
14:28 On-call engineer acknowledges alert, begins investigation
14:31 Engineer identifies new deployment as potential cause
14:33 Decision: rollback deployment
14:35 Rollback initiated
14:38 Rollback complete, error rate returning to baseline
14:42 Confirmed: error rate back to normal
14:45 Incident resolved
Duration: 22 minutes
Impact: ~300 failed checkout attempts
Revenue impact: estimated $4,500
Contributing Factor Analysis
Ask "what conditions made this possible?" not "who caused this?"
Surface cause: Deployment contained a bug
↓
Why? Code review did not catch the bug
↓
Why? The failure mode only occurs under specific data conditions
↓
Why? Test environment does not have production-like data
↓
Root cause: Test data management does not reflect production patterns
Contributing factors (not root cause — incidents have multiple contributors):
1. Test environment lacks production-representative data
2. Canary analysis did not compare error rates before promoting
3. Rollback procedure took 5 minutes (could be faster)
4. No automated rollback on error rate spike
Questions to Ask
| Question | Purpose |
|---|
| ”What information would have helped you act sooner?” | Identify observability gaps |
| ”What did you consider but decide not to do?” | Understand decision-making under pressure |
| ”Has something similar happened before?” | Identify recurring patterns |
| ”What surprised you?” | Surface gaps in mental models |
| ”What went well during the response?” | Reinforce good practices |
Blameless ≠ Accountable-less
❌ Blame:
"John deployed buggy code without proper testing"
→ John feels attacked, hides future mistakes
✅ Blameless:
"The deployment process allowed code to reach production without
catching this failure mode"
→ Team focuses on systemic improvement
≠ No accountability:
"We identified that our canary analysis does not automatically
compare error rates. The platform team will implement automated
canary analysis. Owner: Sarah. Due: March 30."
→ Specific, owned, systemic improvement
Action Item Quality
| Bad Action Item | Problem | Good Action Item |
|---|
| ”Be more careful with deployments” | Vague, no systemic change | ”Implement automated canary analysis that blocks promotion if error rate > 0.5%" |
| "Add more tests” | No specific scope | ”Add integration test for the payment retry path with declined card codes" |
| "Review deployment process” | No outcome defined | ”Document and implement automated rollback trigger when error rate exceeds 2x baseline" |
| "Improve monitoring” | Could mean anything | ”Add checkout error rate dashboard with 1-minute granularity and SLO burn rate alerts” |
Action Item Template
## Action Item: [Descriptive title]
**Owner:** [Specific person]
**Due:** [Specific date]
**Priority:** P1 / P2 / P3
**Type:** Prevent / Detect / Mitigate
**Description:**
[What specifically needs to change and why]
**Definition of Done:**
[How we know this is complete]
**Tracks against incident:** INC-2024-042
Severity Levels and Review Requirements
| Severity | Impact | Review Required | Audience |
|---|
| SEV1 | Revenue loss, data breach, full outage | Formal review within 3 days | Leadership + engineering |
| SEV2 | Partial outage, degraded service | Written review within 1 week | Engineering team |
| SEV3 | Minor impact, limited users affected | Brief write-up, no meeting | Team lead |
| SEV4 | Near-miss, caught before impact | Note in incident tracker | On-call record |
Publishing the Write-Up
# Incident Report: Checkout Service Outage — March 15, 2024
**Severity:** SEV2
**Duration:** 22 minutes (14:23 - 14:45 UTC)
**Impact:** ~300 failed checkout attempts, ~$4,500 estimated revenue impact
**On-call:** [Name]
**Review Date:** March 18, 2024
## Summary
A deployment of checkout-api v2.4.1 introduced a bug in the payment
retry logic that caused failures for transactions where the initial
charge was declined.
## Timeline
[Detailed timeline here]
## Contributing Factors
[Analysis here]
## What Went Well
- On-call responded within 1 minute of alert
- Rollback procedure worked as documented
- Customer support was notified within 5 minutes
## Action Items
1. [P1] Implement automated canary analysis — Owner: Sarah — Due: March 30
2. [P2] Add integration test for payment retry with declined cards — Owner: Mike — Due: March 25
3. [P2] Implement automated rollback on 2x error rate — Owner: Platform Team — Due: April 15
Implementation Checklist