Sustainable Incident On-Call

On-call should not be a punishment. Engineers who dread on-call produce worse code, miss more incidents, and eventually leave the company. Sustainable on-call requires deliberate design — of rotations, alerts, escalation paths, and most importantly, organizational incentives.

Rotation Design

Minimum viable rotation:
  - At least 6-8 engineers in rotation
  - 1 week on-call shifts (not 2+)
  - Follow-the-sun for global teams
  - Clear primary + secondary escalation

Schedule:
  Week 1: Alice (primary), Bob (secondary)
  Week 2: Bob (primary), Carol (secondary)
  Week 3: Carol (primary), Dave (secondary)
  ...
  
  Secondary only paged if primary doesn't acknowledge in 10 minutes
  
  After-hours: Primary handles Sev1/Sev2 only
  Business hours: Primary handles all severities

Alert Quality Metrics

Signal-to-noise ratio:
  Total alerts per week: 40
  Actionable alerts: 15 (37.5%)
  False positives: 25 (62.5%)  ← THIS IS THE PROBLEM

Target:
  Actionable rate: > 80%
  False positive rate: < 20%
  Alerts per on-call shift: < 20 (including business hours)
  Night pages (11PM-7AM): < 2 per week

Metrics to track:
  - Alerts per shift (total and after-hours)
  - Time-to-acknowledge (< 5 min target)
  - Time-to-resolve per severity
  - False positive rate
  - Escalation rate
  - Sleep disruption count

Escalation Policies

escalation_policy:
  sev1_critical:  # Revenue-impacting, full outage
    timeout: 5 min
    chain:
      - primary_on_call
      - secondary_on_call (after 5 min)
      - engineering_manager (after 15 min)
      - vp_engineering (after 30 min)
    channels: phone, sms, slack
    
  sev2_high:  # Degraded service, partial impact  
    timeout: 15 min
    chain:
      - primary_on_call
      - secondary_on_call (after 15 min)
    channels: push_notification, slack
    
  sev3_medium:  # Non-urgent, business hours response
    timeout: 4 hours
    chain:
      - primary_on_call
    channels: slack_only
    restriction: business_hours_only

Burnout Prevention

Organizational practices:
  ☐ No more than 1 week on-call per 6-8 week cycle
  ☐ Comp time: day off after on-call week with pages
  ☐ Monetary compensation for after-hours pages
  ☐ Blameless postmortems (learn, don't punish)
  ☐ Toil budget: 20% of sprint for alert cleanup
  ☐ Regular alert review: delete noisy alerts

Individual protections:
  ☐ Right to silence phone during protected sleep hours
  ☐ Secondary on-call handles overflow
  ☐ Escalation to management for extended incidents
  ☐ No on-call during PTO, sick leave, or parental leave

Red flags:
  ⚠ Same person always on-call (team too small)
  ⚠ Alerts that page but require no action
  ⚠ Incident post-mortems result in more alerts (not fewer)
  ⚠ Engineers refuse to join on-call rotation

Compensation Models

Model	How It Works	When
Flat stipend	$500-$1,500/month for being on-call	Small teams, simple rotation
Per-page payment	$50-$200 per after-hours page	High-reliability teams
Comp time	1 day off per on-call week	Organizations that value time
Hybrid	Stipend + per-page + comp time	Best practice for large orgs
Nothing	No compensation	Fastest way to lose engineers

Anti-Patterns

Anti-Pattern	Consequence	Fix
No on-call compensation	Engineers resent on-call, attrition	Stipend + comp time minimum
Alert fatigue (50+ alerts/week)	Real incidents missed	Aggressive alert cleanup
Same 2-3 people always on-call	Burnout, knowledge concentration	Grow rotation to 6-8 minimum
No follow-the-sun	Night pages for everyone	Distribute across time zones
Heroic culture (“just deal with it”)	Unsustainable, leads to burnout	Systematic, measured on-call health

On-call is a team responsibility, not an individual burden. If your on-call rotation is burning people out, the problem is the system, not the people.

Rotation Design

Alert Quality Metrics

Escalation Policies

Burnout Prevention

Compensation Models

Anti-Patterns

More in Site Reliability Engineering

Capacity Planning: Scaling Infrastructure Before You Need To

SRE Capacity Forecasting

Capacity Planning