ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Sustainable Incident On-Call

Design on-call rotations that protect engineer health while maintaining service reliability. Covers rotation design, escalation policies, compensation models, burnout prevention, alert quality, and the organizational practices that make on-call sustainable.

On-call should not be a punishment. Engineers who dread on-call produce worse code, miss more incidents, and eventually leave the company. Sustainable on-call requires deliberate design — of rotations, alerts, escalation paths, and most importantly, organizational incentives.


Rotation Design

Minimum viable rotation:
  - At least 6-8 engineers in rotation
  - 1 week on-call shifts (not 2+)
  - Follow-the-sun for global teams
  - Clear primary + secondary escalation

Schedule:
  Week 1: Alice (primary), Bob (secondary)
  Week 2: Bob (primary), Carol (secondary)
  Week 3: Carol (primary), Dave (secondary)
  ...
  
  Secondary only paged if primary doesn't acknowledge in 10 minutes
  
  After-hours: Primary handles Sev1/Sev2 only
  Business hours: Primary handles all severities

Alert Quality Metrics

Signal-to-noise ratio:
  Total alerts per week: 40
  Actionable alerts: 15 (37.5%)
  False positives: 25 (62.5%)  ← THIS IS THE PROBLEM

Target:
  Actionable rate: > 80%
  False positive rate: < 20%
  Alerts per on-call shift: < 20 (including business hours)
  Night pages (11PM-7AM): < 2 per week

Metrics to track:
  - Alerts per shift (total and after-hours)
  - Time-to-acknowledge (< 5 min target)
  - Time-to-resolve per severity
  - False positive rate
  - Escalation rate
  - Sleep disruption count

Escalation Policies

escalation_policy:
  sev1_critical:  # Revenue-impacting, full outage
    timeout: 5 min
    chain:
      - primary_on_call
      - secondary_on_call (after 5 min)
      - engineering_manager (after 15 min)
      - vp_engineering (after 30 min)
    channels: phone, sms, slack
    
  sev2_high:  # Degraded service, partial impact  
    timeout: 15 min
    chain:
      - primary_on_call
      - secondary_on_call (after 15 min)
    channels: push_notification, slack
    
  sev3_medium:  # Non-urgent, business hours response
    timeout: 4 hours
    chain:
      - primary_on_call
    channels: slack_only
    restriction: business_hours_only

Burnout Prevention

Organizational practices:
  ☐ No more than 1 week on-call per 6-8 week cycle
  ☐ Comp time: day off after on-call week with pages
  ☐ Monetary compensation for after-hours pages
  ☐ Blameless postmortems (learn, don't punish)
  ☐ Toil budget: 20% of sprint for alert cleanup
  ☐ Regular alert review: delete noisy alerts

Individual protections:
  ☐ Right to silence phone during protected sleep hours
  ☐ Secondary on-call handles overflow
  ☐ Escalation to management for extended incidents
  ☐ No on-call during PTO, sick leave, or parental leave

Red flags:
  ⚠ Same person always on-call (team too small)
  ⚠ Alerts that page but require no action
  ⚠ Incident post-mortems result in more alerts (not fewer)
  ⚠ Engineers refuse to join on-call rotation

Compensation Models

ModelHow It WorksWhen
Flat stipend$500-$1,500/month for being on-callSmall teams, simple rotation
Per-page payment$50-$200 per after-hours pageHigh-reliability teams
Comp time1 day off per on-call weekOrganizations that value time
HybridStipend + per-page + comp timeBest practice for large orgs
NothingNo compensationFastest way to lose engineers

Anti-Patterns

Anti-PatternConsequenceFix
No on-call compensationEngineers resent on-call, attritionStipend + comp time minimum
Alert fatigue (50+ alerts/week)Real incidents missedAggressive alert cleanup
Same 2-3 people always on-callBurnout, knowledge concentrationGrow rotation to 6-8 minimum
No follow-the-sunNight pages for everyoneDistribute across time zones
Heroic culture (“just deal with it”)Unsustainable, leads to burnoutSystematic, measured on-call health

On-call is a team responsibility, not an individual burden. If your on-call rotation is burning people out, the problem is the system, not the people.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →