On-Call Engineering | The Garnet Wiki

On-call is the commitment that production is someone’s explicit responsibility at every moment. Without on-call, incidents are responded to by whoever happens to notice — which means critical alerts at 3 AM go unnoticed until customers complain at 9 AM.

Done well, on-call is a manageable responsibility that develops deep system understanding. Done poorly, it is a burnout machine that drives engineers to quit.

Rotation Design

Team Size Requirements

Minimum viable on-call: 4 engineers
  - Each engineer is on-call 1 week per month
  - Allows for PTO, sick days, swaps

Healthy on-call: 6-8 engineers
  - Each engineer is on-call 1 week every 6-8 weeks
  - Sustainable long-term
  - Buffer for coverage gaps

Rotation Patterns

Primary + Secondary Model:
  Primary:   First responder, handles all pages
  Secondary: Backup if primary doesn't respond in 10 min, or for escalation
  
  Week 1: Alice (P), Bob (S)
  Week 2: Bob (P), Carol (S)
  Week 3: Carol (P), Dave (S)
  Week 4: Dave (P), Alice (S)

Handoff Protocol

End of shift checklist:
  1. Document any ongoing issues in the incident channel
  2. List any alerts that fired and their resolution
  3. Note any alerts that need investigation but are not urgent
  4. Flag any upcoming maintenance or deployments
  5. Update on-call tool (PagerDuty schedule)

Escalation Policy

escalation_policy:
  - level: 1
    target: primary_oncall
    timeout: 5_minutes
    
  - level: 2
    target: secondary_oncall
    timeout: 10_minutes
    
  - level: 3
    target: engineering_manager
    timeout: 15_minutes
    
  - level: 4
    target: vp_engineering
    timeout: 30_minutes

Alert Quality

Alert Classification

Page (wake someone up):
  - Service is down
  - Error rate > 5% for 5+ minutes
  - Customer-facing latency > 5s
  - Data pipeline SLA breach imminent

Notify (Slack, can wait):
  - Disk usage > 80%
  - Elevated error rate (1-5%)
  - Certificate expiring in 14 days
  - Non-critical job failure

Log (dashboard only):
  - CPU spike (no customer impact)
  - Cache hit rate below optimal
  - Non-critical dependency degraded

Alert Hygiene

Review alerts monthly:

Last Month Alert Report:
  Total pages:           47
  Actionable:            31 (66%)   ← target: >80%
  False positives:       12 (26%)   ← target: <10%
  Noisy (duplicate):      4 (8%)   ← target: 0%
  
  Actions:
  - Tune threshold on memory alert (caused 8 false positives)
  - Deduplicate disk space alerts
  - Add alert for the issue that required customer escalation

Runbooks

Every alert should have a linked runbook:

# Runbook: High Error Rate on Order Service

## Alert
`order_service_error_rate > 5% for 5 minutes`

## Impact
Customers may see errors during checkout

## Diagnosis Steps
1. Check error logs: `kubectl logs -l app=order-service --tail=100`
2. Check dependent services:
   - Payment service: https://grafana.internal/d/payment
   - Database: https://grafana.internal/d/postgres
3. Check recent deploys: `kubectl rollout history deploy/order-service`

## Common Causes

### Bad Deploy
Symptoms: Error spike immediately after deploy
Fix: `kubectl rollout undo deploy/order-service`

### Database Connection Exhaustion
Symptoms: "connection pool exhausted" in logs
Fix: Restart service: `kubectl rollout restart deploy/order-service`
Root cause: Investigate connection leak in next business day

### Downstream Service Failure
Symptoms: Timeout errors to payment-service or inventory-service
Fix: Check downstream service status, activate circuit breaker if needed

## Escalation
If unable to resolve within 15 minutes, escalate to secondary on-call.

Compensation and Sustainability

On-Call Compensation Models

Model	Details
Flat weekly rate	$500-1000 per on-call week
Per-incident bonus	Base rate + $X per page responded to
Time-off in lieu	1 comp day per on-call week
Higher base salary	On-call expectation factored into total comp

Burnout Prevention

Maximum consecutive days: 7 (no multi-week rotations)
Follow-the-sun: For global teams, hand off to a team in another timezone
No-page improvements: Track and eliminate recurring pages
Post-incident recovery: If paged after midnight, no meetings before noon
Opt-out periods: Engineers can block on-call during exam weeks, moves, etc.

Anti-Patterns

Anti-Pattern	Consequence	Fix
Same 2 engineers always on-call	Burnout, knowledge hoarding	Minimum 4-person rotation
No runbooks	On-call relies on tribal knowledge	Every alert links to a runbook
> 5 pages per week	Alert fatigue, pages get ignored	Tune alerts, fix noisy ones
No compensation	Resentment, people avoid on-call	Compensate fairly
No feedback loop	Same alerts page forever	Monthly alert review, action items

On-call is not a punishment. It is an investment in production reliability and in engineers’ deep system understanding. But it must be designed with the same care as any other engineering system.