Toil Budgets and Elimination
Measure, budget, and systematically eliminate toil in SRE organizations. Covers toil taxonomy, measurement frameworks, automation ROI calculation, toil elimination strategies, and the organizational patterns that prevent toil from consuming engineering capacity.
Google SRE defines toil as manual, repetitive, automatable work that scales linearly with the size of the service. If your team spends 60% of their time on toil, they have 40% for engineering work. The SRE target is less than 50% toil — ideally under 30%. Every percentage point of toil reduction is reclaimed engineering capacity.
Toil Taxonomy
Characteristics of toil:
Manual: Requires a human to run a script, click a button
Repetitive: Done more than once, not a novel task
Automatable: A machine could do it (if someone built the automation)
Reactive: Triggered by external event, not proactive choice
No value: Does not improve the service permanently
Scales with service: More users = more toil
NOT toil:
✅ Architecture design (creative, non-repetitive)
✅ Code review (valuable human judgment)
✅ Incident postmortems (permanent improvement)
✅ Building automation (eliminates future toil)
✅ On-call with novel incidents (requires engineering thinking)
Measuring Toil
Toil Tracking
# Track toil weekly per team member
weekly_toil_log:
engineer: "Alice"
week: "2026-W10"
total_hours: 40
toil_hours: 14
toil_percentage: 35%
toil_breakdown:
- activity: "Manual certificate rotation"
hours: 3
frequency: weekly
automatable: true
automation_effort: "2 days"
- activity: "Scaling database replicas"
hours: 2
frequency: twice_weekly
automatable: true
automation_effort: "1 day"
- activity: "Answering capacity requests via ticket"
hours: 4
frequency: daily
automatable: true
automation_effort: "1 week"
- activity: "Debugging user-reported 500 errors"
hours: 5
frequency: daily
automatable: partially
automation_effort: "2 weeks (better error handling)"
Automation ROI Calculation
Toil: Manual certificate rotation
Time per occurrence: 45 minutes
Frequency: 12 times per month
Monthly cost: 12 × 45 min = 9 hours
Annual cost: 9 × 12 = 108 hours × $150/hr = $16,200
Automation cost:
Development: 16 hours = $2,400
Testing: 4 hours = $600
Maintenance: 2 hours/month = $3,600/year
Total first year: $6,600
ROI:
First year savings: $16,200 - $6,600 = $9,600 (59% ROI)
Second year savings: $16,200 - $3,600 = $12,600 per year
Payback period: ~5 months
XKCD rule of thumb:
If a task takes 30 min and you do it weekly:
You can spend up to 26 hours automating it (over 5 years)
Elimination Strategies
Automation Ladder
Level 0: Fully manual
Human does everything, every time
Level 1: Documented procedure
Human follows a runbook step by step
Level 2: Partially automated
Script handles most steps, human monitors
Level 3: Fully automated with human trigger
Human runs script, it does everything
Level 4: Fully automated with machine trigger
Event triggers automation, human reviews result
Level 5: Fully autonomous
Event triggers automation, no human involved
Human alerted only on failure
Goal: Move every toil item up the ladder.
Most items should be Level 3+ within 6 months.
Self-Service Platforms
# Instead of tickets → automation → done
# Build self-service → engineers do it themselves
self_service_catalog:
- name: "Create new database"
old_process: "File ticket → DBA reviews → DBA provisions → 3 days"
new_process: "Click form → automated provisioning → 5 minutes"
toil_eliminated: 4 hours/week (DBA team)
- name: "Request cloud IAM role"
old_process: "File ticket → security review → manual creation → 2 days"
new_process: "PR to IAM repo → automated review + Terraform apply → 30 min"
toil_eliminated: 6 hours/week (security team)
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Not measuring toil | Cannot improve what you don’t measure | Weekly toil tracking per engineer |
| Automating only easy toil | High-impact toil remains | Prioritize by ROI, not difficulty |
| Toil > 50% accepted as normal | No engineering capacity for improvement | SRE contract: toil must stay < 50% |
| Band-aid automation | Automates symptoms, not root cause | Fix underlying system issues |
| No toil budget in sprint planning | Toil elimination never prioritized | Dedicated automation sprint capacity |
Toil is the tax on engineering capacity. Every hour spent on toil is an hour not spent on reliability improvements, feature work, or career development. Measure it, budget it, and systematically eliminate it.