Toil Elimination: Automating the Work Nobody Should Be Doing
Identify and eliminate engineering toil with systematic automation. Covers toil budgets, automation ROI calculation, scripting patterns, and building a culture where repetitive work is treated as a bug to be fixed.
Toil is the work that keeps the lights on but adds no lasting value. It is the manual database promotion, the weekly report that someone copies from three dashboards into a spreadsheet, the runbook step that says “SSH into the production server and restart the process.” Toil feels productive because you are busy. But busyness is not progress — it is the opposite.
Google’s SRE book defines toil as work that is manual, repetitive, automatable, tactical, of no enduring value, and scales linearly with service growth. If your team doubles in size and the amount of toil doubles with it, you are maintaining the problem, not solving it.
This guide is about building systematic toil elimination into your engineering culture.
The Toil Budget
SRE teams at Google target a maximum of 50% toil. If your team spends more than half its time on operational toil, it has no capacity to improve the systems it operates — which means the toil will only grow.
| Toil Level | Engineering Impact | What to Do |
|---|---|---|
| < 25% | Healthy. Team has capacity for projects. | Maintain. Keep automating. |
| 25-50% | Sustainable but uncomfortable. | Dedicate 1 sprint/quarter to automation. |
| 50-75% | Unsustainable. Team is a support desk. | Stop feature work. Automate top 3 toil sources. |
| > 75% | Crisis. Engineers are quitting. | Executive escalation. Hire or automate immediately. |
Measuring Toil
You cannot eliminate what you do not measure. Track toil for 2 weeks:
# Toil Log Template (have each engineer fill this out for 2 weeks)
## [Engineer Name] — Week of [Date]
| Task | Time Spent | Frequency | Could It Be Automated? |
|---|---|---|---|
| Restart payment worker | 15 min | 3x/week | Yes — healthcheck + auto-restart |
| Generate weekly metrics report | 45 min | 1x/week | Yes — scheduled query + Slack |
| Rotate API keys for partner | 30 min | 2x/month | Yes — automated rotation |
| Investigate false positive alerts | 20 min | Daily | Yes — fix alert thresholds |
| Onboard new service to monitoring | 2 hrs | 1x/month | Partially — template dashboards |
**Total toil this week: 7.5 hours (18.75% of 40-hour week)**
The Automation ROI Calculator
Not all toil is worth automating. The classic xkcd “Is It Worth the Time?” chart applies — but with a twist: you should also factor in the cognitive cost and error risk of manual work.
def calculate_automation_roi(
manual_time_minutes: float,
frequency_per_month: float,
automation_build_hours: float,
error_probability: float = 0.05,
error_cost_hours: float = 2.0
):
"""
Calculate whether automation is worth building.
Returns months until automation pays for itself.
"""
monthly_manual_cost_hours = (manual_time_minutes * frequency_per_month) / 60
# Include the hidden cost of errors from manual execution
monthly_error_cost_hours = frequency_per_month * error_probability * error_cost_hours
total_monthly_savings = monthly_manual_cost_hours + monthly_error_cost_hours
if total_monthly_savings == 0:
return float('inf')
payback_months = automation_build_hours / total_monthly_savings
return {
'payback_months': round(payback_months, 1),
'annual_savings_hours': round(total_monthly_savings * 12, 0),
'worth_automating': payback_months < 6, # 6-month threshold
}
# Examples:
print(calculate_automation_roi(
manual_time_minutes=30,
frequency_per_month=20, # Almost daily
automation_build_hours=8, # One day to automate
))
# → {'payback_months': 0.7, 'annual_savings_hours': 140, 'worth_automating': True}
print(calculate_automation_roi(
manual_time_minutes=10,
frequency_per_month=1, # Once a month
automation_build_hours=16, # Two days to automate
))
# → {'payback_months': 68.6, 'annual_savings_hours': 2.8, 'worth_automating': False}
The Decision Framework
| Quick Decision | Automate? |
|---|---|
| Done daily + takes > 10 min | ✅ Always |
| Done weekly + takes > 30 min | ✅ Yes |
| Done monthly + takes > 2 hours | ✅ Probably |
| Done monthly + takes < 15 min | ❌ Not worth it |
| Done once a year | ❌ Write a runbook instead |
| Error-prone regardless of frequency | ✅ Yes — human error risk justifies it |
Common Toil Patterns and Solutions
Pattern 1: Manual Data Tasks
| Toil | Automation |
|---|---|
| ”Copy data from dashboard to spreadsheet for weekly report” | Scheduled SQL query → formatted email/Slack message |
| ”Export CSV, transform in Excel, upload to partner SFTP” | Python script with pandas on cron/Airflow |
| ”Check 5 dashboards every morning” | Unified dashboard with anomaly detection alerts |
# Example: Replace manual weekly report with automated delivery
import smtplib
from email.mime.text import MIMEText
import psycopg2
import schedule
def generate_weekly_report():
conn = psycopg2.connect(DATABASE_URL)
cursor = conn.cursor()
cursor.execute("""
SELECT
date_trunc('week', created_at) as week,
count(*) as signups,
count(*) filter (where converted) as conversions,
round(count(*) filter (where converted)::numeric / count(*) * 100, 1) as rate
FROM users
WHERE created_at > now() - interval '4 weeks'
GROUP BY 1 ORDER BY 1
""")
rows = cursor.fetchall()
report = format_report(rows) # Format as HTML table
send_email(
to="team@company.com",
subject=f"Weekly Metrics Report — {datetime.now().strftime('%B %d')}",
body=report
)
schedule.every().monday.at("09:00").do(generate_weekly_report)
Pattern 2: Infrastructure Toil
| Toil | Automation |
|---|---|
| ”SSH into server and restart service” | Kubernetes liveness probe + auto-restart |
| ”Manually scale up before expected traffic spike” | Scheduled autoscaling policy |
| ”Check certificate expiry dates” | cert-manager with auto-renewal |
| ”Rotate database passwords quarterly” | Vault dynamic credentials (auto-rotating) |
Pattern 3: Operational Toil
| Toil | Automation |
|---|---|
| ”Onboard new engineer to 15 systems” | Onboarding script + IDP provisioning |
| ”Create Jira ticket for every alert” | PagerDuty → Jira integration |
| ”Update runbook after every incident” | Post-mortem template auto-generates runbook update |
| ”Check compliance evidence monthly” | Evidence collection script on cron |
Building a Toil-Elimination Culture
The hardest part of eliminating toil is not the automation itself — it is convincing people that manual work is a problem worth solving. Many engineers take pride in being the person who “knows how to restart the billing system,” even though that knowledge should be in a script, not a person.
Principles
- Toil is a bug, not a feature. Track it the same way you track software defects.
- Automate the second time. The first time you do something manually is learning. The second time is a signal to automate.
- Make it easy to report toil. A Slack command, a form, a tag in your issue tracker. If reporting toil is itself toilsome, nobody will report it.
- Celebrate automation wins. “This script replaces 4 hours/week of manual work” should be announced the same way you announce feature launches.
- Budget for elimination. Dedicate 10-20% of sprint capacity to automation. If you wait for “free time,” it will never happen.
Implementation Checklist
- Have every engineer log their toil for 2 weeks using the template above
- Calculate your team’s toil percentage (target: < 50%)
- Rank toil items by time × frequency × error risk
- Automate the top 3 toil sources in the next sprint
- Set up a “toil backlog” in your issue tracker, separate from feature work
- Allocate 10-20% of sprint capacity to toil elimination permanently
- Track automation ROI: hours saved per month for each automation built
- Review toil metrics quarterly and celebrate the trend line going down