Toil Elimination: Automating the Work Nobody Should Be Doing

Toil is the work that keeps the lights on but adds no lasting value. It is the manual database promotion, the weekly report that someone copies from three dashboards into a spreadsheet, the runbook step that says “SSH into the production server and restart the process.” Toil feels productive because you are busy. But busyness is not progress — it is the opposite.

Google’s SRE book defines toil as work that is manual, repetitive, automatable, tactical, of no enduring value, and scales linearly with service growth. If your team doubles in size and the amount of toil doubles with it, you are maintaining the problem, not solving it.

This guide is about building systematic toil elimination into your engineering culture.

The Toil Budget

SRE teams at Google target a maximum of 50% toil. If your team spends more than half its time on operational toil, it has no capacity to improve the systems it operates — which means the toil will only grow.

Toil Level	Engineering Impact	What to Do
< 25%	Healthy. Team has capacity for projects.	Maintain. Keep automating.
25-50%	Sustainable but uncomfortable.	Dedicate 1 sprint/quarter to automation.
50-75%	Unsustainable. Team is a support desk.	Stop feature work. Automate top 3 toil sources.
> 75%	Crisis. Engineers are quitting.	Executive escalation. Hire or automate immediately.

Measuring Toil

You cannot eliminate what you do not measure. Track toil for 2 weeks:

# Toil Log Template (have each engineer fill this out for 2 weeks)

## [Engineer Name] — Week of [Date]

| Task | Time Spent | Frequency | Could It Be Automated? |
|---|---|---|---|
| Restart payment worker | 15 min | 3x/week | Yes — healthcheck + auto-restart |
| Generate weekly metrics report | 45 min | 1x/week | Yes — scheduled query + Slack |
| Rotate API keys for partner | 30 min | 2x/month | Yes — automated rotation |
| Investigate false positive alerts | 20 min | Daily | Yes — fix alert thresholds |
| Onboard new service to monitoring | 2 hrs | 1x/month | Partially — template dashboards |

**Total toil this week: 7.5 hours (18.75% of 40-hour week)**

The Automation ROI Calculator

Not all toil is worth automating. The classic xkcd “Is It Worth the Time?” chart applies — but with a twist: you should also factor in the cognitive cost and error risk of manual work.

def calculate_automation_roi(
    manual_time_minutes: float,
    frequency_per_month: float,
    automation_build_hours: float,
    error_probability: float = 0.05,
    error_cost_hours: float = 2.0
):
    """
    Calculate whether automation is worth building.

    Returns months until automation pays for itself.
    """
    monthly_manual_cost_hours = (manual_time_minutes * frequency_per_month) / 60

    # Include the hidden cost of errors from manual execution
    monthly_error_cost_hours = frequency_per_month * error_probability * error_cost_hours

    total_monthly_savings = monthly_manual_cost_hours + monthly_error_cost_hours

    if total_monthly_savings == 0:
        return float('inf')

    payback_months = automation_build_hours / total_monthly_savings

    return {
        'payback_months': round(payback_months, 1),
        'annual_savings_hours': round(total_monthly_savings * 12, 0),
        'worth_automating': payback_months < 6,  # 6-month threshold
    }


# Examples:
print(calculate_automation_roi(
    manual_time_minutes=30,
    frequency_per_month=20,       # Almost daily
    automation_build_hours=8,     # One day to automate
))
# → {'payback_months': 0.7, 'annual_savings_hours': 140, 'worth_automating': True}

print(calculate_automation_roi(
    manual_time_minutes=10,
    frequency_per_month=1,        # Once a month
    automation_build_hours=16,    # Two days to automate
))
# → {'payback_months': 68.6, 'annual_savings_hours': 2.8, 'worth_automating': False}

The Decision Framework

Quick Decision	Automate?
Done daily + takes > 10 min	✅ Always
Done weekly + takes > 30 min	✅ Yes
Done monthly + takes > 2 hours	✅ Probably
Done monthly + takes < 15 min	❌ Not worth it
Done once a year	❌ Write a runbook instead
Error-prone regardless of frequency	✅ Yes — human error risk justifies it

Common Toil Patterns and Solutions

Pattern 1: Manual Data Tasks

Toil	Automation
”Copy data from dashboard to spreadsheet for weekly report”	Scheduled SQL query → formatted email/Slack message
”Export CSV, transform in Excel, upload to partner SFTP”	Python script with `pandas` on cron/Airflow
”Check 5 dashboards every morning”	Unified dashboard with anomaly detection alerts

# Example: Replace manual weekly report with automated delivery
import smtplib
from email.mime.text import MIMEText
import psycopg2
import schedule

def generate_weekly_report():
    conn = psycopg2.connect(DATABASE_URL)
    cursor = conn.cursor()

    cursor.execute("""
        SELECT
            date_trunc('week', created_at) as week,
            count(*) as signups,
            count(*) filter (where converted) as conversions,
            round(count(*) filter (where converted)::numeric / count(*) * 100, 1) as rate
        FROM users
        WHERE created_at > now() - interval '4 weeks'
        GROUP BY 1 ORDER BY 1
    """)

    rows = cursor.fetchall()
    report = format_report(rows)  # Format as HTML table

    send_email(
        to="team@company.com",
        subject=f"Weekly Metrics Report — {datetime.now().strftime('%B %d')}",
        body=report
    )

schedule.every().monday.at("09:00").do(generate_weekly_report)

Pattern 2: Infrastructure Toil

Toil	Automation
”SSH into server and restart service”	Kubernetes liveness probe + auto-restart
”Manually scale up before expected traffic spike”	Scheduled autoscaling policy
”Check certificate expiry dates”	cert-manager with auto-renewal
”Rotate database passwords quarterly”	Vault dynamic credentials (auto-rotating)

Pattern 3: Operational Toil

Toil	Automation
”Onboard new engineer to 15 systems”	Onboarding script + IDP provisioning
”Create Jira ticket for every alert”	PagerDuty → Jira integration
”Update runbook after every incident”	Post-mortem template auto-generates runbook update
”Check compliance evidence monthly”	Evidence collection script on cron

Building a Toil-Elimination Culture

The hardest part of eliminating toil is not the automation itself — it is convincing people that manual work is a problem worth solving. Many engineers take pride in being the person who “knows how to restart the billing system,” even though that knowledge should be in a script, not a person.

Principles

Toil is a bug, not a feature. Track it the same way you track software defects.
Automate the second time. The first time you do something manually is learning. The second time is a signal to automate.
Make it easy to report toil. A Slack command, a form, a tag in your issue tracker. If reporting toil is itself toilsome, nobody will report it.
Celebrate automation wins. “This script replaces 4 hours/week of manual work” should be announced the same way you announce feature launches.
Budget for elimination. Dedicate 10-20% of sprint capacity to automation. If you wait for “free time,” it will never happen.

Implementation Checklist

Have every engineer log their toil for 2 weeks using the template above
Calculate your team’s toil percentage (target: < 50%)
Rank toil items by time × frequency × error risk
Automate the top 3 toil sources in the next sprint
Set up a “toil backlog” in your issue tracker, separate from feature work
Allocate 10-20% of sprint capacity to toil elimination permanently
Track automation ROI: hours saved per month for each automation built
Review toil metrics quarterly and celebrate the trend line going down