Runbook Automation
Convert manual operational runbooks into automated, self-executing workflows. Covers runbook templating, incident automation, approval workflows, escalation logic, and the patterns that turn tribal knowledge into reliable automation.
A runbook is a set of steps an operator follows to respond to an alert or perform a task. A manual runbook is documentation. An automated runbook is infrastructure. The difference: manual runbooks take 15-45 minutes and depend on who is on call. Automated runbooks take 30 seconds and execute the same way every time.
Manual to Automated
Manual Runbook: "Database Connection Pool Exhaustion"
1. SSH to database server
2. Run: SELECT count(*) FROM pg_stat_activity;
3. Identify idle connections older than 1 hour
4. Kill idle connections: SELECT pg_terminate_backend(pid) ...
5. Check application logs for connection leaks
6. Restart affected application pod if needed
7. Verify pool usage is normal
8. Update incident ticket
Time: 15-30 minutes
Depends on: Who is on call, their experience
Risk: Wrong connection killed, wrong pod restarted
Automated Runbook:
1. Alert triggers automation
2. Script checks pg_stat_activity
3. Auto-kills connections idle > 1 hour
4. Verifies pool usage drops below threshold
5. If not: restarts oldest application pod
6. Posts summary to incident channel
7. Creates resolution ticket
Time: 30-90 seconds
Depends on: Nothing (runs the same every time)
Risk: Bounded (only kills idle connections, never active)
Implementation
# Automated runbook framework
class Runbook:
def __init__(self, name, trigger, steps, escalation):
self.name = name
self.trigger = trigger
self.steps = steps
self.escalation = escalation
def execute(self, context):
log = RunbookLog(self.name, context)
for step in self.steps:
try:
result = step.execute(context)
log.record(step.name, "SUCCESS", result)
if result.requires_approval:
approval = self.request_approval(step, result)
if not approval.granted:
log.record(step.name, "APPROVAL_DENIED", approval)
self.escalation.escalate(log)
return log
except Exception as e:
log.record(step.name, "FAILED", str(e))
self.escalation.escalate(log)
return log
log.complete()
return log
# Define a runbook
db_pool_exhaustion = Runbook(
name="Database Connection Pool Exhaustion",
trigger=Alert("db_active_connections > 90% of max"),
steps=[
CheckStep("Count active connections", check_active_connections),
CheckStep("Identify idle connections", find_idle_connections),
ActionStep("Kill idle connections > 1hr", kill_idle_connections,
requires_approval=False),
VerifyStep("Verify pool usage < 70%", verify_pool_usage),
ConditionalStep("Restart oldest app pod if still high",
condition=lambda ctx: ctx.pool_usage > 0.7,
action=restart_oldest_pod,
requires_approval=True),
NotifyStep("Post resolution to Slack", post_resolution),
],
escalation=PagerDutyEscalation(team="database-oncall"),
)
Approval Workflows
class ApprovalGate:
"""Some actions need human approval before executing."""
async def request_approval(self, action, context):
message = f"""
🔧 Runbook: {context.runbook_name}
📋 Action: {action.description}
⚡ Impact: {action.impact_description}
React ✅ to approve, ❌ to deny
Auto-approves in 5 minutes if no response
"""
result = await self.slack.post_approval_request(
channel="#incident-approvals",
message=message,
timeout_seconds=300,
auto_approve_on_timeout=action.safe_to_auto_approve,
)
return result
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Automate without testing | Automation causes bigger incident | Test in staging, canary in production |
| No approval gates | Destructive actions run without review | Approval for high-impact actions |
| All-or-nothing automation | Cannot partially succeed | Step-by-step with failure handling |
| No audit trail | Cannot review what automation did | Log every step, every decision |
| Automate rarely-run procedures | High maintenance, low value | Start with frequent, well-understood procedures |
The best runbooks are the ones you never have to run manually. Automate the frequent ones first, then work your way to the rare but critical ones.