Runbook Automation | The Garnet Wiki

A runbook is a set of steps an operator follows to respond to an alert or perform a task. A manual runbook is documentation. An automated runbook is infrastructure. The difference: manual runbooks take 15-45 minutes and depend on who is on call. Automated runbooks take 30 seconds and execute the same way every time.

Manual to Automated

Manual Runbook: "Database Connection Pool Exhaustion"
  1. SSH to database server
  2. Run: SELECT count(*) FROM pg_stat_activity;
  3. Identify idle connections older than 1 hour
  4. Kill idle connections: SELECT pg_terminate_backend(pid) ...
  5. Check application logs for connection leaks
  6. Restart affected application pod if needed
  7. Verify pool usage is normal
  8. Update incident ticket

  Time: 15-30 minutes
  Depends on: Who is on call, their experience
  Risk: Wrong connection killed, wrong pod restarted

Automated Runbook:
  1. Alert triggers automation
  2. Script checks pg_stat_activity
  3. Auto-kills connections idle > 1 hour
  4. Verifies pool usage drops below threshold
  5. If not: restarts oldest application pod
  6. Posts summary to incident channel
  7. Creates resolution ticket
  
  Time: 30-90 seconds
  Depends on: Nothing (runs the same every time)
  Risk: Bounded (only kills idle connections, never active)

Implementation

# Automated runbook framework
class Runbook:
    def __init__(self, name, trigger, steps, escalation):
        self.name = name
        self.trigger = trigger
        self.steps = steps
        self.escalation = escalation
    
    def execute(self, context):
        log = RunbookLog(self.name, context)
        
        for step in self.steps:
            try:
                result = step.execute(context)
                log.record(step.name, "SUCCESS", result)
                
                if result.requires_approval:
                    approval = self.request_approval(step, result)
                    if not approval.granted:
                        log.record(step.name, "APPROVAL_DENIED", approval)
                        self.escalation.escalate(log)
                        return log
                
            except Exception as e:
                log.record(step.name, "FAILED", str(e))
                self.escalation.escalate(log)
                return log
        
        log.complete()
        return log

# Define a runbook
db_pool_exhaustion = Runbook(
    name="Database Connection Pool Exhaustion",
    trigger=Alert("db_active_connections > 90% of max"),
    steps=[
        CheckStep("Count active connections", check_active_connections),
        CheckStep("Identify idle connections", find_idle_connections),
        ActionStep("Kill idle connections > 1hr", kill_idle_connections,
                   requires_approval=False),
        VerifyStep("Verify pool usage < 70%", verify_pool_usage),
        ConditionalStep("Restart oldest app pod if still high",
                       condition=lambda ctx: ctx.pool_usage > 0.7,
                       action=restart_oldest_pod,
                       requires_approval=True),
        NotifyStep("Post resolution to Slack", post_resolution),
    ],
    escalation=PagerDutyEscalation(team="database-oncall"),
)

Approval Workflows

class ApprovalGate:
    """Some actions need human approval before executing."""
    
    async def request_approval(self, action, context):
        message = f"""
        🔧 Runbook: {context.runbook_name}
        📋 Action: {action.description}
        ⚡ Impact: {action.impact_description}
        
        React ✅ to approve, ❌ to deny
        Auto-approves in 5 minutes if no response
        """
        
        result = await self.slack.post_approval_request(
            channel="#incident-approvals",
            message=message,
            timeout_seconds=300,
            auto_approve_on_timeout=action.safe_to_auto_approve,
        )
        
        return result

Anti-Patterns

Anti-Pattern	Consequence	Fix
Automate without testing	Automation causes bigger incident	Test in staging, canary in production
No approval gates	Destructive actions run without review	Approval for high-impact actions
All-or-nothing automation	Cannot partially succeed	Step-by-step with failure handling
No audit trail	Cannot review what automation did	Log every step, every decision
Automate rarely-run procedures	High maintenance, low value	Start with frequent, well-understood procedures

The best runbooks are the ones you never have to run manually. Automate the frequent ones first, then work your way to the rare but critical ones.

Manual to Automated

Implementation

Approval Workflows

Anti-Patterns

More in Automation

Ansible for Infrastructure Automation: Playbooks That Do Not Break at 3 AM

Automated Dependency Updates

Automated Change Management Workflow