Site Reliability Engineering (SRE) Practices

SRE is what happens when you treat operations as a software engineering problem. Google coined the term and the discipline: instead of operations teams manually managing infrastructure, SREs write code to automate operational work, define reliability targets as engineering constraints, and use error budgets to balance feature velocity with stability.

SRE vs Traditional Ops

Aspect	Traditional Ops	SRE
Goal	Keep things running	Balance reliability with velocity
Work	Manual, reactive (firefighting)	Automated, proactive (engineering)
Reliability	100% uptime (impossible goal)	Target based on error budget
Incidents	Blame someone	Blameless postmortems
Toil	Accepted as normal	Measured and reduced
Feature vs Reliability	Constant conflict	Error budget arbitrates

Error Budgets in Practice

  Month starts: 100% error budget available
  ├── Week 1: Deploy feature A → 0.02% budget consumed → 99.98% remaining
  ├── Week 2: Deploy feature B → 0.05% budget consumed → 99.93% remaining
  ├── Week 3: Incident (30 min) → 0.07% budget consumed → 99.86% remaining
  └── Week 4: Deploy feature C → 0.01% budget consumed → 99.85% remaining
  
  Budget status: 99.85% remaining (target: 99.9% → 0.1% budget)
  Used: 0.15% of 0.1% budget → 150% consumed → OVER BUDGET
  Action: Freeze deployments, focus on reliability

Incident Management

Severity Levels

Severity	Impact	Response	Example
SEV-1	Full service outage	Immediate, all hands	Payment processing down
SEV-2	Partial outage, workaround exists	15 min response	Search broken, browsing works
SEV-3	Degraded performance	Business hours	Elevated latency, no errors
SEV-4	Minor issue, no user impact	Next business day	Internal tool slow

Blameless Postmortem

## Incident: Order Service Outage
**Date**: 2025-03-15 | **Duration**: 47 minutes | **Severity**: SEV-1

### Summary
Order service became unresponsive due to database connection pool
exhaustion caused by a query without timeout.

### Timeline
- 14:23 - Deploy v2.3.1 with new order search feature
- 14:31 - Database connection pool warnings in logs
- 14:38 - PagerDuty alert: order API error rate > 5%
- 14:42 - Incident declared, on-call begins investigation
- 14:55 - Root cause identified: new query scanning full table
- 15:02 - Rollback to v2.3.0 initiated
- 15:10 - Service recovered, error rates normalized

### Root Cause
New search query lacked a WHERE clause index, causing
full table scans. Each scan held a database connection for
30+ seconds, exhausting the 50-connection pool.

### Contributing Factors (not root causes)
- No query execution time limit
- No load testing of new search feature
- Connection pool size matched production load exactly (no headroom)

### Action Items
| Action | Owner | Due |
|---|---|---|
| Add query timeout (5s max) | Backend team | 3/18 |
| Load test search feature | QA | 3/20 |
| Increase connection pool to 100 | Platform | 3/16 |
| Add connection pool utilization alert | SRE | 3/17 |

Toil Reduction

Activity	Toil?	Automate?
Restarting crashed pods	Yes (repetitive, no value)	Auto-restart + root cause fix
Certificate renewal	Yes (manual, predictable)	cert-manager auto-renewal
Capacity review	No (requires judgment)	Assist with data, human decides
Incident response	Partially (runbook, then judgment)	Automate runbook steps, escalate unknowns
On-call handoff	Yes (if manual)	Automated handoff with context

Anti-Patterns

Anti-Pattern	Problem	Fix
100% uptime target	Impossible, prevents all change	Set realistic SLO (99.9% = 43 min/month)
Toil accepted as normal	Engineers burned out on repetitive work	Measure toil, cap at 50%, automate the rest
Blame-focused postmortems	People hide mistakes, learning stops	Blameless postmortems focused on systems
SRE as rebranded ops	Same work, new title	Engineering work: automation, tooling, code
No error budget policy	No mechanism to balance features vs reliability	Formal error budget with documented actions

Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For SRE consulting, visit garnetgrid.com. :::