SRE is what happens when you treat operations as a software engineering problem. Google coined the term and the discipline: instead of operations teams manually managing infrastructure, SREs write code to automate operational work, define reliability targets as engineering constraints, and use error budgets to balance feature velocity with stability.
SRE vs Traditional Ops
| Aspect | Traditional Ops | SRE |
|---|
| Goal | Keep things running | Balance reliability with velocity |
| Work | Manual, reactive (firefighting) | Automated, proactive (engineering) |
| Reliability | 100% uptime (impossible goal) | Target based on error budget |
| Incidents | Blame someone | Blameless postmortems |
| Toil | Accepted as normal | Measured and reduced |
| Feature vs Reliability | Constant conflict | Error budget arbitrates |
Error Budgets in Practice
Month starts: 100% error budget available
├── Week 1: Deploy feature A → 0.02% budget consumed → 99.98% remaining
├── Week 2: Deploy feature B → 0.05% budget consumed → 99.93% remaining
├── Week 3: Incident (30 min) → 0.07% budget consumed → 99.86% remaining
└── Week 4: Deploy feature C → 0.01% budget consumed → 99.85% remaining
Budget status: 99.85% remaining (target: 99.9% → 0.1% budget)
Used: 0.15% of 0.1% budget → 150% consumed → OVER BUDGET
Action: Freeze deployments, focus on reliability
Incident Management
Severity Levels
| Severity | Impact | Response | Example |
|---|
| SEV-1 | Full service outage | Immediate, all hands | Payment processing down |
| SEV-2 | Partial outage, workaround exists | 15 min response | Search broken, browsing works |
| SEV-3 | Degraded performance | Business hours | Elevated latency, no errors |
| SEV-4 | Minor issue, no user impact | Next business day | Internal tool slow |
Blameless Postmortem
## Incident: Order Service Outage
**Date**: 2025-03-15 | **Duration**: 47 minutes | **Severity**: SEV-1
### Summary
Order service became unresponsive due to database connection pool
exhaustion caused by a query without timeout.
### Timeline
- 14:23 - Deploy v2.3.1 with new order search feature
- 14:31 - Database connection pool warnings in logs
- 14:38 - PagerDuty alert: order API error rate > 5%
- 14:42 - Incident declared, on-call begins investigation
- 14:55 - Root cause identified: new query scanning full table
- 15:02 - Rollback to v2.3.0 initiated
- 15:10 - Service recovered, error rates normalized
### Root Cause
New search query lacked a WHERE clause index, causing
full table scans. Each scan held a database connection for
30+ seconds, exhausting the 50-connection pool.
### Contributing Factors (not root causes)
- No query execution time limit
- No load testing of new search feature
- Connection pool size matched production load exactly (no headroom)
### Action Items
| Action | Owner | Due |
|---|---|---|
| Add query timeout (5s max) | Backend team | 3/18 |
| Load test search feature | QA | 3/20 |
| Increase connection pool to 100 | Platform | 3/16 |
| Add connection pool utilization alert | SRE | 3/17 |
Toil Reduction
| Activity | Toil? | Automate? |
|---|
| Restarting crashed pods | Yes (repetitive, no value) | Auto-restart + root cause fix |
| Certificate renewal | Yes (manual, predictable) | cert-manager auto-renewal |
| Capacity review | No (requires judgment) | Assist with data, human decides |
| Incident response | Partially (runbook, then judgment) | Automate runbook steps, escalate unknowns |
| On-call handoff | Yes (if manual) | Automated handoff with context |
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|
| 100% uptime target | Impossible, prevents all change | Set realistic SLO (99.9% = 43 min/month) |
| Toil accepted as normal | Engineers burned out on repetitive work | Measure toil, cap at 50%, automate the rest |
| Blame-focused postmortems | People hide mistakes, learning stops | Blameless postmortems focused on systems |
| SRE as rebranded ops | Same work, new title | Engineering work: automation, tooling, code |
| No error budget policy | No mechanism to balance features vs reliability | Formal error budget with documented actions |
Checklist
:::note[Source]
This guide is derived from operational intelligence at Garnet Grid Consulting. For SRE consulting, visit garnetgrid.com.
:::
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting
Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.
View Full Profile →