ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Site Reliability Engineering (SRE) Practices

Implement SRE practices. Covers error budgets, incident management, blameless postmortems, capacity planning, toil reduction, and building reliability into engineering culture.

SRE is what happens when you treat operations as a software engineering problem. Google coined the term and the discipline: instead of operations teams manually managing infrastructure, SREs write code to automate operational work, define reliability targets as engineering constraints, and use error budgets to balance feature velocity with stability.


SRE vs Traditional Ops

AspectTraditional OpsSRE
GoalKeep things runningBalance reliability with velocity
WorkManual, reactive (firefighting)Automated, proactive (engineering)
Reliability100% uptime (impossible goal)Target based on error budget
IncidentsBlame someoneBlameless postmortems
ToilAccepted as normalMeasured and reduced
Feature vs ReliabilityConstant conflictError budget arbitrates

Error Budgets in Practice

  Month starts: 100% error budget available
  ├── Week 1: Deploy feature A → 0.02% budget consumed → 99.98% remaining
  ├── Week 2: Deploy feature B → 0.05% budget consumed → 99.93% remaining
  ├── Week 3: Incident (30 min) → 0.07% budget consumed → 99.86% remaining
  └── Week 4: Deploy feature C → 0.01% budget consumed → 99.85% remaining
  
  Budget status: 99.85% remaining (target: 99.9% → 0.1% budget)
  Used: 0.15% of 0.1% budget → 150% consumed → OVER BUDGET
  Action: Freeze deployments, focus on reliability

Incident Management

Severity Levels

SeverityImpactResponseExample
SEV-1Full service outageImmediate, all handsPayment processing down
SEV-2Partial outage, workaround exists15 min responseSearch broken, browsing works
SEV-3Degraded performanceBusiness hoursElevated latency, no errors
SEV-4Minor issue, no user impactNext business dayInternal tool slow

Blameless Postmortem

## Incident: Order Service Outage
**Date**: 2025-03-15 | **Duration**: 47 minutes | **Severity**: SEV-1

### Summary
Order service became unresponsive due to database connection pool
exhaustion caused by a query without timeout.

### Timeline
- 14:23 - Deploy v2.3.1 with new order search feature
- 14:31 - Database connection pool warnings in logs
- 14:38 - PagerDuty alert: order API error rate > 5%
- 14:42 - Incident declared, on-call begins investigation
- 14:55 - Root cause identified: new query scanning full table
- 15:02 - Rollback to v2.3.0 initiated
- 15:10 - Service recovered, error rates normalized

### Root Cause
New search query lacked a WHERE clause index, causing
full table scans. Each scan held a database connection for
30+ seconds, exhausting the 50-connection pool.

### Contributing Factors (not root causes)
- No query execution time limit
- No load testing of new search feature
- Connection pool size matched production load exactly (no headroom)

### Action Items
| Action | Owner | Due |
|---|---|---|
| Add query timeout (5s max) | Backend team | 3/18 |
| Load test search feature | QA | 3/20 |
| Increase connection pool to 100 | Platform | 3/16 |
| Add connection pool utilization alert | SRE | 3/17 |

Toil Reduction

ActivityToil?Automate?
Restarting crashed podsYes (repetitive, no value)Auto-restart + root cause fix
Certificate renewalYes (manual, predictable)cert-manager auto-renewal
Capacity reviewNo (requires judgment)Assist with data, human decides
Incident responsePartially (runbook, then judgment)Automate runbook steps, escalate unknowns
On-call handoffYes (if manual)Automated handoff with context

Anti-Patterns

Anti-PatternProblemFix
100% uptime targetImpossible, prevents all changeSet realistic SLO (99.9% = 43 min/month)
Toil accepted as normalEngineers burned out on repetitive workMeasure toil, cap at 50%, automate the rest
Blame-focused postmortemsPeople hide mistakes, learning stopsBlameless postmortems focused on systems
SRE as rebranded opsSame work, new titleEngineering work: automation, tooling, code
No error budget policyNo mechanism to balance features vs reliabilityFormal error budget with documented actions

Checklist

  • SLOs defined for all customer-facing services
  • Error budgets calculated and tracked monthly
  • Error budget policy: documented actions at budget thresholds
  • Incident management: severity levels, on-call rotation, escalation
  • Postmortems: blameless, action items tracked to completion
  • Toil measurement: tracked, capped at 50% of SRE time
  • Capacity planning: data-driven, reviewed quarterly
  • On-call: fair rotation, compensated, handoffs documented
  • Runbooks: every alert has a corresponding runbook
  • Reliability reviews: regular review of SLO performance

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For SRE consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →