SLO-Based Alerting
Replace threshold-based alerts with SLO-driven alerting that reduces noise and focuses on user impact. Covers error budgets, burn rate alerts, multi-window strategies, alert routing, and the patterns that eliminate alert fatigue while catching real incidents.
Traditional alerting fires when a metric crosses a threshold: “CPU > 80% → page the on-call.” This produces noise because high CPU might not affect users, and low CPU does not mean everything is fine. SLO-based alerting fires when the error budget is burning too fast — meaning real users are experiencing real problems.
From Threshold to SLO
Threshold Alerting (Old):
IF cpu_utilization > 80% THEN page
IF response_time > 500ms THEN page
IF error_rate > 1% THEN page
Problems:
- High CPU with happy users = unnecessary page
- 500ms latency might be fine for batch jobs
- 1% error rate: is that 10 errors or 10,000?
SLO-Based Alerting (New):
SLO: 99.9% of requests succeed within 300ms
Error budget: 0.1% = 43.2 minutes/month
Alert: IF burning error budget 10x faster than sustainable THEN page
Benefits:
- Only alerts when users are impacted
- Severity proportional to user impact
- Error budget provides decision framework
Error Budget Math
# SLO: 99.9% availability per month
slo_target = 0.999
# Error budget per month
total_minutes_per_month = 30 * 24 * 60 # 43,200 minutes
error_budget_minutes = total_minutes_per_month * (1 - slo_target)
# 43,200 * 0.001 = 43.2 minutes/month
# Sustainable burn rate: consume budget evenly over 30 days
sustainable_burn_rate = 1.0 # 1x = budget consumed exactly over the month
# Alert thresholds based on burn rate:
# 14.4x burn rate: budget consumed in ~3 days (fast burn, high urgency)
# 6x burn rate: budget consumed in ~5 days (moderate burn)
# 3x burn rate: budget consumed in ~10 days (slow burn)
# 1x burn rate: budget consumed exactly at month end (normal)
class ErrorBudgetCalculator:
def __init__(self, slo_target, window_days=30):
self.slo_target = slo_target
self.error_budget = (1 - slo_target) * window_days * 24 * 60 # minutes
def current_burn_rate(self, errors_last_hour, total_requests_last_hour):
"""Calculate current burn rate as multiple of sustainable rate."""
error_rate = errors_last_hour / total_requests_last_hour
allowed_error_rate = 1 - self.slo_target
return error_rate / allowed_error_rate
def budget_remaining(self, errors_this_month, total_requests_this_month):
"""Calculate remaining error budget as percentage."""
error_rate = errors_this_month / total_requests_this_month
budget_used = error_rate / (1 - self.slo_target)
return max(0, 1 - budget_used)
Multi-Window Burn Rate Alerts
# Fast burn: Page immediately
- alert: HighBurnRate_Page
expr: |
(
1 - (rate(http_requests_total{code!~"5.."}[1h]) / rate(http_requests_total[1h]))
) > (14.4 * (1 - 0.999))
AND
(
1 - (rate(http_requests_total{code!~"5.."}[5m]) / rate(http_requests_total[5m]))
) > (14.4 * (1 - 0.999))
labels:
severity: page
annotations:
summary: "Burning error budget 14x faster than sustainable"
impact: "SLO will be breached within 3 days at this rate"
# Slow burn: Ticket for review
- alert: SlowBurnRate_Ticket
expr: |
(
1 - (rate(http_requests_total{code!~"5.."}[6h]) / rate(http_requests_total[6h]))
) > (3 * (1 - 0.999))
AND
(
1 - (rate(http_requests_total{code!~"5.."}[30m]) / rate(http_requests_total[30m]))
) > (3 * (1 - 0.999))
labels:
severity: ticket
annotations:
summary: "Burning error budget 3x faster than sustainable"
impact: "SLO at risk if trend continues for 10 days"
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Threshold alerts for everything | Alert fatigue, pages ignored | Migrate to SLO-based alerting |
| No error budget tracking | Cannot make trade-off decisions | Dashboard + monthly reviews |
| SLO without consequences | SLO is meaningless | Freeze launches when budget exhausted |
| Single window burn rate | False positives from spikes | Multi-window (1h AND 5m) |
| Same severity for all alerts | Everything is urgent = nothing is | Tiered: page (fast burn) vs ticket (slow burn) |
SLO-based alerting is not about having fewer alerts — it is about having the right alerts. When your pager goes off, it should mean real users are having real problems.