SRE Capacity Forecasting
Predict infrastructure capacity needs before they become outages. Covers demand forecasting models, resource utilization projections, capacity planning automation, and the patterns that ensure infrastructure scales ahead of growth instead of behind it.
Capacity planning fails when it is reactive. By the time a database runs out of disk space or an application server hits 100% CPU, you have an outage. Capacity forecasting is the discipline of predicting future resource needs based on historical growth, seasonal patterns, and planned business events — so infrastructure scales proactively, not in response to pages.
Forecasting Framework
Inputs for Capacity Forecasting:
Historical Data:
├── CPU utilization (last 6-12 months)
├── Memory usage trends
├── Storage growth rate (GB/month)
├── Network throughput trends
├── Request rate (RPS) growth
└── Database connection counts
Business Context:
├── Expected user growth (marketing campaigns)
├── Feature launches (new features = new load patterns)
├── Seasonal patterns (Black Friday, end of quarter)
├── Acquisition/merger (sudden traffic increase)
└── Geographic expansion (new regions)
Output → Capacity Runway:
"At current growth rate:
CPU headroom: 4.2 months before needing to scale
Storage: 2.1 months before 80% utilization
Database connections: 6.8 months before pool exhaustion
With Black Friday event:
CPU headroom: 1.3 months — ACTION REQUIRED"
Forecasting Implementation
import numpy as np
from datetime import datetime, timedelta
class CapacityForecaster:
"""Predict when resources will hit capacity thresholds."""
def forecast_resource(self, metric_name: str,
history: list, threshold: float = 0.8):
"""Project when a resource will hit threshold."""
# Extract daily averages
days = np.array([h["day_number"] for h in history])
values = np.array([h["utilization"] for h in history])
# Fit linear regression (simple but effective for most resources)
slope, intercept = np.polyfit(days, values, 1)
# Calculate days until threshold
current_value = values[-1]
if slope <= 0:
return {
"metric": metric_name,
"current": current_value,
"trend": "flat_or_declining",
"runway_days": float('inf'),
"action": "No scaling needed",
}
days_to_threshold = (threshold - current_value) / slope
threshold_date = datetime.now() + timedelta(days=days_to_threshold)
# Classify urgency
if days_to_threshold < 30:
urgency = "critical"
action = f"Scale {metric_name} immediately"
elif days_to_threshold < 90:
urgency = "warning"
action = f"Plan {metric_name} scaling for next quarter"
else:
urgency = "healthy"
action = f"{metric_name} has sufficient runway"
return {
"metric": metric_name,
"current": round(current_value, 3),
"growth_rate_per_day": round(slope, 5),
"runway_days": round(days_to_threshold, 1),
"threshold_date": threshold_date.isoformat(),
"urgency": urgency,
"action": action,
}
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| No forecasting, only monitoring | Outages from resource exhaustion | Project utilization trends weekly |
| Linear forecasting only | Miss seasonal spikes | Seasonal decomposition, factor in business events |
| Forecast compute ignore storage | Storage fills silently | Forecast ALL resource dimensions |
| No buffer in forecasts | Scale exactly at limit, no headroom | Target 80% threshold, not 100% |
| Annual capacity review only | 12 months between assessments | Monthly automated forecasting |
Capacity forecasting is the difference between scaling proactively (“we need to add capacity next month”) and scaling reactively (“production is down because the disk is full”). The former is engineering; the latter is firefighting.