ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

SRE Capacity Forecasting

Predict infrastructure capacity needs before they become outages. Covers demand forecasting models, resource utilization projections, capacity planning automation, and the patterns that ensure infrastructure scales ahead of growth instead of behind it.

Capacity planning fails when it is reactive. By the time a database runs out of disk space or an application server hits 100% CPU, you have an outage. Capacity forecasting is the discipline of predicting future resource needs based on historical growth, seasonal patterns, and planned business events — so infrastructure scales proactively, not in response to pages.


Forecasting Framework

Inputs for Capacity Forecasting:

Historical Data:
  ├── CPU utilization (last 6-12 months)
  ├── Memory usage trends
  ├── Storage growth rate (GB/month)
  ├── Network throughput trends
  ├── Request rate (RPS) growth
  └── Database connection counts

Business Context:
  ├── Expected user growth (marketing campaigns)
  ├── Feature launches (new features = new load patterns)
  ├── Seasonal patterns (Black Friday, end of quarter)
  ├── Acquisition/merger (sudden traffic increase)
  └── Geographic expansion (new regions)

Output → Capacity Runway:
  "At current growth rate:
   CPU headroom: 4.2 months before needing to scale
   Storage: 2.1 months before 80% utilization
   Database connections: 6.8 months before pool exhaustion
   
   With Black Friday event:
   CPU headroom: 1.3 months — ACTION REQUIRED"

Forecasting Implementation

import numpy as np
from datetime import datetime, timedelta

class CapacityForecaster:
    """Predict when resources will hit capacity thresholds."""
    
    def forecast_resource(self, metric_name: str, 
                          history: list, threshold: float = 0.8):
        """Project when a resource will hit threshold."""
        
        # Extract daily averages
        days = np.array([h["day_number"] for h in history])
        values = np.array([h["utilization"] for h in history])
        
        # Fit linear regression (simple but effective for most resources)
        slope, intercept = np.polyfit(days, values, 1)
        
        # Calculate days until threshold
        current_value = values[-1]
        if slope <= 0:
            return {
                "metric": metric_name,
                "current": current_value,
                "trend": "flat_or_declining",
                "runway_days": float('inf'),
                "action": "No scaling needed",
            }
        
        days_to_threshold = (threshold - current_value) / slope
        threshold_date = datetime.now() + timedelta(days=days_to_threshold)
        
        # Classify urgency
        if days_to_threshold < 30:
            urgency = "critical"
            action = f"Scale {metric_name} immediately"
        elif days_to_threshold < 90:
            urgency = "warning"
            action = f"Plan {metric_name} scaling for next quarter"
        else:
            urgency = "healthy"
            action = f"{metric_name} has sufficient runway"
        
        return {
            "metric": metric_name,
            "current": round(current_value, 3),
            "growth_rate_per_day": round(slope, 5),
            "runway_days": round(days_to_threshold, 1),
            "threshold_date": threshold_date.isoformat(),
            "urgency": urgency,
            "action": action,
        }

Anti-Patterns

Anti-PatternConsequenceFix
No forecasting, only monitoringOutages from resource exhaustionProject utilization trends weekly
Linear forecasting onlyMiss seasonal spikesSeasonal decomposition, factor in business events
Forecast compute ignore storageStorage fills silentlyForecast ALL resource dimensions
No buffer in forecastsScale exactly at limit, no headroomTarget 80% threshold, not 100%
Annual capacity review only12 months between assessmentsMonthly automated forecasting

Capacity forecasting is the difference between scaling proactively (“we need to add capacity next month”) and scaling reactively (“production is down because the disk is full”). The former is engineering; the latter is firefighting.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →