Capacity Planning: Scaling Infrastructure Before You Need To

Capacity planning is the discipline of having enough infrastructure to handle your traffic — tomorrow, next month, and during the annual spike — without paying for resources you do not need today. It sits at the intersection of engineering and finance: too little capacity causes outages, too much wastes money.

Most teams do capacity planning reactively: they add more servers after an outage. This guide covers how to plan proactively so you never have the “we ran out of capacity” conversation with your CEO.

The Capacity Planning Process

1. Understand current usage
   "We serve 10,000 requests/second with 40% CPU utilization
    on 20 servers. Database handles 5,000 queries/second."

2. Model growth
   "Traffic grows 15% month-over-month. Black Friday is 3× normal.
    Marketing campaign in October expected to add 25%."

3. Identify bottlenecks
   "At 15,000 req/s, the database connection pool saturates.
    At 20,000 req/s, we run out of API server CPU."

4. Plan capacity additions
   "Need 10 more API servers by October.
    Need database upgrade (or read replicas) by November."

5. Budget and approve
   "Additional infrastructure costs $X/month.
    Prevents an outage that costs $Y/hour."

6. Execute and verify
   "Deploy additional capacity. Load test to verify."

Resource Utilization Tracking

Resource	Measure	Healthy Range	Danger Zone
CPU	Average and p99 utilization	40-60% average	> 80% sustained
Memory	Used / Available	50-70%	> 85%
Disk I/O	IOPS and throughput	< 60% of provisioned	> 80%
Network	Bandwidth utilization	< 50% of link capacity	> 70%
Database connections	Active / Max	< 60% of pool	> 80% of pool
Queue depth	Messages waiting	< 100 messages	Growing consistently

Utilization over time — identify the trend:

100% ────────────────────────── Capacity limit
 90% ─────────────────────╱──── Danger zone
 80% ──────────────────╱──────
 70% ───────────────╱─────────
 60% ────────────╱────────────  ← Current utilization trend
 50% ─────────╱───────────────
 40% ──────╱──────────────────
 30% ───╱─────────────────────
  0% ╱────────────────────────
     Jan  Feb  Mar  Apr  May  Jun  Jul  Aug

At 15% monthly growth:
  Jan: 40% → Apr: 65% → Jul: 100% (outage)

  Action required by May (60%) to have capacity
  available by June (70%).

  Lead time for new infrastructure: 2-4 weeks
  Therefore: start procurement in April.

Load Modeling

Calculating Capacity Requirements

Current state:
  10,000 requests/second
  20 servers
  40% CPU utilization per server
  → Each server handles 500 req/s at 40% CPU
  → Each server max capacity ≈ 1,250 req/s (at 100% CPU)
  → Comfortable capacity (at 70% CPU): 875 req/s per server

Target state (Black Friday, 3× traffic):
  30,000 requests/second
  At 875 req/s per server (70% target utilization):
  → Need 30,000 / 875 = 35 servers
  → Currently have 20
  → Need 15 additional servers

  Plus buffer (20% for unexpected spikes):
  → 35 × 1.2 = 42 servers total
  → Need 22 additional servers

The Bottleneck Hierarchy

Typical bottleneck order (what breaks first):

  1. Database connections (pool exhaustion)
  2. Database query latency (CPU or I/O bound)
  3. Application server memory (GC pressure, memory leaks)
  4. Application server CPU (compute-bound workloads)
  5. Network bandwidth (large payloads, media serving)
  6. External API rate limits (third-party dependencies)
  7. DNS resolution (often overlooked, cache TTL issues)

Start investigation at #1 and work down.
Each bottleneck has a different scaling strategy.

Scaling Strategies

Strategy	Best For	Limitation
Vertical scaling (bigger instance)	Database, single-threaded workloads	Physical limits, expensive at top tier
Horizontal scaling (more instances)	Stateless services, web servers	Requires stateless design
Caching	Read-heavy workloads	Cache invalidation complexity
Read replicas	Read-heavy database workloads	Replication lag
CDN	Static assets, media	Dynamic content not cacheable
Async processing	Background jobs, batch operations	Increased system complexity
Sharding	Very large datasets	Application complexity, cross-shard queries

Auto-Scaling Configuration

# Kubernetes HPA (Horizontal Pod Autoscaler)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-api
  minReplicas: 5        # Never go below 5 (night time traffic)
  maxReplicas: 50       # Never go above 50 (budget control)
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60  # Scale up when CPU > 60%
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "500"      # Scale when > 500 req/s per pod
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60   # Wait 1 min before scaling up
      policies:
        - type: Percent
          value: 50                     # Add up to 50% more pods
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
        - type: Percent
          value: 10                     # Remove max 10% at a time
          periodSeconds: 120

The Capacity Planning Process

Resource Utilization Tracking

Load Modeling

Calculating Capacity Requirements

The Bottleneck Hierarchy

Scaling Strategies

Auto-Scaling Configuration

Implementation Checklist

More in Site Reliability Engineering

SRE Capacity Forecasting

Capacity Planning

Chaos Engineering: Breaking Things on Purpose to Build Confidence