Kubernetes Resource Management: Stop Wasting Money on Over-Provisioned Pods

The average Kubernetes cluster is 60-70% over-provisioned. Engineers set CPU requests to 500m and memory to 512Mi because that is what the template said — regardless of whether the pod actually uses 50m of CPU and 128Mi of memory. Multiply this by 200 pods and you are paying for 10 nodes when 4 would suffice.

The reason this happens is fear. Nobody wants to be the person who under-provisioned a pod and caused an OOM kill in production. So they over-provision, the cluster is stable, and the cloud bill silently grows. This guide covers how to rightsize workloads based on data, not fear.

Requests vs Limits: What They Actually Do

Setting	What It Does	What Happens If Wrong
CPU request	Guaranteed CPU. Scheduler uses this to place pods.	Too high → wasted capacity, pods cannot be scheduled. Too low → CPU throttling under load.
CPU limit	Maximum CPU. Pod is throttled (not killed) above this.	Too high → noisy neighbor risk. Too low → artificial performance ceiling.
Memory request	Guaranteed memory. Scheduler uses this to place pods.	Too high → wasted capacity. Too low → OOM killed under load.
Memory limit	Maximum memory. Pod is OOM killed above this.	Too high → risk of node memory pressure. Too low → OOM kills.

apiVersion: v1
kind: Pod
spec:
  containers:
    - name: api
      resources:
        requests:
          cpu: "100m"       # 0.1 CPU cores guaranteed
          memory: "256Mi"   # 256 MB guaranteed
        limits:
          cpu: "500m"       # Throttled above 0.5 cores
          memory: "512Mi"   # OOM killed above 512 MB

QoS Classes

Kubernetes assigns Quality of Service classes based on how you set requests and limits:

QoS Class	Condition	Eviction Priority
Guaranteed	requests == limits for all containers	Last to be evicted (safest)
Burstable	requests < limits	Evicted after BestEffort
BestEffort	No requests or limits set	First to be evicted (most risky)

Production recommendation: Set requests for ALL pods (never BestEffort in production). Set memory limits always. CPU limits are debatable — some teams remove them to avoid throttling.

The CPU Limit Debate

Approach	Argument
Set CPU limits	Prevents noisy neighbors. Predictable performance.
Remove CPU limits	Avoids artificial throttling. Pods use idle CPU. Better utilization.
Compromise	Set generous CPU limits (3x request). Throttling only under extreme load.

Google’s internal research suggests removing CPU limits and relying on requests for scheduling gives better overall cluster utilization. However, this only works if your cluster has headroom and your monitoring catches CPU contention.

Rightsizing Workloads

Step 1: Measure Actual Usage

# CPU: P95 usage over 14 days (per pod)
quantile_over_time(0.95,
  rate(container_cpu_usage_seconds_total{namespace="production"}[5m])[14d:5m]
)

# Memory: P95 usage over 14 days (per pod)
quantile_over_time(0.95,
  container_memory_working_set_bytes{namespace="production"}[14d]
)

Step 2: Compare to Requests

Pod: checkout-api
  CPU request:  500m    |  CPU P95 usage:    80m   |  Over-provisioned: 84%
  Memory request: 512Mi |  Memory P95 usage: 180Mi |  Over-provisioned: 65%

  Recommendation:
    CPU request:    150m (P95 × 1.5 safety margin)
    Memory request: 270Mi (P95 × 1.5 safety margin)
    Memory limit:   400Mi (P99 × 1.5 safety margin)

  Monthly savings per replica: ~$12
  × 5 replicas = $60/month
  × 40 similar pods = $2,400/month

Step 3: Apply with VPA

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: checkout-api-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-api
  updatePolicy:
    updateMode: "Off"  # Start with "Off" — recommendations only
    # Switch to "Auto" after validating recommendations
  resourcePolicy:
    containerPolicies:
      - containerName: api
        minAllowed:
          cpu: "50m"
          memory: "128Mi"
        maxAllowed:
          cpu: "2"
          memory: "2Gi"

Horizontal Pod Autoscaling (HPA)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-api
  minReplicas: 3               # Never go below 3 for HA
  maxReplicas: 20              # Cost ceiling
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # Wait 1 min before scaling up
      policies:
        - type: Pods
          value: 4                       # Add max 4 pods at a time
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5 min before scaling down
      policies:
        - type: Percent
          value: 25                      # Remove max 25% of pods at a time
          periodSeconds: 120
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70  # Scale when CPU > 70% of request
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80  # Scale when memory > 80% of request

HPA Common Mistakes

Mistake	What Happens	Fix
minReplicas = 1	Single point of failure. No HA.	Set minReplicas ≥ 2 (3 for critical services)
maxReplicas too high	Auto-scaler provisions 100 pods, $10K bill	Set realistic max based on actual load patterns
Scaling on CPU only	Memory-bound apps never scale	Add memory metric alongside CPU
No stabilization window	Rapid scale up/down (flapping)	Set scaleDown stabilization ≥ 300s
HPA + VPA on same metric	They fight each other	Use VPA for vertical, HPA for horizontal on different metrics

Node Sizing Strategy

Approach	Pros	Cons
Few large nodes	Better bin-packing, less overhead	Bigger blast radius per node failure
Many small nodes	Smaller blast radius, faster replacement	More scheduling overhead, worse utilization
Mixed sizes	Flexible, optimize for different workloads	More complex management

General guidance:
  - Production: 4-16 vCPU, 16-64GB nodes (balance of flexibility and utilization)
  - Batch/ML: Large nodes (32+ vCPU) for GPU or memory-intensive workloads
  - System pods: Small dedicated nodes (2 vCPU) for kube-system, monitoring

  Never run production on fewer than 3 nodes (HA requirement).
  Target 60-75% node utilization (leave headroom for bursts and pod scheduling).