ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Kubernetes Resource Management: Stop Wasting Money on Over-Provisioned Pods

Master Kubernetes resource requests, limits, QoS classes, and horizontal/vertical autoscaling. Covers the real cost of over-provisioning, how to rightsize workloads, and the resource management patterns that save money without risking stability.

The average Kubernetes cluster is 60-70% over-provisioned. Engineers set CPU requests to 500m and memory to 512Mi because that is what the template said — regardless of whether the pod actually uses 50m of CPU and 128Mi of memory. Multiply this by 200 pods and you are paying for 10 nodes when 4 would suffice.

The reason this happens is fear. Nobody wants to be the person who under-provisioned a pod and caused an OOM kill in production. So they over-provision, the cluster is stable, and the cloud bill silently grows. This guide covers how to rightsize workloads based on data, not fear.


Requests vs Limits: What They Actually Do

SettingWhat It DoesWhat Happens If Wrong
CPU requestGuaranteed CPU. Scheduler uses this to place pods.Too high → wasted capacity, pods cannot be scheduled. Too low → CPU throttling under load.
CPU limitMaximum CPU. Pod is throttled (not killed) above this.Too high → noisy neighbor risk. Too low → artificial performance ceiling.
Memory requestGuaranteed memory. Scheduler uses this to place pods.Too high → wasted capacity. Too low → OOM killed under load.
Memory limitMaximum memory. Pod is OOM killed above this.Too high → risk of node memory pressure. Too low → OOM kills.
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: api
      resources:
        requests:
          cpu: "100m"       # 0.1 CPU cores guaranteed
          memory: "256Mi"   # 256 MB guaranteed
        limits:
          cpu: "500m"       # Throttled above 0.5 cores
          memory: "512Mi"   # OOM killed above 512 MB

QoS Classes

Kubernetes assigns Quality of Service classes based on how you set requests and limits:

QoS ClassConditionEviction Priority
Guaranteedrequests == limits for all containersLast to be evicted (safest)
Burstablerequests < limitsEvicted after BestEffort
BestEffortNo requests or limits setFirst to be evicted (most risky)

Production recommendation: Set requests for ALL pods (never BestEffort in production). Set memory limits always. CPU limits are debatable — some teams remove them to avoid throttling.

The CPU Limit Debate

ApproachArgument
Set CPU limitsPrevents noisy neighbors. Predictable performance.
Remove CPU limitsAvoids artificial throttling. Pods use idle CPU. Better utilization.
CompromiseSet generous CPU limits (3x request). Throttling only under extreme load.

Google’s internal research suggests removing CPU limits and relying on requests for scheduling gives better overall cluster utilization. However, this only works if your cluster has headroom and your monitoring catches CPU contention.


Rightsizing Workloads

Step 1: Measure Actual Usage

# CPU: P95 usage over 14 days (per pod)
quantile_over_time(0.95,
  rate(container_cpu_usage_seconds_total{namespace="production"}[5m])[14d:5m]
)

# Memory: P95 usage over 14 days (per pod)
quantile_over_time(0.95,
  container_memory_working_set_bytes{namespace="production"}[14d]
)

Step 2: Compare to Requests

Pod: checkout-api
  CPU request:  500m    |  CPU P95 usage:    80m   |  Over-provisioned: 84%
  Memory request: 512Mi |  Memory P95 usage: 180Mi |  Over-provisioned: 65%

  Recommendation:
    CPU request:    150m (P95 × 1.5 safety margin)
    Memory request: 270Mi (P95 × 1.5 safety margin)
    Memory limit:   400Mi (P99 × 1.5 safety margin)

  Monthly savings per replica: ~$12
  × 5 replicas = $60/month
  × 40 similar pods = $2,400/month

Step 3: Apply with VPA

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: checkout-api-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-api
  updatePolicy:
    updateMode: "Off"  # Start with "Off" — recommendations only
    # Switch to "Auto" after validating recommendations
  resourcePolicy:
    containerPolicies:
      - containerName: api
        minAllowed:
          cpu: "50m"
          memory: "128Mi"
        maxAllowed:
          cpu: "2"
          memory: "2Gi"

Horizontal Pod Autoscaling (HPA)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-api
  minReplicas: 3               # Never go below 3 for HA
  maxReplicas: 20              # Cost ceiling
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # Wait 1 min before scaling up
      policies:
        - type: Pods
          value: 4                       # Add max 4 pods at a time
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5 min before scaling down
      policies:
        - type: Percent
          value: 25                      # Remove max 25% of pods at a time
          periodSeconds: 120
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70  # Scale when CPU > 70% of request
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80  # Scale when memory > 80% of request

HPA Common Mistakes

MistakeWhat HappensFix
minReplicas = 1Single point of failure. No HA.Set minReplicas ≥ 2 (3 for critical services)
maxReplicas too highAuto-scaler provisions 100 pods, $10K billSet realistic max based on actual load patterns
Scaling on CPU onlyMemory-bound apps never scaleAdd memory metric alongside CPU
No stabilization windowRapid scale up/down (flapping)Set scaleDown stabilization ≥ 300s
HPA + VPA on same metricThey fight each otherUse VPA for vertical, HPA for horizontal on different metrics

Node Sizing Strategy

ApproachProsCons
Few large nodesBetter bin-packing, less overheadBigger blast radius per node failure
Many small nodesSmaller blast radius, faster replacementMore scheduling overhead, worse utilization
Mixed sizesFlexible, optimize for different workloadsMore complex management
General guidance:
  - Production: 4-16 vCPU, 16-64GB nodes (balance of flexibility and utilization)
  - Batch/ML: Large nodes (32+ vCPU) for GPU or memory-intensive workloads
  - System pods: Small dedicated nodes (2 vCPU) for kube-system, monitoring

  Never run production on fewer than 3 nodes (HA requirement).
  Target 60-75% node utilization (leave headroom for bursts and pod scheduling).

Implementation Checklist

  • Audit all pods: compare resource requests to actual P95 usage over 14 days
  • Rightsize top 20 most over-provisioned pods (biggest savings, lowest risk)
  • Set memory limits on all production pods (prevent unbounded memory growth)
  • Deploy VPA in recommendation mode to get ongoing rightsizing suggestions
  • Configure HPA for stateless services with scaleDown stabilization ≥ 300 seconds
  • Set minReplicas ≥ 3 for all critical services
  • Never run BestEffort pods in production (always set resource requests)
  • Monitor cluster utilization: target 60-75% node CPU/memory usage
  • Review resource usage monthly and apply rightsizing recommendations
  • Calculate monthly savings from rightsizing and report to leadership
Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →