Spot Instance Engineering: Architecting for Interruption

Spot instances are the cheapest compute available from any cloud provider — 60-90% below on-demand pricing. The catch: the provider can reclaim them with as little as 2 minutes notice. This makes spot instances useless for stateful single-instance workloads and invaluable for workloads that can be distributed, retried, or interrupted without data loss.

The engineering challenge is designing your infrastructure so that a 2-minute eviction notice is a routine event, not an emergency.

Cost Comparison

AWS (us-east-1):
  m5.xlarge on-demand:  $0.192/hr
  m5.xlarge spot:       $0.058/hr  (70% savings)
  
GCP:
  n2-standard-4 on-demand:  $0.194/hr
  n2-standard-4 preemptible: $0.048/hr  (75% savings)

Azure:
  Standard_D4s_v3 pay-as-you-go:  $0.192/hr
  Standard_D4s_v3 spot:            $0.038/hr  (80% savings)

At scale, this translates to six-figure annual savings.

Workload Classification

Excellent for Spot

Batch processing: MapReduce, ETL, data pipelines
CI/CD workers: Build agents, test runners
Training ML models: Checkpointing handles interruption
Stateless web servers: Behind a load balancer with multiple instances
Rendering: Video transcoding, image processing
Big data: Spark, EMR, Dataproc clusters

Poor for Spot

Single-instance databases: Eviction = downtime
Stateful services with no replication: Data loss on eviction
Long-running transactions: Cannot be retried cheaply
Real-time systems with strict SLAs: Eviction causes SLA violations

Interruption Handling

AWS Spot Interruption Notice

AWS provides a 2-minute warning via the instance metadata service:

import requests
import time

def check_interruption():
    try:
        response = requests.get(
            "http://169.254.169.254/latest/meta-data/spot/instance-action",
            timeout=1
        )
        if response.status_code == 200:
            action = response.json()
            print(f"Spot interruption: {action['action']} at {action['time']}")
            initiate_graceful_shutdown()
    except requests.exceptions.RequestException:
        pass  # No interruption

# Poll every 5 seconds
while True:
    check_interruption()
    time.sleep(5)

Graceful Shutdown on Eviction

When an interruption notice arrives:

Stop accepting new work (deregister from load balancer)
Complete in-flight work (finish current request/job)
Checkpoint state (save progress to durable storage)
Signal the orchestrator (request replacement instance)

def initiate_graceful_shutdown():
    # Deregister from ALB
    deregister_from_target_group()
    
    # Finish current batch job
    if current_job:
        current_job.save_checkpoint(s3_bucket)
    
    # Drain connections (max 90 seconds, save 30s for cleanup)
    drain_connections(timeout=90)
    
    # Final cleanup
    flush_logs()

Diversification Strategies

Single instance type spot pools can be exhausted. Diversify across multiple instance types and availability zones:

AWS Fleet Configuration

{
  "SpotOptions": {
    "AllocationStrategy": "capacity-optimized-prioritized",
    "InstanceInterruptionBehavior": "terminate"
  },
  "Overrides": [
    { "InstanceType": "m5.xlarge",  "Priority": 1 },
    { "InstanceType": "m5a.xlarge", "Priority": 2 },
    { "InstanceType": "m5d.xlarge", "Priority": 3 },
    { "InstanceType": "m4.xlarge",  "Priority": 4 },
    { "InstanceType": "r5.large",   "Priority": 5 }
  ]
}

Kubernetes with Spot Node Pools

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: spot-workers
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot"]
    - key: node.kubernetes.io/instance-type
      operator: In
      values:
        - m5.xlarge
        - m5a.xlarge
        - m5d.xlarge
        - m5n.xlarge
        - m6i.xlarge
  limits:
    resources:
      cpu: 200
  ttlSecondsAfterEmpty: 30

Hybrid Architectures

The most reliable pattern combines on-demand for baseline capacity with spot for variable load:

On-Demand Base (always running):
  └── 3 instances → handles P10 traffic
  
Spot Fleet (auto-scaled):
  └── 0-20 instances → handles P10 to P90
  
Fallback to On-Demand:
  └── If spot capacity unavailable, auto-scale on-demand

ASG Mixed Instance Policy

MixedInstancesPolicy:
  InstancesDistribution:
    OnDemandBaseCapacity: 3           # Always 3 on-demand
    OnDemandPercentageAboveBaseCapacity: 0  # Everything else is spot
    SpotAllocationStrategy: capacity-optimized
  LaunchTemplate:
    Overrides:
      - InstanceType: m5.xlarge
      - InstanceType: m5a.xlarge
      - InstanceType: m5d.xlarge

ML Training with Spot

Machine learning training is the ideal spot workload — long-running but checkpointable:

import torch

# Save checkpoint every N steps
def save_checkpoint(model, optimizer, epoch, step, loss):
    torch.save({
        'epoch': epoch,
        'step': step,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }, f's3://checkpoints/model_epoch{epoch}_step{step}.pt')

# Resume from latest checkpoint
def load_latest_checkpoint(model, optimizer):
    latest = find_latest_checkpoint('s3://checkpoints/')
    if latest:
        checkpoint = torch.load(latest)
        model.load_state_dict(checkpoint['model_state_dict'])
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        return checkpoint['epoch'], checkpoint['step']
    return 0, 0

Checkpoint every 15-30 minutes. On interruption, a new spot instance picks up from the last checkpoint.

Anti-Patterns

Anti-Pattern	Consequence	Fix
Single instance type	Capacity exhaustion	Diversify across 5+ types
No interruption handler	Abrupt termination, lost work	Poll metadata, checkpoint
Spot-only with no fallback	Complete outage if spot unavailable	Hybrid on-demand + spot
Large monolithic jobs	All progress lost on eviction	Break into small checkpointable units
Ignoring AZ distribution	All spot capacity in one AZ	Spread across 3+ AZs

Spot instances are not a cost optimization hack — they are an architectural pattern. The discount compensates you for building resilient, distributed, stateless systems. And those architectural properties have value far beyond the cost savings.