ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Spot Instance Engineering: Architecting for Interruption

Leverage spot and preemptible instances for 60-90% cost savings by designing workloads that handle interruption gracefully. Covers workload selection, interruption handling, diversification strategies, and operational patterns for reliable spot-based infrastructure.

Spot instances are the cheapest compute available from any cloud provider — 60-90% below on-demand pricing. The catch: the provider can reclaim them with as little as 2 minutes notice. This makes spot instances useless for stateful single-instance workloads and invaluable for workloads that can be distributed, retried, or interrupted without data loss.

The engineering challenge is designing your infrastructure so that a 2-minute eviction notice is a routine event, not an emergency.


Cost Comparison

AWS (us-east-1):
  m5.xlarge on-demand:  $0.192/hr
  m5.xlarge spot:       $0.058/hr  (70% savings)
  
GCP:
  n2-standard-4 on-demand:  $0.194/hr
  n2-standard-4 preemptible: $0.048/hr  (75% savings)

Azure:
  Standard_D4s_v3 pay-as-you-go:  $0.192/hr
  Standard_D4s_v3 spot:            $0.038/hr  (80% savings)

At scale, this translates to six-figure annual savings.


Workload Classification

Excellent for Spot

  • Batch processing: MapReduce, ETL, data pipelines
  • CI/CD workers: Build agents, test runners
  • Training ML models: Checkpointing handles interruption
  • Stateless web servers: Behind a load balancer with multiple instances
  • Rendering: Video transcoding, image processing
  • Big data: Spark, EMR, Dataproc clusters

Poor for Spot

  • Single-instance databases: Eviction = downtime
  • Stateful services with no replication: Data loss on eviction
  • Long-running transactions: Cannot be retried cheaply
  • Real-time systems with strict SLAs: Eviction causes SLA violations

Interruption Handling

AWS Spot Interruption Notice

AWS provides a 2-minute warning via the instance metadata service:

import requests
import time

def check_interruption():
    try:
        response = requests.get(
            "http://169.254.169.254/latest/meta-data/spot/instance-action",
            timeout=1
        )
        if response.status_code == 200:
            action = response.json()
            print(f"Spot interruption: {action['action']} at {action['time']}")
            initiate_graceful_shutdown()
    except requests.exceptions.RequestException:
        pass  # No interruption

# Poll every 5 seconds
while True:
    check_interruption()
    time.sleep(5)

Graceful Shutdown on Eviction

When an interruption notice arrives:

  1. Stop accepting new work (deregister from load balancer)
  2. Complete in-flight work (finish current request/job)
  3. Checkpoint state (save progress to durable storage)
  4. Signal the orchestrator (request replacement instance)
def initiate_graceful_shutdown():
    # Deregister from ALB
    deregister_from_target_group()
    
    # Finish current batch job
    if current_job:
        current_job.save_checkpoint(s3_bucket)
    
    # Drain connections (max 90 seconds, save 30s for cleanup)
    drain_connections(timeout=90)
    
    # Final cleanup
    flush_logs()

Diversification Strategies

Single instance type spot pools can be exhausted. Diversify across multiple instance types and availability zones:

AWS Fleet Configuration

{
  "SpotOptions": {
    "AllocationStrategy": "capacity-optimized-prioritized",
    "InstanceInterruptionBehavior": "terminate"
  },
  "Overrides": [
    { "InstanceType": "m5.xlarge",  "Priority": 1 },
    { "InstanceType": "m5a.xlarge", "Priority": 2 },
    { "InstanceType": "m5d.xlarge", "Priority": 3 },
    { "InstanceType": "m4.xlarge",  "Priority": 4 },
    { "InstanceType": "r5.large",   "Priority": 5 }
  ]
}

Kubernetes with Spot Node Pools

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: spot-workers
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot"]
    - key: node.kubernetes.io/instance-type
      operator: In
      values:
        - m5.xlarge
        - m5a.xlarge
        - m5d.xlarge
        - m5n.xlarge
        - m6i.xlarge
  limits:
    resources:
      cpu: 200
  ttlSecondsAfterEmpty: 30

Hybrid Architectures

The most reliable pattern combines on-demand for baseline capacity with spot for variable load:

On-Demand Base (always running):
  └── 3 instances → handles P10 traffic
  
Spot Fleet (auto-scaled):
  └── 0-20 instances → handles P10 to P90
  
Fallback to On-Demand:
  └── If spot capacity unavailable, auto-scale on-demand

ASG Mixed Instance Policy

MixedInstancesPolicy:
  InstancesDistribution:
    OnDemandBaseCapacity: 3           # Always 3 on-demand
    OnDemandPercentageAboveBaseCapacity: 0  # Everything else is spot
    SpotAllocationStrategy: capacity-optimized
  LaunchTemplate:
    Overrides:
      - InstanceType: m5.xlarge
      - InstanceType: m5a.xlarge
      - InstanceType: m5d.xlarge

ML Training with Spot

Machine learning training is the ideal spot workload — long-running but checkpointable:

import torch

# Save checkpoint every N steps
def save_checkpoint(model, optimizer, epoch, step, loss):
    torch.save({
        'epoch': epoch,
        'step': step,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }, f's3://checkpoints/model_epoch{epoch}_step{step}.pt')

# Resume from latest checkpoint
def load_latest_checkpoint(model, optimizer):
    latest = find_latest_checkpoint('s3://checkpoints/')
    if latest:
        checkpoint = torch.load(latest)
        model.load_state_dict(checkpoint['model_state_dict'])
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        return checkpoint['epoch'], checkpoint['step']
    return 0, 0

Checkpoint every 15-30 minutes. On interruption, a new spot instance picks up from the last checkpoint.


Anti-Patterns

Anti-PatternConsequenceFix
Single instance typeCapacity exhaustionDiversify across 5+ types
No interruption handlerAbrupt termination, lost workPoll metadata, checkpoint
Spot-only with no fallbackComplete outage if spot unavailableHybrid on-demand + spot
Large monolithic jobsAll progress lost on evictionBreak into small checkpointable units
Ignoring AZ distributionAll spot capacity in one AZSpread across 3+ AZs

Spot instances are not a cost optimization hack — they are an architectural pattern. The discount compensates you for building resilient, distributed, stateless systems. And those architectural properties have value far beyond the cost savings.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →