Spot Instance Engineering: Architecting for Interruption
Leverage spot and preemptible instances for 60-90% cost savings by designing workloads that handle interruption gracefully. Covers workload selection, interruption handling, diversification strategies, and operational patterns for reliable spot-based infrastructure.
Spot instances are the cheapest compute available from any cloud provider — 60-90% below on-demand pricing. The catch: the provider can reclaim them with as little as 2 minutes notice. This makes spot instances useless for stateful single-instance workloads and invaluable for workloads that can be distributed, retried, or interrupted without data loss.
The engineering challenge is designing your infrastructure so that a 2-minute eviction notice is a routine event, not an emergency.
Cost Comparison
AWS (us-east-1):
m5.xlarge on-demand: $0.192/hr
m5.xlarge spot: $0.058/hr (70% savings)
GCP:
n2-standard-4 on-demand: $0.194/hr
n2-standard-4 preemptible: $0.048/hr (75% savings)
Azure:
Standard_D4s_v3 pay-as-you-go: $0.192/hr
Standard_D4s_v3 spot: $0.038/hr (80% savings)
At scale, this translates to six-figure annual savings.
Workload Classification
Excellent for Spot
- Batch processing: MapReduce, ETL, data pipelines
- CI/CD workers: Build agents, test runners
- Training ML models: Checkpointing handles interruption
- Stateless web servers: Behind a load balancer with multiple instances
- Rendering: Video transcoding, image processing
- Big data: Spark, EMR, Dataproc clusters
Poor for Spot
- Single-instance databases: Eviction = downtime
- Stateful services with no replication: Data loss on eviction
- Long-running transactions: Cannot be retried cheaply
- Real-time systems with strict SLAs: Eviction causes SLA violations
Interruption Handling
AWS Spot Interruption Notice
AWS provides a 2-minute warning via the instance metadata service:
import requests
import time
def check_interruption():
try:
response = requests.get(
"http://169.254.169.254/latest/meta-data/spot/instance-action",
timeout=1
)
if response.status_code == 200:
action = response.json()
print(f"Spot interruption: {action['action']} at {action['time']}")
initiate_graceful_shutdown()
except requests.exceptions.RequestException:
pass # No interruption
# Poll every 5 seconds
while True:
check_interruption()
time.sleep(5)
Graceful Shutdown on Eviction
When an interruption notice arrives:
- Stop accepting new work (deregister from load balancer)
- Complete in-flight work (finish current request/job)
- Checkpoint state (save progress to durable storage)
- Signal the orchestrator (request replacement instance)
def initiate_graceful_shutdown():
# Deregister from ALB
deregister_from_target_group()
# Finish current batch job
if current_job:
current_job.save_checkpoint(s3_bucket)
# Drain connections (max 90 seconds, save 30s for cleanup)
drain_connections(timeout=90)
# Final cleanup
flush_logs()
Diversification Strategies
Single instance type spot pools can be exhausted. Diversify across multiple instance types and availability zones:
AWS Fleet Configuration
{
"SpotOptions": {
"AllocationStrategy": "capacity-optimized-prioritized",
"InstanceInterruptionBehavior": "terminate"
},
"Overrides": [
{ "InstanceType": "m5.xlarge", "Priority": 1 },
{ "InstanceType": "m5a.xlarge", "Priority": 2 },
{ "InstanceType": "m5d.xlarge", "Priority": 3 },
{ "InstanceType": "m4.xlarge", "Priority": 4 },
{ "InstanceType": "r5.large", "Priority": 5 }
]
}
Kubernetes with Spot Node Pools
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: spot-workers
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- m5.xlarge
- m5a.xlarge
- m5d.xlarge
- m5n.xlarge
- m6i.xlarge
limits:
resources:
cpu: 200
ttlSecondsAfterEmpty: 30
Hybrid Architectures
The most reliable pattern combines on-demand for baseline capacity with spot for variable load:
On-Demand Base (always running):
└── 3 instances → handles P10 traffic
Spot Fleet (auto-scaled):
└── 0-20 instances → handles P10 to P90
Fallback to On-Demand:
└── If spot capacity unavailable, auto-scale on-demand
ASG Mixed Instance Policy
MixedInstancesPolicy:
InstancesDistribution:
OnDemandBaseCapacity: 3 # Always 3 on-demand
OnDemandPercentageAboveBaseCapacity: 0 # Everything else is spot
SpotAllocationStrategy: capacity-optimized
LaunchTemplate:
Overrides:
- InstanceType: m5.xlarge
- InstanceType: m5a.xlarge
- InstanceType: m5d.xlarge
ML Training with Spot
Machine learning training is the ideal spot workload — long-running but checkpointable:
import torch
# Save checkpoint every N steps
def save_checkpoint(model, optimizer, epoch, step, loss):
torch.save({
'epoch': epoch,
'step': step,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, f's3://checkpoints/model_epoch{epoch}_step{step}.pt')
# Resume from latest checkpoint
def load_latest_checkpoint(model, optimizer):
latest = find_latest_checkpoint('s3://checkpoints/')
if latest:
checkpoint = torch.load(latest)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
return checkpoint['epoch'], checkpoint['step']
return 0, 0
Checkpoint every 15-30 minutes. On interruption, a new spot instance picks up from the last checkpoint.
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Single instance type | Capacity exhaustion | Diversify across 5+ types |
| No interruption handler | Abrupt termination, lost work | Poll metadata, checkpoint |
| Spot-only with no fallback | Complete outage if spot unavailable | Hybrid on-demand + spot |
| Large monolithic jobs | All progress lost on eviction | Break into small checkpointable units |
| Ignoring AZ distribution | All spot capacity in one AZ | Spread across 3+ AZs |
Spot instances are not a cost optimization hack — they are an architectural pattern. The discount compensates you for building resilient, distributed, stateless systems. And those architectural properties have value far beyond the cost savings.