Spot instances are AWS's best-kept semi-secret.
Up to 90% cheaper than on-demand. Same performance. Same hardware.
The catch? AWS can reclaim them with 2 minutes notice.
This scares most people away. It shouldn't. With the right approach, Spot is production-ready for more workloads than you'd think.
Spot instances are spare AWS capacity sold at steep discounts. When AWS needs that capacity back, your instance gets terminated.
The economics are simple:
"But what if my instance gets terminated?"
Let's look at real data:
A well-architected Spot deployment is often more reliable than a single on-demand instance.
Build jobs are naturally interruptible. If a build gets killed, just restart it. With modern CI systems, this is automatic.
Savings: 60-80% on build infrastructure.Kubernetes already handles pod scheduling and node failures. Add Spot nodes with proper taints and tolerations, and Kubernetes handles the rest.
Savings: 50-70% on EKS/EKS node groups.Jobs that can checkpoint and resume are perfect for Spot. AWS Batch has native Spot support.
Savings: 70-90% on batch compute.If your dev environment goes down for 2 minutes while AWS reclaims capacity, does anyone care?
Savings: 60-80% on non-production.With load balancers, health checks, and auto-scaling, web tiers can absorb Spot interruptions gracefully.
Savings: 50-70% on web servers.Model training can checkpoint. When Spot gets reclaimed, resume from checkpoint on a new instance.
Savings: 70-90% on training compute.Don't request only m5.xlarge. Request capacity across multiple instance types:
m5.xlarge, m5a.xlarge, m5n.xlarge, m5d.xlarge,
m4.xlarge, r5.large, c5.xlarge
More instance types = more pools = lower interruption probability.
When using Auto Scaling Groups, set the allocation strategy to capacity-optimized.
This automatically launches in pools with the most available capacity, reducing interruption risk.
Don't go 100% Spot. Use a mix:
If Spot gets interrupted, your on-demand baseline keeps you running.
When AWS reclaims Spot, you get 2 minutes notice via:
/spot/termination-time)Use this time to:
Don't manage Spot instances manually. Use:
Both handle instance replacement automatically.
EKS + Spot is powerful. Here's the setup:
Create node groups specifically for Spot:
nodeGroups:
- name: spot-workers
instanceTypes: [m5.large, m5a.large, m5d.large]
capacityType: SPOT
desiredCapacity: 10
labels:
node-type: spot
taints:
- key: spot
value: "true"
effect: NoSchedule
Only schedule interruptible workloads on Spot nodes:
tolerations:
- key: "spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
Karpenter is a Kubernetes autoscaler that's Spot-native. It automatically:
It's the easiest way to run Spot on Kubernetes.
Requesting only one instance type limits you to one Spot pool. When that pool runs out, you get interrupted.
Fix: Request at least 10 instance types that meet your requirements.Your application needs to handle termination gracefully. If you ignore the warning, you'll lose in-progress work.
Fix: Implement termination handling. Checkpoint work. Drain connections.Databases, queues, and stateful services shouldn't run on Spot. The interruption handling is too complex.
Fix: Use Spot for stateless workloads. Keep stateful services on on-demand or Reserved.All-Spot deployments can experience cascading failures during high-interruption periods.
Fix: Maintain an on-demand baseline. 20-30% on-demand is a good starting point.If you don't track Spot savings, you can't prove ROI.
Fix: Use Cost Explorer to compare Spot vs. on-demand spend. Celebrate the wins.Answer these questions:
1. Can your workload handle 2-minute interruptions?
2. Do you have flexible capacity requirements?
3. Is your workload stateless or checkpoint-able?
4. Are you running Kubernetes?
Here's your first Spot experiment:
1. Identify a candidate workload: Dev environment, CI/CD, batch job 2. Set up a Spot-based Auto Scaling Group with 5+ instance types 3. Run it for a week and monitor interruptions 4. Measure savings in Cost Explorer 5. Expand if successful
Start small. Learn. Scale up.
The 70% savings are real. The interruption fear is overblown.