Spot Instances: The Practical Guide to 70% Savings

Post Cover

Spot instances are AWS's best-kept semi-secret.

Up to 90% cheaper than on-demand. Same performance. Same hardware.

The catch? AWS can reclaim them with 2 minutes notice.

This scares most people away. It shouldn't. With the right approach, Spot is production-ready for more workloads than you'd think.

What Are Spot Instances?

Spot instances are spare AWS capacity sold at steep discounts. When AWS needs that capacity back, your instance gets terminated.

The economics are simple:

  • AWS has unused capacity
  • Unused capacity generates $0 revenue
  • Selling it cheap generates some revenue
  • Everyone wins (if you can handle interruptions)

The Interruption Reality

"But what if my instance gets terminated?"

Let's look at real data:

  • Most Spot pools have <5% monthly interruption rate
  • Some pools have <1% interruption rate
  • You get 2 minutes warning before termination
  • Diversifying across instance types dramatically reduces interruption risk

A well-architected Spot deployment is often more reliable than a single on-demand instance.

Perfect Workloads for Spot

CI/CD Build Fleets

Build jobs are naturally interruptible. If a build gets killed, just restart it. With modern CI systems, this is automatic.

Savings: 60-80% on build infrastructure.

Kubernetes Worker Nodes

Kubernetes already handles pod scheduling and node failures. Add Spot nodes with proper taints and tolerations, and Kubernetes handles the rest.

Savings: 50-70% on EKS/EKS node groups.

Batch Processing

Jobs that can checkpoint and resume are perfect for Spot. AWS Batch has native Spot support.

Savings: 70-90% on batch compute.

Dev/Test Environments

If your dev environment goes down for 2 minutes while AWS reclaims capacity, does anyone care?

Savings: 60-80% on non-production.

Stateless Web Tiers

With load balancers, health checks, and auto-scaling, web tiers can absorb Spot interruptions gracefully.

Savings: 50-70% on web servers.

Training ML Models

Model training can checkpoint. When Spot gets reclaimed, resume from checkpoint on a new instance.

Savings: 70-90% on training compute.

Spot Strategies That Work

1. Diversify Instance Types

Don't request only m5.xlarge. Request capacity across multiple instance types:

m5.xlarge, m5a.xlarge, m5n.xlarge, m5d.xlarge,
m4.xlarge, r5.large, c5.xlarge

More instance types = more pools = lower interruption probability.

2. Use Capacity-Optimized Allocation

When using Auto Scaling Groups, set the allocation strategy to capacity-optimized.

This automatically launches in pools with the most available capacity, reducing interruption risk.

3. Mix Spot with On-Demand

Don't go 100% Spot. Use a mix:

  • 30% on-demand (guaranteed baseline)
  • 70% Spot (discounted capacity)

If Spot gets interrupted, your on-demand baseline keeps you running.

4. Handle the 2-Minute Warning

When AWS reclaims Spot, you get 2 minutes notice via:

  • Instance metadata (/spot/termination-time)
  • CloudWatch Events
  • EventBridge

Use this time to:

  • Drain connections gracefully
  • Checkpoint in-progress work
  • Deregister from load balancers

5. Use Spot Fleet or EC2 Auto Scaling

Don't manage Spot instances manually. Use:

  • Spot Fleet: For diverse, flexible capacity
  • EC2 Auto Scaling: For tighter ASG integration

Both handle instance replacement automatically.

Spot on Kubernetes

EKS + Spot is powerful. Here's the setup:

Dedicated Spot Node Groups

Create node groups specifically for Spot:

nodeGroups:
  • name: spot-workers
instanceTypes: [m5.large, m5a.large, m5d.large] capacityType: SPOT desiredCapacity: 10 labels: node-type: spot taints:
  • key: spot
value: "true" effect: NoSchedule

Pod Tolerations

Only schedule interruptible workloads on Spot nodes:

tolerations:
  • key: "spot"
operator: "Equal" value: "true" effect: "NoSchedule"

Use Karpenter

Karpenter is a Kubernetes autoscaler that's Spot-native. It automatically:

  • Provisions Spot when available
  • Falls back to on-demand when needed
  • Consolidates workloads efficiently

It's the easiest way to run Spot on Kubernetes.

Common Mistakes

Mistake #1: Single Instance Type

Requesting only one instance type limits you to one Spot pool. When that pool runs out, you get interrupted.

Fix: Request at least 10 instance types that meet your requirements.

Mistake #2: Not Handling Interruptions

Your application needs to handle termination gracefully. If you ignore the warning, you'll lose in-progress work.

Fix: Implement termination handling. Checkpoint work. Drain connections.

Mistake #3: Using Spot for Stateful Workloads

Databases, queues, and stateful services shouldn't run on Spot. The interruption handling is too complex.

Fix: Use Spot for stateless workloads. Keep stateful services on on-demand or Reserved.

Mistake #4: Going 100% Spot

All-Spot deployments can experience cascading failures during high-interruption periods.

Fix: Maintain an on-demand baseline. 20-30% on-demand is a good starting point.

Mistake #5: Not Monitoring Savings

If you don't track Spot savings, you can't prove ROI.

Fix: Use Cost Explorer to compare Spot vs. on-demand spend. Celebrate the wins.

Is Spot Right for You?

Answer these questions:

1. Can your workload handle 2-minute interruptions?

  • Yes → Spot is likely a good fit
  • No → Can you make it interruptible? If not, skip Spot.

2. Do you have flexible capacity requirements?

  • Yes → Spot Fleet with diversified instances
  • No → Spot might not be reliable enough

3. Is your workload stateless or checkpoint-able?

  • Yes → Spot is great
  • No → Keep it on on-demand

4. Are you running Kubernetes?

  • Yes → Spot is almost always worth it for worker nodes
  • No → Still worth exploring for batch/dev/test

Getting Started

Here's your first Spot experiment:

1. Identify a candidate workload: Dev environment, CI/CD, batch job 2. Set up a Spot-based Auto Scaling Group with 5+ instance types 3. Run it for a week and monitor interruptions 4. Measure savings in Cost Explorer 5. Expand if successful

Start small. Learn. Scale up.

The 70% savings are real. The interruption fear is overblown.

LET US HELP YOU
CUSTOMER
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Prefer to email us directly? support@finfan.cloud

We typically respond within 24 hours during business days.