Using Spot Instances in Production: A Practical Guide

2025-12-20•3.755 min read

InfrastructureAWSSpot InstancesArchitecture

The Spot Instance Opportunity

Spot instances offer 70-90% discounts compared to on-demand pricing. Yet most companies don't use them in production. Why? Fear of interruptions.

This fear is overblown. With proper architecture, spot instances are perfectly viable for production workloads.

Understanding Interruptions

AWS terminates spot instances when demand exceeds available capacity. This happens:

Average interruption rate: 2-3% (1-2 times per month for a fleet of 100 instances)
Predictable: AWS gives 2 minutes notice before interruption
Geographic: Some AZs/instance types are more stable than others
Seasonal: Less common during off-peak hours

When to Use Spot Instances

Good Candidates

Stateless web services with load balancing
Batch jobs with retries
Background processing
Dev/test/staging environments
Analytics workloads with checkpointing

Poor Candidates

Single-instance stateful services
Sensitive database write operations
Real-time trading systems
Any workload without graceful shutdown capability

Architecture for Spot Reliability

Pattern 1: Spot Fleet with On-Demand Baseline

# AWS Auto Scaling Group configuration
DesiredCapacity: 10
OnDemandPercentageAboveBaseCapacity: 30
SpotAllocationStrategy: capacity-optimized

# 7 spot instances + 3 on-demand = 10 total
# Cost: ~30% of full on-demand price

Benefits:

Guaranteed baseline capacity from on-demand instances
Cost savings from spot fleet
Graceful degradation if spot instances are interrupted

Pattern 2: Application-Level Graceful Shutdown

import signal
import time
from flask import Flask

app = Flask(__name__)
GRACEFUL_SHUTDOWN_TIMEOUT = 30

def handle_sigterm(signum, frame):
    """Handle AWS spot interruption notice (2-minute warning)"""
    print("Received SIGTERM - initiating graceful shutdown")

    # Step 1: Stop accepting new requests
    app.config['ACCEPTING_REQUESTS'] = False

    # Step 2: Wait for in-flight requests to complete
    time.sleep(GRACEFUL_SHUTDOWN_TIMEOUT)

    # Step 3: Exit cleanly
    exit(0)

signal.signal(signal.SIGTERM, handle_sigterm)

@app.before_request
def check_accepting_requests():
    if not app.config.get('ACCEPTING_REQUESTS', True):
        return "Service shutting down", 503

Pattern 3: Spot Interruption Handler

# Detect interruption notice from EC2 metadata service
import requests
from datetime import datetime

def check_spot_interruption():
    """Poll for spot interruption notice"""
    try:
        response = requests.get(
            'http://169.254.169.254/latest/meta-data/spot/instance-action',
            timeout=0.1
        )
        if response.status_code == 200:
            action = response.json()
            action_time = datetime.fromisoformat(action['time'])
            print(f"Spot interruption scheduled at {action_time}")
            return True
    except requests.exceptions.ConnectionError:
        pass
    return False

# In your monitoring loop:
if check_spot_interruption():
    # Drain connection pool, stop accepting requests
    trigger_graceful_shutdown()

Kubernetes Spot Instance Strategy

# Deployment with node affinity for spot instances
apiVersion: apps/v1
kind: Deployment
metadata:
  name: worker-app
spec:
  replicas: 10
  selector:
    matchLabels:
      app: worker
  template:
    metadata:
      labels:
        app: worker
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: karpenter.sh/capacity-type
                operator: In
                values: ["spot"]
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: worker
      terminationGracePeriodSeconds: 120
      containers:
      - name: worker
        image: myapp:latest
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]

Real-World Results

Case Study: Data Processing Pipeline

Before: 100 on-demand instances

Cost: $3,000/month
CPU utilization: 20%

After: 70 spot + 30 on-demand instances

Cost: $900/month (70% savings)
Same throughput and reliability

The key: Proper batch job architecture with retries and checkpointing means individual instance failures are non-events.

Case Study: Web Service

Before: 20 on-demand instances

Cost: $6,000/month
Availability: 99.9% (expected, with on-demand)

After: 10 on-demand + 20 spot instances

Cost: $2,800/month (53% savings)
Availability: 99.95% (better, due to multi-zone setup)

Monitoring and Alerting

# CloudWatch metrics for spot instance health
import boto3

cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_data(
    Namespace='SpotInstances',
    MetricData=[
        {
            'MetricName': 'SpotInstanceTerminations',
            'Value': terminated_count,
            'Unit': 'Count'
        },
        {
            'MetricName': 'SpotFleetCoverage',
            'Value': (spot_running / spot_desired) * 100,
            'Unit': 'Percent'
        }
    ]
)

Set alarms for:

Sustained spot interruption rates > 5%
Spot fleet falling below 80% desired capacity
Slow graceful shutdowns (timeout > 30s)

Cost-Benefit Analysis

Costs of Using Spot

Engineering complexity for graceful shutdown
Potential for interruption-induced failures (rare with proper design)
Monitoring overhead

Benefits of Using Spot

70-90% cost reduction
Ability to scale to much larger fleet at same cost
Better resource utilization across organization

Best Practices

Start with batch jobs: Easiest to implement, highest ROI
Use capacity-optimized allocation: Better stability than lowest-price
Diversify instance types and AZs: Reduce correlation of interruptions
Implement graceful shutdown: 2-minute notice window is often enough
Monitor carefully: Track interruption rates and patterns
Test interruptions: Chaos engineer your spot fleet regularly

Conclusion

Spot instances are not just for non-critical workloads anymore. With proper architecture, they're a powerful tool for cost reduction that shouldn't be left on the table. Start small, prove it works, then scale aggressively.