Using Spot Instances in Production: A Practical Guide

3.755 min read
InfrastructureAWSSpot InstancesArchitecture

The Spot Instance Opportunity

Spot instances offer 70-90% discounts compared to on-demand pricing. Yet most companies don't use them in production. Why? Fear of interruptions.

This fear is overblown. With proper architecture, spot instances are perfectly viable for production workloads.

Understanding Interruptions

AWS terminates spot instances when demand exceeds available capacity. This happens:

  • Average interruption rate: 2-3% (1-2 times per month for a fleet of 100 instances)
  • Predictable: AWS gives 2 minutes notice before interruption
  • Geographic: Some AZs/instance types are more stable than others
  • Seasonal: Less common during off-peak hours

When to Use Spot Instances

Good Candidates

  • Stateless web services with load balancing
  • Batch jobs with retries
  • Background processing
  • Dev/test/staging environments
  • Analytics workloads with checkpointing

Poor Candidates

  • Single-instance stateful services
  • Sensitive database write operations
  • Real-time trading systems
  • Any workload without graceful shutdown capability

Architecture for Spot Reliability

Pattern 1: Spot Fleet with On-Demand Baseline

# AWS Auto Scaling Group configuration
DesiredCapacity: 10
OnDemandPercentageAboveBaseCapacity: 30
SpotAllocationStrategy: capacity-optimized

# 7 spot instances + 3 on-demand = 10 total
# Cost: ~30% of full on-demand price

Benefits:

  • Guaranteed baseline capacity from on-demand instances
  • Cost savings from spot fleet
  • Graceful degradation if spot instances are interrupted

Pattern 2: Application-Level Graceful Shutdown

import signal
import time
from flask import Flask

app = Flask(__name__)
GRACEFUL_SHUTDOWN_TIMEOUT = 30

def handle_sigterm(signum, frame):
    """Handle AWS spot interruption notice (2-minute warning)"""
    print("Received SIGTERM - initiating graceful shutdown")

    # Step 1: Stop accepting new requests
    app.config['ACCEPTING_REQUESTS'] = False

    # Step 2: Wait for in-flight requests to complete
    time.sleep(GRACEFUL_SHUTDOWN_TIMEOUT)

    # Step 3: Exit cleanly
    exit(0)

signal.signal(signal.SIGTERM, handle_sigterm)

@app.before_request
def check_accepting_requests():
    if not app.config.get('ACCEPTING_REQUESTS', True):
        return "Service shutting down", 503

Pattern 3: Spot Interruption Handler

# Detect interruption notice from EC2 metadata service
import requests
from datetime import datetime

def check_spot_interruption():
    """Poll for spot interruption notice"""
    try:
        response = requests.get(
            'http://169.254.169.254/latest/meta-data/spot/instance-action',
            timeout=0.1
        )
        if response.status_code == 200:
            action = response.json()
            action_time = datetime.fromisoformat(action['time'])
            print(f"Spot interruption scheduled at {action_time}")
            return True
    except requests.exceptions.ConnectionError:
        pass
    return False

# In your monitoring loop:
if check_spot_interruption():
    # Drain connection pool, stop accepting requests
    trigger_graceful_shutdown()

Kubernetes Spot Instance Strategy

# Deployment with node affinity for spot instances
apiVersion: apps/v1
kind: Deployment
metadata:
  name: worker-app
spec:
  replicas: 10
  selector:
    matchLabels:
      app: worker
  template:
    metadata:
      labels:
        app: worker
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: karpenter.sh/capacity-type
                operator: In
                values: ["spot"]
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: worker
      terminationGracePeriodSeconds: 120
      containers:
      - name: worker
        image: myapp:latest
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]

Real-World Results

Case Study: Data Processing Pipeline

Before: 100 on-demand instances

  • Cost: $3,000/month
  • CPU utilization: 20%

After: 70 spot + 30 on-demand instances

  • Cost: $900/month (70% savings)
  • Same throughput and reliability

The key: Proper batch job architecture with retries and checkpointing means individual instance failures are non-events.

Case Study: Web Service

Before: 20 on-demand instances

  • Cost: $6,000/month
  • Availability: 99.9% (expected, with on-demand)

After: 10 on-demand + 20 spot instances

  • Cost: $2,800/month (53% savings)
  • Availability: 99.95% (better, due to multi-zone setup)

Monitoring and Alerting

# CloudWatch metrics for spot instance health
import boto3

cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_data(
    Namespace='SpotInstances',
    MetricData=[
        {
            'MetricName': 'SpotInstanceTerminations',
            'Value': terminated_count,
            'Unit': 'Count'
        },
        {
            'MetricName': 'SpotFleetCoverage',
            'Value': (spot_running / spot_desired) * 100,
            'Unit': 'Percent'
        }
    ]
)

Set alarms for:

  • Sustained spot interruption rates > 5%
  • Spot fleet falling below 80% desired capacity
  • Slow graceful shutdowns (timeout > 30s)

Cost-Benefit Analysis

Costs of Using Spot

  • Engineering complexity for graceful shutdown
  • Potential for interruption-induced failures (rare with proper design)
  • Monitoring overhead

Benefits of Using Spot

  • 70-90% cost reduction
  • Ability to scale to much larger fleet at same cost
  • Better resource utilization across organization

Best Practices

  1. Start with batch jobs: Easiest to implement, highest ROI
  2. Use capacity-optimized allocation: Better stability than lowest-price
  3. Diversify instance types and AZs: Reduce correlation of interruptions
  4. Implement graceful shutdown: 2-minute notice window is often enough
  5. Monitor carefully: Track interruption rates and patterns
  6. Test interruptions: Chaos engineer your spot fleet regularly

Conclusion

Spot instances are not just for non-critical workloads anymore. With proper architecture, they're a powerful tool for cost reduction that shouldn't be left on the table. Start small, prove it works, then scale aggressively.