Using Spot Instances in Production: A Practical Guide
The Spot Instance Opportunity
Spot instances offer 70-90% discounts compared to on-demand pricing. Yet most companies don't use them in production. Why? Fear of interruptions.
This fear is overblown. With proper architecture, spot instances are perfectly viable for production workloads.
Understanding Interruptions
AWS terminates spot instances when demand exceeds available capacity. This happens:
- Average interruption rate: 2-3% (1-2 times per month for a fleet of 100 instances)
- Predictable: AWS gives 2 minutes notice before interruption
- Geographic: Some AZs/instance types are more stable than others
- Seasonal: Less common during off-peak hours
When to Use Spot Instances
Good Candidates
- Stateless web services with load balancing
- Batch jobs with retries
- Background processing
- Dev/test/staging environments
- Analytics workloads with checkpointing
Poor Candidates
- Single-instance stateful services
- Sensitive database write operations
- Real-time trading systems
- Any workload without graceful shutdown capability
Architecture for Spot Reliability
Pattern 1: Spot Fleet with On-Demand Baseline
# AWS Auto Scaling Group configuration
DesiredCapacity: 10
OnDemandPercentageAboveBaseCapacity: 30
SpotAllocationStrategy: capacity-optimized
# 7 spot instances + 3 on-demand = 10 total
# Cost: ~30% of full on-demand price
Benefits:
- Guaranteed baseline capacity from on-demand instances
- Cost savings from spot fleet
- Graceful degradation if spot instances are interrupted
Pattern 2: Application-Level Graceful Shutdown
import signal
import time
from flask import Flask
app = Flask(__name__)
GRACEFUL_SHUTDOWN_TIMEOUT = 30
def handle_sigterm(signum, frame):
"""Handle AWS spot interruption notice (2-minute warning)"""
print("Received SIGTERM - initiating graceful shutdown")
# Step 1: Stop accepting new requests
app.config['ACCEPTING_REQUESTS'] = False
# Step 2: Wait for in-flight requests to complete
time.sleep(GRACEFUL_SHUTDOWN_TIMEOUT)
# Step 3: Exit cleanly
exit(0)
signal.signal(signal.SIGTERM, handle_sigterm)
@app.before_request
def check_accepting_requests():
if not app.config.get('ACCEPTING_REQUESTS', True):
return "Service shutting down", 503
Pattern 3: Spot Interruption Handler
# Detect interruption notice from EC2 metadata service
import requests
from datetime import datetime
def check_spot_interruption():
"""Poll for spot interruption notice"""
try:
response = requests.get(
'http://169.254.169.254/latest/meta-data/spot/instance-action',
timeout=0.1
)
if response.status_code == 200:
action = response.json()
action_time = datetime.fromisoformat(action['time'])
print(f"Spot interruption scheduled at {action_time}")
return True
except requests.exceptions.ConnectionError:
pass
return False
# In your monitoring loop:
if check_spot_interruption():
# Drain connection pool, stop accepting requests
trigger_graceful_shutdown()
Kubernetes Spot Instance Strategy
# Deployment with node affinity for spot instances
apiVersion: apps/v1
kind: Deployment
metadata:
name: worker-app
spec:
replicas: 10
selector:
matchLabels:
app: worker
template:
metadata:
labels:
app: worker
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: worker
terminationGracePeriodSeconds: 120
containers:
- name: worker
image: myapp:latest
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
Real-World Results
Case Study: Data Processing Pipeline
Before: 100 on-demand instances
- Cost: $3,000/month
- CPU utilization: 20%
After: 70 spot + 30 on-demand instances
- Cost: $900/month (70% savings)
- Same throughput and reliability
The key: Proper batch job architecture with retries and checkpointing means individual instance failures are non-events.
Case Study: Web Service
Before: 20 on-demand instances
- Cost: $6,000/month
- Availability: 99.9% (expected, with on-demand)
After: 10 on-demand + 20 spot instances
- Cost: $2,800/month (53% savings)
- Availability: 99.95% (better, due to multi-zone setup)
Monitoring and Alerting
# CloudWatch metrics for spot instance health
import boto3
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_data(
Namespace='SpotInstances',
MetricData=[
{
'MetricName': 'SpotInstanceTerminations',
'Value': terminated_count,
'Unit': 'Count'
},
{
'MetricName': 'SpotFleetCoverage',
'Value': (spot_running / spot_desired) * 100,
'Unit': 'Percent'
}
]
)
Set alarms for:
- Sustained spot interruption rates > 5%
- Spot fleet falling below 80% desired capacity
- Slow graceful shutdowns (timeout > 30s)
Cost-Benefit Analysis
Costs of Using Spot
- Engineering complexity for graceful shutdown
- Potential for interruption-induced failures (rare with proper design)
- Monitoring overhead
Benefits of Using Spot
- 70-90% cost reduction
- Ability to scale to much larger fleet at same cost
- Better resource utilization across organization
Best Practices
- Start with batch jobs: Easiest to implement, highest ROI
- Use capacity-optimized allocation: Better stability than lowest-price
- Diversify instance types and AZs: Reduce correlation of interruptions
- Implement graceful shutdown: 2-minute notice window is often enough
- Monitor carefully: Track interruption rates and patterns
- Test interruptions: Chaos engineer your spot fleet regularly
Conclusion
Spot instances are not just for non-critical workloads anymore. With proper architecture, they're a powerful tool for cost reduction that shouldn't be left on the table. Start small, prove it works, then scale aggressively.