Scaling Overview
Production LLM API gateways require sophisticated scaling strategies to handle variable loads while maintaining performance and cost efficiency. This guide covers horizontal scaling, auto-scaling, and capacity planning.
Horizontal Scaling
- Add more gateway instances
- Stateless architecture
- Load balancer distribution
- Regional deployment
Vertical Scaling
- Increase instance resources
- More CPU/Memory
- Limited scalability ceiling
- Higher per-instance cost
Horizontal Scaling Architecture
Kubernetes Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-gateway-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-gateway
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: 1000
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
Auto-Scaling Strategies
Predictive Scaling
Use historical data to predict traffic patterns and pre-scale before demand spikes:
predictive_scaling:
enabled: true
lookback_period: 7d
prediction_window: 1h
patterns:
- name: daily_peak
schedule: "0 9 * * 1-5"
min_replicas: 10
- name: weekly_low
schedule: "0 0 * * 0,6"
min_replicas: 2
Event-Driven Scaling
Scale based on business events and external triggers:
- Marketing campaigns: Pre-scale before product launches or promotions
- External integrations: Scale when partner traffic increases
- Cost thresholds: Scale down during low-revenue periods
Scaling Cooldown Period
Always configure cooldown periods (typically 3-5 minutes) to prevent thrashing. Rapid scaling up and down causes instability and increases costs.
Capacity Planning
Resource Requirements Calculator
| Traffic Level | CPU Cores | Memory | Instances |
|---|---|---|---|
| 1K req/s | 2 cores | 4 GB | 3 |
| 10K req/s | 16 cores | 32 GB | 8 |
| 100K req/s | 128 cores | 256 GB | 50+ |
Cost Optimization
# Right-sizing recommendations
instance_sizing:
small:
cpu: 1
memory: 2Gi
max_rps: 500
cost_per_hour: "$0.05"
medium:
cpu: 4
memory: 8Gi
max_rps: 3000
cost_per_hour: "$0.20"
large:
cpu: 16
memory: 32Gi
max_rps: 15000
cost_per_hour: "$0.80"
Performance Optimization
Connection Pooling
connection_pool:
max_connections: 1000
max_per_host: 100
idle_timeout: 60s
keep_alive: true
# Optimize for LLM API characteristics
timeouts:
connect: 5s
request: 120s # LLM responses can be slow
response_header: 30s
Load Balancing Strategies
- Round Robin: Simple, even distribution across instances
- Least Connections: Route to instances with fewest active requests
- Weighted: Send more traffic to larger instances
- Geographic: Route to nearest region for latency optimization
Connection Limits
Monitor connection pool utilization closely. Exhausting connections causes cascading failures. Set limits at 80% of theoretical maximum.