Authentication Layers
Strong authentication is the first line of defense for LLM APIs. Implement multiple authentication methods to balance security with developer experience, choosing the right approach based on your use case and user type.
API Key Authentication
Simple, stateless authentication ideal for server-to-server communication. Keys should be long, randomly generated strings stored securely. Implement key rotation policies and scope keys to specific permissions.
JWT Token Auth
Stateless tokens with embedded claims for user identity and permissions. Excellent for microservices and distributed systems. Include expiration times and validate signatures on every request.
OAuth 2.0 Flow
Industry-standard authorization for user-facing applications. Supports delegated access, fine-grained scopes, and token refresh. Essential for third-party integrations and user consent flows.
mTLS (Mutual TLS)
Highest security level requiring client certificates. Zero-trust networking approach where both client and server verify identity. Best for internal services and high-security environments.
Rate Limiting Fundamentals
Rate limiting protects your LLM infrastructure from abuse, prevents runaway costs, and ensures fair resource distribution. Implement limits at multiple levels: global, per-user, per-endpoint, and per-model.
Strict
10-50 req/min
Best for free tiers and untrusted users. Prevents abuse but may frustrate legitimate power users.
Balanced
100-500 req/min
Suitable for most production applications. Allows burst traffic while preventing sustained overload.
Permissive
1000+ req/min
For trusted internal services and premium tiers. Monitor closely for anomalies.
💡 Multi-Dimensional Limits
Implement concurrent limits across multiple dimensions: requests per minute, tokens per hour, cost per day, and concurrent connections. This prevents abuse while allowing flexible usage patterns.
Rate Limiting Algorithms
| Algorithm | How It Works | Best For | Trade-offs |
|---|---|---|---|
| Token Bucket | Tokens added at fixed rate, consumed per request | Burst-heavy traffic | Complex state management |
| Leaky Bucket | Requests queue, processed at fixed rate | Smooth traffic flow | Higher latency for bursts |
| Fixed Window | Count requests in time windows | Simple implementation | Window boundary spikes |
| Sliding Window | Weighted count across window boundaries | Accurate limiting | More memory usage |
Security Strategies
Cost-Based Throttling
Implement dynamic rate limiting based on token cost rather than just request count. Expensive models consume more rate limit budget. This prevents budget overruns from high-token requests.
Adaptive Limits
Adjust limits dynamically based on system load, time of day, or user behavior patterns. Reduce limits during peak hours, increase during off-peak. Respond to system health in real-time.
Circuit Breakers
Automatically trip when error rates exceed thresholds. Prevent cascade failures by failing fast. Allow gradual recovery with half-open state testing before full restoration.
User Quotas
Assign individual quotas per user or team. Track usage against allocated budgets. Notify users approaching limits and implement hard stops at quota boundaries.
Implementation Example
import redis import time from functools import wraps class TokenBucketRateLimiter: def __init__(self, redis_client, capacity, refill_rate): self.redis = redis_client self.capacity = capacity # Max tokens self.refill_rate = refill_rate # Tokens per second def allow_request(self, key, tokens_needed=1): """Check if request is allowed under rate limit""" now = time.time() # Get current bucket state bucket = self.redis.hgetall(key) if not bucket: # Initialize new bucket self.redis.hset(key, mapping={ 'tokens': self.capacity - tokens_needed, 'last_update': now }) self.redis.expire(key, 3600) return True, self.capacity - tokens_needed # Calculate refill last_update = float(bucket['last_update']) current_tokens = float(bucket['tokens']) elapsed = now - last_update # Add refilled tokens new_tokens = min( self.capacity, current_tokens + (elapsed * self.refill_rate) ) if new_tokens >= tokens_needed: # Allow request, consume tokens self.redis.hset(key, mapping={ 'tokens': new_tokens - tokens_needed, 'last_update': now }) return True, new_tokens - tokens_needed # Deny request return False, new_tokens # Usage limiter = TokenBucketRateLimiter( redis.Redis(), capacity=100, # 100 requests refill_rate=1 # 1 request per second ) allowed, remaining = limiter.allow_request("user:123")
⚠️ Common Pitfalls
Avoid race conditions in distributed rate limiters by using atomic Redis operations. Don't rely solely on client-side rate limiting. Always implement server-side validation. Handle clock skew in distributed systems. Test edge cases around limit boundaries.
🔗 Related Security Resources
Continue learning: Why Use LLM Proxy | Production Deployment | Load Balancing Strategies | Enterprise Requirements