Why Kubernetes?
Understanding the benefits of deploying LLM proxy on Kubernetes
Kubernetes provides enterprise-grade orchestration for your LLM proxy, offering automatic scaling, self-healing, rolling updates, and efficient resource utilization. Deploying on Kubernetes ensures your AI infrastructure can handle production workloads with high availability and reliability. The platform's declarative configuration and robust ecosystem make it the ideal choice for organizations serious about AI at scale.
Auto-Scaling
Automatically scale based on CPU, memory, or custom metrics like request rate or queue depth
Self-Healing
Automatic recovery from failures with health checks and pod restart policies
Resource Management
Efficient resource allocation with requests, limits, and quality of service classes
Rolling Updates
Zero-downtime deployments with automatic rollback on failure detection
Prerequisites
Required components and configuration before deployment
Kubernetes Cluster
A running Kubernetes cluster (v1.25+) with kubectl configured. Options include managed services or self-hosted clusters.
- Managed: EKS, GKE, AKS
- Self-hosted: kubeadm, k3s
- Local: minikube, kind
- Minimum 3 worker nodes recommended
Container Registry
Access to container registry for storing LLM proxy images. Use public or private registries based on your security requirements.
- Docker Hub
- GitHub Container Registry
- AWS ECR / GCR / ACR
- Private registry with TLS
Storage & Database
Persistent storage for cache data and PostgreSQL database for metadata, usage tracking, and API key management.
- StorageClass for PVCs
- PostgreSQL (managed or self-hosted)
- Redis for caching
- Backup strategy defined
Secrets & Certificates
API keys stored securely in Kubernetes Secrets. SSL/TLS certificates configured for Ingress.
- Kubernetes Secrets for API keys
- cert-manager for TLS
- External Secrets Operator (optional)
- Sealed Secrets for GitOps
Deployment Configuration
Complete Kubernetes manifests for production deployment
Namespace & Secrets
apiVersion: v1 kind: Namespace metadata: name: llm-proxy --- apiVersion: v1 kind: Secret metadata: name: llm-proxy-secrets namespace: llm-proxy type: Opaque stringData: OPENAI_API_KEY: "sk-your-openai-key" ANTHROPIC_API_KEY: "sk-ant-your-key" LITELLM_MASTER_KEY: "sk-master-key" DATABASE_URL: "postgresql://user:pass@postgres:5432/litellm" REDIS_URL: "redis://redis:6379"
Deployment Manifest
apiVersion: apps/v1 kind: Deployment metadata: name: llm-proxy namespace: llm-proxy labels: app: llm-proxy spec: replicas: 3 selector: matchLabels: app: llm-proxy strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 template: metadata: labels: app: llm-proxy spec: containers: - name: llm-proxy image: ghcr.io/berriai/litellm:main-latest ports: - containerPort: 4000 envFrom: - secretRef: name: llm-proxy-secrets resources: requests: memory: "512Mi" cpu: "250m" limits: memory: "2Gi" cpu: "1000m" livenessProbe: httpGet: path: /health port: 4000 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /health port: 4000 initialDelaySeconds: 5 periodSeconds: 5 affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app: llm-proxy topologyKey: kubernetes.io/hostname
Configure pod anti-affinity to spread replicas across different nodes. This ensures high availability even if a node fails. For production, consider topology spread constraints for more control over pod distribution across zones and regions.
Auto-Scaling Configuration
Configure Horizontal Pod Autoscaler for dynamic scaling
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llm-proxy-hpa namespace: llm-proxy spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: llm-proxy minReplicas: 3 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 behavior: scaleDown: stabilizationWindowSeconds: 300 policies: - type: Pods value: 2 periodSeconds: 60 scaleUp: stabilizationWindowSeconds: 60 policies: - type: Pods value: 4 periodSeconds: 60
Scaling Metrics
| Metric Type | Target Value | Description |
|---|---|---|
| CPU Utilization | 70% | Scale when CPU usage exceeds threshold |
| Memory Utilization | 80% | Scale based on memory consumption |
| Custom: Request Rate | 100 req/s | Scale based on incoming request rate |
| Custom: Queue Depth | 50 requests | Scale when request queue builds up |
Monitoring Setup
Deploy comprehensive monitoring with Prometheus and Grafana
Prometheus
Collect metrics from LLM proxy pods including request rate, latency, errors, and custom business metrics
Grafana
Visualize metrics with pre-built dashboards for LLM proxy performance, costs, and usage analytics
AlertManager
Configure alerts for high error rates, resource exhaustion, and abnormal usage patterns
Log Aggregation
Centralize logs with Loki or ELK stack for debugging and audit trail requirements
Before going live: configure resource limits for all containers, set up pod disruption budgets, implement network policies, enable audit logging, configure TLS for all internal communication, and establish disaster recovery procedures with tested backups.