Deploy LLM Proxy on Kubernetes - Production Guide 2024

🎯 Why Kubernetes?

Understanding the benefits of deploying LLM proxy on Kubernetes

Kubernetes provides enterprise-grade orchestration for your LLM proxy, offering automatic scaling, self-healing, rolling updates, and efficient resource utilization. Deploying on Kubernetes ensures your AI infrastructure can handle production workloads with high availability and reliability. The platform's declarative configuration and robust ecosystem make it the ideal choice for organizations serious about AI at scale.

🔄

Auto-Scaling

Automatically scale based on CPU, memory, or custom metrics like request rate or queue depth

🛡️

Self-Healing

Automatic recovery from failures with health checks and pod restart policies

📦

Resource Management

Efficient resource allocation with requests, limits, and quality of service classes

🚀

Rolling Updates

Zero-downtime deployments with automatic rollback on failure detection

📋 Prerequisites

Required components and configuration before deployment

1

Kubernetes Cluster

A running Kubernetes cluster (v1.25+) with kubectl configured. Options include managed services or self-hosted clusters.

Managed: EKS, GKE, AKS
Self-hosted: kubeadm, k3s
Local: minikube, kind
Minimum 3 worker nodes recommended

2

Container Registry

Access to container registry for storing LLM proxy images. Use public or private registries based on your security requirements.

Docker Hub
GitHub Container Registry
AWS ECR / GCR / ACR
Private registry with TLS

3

Storage & Database

Persistent storage for cache data and PostgreSQL database for metadata, usage tracking, and API key management.

StorageClass for PVCs
PostgreSQL (managed or self-hosted)
Redis for caching
Backup strategy defined

4

Secrets & Certificates

API keys stored securely in Kubernetes Secrets. SSL/TLS certificates configured for Ingress.

Kubernetes Secrets for API keys
cert-manager for TLS
External Secrets Operator (optional)
Sealed Secrets for GitOps

🚀 Deployment Configuration

Complete Kubernetes manifests for production deployment

Kubernetes Architecture Overview

Ingress

NGINX / Traefik

→

Service

ClusterIP / LB

→

Deployment

3+ Replicas

→

Pods

LLM Proxy

Namespace & Secrets

                    namespace-and-secrets.yaml
                    YAML
                

                    apiVersion: v1
kind: Namespace
metadata:
  name: llm-proxy
---
apiVersion: v1
kind: Secret
metadata:
  name: llm-proxy-secrets
  namespace: llm-proxy
type: Opaque
stringData:
  OPENAI_API_KEY: "sk-your-openai-key"
  ANTHROPIC_API_KEY: "sk-ant-your-key"
  LITELLM_MASTER_KEY: "sk-master-key"
  DATABASE_URL: "postgresql://user:pass@postgres:5432/litellm"
  REDIS_URL: "redis://redis:6379"
                    
                

Deployment Manifest

                    deployment.yaml
                    YAML
                

                    apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-proxy
  namespace: llm-proxy
  labels:
    app: llm-proxy
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-proxy
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: llm-proxy
    spec:
      containers:
      - name: llm-proxy
        image: ghcr.io/berriai/litellm:main-latest
        ports:
        - containerPort: 4000
        envFrom:
        - secretRef:
            name: llm-proxy-secrets
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 4000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 4000
          initialDelaySeconds: 5
          periodSeconds: 5
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: llm-proxy
              topologyKey: kubernetes.io/hostname
                    
                

💡 Best Practice: Pod Anti-Affinity

Configure pod anti-affinity to spread replicas across different nodes. This ensures high availability even if a node fails. For production, consider topology spread constraints for more control over pod distribution across zones and regions.

📈 Auto-Scaling Configuration

Configure Horizontal Pod Autoscaler for dynamic scaling

                    hpa.yaml
                    YAML
                

                    apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-proxy-hpa
  namespace: llm-proxy
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-proxy
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60
                    
                

Scaling Metrics

Metric Type	Target Value	Description
CPU Utilization	70%	Scale when CPU usage exceeds threshold
Memory Utilization	80%	Scale based on memory consumption
Custom: Request Rate	100 req/s	Scale based on incoming request rate
Custom: Queue Depth	50 requests	Scale when request queue builds up

📊 Monitoring Setup

Deploy comprehensive monitoring with Prometheus and Grafana

📈

Prometheus

Collect metrics from LLM proxy pods including request rate, latency, errors, and custom business metrics

📊

Grafana

Visualize metrics with pre-built dashboards for LLM proxy performance, costs, and usage analytics

🚨

AlertManager

Configure alerts for high error rates, resource exhaustion, and abnormal usage patterns

📝

Log Aggregation

Centralize logs with Loki or ELK stack for debugging and audit trail requirements

⚠️ Production Checklist

Before going live: configure resource limits for all containers, set up pod disruption budgets, implement network policies, enable audit logging, configure TLS for all internal communication, and establish disaster recovery procedures with tested backups.