☸️ Kubernetes Deployment

Deploy LLM Proxy on Kubernetes

Master production-grade LLM proxy deployment on Kubernetes. Learn Helm charts, Deployments, Services, Ingress configuration, Horizontal Pod Autoscaling, Secrets management, ConfigMaps, persistent storage, and comprehensive monitoring with Prometheus and Grafana.

🎯 Why Kubernetes?

Understanding the benefits of deploying LLM proxy on Kubernetes

Kubernetes provides enterprise-grade orchestration for your LLM proxy, offering automatic scaling, self-healing, rolling updates, and efficient resource utilization. Deploying on Kubernetes ensures your AI infrastructure can handle production workloads with high availability and reliability. The platform's declarative configuration and robust ecosystem make it the ideal choice for organizations serious about AI at scale.

🔄

Auto-Scaling

Automatically scale based on CPU, memory, or custom metrics like request rate or queue depth

🛡️

Self-Healing

Automatic recovery from failures with health checks and pod restart policies

📦

Resource Management

Efficient resource allocation with requests, limits, and quality of service classes

🚀

Rolling Updates

Zero-downtime deployments with automatic rollback on failure detection

📋 Prerequisites

Required components and configuration before deployment

1

Kubernetes Cluster

A running Kubernetes cluster (v1.25+) with kubectl configured. Options include managed services or self-hosted clusters.

  • Managed: EKS, GKE, AKS
  • Self-hosted: kubeadm, k3s
  • Local: minikube, kind
  • Minimum 3 worker nodes recommended
2

Container Registry

Access to container registry for storing LLM proxy images. Use public or private registries based on your security requirements.

  • Docker Hub
  • GitHub Container Registry
  • AWS ECR / GCR / ACR
  • Private registry with TLS
3

Storage & Database

Persistent storage for cache data and PostgreSQL database for metadata, usage tracking, and API key management.

  • StorageClass for PVCs
  • PostgreSQL (managed or self-hosted)
  • Redis for caching
  • Backup strategy defined
4

Secrets & Certificates

API keys stored securely in Kubernetes Secrets. SSL/TLS certificates configured for Ingress.

  • Kubernetes Secrets for API keys
  • cert-manager for TLS
  • External Secrets Operator (optional)
  • Sealed Secrets for GitOps

🚀 Deployment Configuration

Complete Kubernetes manifests for production deployment

Kubernetes Architecture Overview
Ingress
NGINX / Traefik
Service
ClusterIP / LB
Deployment
3+ Replicas
Pods
LLM Proxy

Namespace & Secrets

namespace-and-secrets.yaml YAML
apiVersion: v1
kind: Namespace
metadata:
  name: llm-proxy
---
apiVersion: v1
kind: Secret
metadata:
  name: llm-proxy-secrets
  namespace: llm-proxy
type: Opaque
stringData:
  OPENAI_API_KEY: "sk-your-openai-key"
  ANTHROPIC_API_KEY: "sk-ant-your-key"
  LITELLM_MASTER_KEY: "sk-master-key"
  DATABASE_URL: "postgresql://user:pass@postgres:5432/litellm"
  REDIS_URL: "redis://redis:6379"
                    

Deployment Manifest

deployment.yaml YAML
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-proxy
  namespace: llm-proxy
  labels:
    app: llm-proxy
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-proxy
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: llm-proxy
    spec:
      containers:
      - name: llm-proxy
        image: ghcr.io/berriai/litellm:main-latest
        ports:
        - containerPort: 4000
        envFrom:
        - secretRef:
            name: llm-proxy-secrets
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 4000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 4000
          initialDelaySeconds: 5
          periodSeconds: 5
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: llm-proxy
              topologyKey: kubernetes.io/hostname
                    
💡 Best Practice: Pod Anti-Affinity

Configure pod anti-affinity to spread replicas across different nodes. This ensures high availability even if a node fails. For production, consider topology spread constraints for more control over pod distribution across zones and regions.

📈 Auto-Scaling Configuration

Configure Horizontal Pod Autoscaler for dynamic scaling

hpa.yaml YAML
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-proxy-hpa
  namespace: llm-proxy
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-proxy
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60
                    

Scaling Metrics

Metric Type Target Value Description
CPU Utilization 70% Scale when CPU usage exceeds threshold
Memory Utilization 80% Scale based on memory consumption
Custom: Request Rate 100 req/s Scale based on incoming request rate
Custom: Queue Depth 50 requests Scale when request queue builds up

📊 Monitoring Setup

Deploy comprehensive monitoring with Prometheus and Grafana

📈

Prometheus

Collect metrics from LLM proxy pods including request rate, latency, errors, and custom business metrics

📊

Grafana

Visualize metrics with pre-built dashboards for LLM proxy performance, costs, and usage analytics

🚨

AlertManager

Configure alerts for high error rates, resource exhaustion, and abnormal usage patterns

📝

Log Aggregation

Centralize logs with Loki or ELK stack for debugging and audit trail requirements

⚠️ Production Checklist

Before going live: configure resource limits for all containers, set up pod disruption budgets, implement network policies, enable audit logging, configure TLS for all internal communication, and establish disaster recovery procedures with tested backups.