AI API Proxy Kubernetes Deployment Guide

Complete production-ready guide to deploying AI API proxies on Kubernetes. Includes Helm charts, HPA autoscaling, service mesh integration, and 2026 best practices for high-availability enterprise deployments.

Production Kubernetes Architecture

A well-architected Kubernetes deployment ensures scalability, reliability, and maintainability for your AI API proxy infrastructure.

1

Ingress Controller

NGINX or Traefik routing traffic to API services with SSL termination and rate limiting.

2

Service Mesh (Istio/Linkerd)

Advanced traffic management, observability, and security policies across all services.

3

Horizontal Pod Autoscaler

Dynamic scaling based on CPU, memory, or custom metrics to handle varying API loads.

4

StatefulSets + Persistent Storage

Reliable state management for API rate limiting, caching, and session data.

Step-by-Step Production Deployment

1. Helm Chart Configuration

Create a production-ready Helm chart for your AI API proxy:

values.yaml
# AI API Proxy Helm Values
replicaCount: 3
image:
  repository: ai-proxy/api-gateway
  tag: v2.6.0
  pullPolicy: IfNotPresent

resources:
  limits:
    cpu: "1000m"
    memory: "2Gi"
  requests:
    cpu: "200m"
    memory: "512Mi"

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20
  targetCPUUtilization: 70
  targetMemoryUtilization: 80

service:
  type: ClusterIP
  port: 8080

ingress:
  enabled: true
  className: "nginx"
  hosts:
    - host: api.example.com
      paths:
        - path: /
          pathType: Prefix

config:
  openai:
    apiKeySecret: openai-api-key
    rateLimit: 5000
  anthropic:
    apiKeySecret: anthropic-api-key
    rateLimit: 3000

2. Kubernetes Manifests

Deployment manifest with liveness and readiness probes:

deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-api-proxy
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ai-api-proxy
  template:
    metadata:
      labels:
        app: ai-api-proxy
    spec:
      containers:
      - name: ai-proxy
        image: ai-proxy/api-gateway:v2.6.0
        ports:
        - containerPort: 8080
        env:
        - name: OPENAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: openai-secrets
              key: apiKey
        - name: ANTHROPIC_API_KEY
          valueFrom:
            secretKeyRef:
              name: anthropic-secrets
              key: apiKey
        resources:
          requests:
            memory: "512Mi"
            cpu: "200m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

3. Service Mesh Integration

Istio VirtualService for advanced traffic management:

virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ai-api-proxy
spec:
  hosts:
  - ai-api.example.com
  gateways:
  - public-gateway
  http:
  - name: "openai-route"
    match:
    - headers:
        x-model-type:
          exact: "openai"
    route:
    - destination:
        host: ai-api-proxy
        port:
          number: 8080
    retries:
      attempts: 3
      perTryTimeout: 2s
    timeout: 30s
    
  - name: "anthropic-route"
    match:
    - headers:
        x-model-type:
          exact: "anthropic"
    route:
    - destination:
        host: ai-api-proxy
        port:
          number: 8080
    fault:
      delay:
        percentage:
          value: 0.1
        fixedDelay: 7s
    timeout: 60s
📈

HPA Configuration

Dynamic scaling based on CPU utilization and custom metrics (requests per second).

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-api-proxy-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-api-proxy
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: requests-per-second
      target:
        type: AverageValue
        averageValue: 1000
🛡️

Security Policies

PodSecurityPolicy and NetworkPolicy for secure cluster operations.

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: ai-proxy-restricted
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
    - ALL
  volumes:
    - 'configMap'
    - 'secret'
  runAsUser:
    rule: 'MustRunAsNonRoot'
  seLinux:
    rule: 'RunAsAny'
  fsGroup:
    rule: 'MustRunAs'
    ranges:
      - min: 1000
        max: 65535

Monitoring and Observability

Metric Target Alert Threshold Tool
CPU Utilization 70% average 90% for 5min Prometheus
Memory Usage 80% average 90% for 5min Grafana
Request Rate 1000 req/sec 2000 req/sec Prometheus
Error Rate 0.1% 1% Jaeger
Response Time (p95) 500ms 1000ms Kiali
Pod Restarts 0 3 in 10min Kubernetes Events