llm-proxy-nginx.conf
$ cat <<'EOF'
NGINX LLM Proxy Configuration Guide
Configure high-performance LLM API proxy with load balancing,
caching, SSL termination, rate limiting, and monitoring.
$ nginx -t && nginx -s reload
✓ Configuration valid
✓ Proxy running on https://api.yourdomain.com
00

Introduction

NGINX is the industry-standard reverse proxy for handling high-throughput LLM API traffic. This guide covers complete configuration for production deployments including multi-provider load balancing, intelligent caching strategies, SSL/TLS termination, rate limiting, and performance tuning.

Setting up NGINX as an LLM proxy provides several key advantages for production AI applications. NGINX excels at handling concurrent connections, provides sophisticated caching mechanisms, supports advanced load balancing algorithms, and offers fine-grained control over request routing. Whether you're proxying OpenAI, Anthropic, or multiple AI providers, NGINX provides the performance and flexibility needed for enterprise-grade deployments.

High Performance

Handle thousands of concurrent connections with minimal resource usage. Event-driven architecture ensures optimal performance under load.

🔄

Load Balancing

Distribute traffic across multiple LLM providers with configurable algorithms including round-robin, least connections, and weighted distribution.

💾

Smart Caching

Reduce API costs by 40-70% with intelligent response caching. Configure cache keys, TTLs, and invalidation strategies for your use case.

🔒

SSL/TLS Termination

Offload SSL encryption from your application servers. Automatic certificate management with Let's Encrypt integration.

📊

Monitoring

Built-in metrics and logging for request tracking. Integrate with Prometheus, Grafana, and ELK stack for comprehensive observability.

🛡️

Rate Limiting

Protect against quota exhaustion and cost overruns. Configure per-user, per-IP, or global rate limits with custom responses.

01

Basic Setup

Initial NGINX installation and configuration for LLM proxy functionality

/etc/nginx/sites-available/llm-proxy.conf NGINX
# Basic LLM Proxy Configuration
server {
    listen 80;
    server_name api.yourdomain.com;

    # Redirect HTTP to HTTPS
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name api.yourdomain.com;

    # SSL Configuration
    ssl_certificate /etc/letsencrypt/live/api.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/api.yourdomain.com/privkey.pem;

    # Proxy to OpenAI API
    location /v1/ {
        proxy_pass https://api.openai.com/v1/;
        proxy_set_header Host api.openai.com;
        proxy_set_header Authorization "Bearer $openai_api_key";
        proxy_ssl_server_name on;
        
        # Timeouts
        proxy_connect_timeout 60s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;
    }
}
                    
02

Upstream Configuration

Configure multiple LLM providers with load balancing and health checks

// LLM Provider Load Balancing Architecture
Client
Your Application
->
NGINX
Load Balancer
->
OpenAI
GPT-4/Turbo
Anthropic
Claude 3
Azure
OpenAI Service
/etc/nginx/conf.d/upstreams/llm-providers.conf NGINX
# OpenAI Upstream with multiple endpoints
upstream openai_backend {
    least_conn;
    
    # Primary endpoint
    server api.openai.com:443 weight=3;
    
    # Keepalive connections
    keepalive 32;
    keepalive_timeout 60s;
}

# Anthropic Upstream
upstream anthropic_backend {
    server api.anthropic.com:443;
    keepalive 16;
}

# Multi-provider upstream for load balancing
upstream llm_fallback {
    server api.openai.com:443 weight=2;
    server api.anthropic.com:443 weight=1;
    keepalive 24;
}
                    

Load Balancing Algorithms

Algorithm Use Case Configuration
Round Robin Equal distribution Default (no directive)
Least Connections Variable request times least_conn;
IP Hash Session persistence ip_hash;
Weighted Capacity-based routing weight=N;
03

Response Caching

Implement intelligent caching to reduce API costs and improve response times

💡 Cost Savings with Caching

Response caching can reduce your LLM API costs by 40-70% for applications with repeated queries. Cache identical requests, similar prompts, or even implement semantic caching for maximum efficiency. Configure appropriate TTLs based on your content's freshness requirements.

/etc/nginx/conf.d/caching.conf NGINX
# Cache path configuration
proxy_cache_path /var/cache/nginx/llm
    levels=1:2
    keys_zone=llm_cache:100m
    max_size=10g
    inactive=24h
    use_temp_path=off;

# Cache configuration in server block
server {
    # ... other config ...
    
    location /v1/chat/completions {
        proxy_pass https://openai_backend/v1/chat/completions;
        
        # Enable caching
        proxy_cache llm_cache;
        
        # Cache key based on request body
        proxy_cache_key "$request_method|$request_uri|$request_body";
        
        # Cache valid responses for 24 hours
        proxy_cache_valid 200 24h;
        proxy_cache_valid 429 1m;
        proxy_cache_valid 500 0;
        
        # Allow cache bypass
        proxy_cache_bypass $http_x_no_cache;
        
        # Add cache status header
        add_header X-Cache-Status $upstream_cache_status;
        
        # ... other proxy settings ...
    }
}
                    
⚠️ Cache Key Considerations

For chat completions, messages may have minor variations that don't affect responses. Consider normalizing whitespace, sorting message keys, or using semantic similarity for cache keys. Always test cache behavior before deploying to production to ensure expected functionality.

04

SSL/TLS Configuration

Secure your proxy endpoint with proper SSL/TLS configuration

/etc/nginx/snippets/ssl-params.conf NGINX
# Modern SSL Configuration
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
ssl_prefer_server_ciphers off;

# SSL Session Configuration
ssl_session_cache shared:SSL:10m;
ssl_session_timeout 1d;
ssl_session_tickets off;

# HSTS
add_header Strict-Transport-Security "max-age=63072000" always;
                    
05

Performance Tuning

Optimize NGINX for high-throughput LLM API traffic

/etc/nginx/nginx.conf NGINX
# Worker Processes
worker_processes auto;
worker_rlimit_nofile 65535;

events {
    worker_connections 4096;
    multi_accept on;
    use epoll;
}

http {
    # Connection optimization
    keepalive_timeout 65;
    keepalive_requests 1000;
    
    # Buffer settings
    client_body_buffer_size 128k;
    proxy_buffer_size 128k;
    proxy_buffers 4 256k;
    proxy_busy_buffers_size 256k;
    
    # Request body handling for large prompts
    client_max_body_size 50m;
    client_body_in_file_only off;
    
    # Logging format
    log_format llm_json '{"time":"$time_iso8601",'
                   '"method":"$request_method",'
                   '"uri":"$request_uri",'
                   '"status":$status,'
                   '"bytes":$body_bytes_sent,'
                   '"duration":$request_time,'
                   '"cache":"$upstream_cache_status"}';
    
    access_log /var/log/nginx/llm-access.log llm_json;
}
                    

Key Performance Parameters

Parameter Recommended Value Purpose
worker_processes auto (CPU cores) Parallel request handling
worker_connections 4096+ Concurrent connections per worker
keepalive 32+ Connection pooling to upstream
proxy_buffers 4 256k Response buffering
client_max_body_size 50m Large prompt support
06

Rate Limiting

Protect your API quota and control costs with rate limiting

/etc/nginx/conf.d/rate-limiting.conf NGINX
# Rate limiting zones
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
limit_conn_zone $binary_remote_addr zone=conn_limit:10m;

server {
    location /v1/ {
        # Rate limiting with burst
        limit_req zone=api_limit burst=20 nodelay;
        limit_conn conn_limit 10;
        
        # Custom error response
        error_page 429 = @rate_limited;
        
        # ... proxy configuration ...
    }
    
    location @rate_limited {
        default_type application/json;
        return 429 '{"error":"Rate limit exceeded"}';
    }
}