LLM Proxy with NGINX - Complete Configuration Guide 2024

00

Introduction

NGINX is the industry-standard reverse proxy for handling high-throughput LLM API traffic. This guide covers complete configuration for production deployments including multi-provider load balancing, intelligent caching strategies, SSL/TLS termination, rate limiting, and performance tuning.

Setting up NGINX as an LLM proxy provides several key advantages for production AI applications. NGINX excels at handling concurrent connections, provides sophisticated caching mechanisms, supports advanced load balancing algorithms, and offers fine-grained control over request routing. Whether you're proxying OpenAI, Anthropic, or multiple AI providers, NGINX provides the performance and flexibility needed for enterprise-grade deployments.

⚡

High Performance

Handle thousands of concurrent connections with minimal resource usage. Event-driven architecture ensures optimal performance under load.

🔄

Load Balancing

Distribute traffic across multiple LLM providers with configurable algorithms including round-robin, least connections, and weighted distribution.

💾

Smart Caching

Reduce API costs by 40-70% with intelligent response caching. Configure cache keys, TTLs, and invalidation strategies for your use case.

🔒

SSL/TLS Termination

Offload SSL encryption from your application servers. Automatic certificate management with Let's Encrypt integration.

📊

Monitoring

Built-in metrics and logging for request tracking. Integrate with Prometheus, Grafana, and ELK stack for comprehensive observability.

🛡️

Rate Limiting

Protect against quota exhaustion and cost overruns. Configure per-user, per-IP, or global rate limits with custom responses.

01

Basic Setup

Initial NGINX installation and configuration for LLM proxy functionality

1
Install NGINX

Install NGINX on your server. Choose between package manager installation or Docker container based on your infrastructure preferences.
- Ubuntu/Debian: apt install nginx
- CentOS/RHEL: yum install nginx
- Docker: docker run nginx:alpine
- Compile from source for custom modules
2
Create Configuration Directory

Organize your configuration files for maintainability. Separate upstream definitions, SSL settings, and location blocks into dedicated files.
- /etc/nginx/conf.d/upstreams/
- /etc/nginx/conf.d/ssl/
- /etc/nginx/sites-available/
- /etc/nginx/snippets/
3

Basic Proxy Configuration

Create the initial configuration file with essential proxy settings. This includes request forwarding, header manipulation, and timeout configuration.

/etc/nginx/sites-available/llm-proxy.conf NGINX

# Basic LLM Proxy Configuration
server {
    listen 80;
    server_name api.yourdomain.com;

    # Redirect HTTP to HTTPS
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name api.yourdomain.com;

    # SSL Configuration
    ssl_certificate /etc/letsencrypt/live/api.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/api.yourdomain.com/privkey.pem;

    # Proxy to OpenAI API
    location /v1/ {
        proxy_pass https://api.openai.com/v1/;
        proxy_set_header Host api.openai.com;
        proxy_set_header Authorization "Bearer $openai_api_key";
        proxy_ssl_server_name on;
        
        # Timeouts
        proxy_connect_timeout 60s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;
    }
}

02

Upstream Configuration

Configure multiple LLM providers with load balancing and health checks

// LLM Provider Load Balancing Architecture

Client

Your Application

->

NGINX

Load Balancer

->

OpenAI

GPT-4/Turbo

Anthropic

Claude 3

Azure

OpenAI Service

/etc/nginx/conf.d/upstreams/llm-providers.conf NGINX

# OpenAI Upstream with multiple endpoints
upstream openai_backend {
    least_conn;
    
    # Primary endpoint
    server api.openai.com:443 weight=3;
    
    # Keepalive connections
    keepalive 32;
    keepalive_timeout 60s;
}

# Anthropic Upstream
upstream anthropic_backend {
    server api.anthropic.com:443;
    keepalive 16;
}

# Multi-provider upstream for load balancing
upstream llm_fallback {
    server api.openai.com:443 weight=2;
    server api.anthropic.com:443 weight=1;
    keepalive 24;
}

Load Balancing Algorithms

Algorithm	Use Case	Configuration
Round Robin	Equal distribution	Default (no directive)
Least Connections	Variable request times	least_conn;
IP Hash	Session persistence	ip_hash;
Weighted	Capacity-based routing	weight=N;

03

Response Caching

Implement intelligent caching to reduce API costs and improve response times

💡 Cost Savings with Caching

Response caching can reduce your LLM API costs by 40-70% for applications with repeated queries. Cache identical requests, similar prompts, or even implement semantic caching for maximum efficiency. Configure appropriate TTLs based on your content's freshness requirements.

/etc/nginx/conf.d/caching.conf NGINX

# Cache path configuration
proxy_cache_path /var/cache/nginx/llm
    levels=1:2
    keys_zone=llm_cache:100m
    max_size=10g
    inactive=24h
    use_temp_path=off;

# Cache configuration in server block
server {
    # ... other config ...
    
    location /v1/chat/completions {
        proxy_pass https://openai_backend/v1/chat/completions;
        
        # Enable caching
        proxy_cache llm_cache;
        
        # Cache key based on request body
        proxy_cache_key "$request_method|$request_uri|$request_body";
        
        # Cache valid responses for 24 hours
        proxy_cache_valid 200 24h;
        proxy_cache_valid 429 1m;
        proxy_cache_valid 500 0;
        
        # Allow cache bypass
        proxy_cache_bypass $http_x_no_cache;
        
        # Add cache status header
        add_header X-Cache-Status $upstream_cache_status;
        
        # ... other proxy settings ...
    }
}

⚠️ Cache Key Considerations

For chat completions, messages may have minor variations that don't affect responses. Consider normalizing whitespace, sorting message keys, or using semantic similarity for cache keys. Always test cache behavior before deploying to production to ensure expected functionality.

04

SSL/TLS Configuration

Secure your proxy endpoint with proper SSL/TLS configuration

/etc/nginx/snippets/ssl-params.conf NGINX

# Modern SSL Configuration
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
ssl_prefer_server_ciphers off;

# SSL Session Configuration
ssl_session_cache shared:SSL:10m;
ssl_session_timeout 1d;
ssl_session_tickets off;

# HSTS
add_header Strict-Transport-Security "max-age=63072000" always;

A
Obtain SSL Certificate

Use Let's Encrypt for free, automatically-renewed certificates. Certbot handles certificate issuance and renewal automatically.
- Install certbot and python3-certbot-nginx
- Run: certbot --nginx -d api.yourdomain.com
- Test renewal: certbot renew --dry-run
- Auto-renewal configured via systemd timer
B
Configure OCSP Stapling

Improve SSL handshake performance by enabling OCSP stapling. This reduces the number of round trips needed during certificate validation.
- ssl_stapling on;
- ssl_stapling_verify on;
- resolver 8.8.8.8 8.8.4.4 valid=300s;

05

Performance Tuning

Optimize NGINX for high-throughput LLM API traffic

/etc/nginx/nginx.conf NGINX

# Worker Processes
worker_processes auto;
worker_rlimit_nofile 65535;

events {
    worker_connections 4096;
    multi_accept on;
    use epoll;
}

http {
    # Connection optimization
    keepalive_timeout 65;
    keepalive_requests 1000;
    
    # Buffer settings
    client_body_buffer_size 128k;
    proxy_buffer_size 128k;
    proxy_buffers 4 256k;
    proxy_busy_buffers_size 256k;
    
    # Request body handling for large prompts
    client_max_body_size 50m;
    client_body_in_file_only off;
    
    # Logging format
    log_format llm_json '{"time":"$time_iso8601",'
                   '"method":"$request_method",'
                   '"uri":"$request_uri",'
                   '"status":$status,'
                   '"bytes":$body_bytes_sent,'
                   '"duration":$request_time,'
                   '"cache":"$upstream_cache_status"}';
    
    access_log /var/log/nginx/llm-access.log llm_json;
}

Key Performance Parameters

Parameter	Recommended Value	Purpose
worker_processes	auto (CPU cores)	Parallel request handling
worker_connections	4096+	Concurrent connections per worker
keepalive	32+	Connection pooling to upstream
proxy_buffers	4 256k	Response buffering
client_max_body_size	50m	Large prompt support

06

Rate Limiting

Protect your API quota and control costs with rate limiting

/etc/nginx/conf.d/rate-limiting.conf NGINX

# Rate limiting zones
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
limit_conn_zone $binary_remote_addr zone=conn_limit:10m;

server {
    location /v1/ {
        # Rate limiting with burst
        limit_req zone=api_limit burst=20 nodelay;
        limit_conn conn_limit 10;
        
        # Custom error response
        error_page 429 = @rate_limited;
        
        # ... proxy configuration ...
    }
    
    location @rate_limited {
        default_type application/json;
        return 429 '{"error":"Rate limit exceeded"}';
    }
}

Introduction

High Performance

Load Balancing

Smart Caching

SSL/TLS Termination

Monitoring

Rate Limiting

Basic Setup

Install NGINX

Create Configuration Directory

Basic Proxy Configuration

Upstream Configuration

Load Balancing Algorithms

Response Caching

SSL/TLS Configuration

Obtain SSL Certificate

Configure OCSP Stapling

Performance Tuning

Key Performance Parameters

Rate Limiting

Related Configuration Guides

OpenAI Reverse Proxy Setup

Rate Limiting Setup

Deploy LLM API Gateway

LiteLLM Docker Deployment