Introduction
NGINX is the industry-standard reverse proxy for handling high-throughput LLM API traffic. This guide covers complete configuration for production deployments including multi-provider load balancing, intelligent caching strategies, SSL/TLS termination, rate limiting, and performance tuning.
Setting up NGINX as an LLM proxy provides several key advantages for production AI applications. NGINX excels at handling concurrent connections, provides sophisticated caching mechanisms, supports advanced load balancing algorithms, and offers fine-grained control over request routing. Whether you're proxying OpenAI, Anthropic, or multiple AI providers, NGINX provides the performance and flexibility needed for enterprise-grade deployments.
High Performance
Handle thousands of concurrent connections with minimal resource usage. Event-driven architecture ensures optimal performance under load.
Load Balancing
Distribute traffic across multiple LLM providers with configurable algorithms including round-robin, least connections, and weighted distribution.
Smart Caching
Reduce API costs by 40-70% with intelligent response caching. Configure cache keys, TTLs, and invalidation strategies for your use case.
SSL/TLS Termination
Offload SSL encryption from your application servers. Automatic certificate management with Let's Encrypt integration.
Monitoring
Built-in metrics and logging for request tracking. Integrate with Prometheus, Grafana, and ELK stack for comprehensive observability.
Rate Limiting
Protect against quota exhaustion and cost overruns. Configure per-user, per-IP, or global rate limits with custom responses.
Basic Setup
Initial NGINX installation and configuration for LLM proxy functionality
-
1
Install NGINX
Install NGINX on your server. Choose between package manager installation or Docker container based on your infrastructure preferences.
- Ubuntu/Debian: apt install nginx
- CentOS/RHEL: yum install nginx
- Docker: docker run nginx:alpine
- Compile from source for custom modules
-
2
Create Configuration Directory
Organize your configuration files for maintainability. Separate upstream definitions, SSL settings, and location blocks into dedicated files.
- /etc/nginx/conf.d/upstreams/
- /etc/nginx/conf.d/ssl/
- /etc/nginx/sites-available/
- /etc/nginx/snippets/
-
3
Basic Proxy Configuration
Create the initial configuration file with essential proxy settings. This includes request forwarding, header manipulation, and timeout configuration.
# Basic LLM Proxy Configuration server { listen 80; server_name api.yourdomain.com; # Redirect HTTP to HTTPS return 301 https://$server_name$request_uri; } server { listen 443 ssl http2; server_name api.yourdomain.com; # SSL Configuration ssl_certificate /etc/letsencrypt/live/api.yourdomain.com/fullchain.pem; ssl_certificate_key /etc/letsencrypt/live/api.yourdomain.com/privkey.pem; # Proxy to OpenAI API location /v1/ { proxy_pass https://api.openai.com/v1/; proxy_set_header Host api.openai.com; proxy_set_header Authorization "Bearer $openai_api_key"; proxy_ssl_server_name on; # Timeouts proxy_connect_timeout 60s; proxy_send_timeout 60s; proxy_read_timeout 60s; } }
Upstream Configuration
Configure multiple LLM providers with load balancing and health checks
# OpenAI Upstream with multiple endpoints upstream openai_backend { least_conn; # Primary endpoint server api.openai.com:443 weight=3; # Keepalive connections keepalive 32; keepalive_timeout 60s; } # Anthropic Upstream upstream anthropic_backend { server api.anthropic.com:443; keepalive 16; } # Multi-provider upstream for load balancing upstream llm_fallback { server api.openai.com:443 weight=2; server api.anthropic.com:443 weight=1; keepalive 24; }
Load Balancing Algorithms
| Algorithm | Use Case | Configuration |
|---|---|---|
| Round Robin | Equal distribution | Default (no directive) |
| Least Connections | Variable request times | least_conn; |
| IP Hash | Session persistence | ip_hash; |
| Weighted | Capacity-based routing | weight=N; |
Response Caching
Implement intelligent caching to reduce API costs and improve response times
Response caching can reduce your LLM API costs by 40-70% for applications with repeated queries. Cache identical requests, similar prompts, or even implement semantic caching for maximum efficiency. Configure appropriate TTLs based on your content's freshness requirements.
# Cache path configuration proxy_cache_path /var/cache/nginx/llm levels=1:2 keys_zone=llm_cache:100m max_size=10g inactive=24h use_temp_path=off; # Cache configuration in server block server { # ... other config ... location /v1/chat/completions { proxy_pass https://openai_backend/v1/chat/completions; # Enable caching proxy_cache llm_cache; # Cache key based on request body proxy_cache_key "$request_method|$request_uri|$request_body"; # Cache valid responses for 24 hours proxy_cache_valid 200 24h; proxy_cache_valid 429 1m; proxy_cache_valid 500 0; # Allow cache bypass proxy_cache_bypass $http_x_no_cache; # Add cache status header add_header X-Cache-Status $upstream_cache_status; # ... other proxy settings ... } }
For chat completions, messages may have minor variations that don't affect responses. Consider normalizing whitespace, sorting message keys, or using semantic similarity for cache keys. Always test cache behavior before deploying to production to ensure expected functionality.
SSL/TLS Configuration
Secure your proxy endpoint with proper SSL/TLS configuration
# Modern SSL Configuration ssl_protocols TLSv1.2 TLSv1.3; ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256; ssl_prefer_server_ciphers off; # SSL Session Configuration ssl_session_cache shared:SSL:10m; ssl_session_timeout 1d; ssl_session_tickets off; # HSTS add_header Strict-Transport-Security "max-age=63072000" always;
-
A
Obtain SSL Certificate
Use Let's Encrypt for free, automatically-renewed certificates. Certbot handles certificate issuance and renewal automatically.
- Install certbot and python3-certbot-nginx
- Run: certbot --nginx -d api.yourdomain.com
- Test renewal: certbot renew --dry-run
- Auto-renewal configured via systemd timer
-
B
Configure OCSP Stapling
Improve SSL handshake performance by enabling OCSP stapling. This reduces the number of round trips needed during certificate validation.
- ssl_stapling on;
- ssl_stapling_verify on;
- resolver 8.8.8.8 8.8.4.4 valid=300s;
Performance Tuning
Optimize NGINX for high-throughput LLM API traffic
# Worker Processes worker_processes auto; worker_rlimit_nofile 65535; events { worker_connections 4096; multi_accept on; use epoll; } http { # Connection optimization keepalive_timeout 65; keepalive_requests 1000; # Buffer settings client_body_buffer_size 128k; proxy_buffer_size 128k; proxy_buffers 4 256k; proxy_busy_buffers_size 256k; # Request body handling for large prompts client_max_body_size 50m; client_body_in_file_only off; # Logging format log_format llm_json '{"time":"$time_iso8601",' '"method":"$request_method",' '"uri":"$request_uri",' '"status":$status,' '"bytes":$body_bytes_sent,' '"duration":$request_time,' '"cache":"$upstream_cache_status"}'; access_log /var/log/nginx/llm-access.log llm_json; }
Key Performance Parameters
| Parameter | Recommended Value | Purpose |
|---|---|---|
| worker_processes | auto (CPU cores) | Parallel request handling |
| worker_connections | 4096+ | Concurrent connections per worker |
| keepalive | 32+ | Connection pooling to upstream |
| proxy_buffers | 4 256k | Response buffering |
| client_max_body_size | 50m | Large prompt support |
Rate Limiting
Protect your API quota and control costs with rate limiting
# Rate limiting zones limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s; limit_conn_zone $binary_remote_addr zone=conn_limit:10m; server { location /v1/ { # Rate limiting with burst limit_req zone=api_limit burst=20 nodelay; limit_conn conn_limit 10; # Custom error response error_page 429 = @rate_limited; # ... proxy configuration ... } location @rate_limited { default_type application/json; return 429 '{"error":"Rate limit exceeded"}'; } }