Python LLM Proxy FastAPI - Modern Async AI Gateway

Why FastAPI for LLM Proxies?

FastAPI has become the go-to framework for building AI and machine learning APIs in Python. Its async-first architecture perfectly matches the I/O-bound nature of LLM proxy workloads, where most time is spent waiting for upstream API responses rather than CPU computation.

The automatic OpenAPI documentation generation makes FastAPI LLM proxies self-documenting. Developers can explore available endpoints, request schemas, and response formats through an interactive Swagger UI, accelerating integration and reducing support overhead.

Type hints and Pydantic models provide compile-time validation and IDE support, catching configuration errors before they reach production. This is particularly valuable for LLM APIs with complex nested request structures and multiple optional parameters.

300% Faster than Flask

Auto OpenAPI Docs

Native Async Support

Type Safe Validation

Complete Proxy Implementation

Build a production-ready LLM proxy with streaming support, authentication, and error handling. This implementation proxies OpenAI-compatible APIs while adding custom functionality.

                        
                        
                        
                    
main.py

from fastapi import FastAPI, Request, HTTPException, Depends
from fastapi.responses import StreamingResponse, JSONResponse
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from pydantic import BaseModel
from typing import List, Optional, Dict, Any
import httpx
import os
import json

app = FastAPI(
    title="LLM Proxy Server",
    description="High-performance proxy for LLM APIs",
    version="1.0.0"
)

security = HTTPBearer()

# Pydantic models for request validation
class ChatMessage(BaseModel):
    role: str
    content: str

class ChatRequest(BaseModel):
    model: str
    messages: List[ChatMessage]
    temperature: Optional[float] = 0.7
    max_tokens: Optional[int] = None
    stream: Optional[bool] = False

# Configuration
OPENAI_BASE_URL = os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Authentication dependency
async def verify_api_key(
    credentials: HTTPAuthorizationCredentials = Depends(security)
):
    valid_keys = os.getenv("VALID_API_KEYS", "").split(",")
    if credentials.credentials not in valid_keys:
        raise HTTPException(status_code=401, detail="Invalid API key")
    return credentials.credentials

# Streaming proxy for chat completions
@app.post("/v1/chat/completions")
async def chat_completions(
    request: ChatRequest,
    api_key: str = Depends(verify_api_key)
):
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "Content-Type": "application/json"
    }
    
    async with httpx.AsyncClient(timeout=120.0) as client:
        if request.stream:
            return StreamingResponse(
                stream_response(client, request, headers),
                media_type="text/event-stream"
            )
        else:
            response = await client.post(
                f"{OPENAI_BASE_URL}/chat/completions",
                json=request.dict(),
                headers=headers
            )
            return JSONResponse(
                content=response.json(),
                status_code=response.status_code
            )

# Stream generator for SSE responses
async def stream_response(client, request, headers):
    async with client.stream(
        "POST",
        f"{OPENAI_BASE_URL}/chat/completions",
        json=request.dict(),
        headers=headers
    ) as response:
        async for chunk in response.aiter_bytes():
            yield chunk

@app.get("/health")
async def health_check():
    return {"status": "healthy"}
                

Key Features

⚡

Native Async Support

Leverage Python's asyncio for concurrent request handling. Serve thousands of simultaneous connections with uvicorn and async/await.

📚

Auto Documentation

Interactive Swagger UI and ReDoc documentation generated automatically from your code. No separate documentation maintenance.

✅

Request Validation

Pydantic models validate and serialize requests automatically. Catch errors before they reach your proxy logic.

🔄

Streaming Support

Proxy streaming responses efficiently using async generators. Support real-time token delivery for chat applications.

🔐

Built-in Security

OAuth2, API key authentication, and dependency injection for clean security patterns. Protect your LLM endpoints.

🧪

Easy Testing

TestClient provides a simple interface for unit and integration testing. Mock external APIs with httpx mock transports.

Middleware Implementation

Add rate limiting, logging, and caching as FastAPI middleware. The middleware pattern enables clean separation of cross-cutting concerns.

                        
                        
                        
                    
middleware.py

from fastapi import Request, Response
from starlette.middleware.base import BaseHTTPMiddleware
from typing import Callable
import time
import json

class RateLimitMiddleware(BaseHTTPMiddleware):
    def __init__(self, app, requests_per_minute: int = 60):
        super().__init__(app)
        self.rpm = requests_per_minute
        self.requests = {}
    
    async def dispatch(self, request: Request, call_next: Callable):
        client_ip = request.client.host
        
        # Check rate limit
        current_time = time.time()
        if client_ip in self.requests:
            requests, window_start = self.requests[client_ip]
            if current_time - window_start > 60:
                self.requests[client_ip] = [1, current_time]
            elif requests >= self.rpm:
                return Response(
                    content=json.dumps({"error": "Rate limit exceeded"}),
                    status_code=429,
                    media_type="application/json"
                )
            else:
                self.requests[client_ip][0] += 1
        else:
            self.requests[client_ip] = [1, current_time]
        
        return await call_next(request)

class LoggingMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next: Callable):
        start_time = time.time()
        response = await call_next(request)
        duration = time.time() - start_time
        
        print(f"[{request.method}] {request.url.path} - {duration:.3f}s")
        return response
                

Benefits of FastAPI

Developer Experience

Automatic IDE support with type hints. Editor autocomplete catches errors before runtime.

Production Ready

Built on Starlette for production-grade performance. Handles high concurrency with minimal resources.

Extensible

Rich ecosystem of extensions and middleware. Add WebSocket support, CORS, and more.

Modern Python

Leverages Python 3.8+ features including async/await, type hints, and dataclasses.

Deployment Options

Docker: Package your FastAPI proxy in a minimal Docker container using Python slim images. Multi-stage builds keep image sizes small.

Kubernetes: Deploy with multiple replicas behind a service. Use health check endpoints for automatic pod management.

Serverless: Run on AWS Lambda, Google Cloud Functions, or Azure Functions using Mangum adapter for ASGI compatibility.

Build Your FastAPI LLM Proxy

Create modern, async LLM gateways with Python's most popular web framework.

Get Started

Why FastAPI for LLM Proxies?

Complete Proxy Implementation

Key Features

Native Async Support

Auto Documentation

Request Validation

Streaming Support

Built-in Security

Easy Testing

Middleware Implementation

Benefits of FastAPI

Developer Experience

Production Ready

Extensible

Modern Python

Deployment Options

Build Your FastAPI LLM Proxy

Related Implementation Guides