Python โ€ข FastAPI โ€ข Async

Python LLM Proxy with FastAPI

Build modern async LLM proxy servers with Python FastAPI. Leverage automatic OpenAPI documentation, type hints, and native async streaming for high-performance AI gateways.

Why FastAPI for LLM Proxies?

FastAPI has become the go-to framework for building AI and machine learning APIs in Python. Its async-first architecture perfectly matches the I/O-bound nature of LLM proxy workloads, where most time is spent waiting for upstream API responses rather than CPU computation.

The automatic OpenAPI documentation generation makes FastAPI LLM proxies self-documenting. Developers can explore available endpoints, request schemas, and response formats through an interactive Swagger UI, accelerating integration and reducing support overhead.

Type hints and Pydantic models provide compile-time validation and IDE support, catching configuration errors before they reach production. This is particularly valuable for LLM APIs with complex nested request structures and multiple optional parameters.

300% Faster than Flask
Auto OpenAPI Docs
Native Async Support
Type Safe Validation

Complete Proxy Implementation

Build a production-ready LLM proxy with streaming support, authentication, and error handling. This implementation proxies OpenAI-compatible APIs while adding custom functionality.

main.py
from fastapi import FastAPI, Request, HTTPException, Depends from fastapi.responses import StreamingResponse, JSONResponse from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials from pydantic import BaseModel from typing import List, Optional, Dict, Any import httpx import os import json app = FastAPI( title="LLM Proxy Server", description="High-performance proxy for LLM APIs", version="1.0.0" ) security = HTTPBearer() # Pydantic models for request validation class ChatMessage(BaseModel): role: str content: str class ChatRequest(BaseModel): model: str messages: List[ChatMessage] temperature: Optional[float] = 0.7 max_tokens: Optional[int] = None stream: Optional[bool] = False # Configuration OPENAI_BASE_URL = os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1") OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") # Authentication dependency async def verify_api_key( credentials: HTTPAuthorizationCredentials = Depends(security) ): valid_keys = os.getenv("VALID_API_KEYS", "").split(",") if credentials.credentials not in valid_keys: raise HTTPException(status_code=401, detail="Invalid API key") return credentials.credentials # Streaming proxy for chat completions @app.post("/v1/chat/completions") async def chat_completions( request: ChatRequest, api_key: str = Depends(verify_api_key) ): headers = { "Authorization": f"Bearer {OPENAI_API_KEY}", "Content-Type": "application/json" } async with httpx.AsyncClient(timeout=120.0) as client: if request.stream: return StreamingResponse( stream_response(client, request, headers), media_type="text/event-stream" ) else: response = await client.post( f"{OPENAI_BASE_URL}/chat/completions", json=request.dict(), headers=headers ) return JSONResponse( content=response.json(), status_code=response.status_code ) # Stream generator for SSE responses async def stream_response(client, request, headers): async with client.stream( "POST", f"{OPENAI_BASE_URL}/chat/completions", json=request.dict(), headers=headers ) as response: async for chunk in response.aiter_bytes(): yield chunk @app.get("/health") async def health_check(): return {"status": "healthy"}

Key Features

โšก

Native Async Support

Leverage Python's asyncio for concurrent request handling. Serve thousands of simultaneous connections with uvicorn and async/await.

๐Ÿ“š

Auto Documentation

Interactive Swagger UI and ReDoc documentation generated automatically from your code. No separate documentation maintenance.

โœ…

Request Validation

Pydantic models validate and serialize requests automatically. Catch errors before they reach your proxy logic.

๐Ÿ”„

Streaming Support

Proxy streaming responses efficiently using async generators. Support real-time token delivery for chat applications.

๐Ÿ”

Built-in Security

OAuth2, API key authentication, and dependency injection for clean security patterns. Protect your LLM endpoints.

๐Ÿงช

Easy Testing

TestClient provides a simple interface for unit and integration testing. Mock external APIs with httpx mock transports.

Middleware Implementation

Add rate limiting, logging, and caching as FastAPI middleware. The middleware pattern enables clean separation of cross-cutting concerns.

middleware.py
from fastapi import Request, Response from starlette.middleware.base import BaseHTTPMiddleware from typing import Callable import time import json class RateLimitMiddleware(BaseHTTPMiddleware): def __init__(self, app, requests_per_minute: int = 60): super().__init__(app) self.rpm = requests_per_minute self.requests = {} async def dispatch(self, request: Request, call_next: Callable): client_ip = request.client.host # Check rate limit current_time = time.time() if client_ip in self.requests: requests, window_start = self.requests[client_ip] if current_time - window_start > 60: self.requests[client_ip] = [1, current_time] elif requests >= self.rpm: return Response( content=json.dumps({"error": "Rate limit exceeded"}), status_code=429, media_type="application/json" ) else: self.requests[client_ip][0] += 1 else: self.requests[client_ip] = [1, current_time] return await call_next(request) class LoggingMiddleware(BaseHTTPMiddleware): async def dispatch(self, request: Request, call_next: Callable): start_time = time.time() response = await call_next(request) duration = time.time() - start_time print(f"[{request.method}] {request.url.path} - {duration:.3f}s") return response

Benefits of FastAPI

Developer Experience

Automatic IDE support with type hints. Editor autocomplete catches errors before runtime.

Production Ready

Built on Starlette for production-grade performance. Handles high concurrency with minimal resources.

Extensible

Rich ecosystem of extensions and middleware. Add WebSocket support, CORS, and more.

Modern Python

Leverages Python 3.8+ features including async/await, type hints, and dataclasses.

Deployment Options

Docker: Package your FastAPI proxy in a minimal Docker container using Python slim images. Multi-stage builds keep image sizes small.

Kubernetes: Deploy with multiple replicas behind a service. Use health check endpoints for automatic pod management.

Serverless: Run on AWS Lambda, Google Cloud Functions, or Azure Functions using Mangum adapter for ASGI compatibility.

Build Your FastAPI LLM Proxy

Create modern, async LLM gateways with Python's most popular web framework.

Get Started