Why Expose Ollama as OpenAI API?
Bridge the gap between local AI models and existing tooling
Complete Privacy
Run AI workloads entirely on your infrastructure. No data leaves your network, ensuring compliance with strict data protection requirements and enabling sensitive use cases that cannot rely on cloud APIs.
Zero Marginal Cost
After initial hardware investment, run unlimited API calls without per-token charges. Perfect for development, testing, and production workloads with predictable, fixed infrastructure costs.
Drop-in Replacement
Seamlessly integrate with existing OpenAI SDK code by simply changing the base URL. No code refactoring required for most applications, enabling quick migration to local models.
Low Latency
Eliminate network round-trips to cloud providers. Local inference provides consistent, low-latency responses ideal for real-time applications, interactive tools, and latency-sensitive workflows.
Implementation Guide
Step-by-step setup to expose Ollama with OpenAI compatibility
Install and Configure Ollama
Download Ollama from ollama.com and install on your system. Pull your desired models using the CLI: ollama pull llama3, ollama pull mistral, or ollama pull codellama. Verify installation with ollama list to see available models.
Enable Ollama API Server
Ollama automatically runs an API server on port 11434 by default. The native Ollama API provides model management, chat completions, and embeddings. Ensure the server is running with ollama serve or check the system tray on macOS/Windows.
Deploy OpenAI-Compatible Layer
Use Ollama's built-in OpenAI compatibility endpoint at /v1 or deploy a proxy service like LiteLLM, OpenRouter, or a custom middleware that translates OpenAI API format to Ollama's native API format.
# Simple FastAPI proxy for Ollama OpenAI compatibility from fastapi import FastAPI, Request from fastapi.responses import StreamingResponse import httpx app = FastAPI() OLLAMA_BASE = "http://localhost:11434" @app.post("/v1/chat/completions") async def chat_completions(request: Request): body = await request.json() # Map OpenAI format to Ollama format ollama_request = { "model": body.get("model", "llama3"), "messages": body.get("messages", []), "stream": body.get("stream", False), "options": { "temperature": body.get("temperature", 0.7), "num_predict": body.get("max_tokens", 2048) } } # Forward to Ollama async with httpx.AsyncClient() as client: response = await client.post( f"{OLLAMA_BASE}/api/chat", json=ollama_request, timeout=300.0 ) return response.json() @app.get("/v1/models") async def list_models(): async with httpx.AsyncClient() as client: response = await client.get(f"{OLLAMA_BASE}/api/tags") models = response.json().get("models", []) return { "object": "list", "data": [{"id": m["name"], "object": "model"} for m in models] }
Ollama 0.1.26+ includes native OpenAI compatibility. Simply point your OpenAI SDK to http://localhost:11434/v1 with any API key. No proxy needed for basic use cases. The /v1/chat/completions and /v1/models endpoints work out of the box with standard OpenAI client libraries.
SDK Integration Examples
Connect your applications using familiar OpenAI SDKs
from openai import OpenAI # Point to local Ollama server client = OpenAI( base_url="http://localhost:11434/v1", api_key="ollama" # Required but unused ) # Standard OpenAI API call response = client.chat.completions.create( model="llama3", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain quantum computing"} ], temperature=0.7, max_tokens=1000 ) print(response.choices[0].message.content)
Comparing Ollama Models with OpenAI
| Feature | Ollama Local | OpenAI Cloud |
|---|---|---|
| Data Privacy | ✓ Complete local control | Cloud processing |
| Cost per Token | ✓ Zero marginal cost | Pay per usage |
| Latency | ✓ Local network speed | Internet dependent |
| Model Selection | Open source models | ✓ Proprietary GPT-4 |
| Rate Limits | ✓ Hardware dependent only | API quotas apply |
| Offline Access | ✓ Works offline | Requires internet |
Advanced Configuration
Optimize your Ollama OpenAI gateway for production
version: '3.8' services: ollama: image: ollama/ollama:latest ports: - "11434:11434" volumes: - ollama_data:/root/.ollama deploy: resources: reservations: devices: - capabilities: [gpu] openai-proxy: build: . ports: - "8000:8000" environment: - OLLAMA_BASE_URL=http://ollama:11434" depends_on: - ollama volumes: ollama_data:
Explore more resources: LLM Proxy Authentication | Cloudflare Workers AI Gateway | LM Studio Proxy | Reduce LLM API Costs