Expose Ollama as OpenAI API - Complete Setup Guide

Why Expose Ollama as OpenAI API?

Bridge the gap between local AI models and existing tooling

🔒

Complete Privacy

Run AI workloads entirely on your infrastructure. No data leaves your network, ensuring compliance with strict data protection requirements and enabling sensitive use cases that cannot rely on cloud APIs.

💰

Zero Marginal Cost

After initial hardware investment, run unlimited API calls without per-token charges. Perfect for development, testing, and production workloads with predictable, fixed infrastructure costs.

🔌

Drop-in Replacement

Seamlessly integrate with existing OpenAI SDK code by simply changing the base URL. No code refactoring required for most applications, enabling quick migration to local models.

🚀

Low Latency

Eliminate network round-trips to cloud providers. Local inference provides consistent, low-latency responses ideal for real-time applications, interactive tools, and latency-sensitive workflows.

Implementation Guide

Step-by-step setup to expose Ollama with OpenAI compatibility

Install and Configure Ollama

Download Ollama from ollama.com and install on your system. Pull your desired models using the CLI: ollama pull llama3, ollama pull mistral, or ollama pull codellama. Verify installation with ollama list to see available models.

Enable Ollama API Server

Ollama automatically runs an API server on port 11434 by default. The native Ollama API provides model management, chat completions, and embeddings. Ensure the server is running with ollama serve or check the system tray on macOS/Windows.

Deploy OpenAI-Compatible Layer

Use Ollama's built-in OpenAI compatibility endpoint at /v1 or deploy a proxy service like LiteLLM, OpenRouter, or a custom middleware that translates OpenAI API format to Ollama's native API format.

ollama_openai_proxy.py Python

# Simple FastAPI proxy for Ollama OpenAI compatibility
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import httpx

app = FastAPI()

OLLAMA_BASE = "http://localhost:11434"

@app.post("/v1/chat/completions")
async def chat_completions(request: Request):
    body = await request.json()
    
    # Map OpenAI format to Ollama format
    ollama_request = {
        "model": body.get("model", "llama3"),
        "messages": body.get("messages", []),
        "stream": body.get("stream", False),
        "options": {
            "temperature": body.get("temperature", 0.7),
            "num_predict": body.get("max_tokens", 2048)
        }
    }
    
    # Forward to Ollama
    async with httpx.AsyncClient() as client:
        response = await client.post(
            f"{OLLAMA_BASE}/api/chat",
            json=ollama_request,
            timeout=300.0
        )
        return response.json()

@app.get("/v1/models")
async def list_models():
    async with httpx.AsyncClient() as client:
        response = await client.get(f"{OLLAMA_BASE}/api/tags")
        models = response.json().get("models", [])
        return {
            "object": "list",
            "data": [{"id": m["name"], "object": "model"} for m in models]
        }

💡 Built-in OpenAI Compatibility

Ollama 0.1.26+ includes native OpenAI compatibility. Simply point your OpenAI SDK to http://localhost:11434/v1 with any API key. No proxy needed for basic use cases. The /v1/chat/completions and /v1/models endpoints work out of the box with standard OpenAI client libraries.

SDK Integration Examples

Connect your applications using familiar OpenAI SDKs

openai_client.py Python

from openai import OpenAI

# Point to local Ollama server
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but unused
)

# Standard OpenAI API call
response = client.chat.completions.create(
    model="llama3",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing"}
    ],
    temperature=0.7,
    max_tokens=1000
)

print(response.choices[0].message.content)

Comparing Ollama Models with OpenAI

Feature	Ollama Local	OpenAI Cloud
Data Privacy	✓ Complete local control	Cloud processing
Cost per Token	✓ Zero marginal cost	Pay per usage
Latency	✓ Local network speed	Internet dependent
Model Selection	Open source models	✓ Proprietary GPT-4
Rate Limits	✓ Hardware dependent only	API quotas apply
Offline Access	✓ Works offline	Requires internet

Advanced Configuration

Optimize your Ollama OpenAI gateway for production

docker-compose.yml YAML

version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              
  openai-proxy:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434"
    depends_on:
      - ollama

volumes:
  ollama_data:

🔗 Related Guides

Explore more resources: LLM Proxy Authentication | Cloudflare Workers AI Gateway | LM Studio Proxy | Reduce LLM API Costs