Go LLM Proxy Server - High-Performance AI Gateway in Golang

Why Build LLM Proxy in Go?

Go's unique combination of simplicity, performance, and built-in concurrency makes it an excellent choice for building LLM proxy servers. The language's design philosophy aligns perfectly with the requirements of high-throughput AI API gateways that need to handle thousands of concurrent connections efficiently.

The goroutine-based concurrency model enables straightforward handling of simultaneous LLM API requests without the complexity of thread pools or callback-based async patterns. Each incoming request spawns a lightweight goroutine, allowing Go to efficiently multiplex thousands of connections onto a small number of OS threads.

Go's rich standard library provides everything needed for HTTP servers, JSON processing, and TLS termination out of the box. Combined with the strong typing and compile-time error checking, Go enables building reliable LLM proxies with minimal external dependencies.

2KB Goroutine Stack

M Concurrent Connections

~10MB Binary Size

<1ms Proxy Overhead

Core Implementation

Let's build a production-ready LLM proxy server in Go. This implementation includes request forwarding, response streaming, and error handling for OpenAI-compatible APIs.

main.go

package main

import (
    "context"
    "encoding/json"
    "fmt"
    "io"
    "log"
    "net/http"
    "os"
    "time"
)

// ProxyServer handles LLM API requests
type ProxyServer struct {
    targetURL  string
    httpClient *http.Client
    apiKey     string
}

// ChatRequest represents OpenAI chat request
type ChatRequest struct {
    Model    string                 `json:"model"`
    Messages []ChatMessage         `json:"messages"`
    Stream   bool                   `json:"stream"`
}

type ChatMessage struct {
    Role    string `json:"role"`
    Content string `json:"content"`
}

func NewProxyServer(targetURL, apiKey string) *ProxyServer {
    return &ProxyServer{
        targetURL: targetURL,
        apiKey:    apiKey,
        httpClient: &http.Client{
            Timeout: 120 * time.Second,
        },
    }
}

// HandleChat processes chat completion requests
func (p *ProxyServer) HandleChat(w http.ResponseWriter, r *http.Request) {
    // Parse incoming request
    var req ChatRequest
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        http.Error(w, err.Error(), http.StatusBadRequest)
        return
    }

    // Create proxy request
    body, _ := json.Marshal(req)
    proxyReq, err := http.NewRequestWithContext(
        context.Background(),
        "POST",
        p.targetURL+"/v1/chat/completions",
        bytes.NewReader(body),
    )
    if err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }

    // Set headers
    proxyReq.Header.Set("Authorization", "Bearer "+p.apiKey)
    proxyReq.Header.Set("Content-Type", "application/json")

    // Forward request
    resp, err := p.httpClient.Do(proxyReq)
    if err != nil {
        http.Error(w, err.Error(), http.StatusBadGateway)
        return
    }
    defer resp.Body.Close()

    // Stream response back
    w.Header().Set("Content-Type", "application/json")
    io.Copy(w, resp.Body)
}

func main() {
    proxy := NewProxyServer(
        "https://api.openai.com",
        os.Getenv("OPENAI_API_KEY"),
    )

    http.HandleFunc("/v1/chat/completions", proxy.HandleChat)
    log.Println("Server starting on :8080")
    log.Fatal(http.ListenAndServe(":8080", nil))
}
                

Advanced Features

⚡

Goroutine Concurrency

Handle thousands of simultaneous connections with lightweight goroutines. Each request runs independently without blocking others.

🔄

Streaming Support

Efficiently proxy streaming responses using io.Pipe and flush writers. Support real-time token delivery for chat applications.

📊

Built-in Metrics

Expose Prometheus metrics using the expvar package or dedicated metrics libraries. Monitor request rates, latencies, and errors.

🛡️

Middleware Pattern

Implement authentication, rate limiting, and logging as composable middleware chains. Clean separation of concerns.

💾

Response Caching

Implement caching layers using sync.Map or Redis clients. Reduce upstream API costs for repeated queries.

⚖️

Load Balancing

Distribute requests across multiple LLM providers using round-robin or weighted algorithms. Implement health checking.

Middleware Implementation

Implement authentication, rate limiting, and logging as composable middleware functions. This pattern enables clean separation of concerns and easy extensibility.

middleware.go

package main

import (
    "net/http"
    "sync"
    "time"
)

// Middleware type for chaining handlers
type Middleware func(http.Handler) http.Handler

// Chain multiple middleware together
func Chain(h http.Handler, middlewares ...Middleware) http.Handler {
    for _, m := range middlewares {
        h = m(h)
    }
    return h
}

// Rate limiting middleware using token bucket
func RateLimit(requestsPerSecond int) Middleware {
    var mu sync.Mutex
    tokens := requestsPerSecond
    lastUpdate := time.Now()

    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            mu.Lock()
            now := time.Now()
            elapsed := now.Sub(lastUpdate).Seconds()
            tokens += int(elapsed * float64(requestsPerSecond))
            if tokens > requestsPerSecond {
                tokens = requestsPerSecond
            }
            lastUpdate = now

            if tokens <= 0 {
                mu.Unlock()
                http.Error(w, "rate limit exceeded", http.StatusTooManyRequests)
                return
            }
            tokens--
            mu.Unlock()
            next.ServeHTTP(w, r)
        })
    }
}

// Authentication middleware
func Authenticate(validKeys map[string]bool) Middleware {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            key := r.Header.Get("X-API-Key")
            if !validKeys[key] {
                http.Error(w, "unauthorized", http.StatusUnauthorized)
                return
            }
            next.ServeHTTP(w, r)
        })
    }
}
                

Architecture Overview

Request Processing Pipeline

Client Request

→

Go Proxy

→

Middleware

→

LLM Provider

→

Response

The architecture leverages Go's net/http server as the foundation. Incoming requests pass through a middleware chain for authentication, rate limiting, and logging before being forwarded to the target LLM provider. Streaming responses are piped directly back to clients with minimal overhead.

Key Benefits

Single Binary Deployment

Compile to a static binary with no runtime dependencies. Deploy anywhere without worrying about interpreter versions or library conflicts.

Efficient Memory Usage

Go's garbage collector is optimized for low latency. Handle high request volumes with predictable memory footprint and minimal pause times.

Cross-Platform Support

Compile for Linux, macOS, Windows, and ARM architectures with ease. Deploy on servers, containers, or edge devices from the same codebase.

Strong Typing

Catch errors at compile time with Go's type system. Refactor confidently with IDE support and static analysis tools.

Built-in Testing

Write unit tests, benchmarks, and integration tests using the standard testing package. Achieve high test coverage with minimal tooling.

Rich Ecosystem

Leverage hundreds of high-quality packages for Redis, PostgreSQL, Prometheus, and more. The Go community maintains production-ready libraries.

Production Considerations

Graceful Shutdown: Implement graceful shutdown to complete in-flight requests before terminating. Use context cancellation and sync.WaitGroup to track active connections.

Health Checks: Expose health check endpoints for load balancers and orchestration platforms. Include dependency status for comprehensive monitoring.

Configuration Management: Use environment variables or configuration files for deployment flexibility. Consider Viper for hierarchical configuration with multiple sources.

Structured Logging: Implement structured logging with zap or zerolog for production deployments. Include request IDs, timing information, and error details.

Start Building Your Go LLM Proxy

Build high-performance, production-ready LLM gateways with Go's powerful concurrency primitives.

Get Started