LLM Proxy Caching Tutorial - Complete Implementation Guide

Why Cache LLM Responses?

Caching is one of the most effective strategies for reducing LLM API costs and improving response times. Every cached response eliminates an expensive API call while delivering instant results to users. For applications with repetitive query patterns, caching can reduce API volume by 40-70%, translating directly to proportional cost savings.

Beyond cost savings, caching dramatically improves user experience. Cached responses return in milliseconds rather than seconds, enabling snappy application performance. This is particularly valuable for chat applications, search interfaces, and any system where latency impacts user satisfaction and engagement.

Without Caching

Every query triggers an API call. High latency (1-5 seconds), unpredictable costs, rate limit concerns during traffic spikes, and poor user experience during peak usage.

With Caching

Repeated queries served instantly from cache. Low latency (<50ms), predictable costs, protection against rate limits, and consistent user experience regardless of API status.

Key Benefits of LLM Caching

✓ Cost Reduction: Save 40-70% on API costs by eliminating redundant calls

✓ Latency Improvement: Response times drop from seconds to milliseconds

✓ Rate Limit Protection: Reduce API call volume to stay within quotas

✓ Reliability: Serve cached content even during API outages

Exact Match Caching

Exact match caching stores responses keyed by the complete input text. When an identical query arrives, the cached response is returned without calling the LLM API. This approach is simple to implement and highly effective for applications with predictable, repetitive queries.

1 Hash-Based Cache Key

Generate a deterministic cache key by hashing the complete input including prompt, parameters, and model configuration. This ensures consistent cache lookups for identical requests while keeping cache keys compact and efficient.

Python

import hashlib
import json

def generate_cache_key(prompt, model, temperature=0.7, **kwargs):
    """Generate deterministic cache key from request params"""
    cache_data = {
        "prompt": prompt,
        "model": model,
        "temperature": temperature,
        **kwargs
    }
    
    # Create stable JSON string
    data_string = json.dumps(cache_data, sort_keys=True)
    
    # Generate SHA256 hash
    return hashlib.sha256(data_string.encode()).hexdigest()

# Usage example
key = generate_cache_key(
    prompt="Explain quantum computing",
    model="gpt-4",
    temperature=0.7
)
# Returns: "a1b2c3d4e5f6..."

💡 When to Use Exact Caching

Exact caching works best for FAQ systems, documentation search, command-line tools, and any application where users frequently submit identical queries. It's less effective for conversational AI where context changes with each message.

Semantic Caching

Semantic caching goes beyond exact matches by identifying similar queries using vector embeddings. Two users asking "What is Python?" and "Explain Python language" receive the same cached response because the semantic meaning is nearly identical. This dramatically increases cache hit rates compared to exact matching.

2 Vector Similarity Search

Convert each query into a vector embedding using a sentence transformer model. Compare the query embedding against cached embeddings to find semantically similar requests. Return cached responses when similarity exceeds a defined threshold (typically 0.90-0.95).

Python

from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticCache:
    def __init__(self, similarity_threshold=0.92):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.threshold = similarity_threshold
        self.cache = {}  # {query: (embedding, response)}
    
    def find_similar(self, query):
        """Find semantically similar cached query"""
        query_embedding = self.model.encode(query)
        
        for cached_query, (embedding, response) in self.cache.items():
            similarity = np.dot(query_embedding, embedding) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(embedding)
            )
            
            if similarity >= self.threshold:
                return response, similarity
        
        return None, 0.0
    
    def add(self, query, response):
        """Add query-response pair to cache"""
        embedding = self.model.encode(query)
        self.cache[query] = (embedding, response)

Complete Implementation

Here's a production-ready caching layer that combines exact and semantic caching with Redis for distributed storage. This implementation handles concurrent requests, TTL-based expiration, and cache statistics tracking.

Python

import redis
import hashlib
import json
from openai import OpenAI

class CachedLLMProxy:
    def __init__(self, redis_url="redis://localhost:6379"):
        self.redis = redis.from_url(redis_url)
        self.client = OpenAI()
        self.stats = {"hits": 0, "misses": 0}
    
    def complete(self, prompt, model="gpt-3.5-turbo", ttl=86400, **kwargs):
        """Generate completion with caching"""
        
        # Generate cache key
        cache_key = self._make_key(prompt, model, **kwargs)
        
        # Check cache first
        cached = self.redis.get(cache_key)
        if cached:
            self.stats["hits"] += 1
            return json.loads(cached)
        
        # Cache miss - call API
        self.stats["misses"] += 1
        response = self.client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            **kwargs
        )
        
        # Cache the response
        result = response.choices[0].message.content
        self.redis.setex(cache_key, ttl, json.dumps(result))
        
        return result
    
    def _make_key(self, prompt, model, **kwargs):
        """Create deterministic cache key"""
        data = {"p": prompt, "m": model, **kwargs}
        return f"llm:{hashlib.md5(json.dumps(data, sort_keys=True).encode()).hexdigest()}"
    
    def hit_rate(self):
        """Calculate cache hit rate"""
        total = self.stats["hits"] + self.stats["misses"]
        return self.stats["hits"] / total if total > 0 else 0

Cache Invalidation Strategies

Effective cache invalidation ensures users receive accurate, up-to-date responses while maintaining high cache hit rates. The right strategy depends on your use case and how frequently the underlying information changes.

3 Choose Your Strategy

⏰ TTL-Based: Set expiration time (24h for facts, 1h for news). Simple and automatic.

🔔 Event-Driven: Invalidate when underlying data changes. Perfect for product catalogs.

🔢 Version-Based: Tag cache entries with model versions. Invalidate on model updates.

👤 User-Specific: Include user ID in cache key for personalized responses.

⚠️ Common Pitfalls

Don't cache responses for time-sensitive queries. Be careful with user-specific context in conversations. Consider privacy implications when caching user data. Always test cache hit rates in production to validate effectiveness.

Best Practices Summary

Production Checklist

✓ Monitor cache hit rate - aim for 40%+ for most applications

✓ Use Redis or similar for distributed caching across instances

✓ Set appropriate TTLs based on content type and update frequency

✓ Implement cache warming for common queries on startup

✓ Track both costs saved and latency improvements as metrics

✓ Consider cache size limits and eviction policies

🔗 Related Guides

Continue learning: Reduce LLM API Costs | What is LLM Proxy | LLM Proxy vs API Gateway | LM Studio Proxy