Why Cache LLM Responses?
Caching is one of the most effective strategies for reducing LLM API costs and improving response times. Every cached response eliminates an expensive API call while delivering instant results to users. For applications with repetitive query patterns, caching can reduce API volume by 40-70%, translating directly to proportional cost savings.
Beyond cost savings, caching dramatically improves user experience. Cached responses return in milliseconds rather than seconds, enabling snappy application performance. This is particularly valuable for chat applications, search interfaces, and any system where latency impacts user satisfaction and engagement.
Without Caching
Every query triggers an API call. High latency (1-5 seconds), unpredictable costs, rate limit concerns during traffic spikes, and poor user experience during peak usage.
With Caching
Repeated queries served instantly from cache. Low latency (<50ms), predictable costs, protection against rate limits, and consistent user experience regardless of API status.
Key Benefits of LLM Caching
Exact Match Caching
Exact match caching stores responses keyed by the complete input text. When an identical query arrives, the cached response is returned without calling the LLM API. This approach is simple to implement and highly effective for applications with predictable, repetitive queries.
Generate a deterministic cache key by hashing the complete input including prompt, parameters, and model configuration. This ensures consistent cache lookups for identical requests while keeping cache keys compact and efficient.
import hashlib import json def generate_cache_key(prompt, model, temperature=0.7, **kwargs): """Generate deterministic cache key from request params""" cache_data = { "prompt": prompt, "model": model, "temperature": temperature, **kwargs } # Create stable JSON string data_string = json.dumps(cache_data, sort_keys=True) # Generate SHA256 hash return hashlib.sha256(data_string.encode()).hexdigest() # Usage example key = generate_cache_key( prompt="Explain quantum computing", model="gpt-4", temperature=0.7 ) # Returns: "a1b2c3d4e5f6..."
Exact caching works best for FAQ systems, documentation search, command-line tools, and any application where users frequently submit identical queries. It's less effective for conversational AI where context changes with each message.
Semantic Caching
Semantic caching goes beyond exact matches by identifying similar queries using vector embeddings. Two users asking "What is Python?" and "Explain Python language" receive the same cached response because the semantic meaning is nearly identical. This dramatically increases cache hit rates compared to exact matching.
Convert each query into a vector embedding using a sentence transformer model. Compare the query embedding against cached embeddings to find semantically similar requests. Return cached responses when similarity exceeds a defined threshold (typically 0.90-0.95).
from sentence_transformers import SentenceTransformer import numpy as np class SemanticCache: def __init__(self, similarity_threshold=0.92): self.model = SentenceTransformer('all-MiniLM-L6-v2') self.threshold = similarity_threshold self.cache = {} # {query: (embedding, response)} def find_similar(self, query): """Find semantically similar cached query""" query_embedding = self.model.encode(query) for cached_query, (embedding, response) in self.cache.items(): similarity = np.dot(query_embedding, embedding) / ( np.linalg.norm(query_embedding) * np.linalg.norm(embedding) ) if similarity >= self.threshold: return response, similarity return None, 0.0 def add(self, query, response): """Add query-response pair to cache""" embedding = self.model.encode(query) self.cache[query] = (embedding, response)
Complete Implementation
Here's a production-ready caching layer that combines exact and semantic caching with Redis for distributed storage. This implementation handles concurrent requests, TTL-based expiration, and cache statistics tracking.
import redis import hashlib import json from openai import OpenAI class CachedLLMProxy: def __init__(self, redis_url="redis://localhost:6379"): self.redis = redis.from_url(redis_url) self.client = OpenAI() self.stats = {"hits": 0, "misses": 0} def complete(self, prompt, model="gpt-3.5-turbo", ttl=86400, **kwargs): """Generate completion with caching""" # Generate cache key cache_key = self._make_key(prompt, model, **kwargs) # Check cache first cached = self.redis.get(cache_key) if cached: self.stats["hits"] += 1 return json.loads(cached) # Cache miss - call API self.stats["misses"] += 1 response = self.client.chat.completions.create( model=model, messages=[{"role": "user", "content": prompt}], **kwargs ) # Cache the response result = response.choices[0].message.content self.redis.setex(cache_key, ttl, json.dumps(result)) return result def _make_key(self, prompt, model, **kwargs): """Create deterministic cache key""" data = {"p": prompt, "m": model, **kwargs} return f"llm:{hashlib.md5(json.dumps(data, sort_keys=True).encode()).hexdigest()}" def hit_rate(self): """Calculate cache hit rate""" total = self.stats["hits"] + self.stats["misses"] return self.stats["hits"] / total if total > 0 else 0
Cache Invalidation Strategies
Effective cache invalidation ensures users receive accurate, up-to-date responses while maintaining high cache hit rates. The right strategy depends on your use case and how frequently the underlying information changes.
Don't cache responses for time-sensitive queries. Be careful with user-specific context in conversations. Consider privacy implications when caching user data. Always test cache hit rates in production to validate effectiveness.
Best Practices Summary
Production Checklist
Continue learning: Reduce LLM API Costs | What is LLM Proxy | LLM Proxy vs API Gateway | LM Studio Proxy