Caching Strategies

🎯

Exact Match Cache

Store responses for exact prompt matches. Fast lookups with minimal overhead. Best for FAQ-style queries and repetitive requests.

  • SHA-256 hash for keys
  • Sub-millisecond lookups
  • Configurable TTL
  • Memory-efficient storage
🧠

Semantic Cache

Cache responses based on meaning, not exact text. Embed prompts with vector similarity search to match semantically similar queries.

  • Vector embeddings storage
  • Similarity threshold tuning
  • Redis Stack support
  • 90%+ cost savings
⏱️

TTL-Based Cache

Automatic expiration for time-sensitive content. Perfect for models with knowledge cutoffs or frequently updated information.

  • Flexible TTL per model
  • Lazy expiration
  • Memory optimization
  • Background refresh
🔀

Multi-Level Cache

Hierarchical caching with L1 (local) and L2 (Redis) layers. Minimize latency while maximizing cache coverage across deployments.

  • In-memory L1 cache
  • Distributed L2 Redis
  • Automatic promotion
  • Cache warming

Architecture Flow

Client Request

Prompt + Model

LLM Proxy

Cache Check

Redis Cache

Lookup

Cache Hit

Return Cached

OR
Cache Miss

Call LLM API

Store Response

Cache for Future

Implementation Examples

Python Setup
Redis Config
Semantic Cache
# Install dependencies pip install redis litellm # Basic Redis cache setup import redis import hashlib import json class LLMCache: def __init__(self, redis_url="redis://localhost:6379"): self.redis = redis.from_url(redis_url) self.ttl = 3600 # 1 hour default def get_cache_key(self, prompt, model): return hashlib.sha256( f"{model}:{prompt}".encode() ).hexdigest() def get(self, prompt, model): key = self.get_cache_key(prompt, model) cached = self.redis.get(key) return json.loads(cached) if cached else None def set(self, prompt, model, response): key = self.get_cache_key(prompt, model) self.redis.setex(key, self.ttl, json.dumps(response))
# Redis configuration for optimal caching # redis.conf maxmemory 4gb maxmemory-policy allkeys-lru # Enable Redis Stack for semantic search loadmodule /path/to/redisearch.so loadmodule /path/to/redisjson.so # Persistence options save 900 1 # Save after 900 sec if at least 1 key changed appendonly yes # AOF persistence # Connection pooling tcp-backlog 511 timeout 0 tcp-keepalive 300
# Semantic caching with embeddings import numpy as np from redis.commands.search.query import Query class SemanticCache: def __init__(self, redis_client, threshold=0.95): self.redis = redis_client self.threshold = threshold def find_similar(self, embedding): # Vector similarity search query = Query("*=>[KNN 1 @embedding $vec AS score]")\ .add_param("vec", embedding.tobytes())\ .return_fields("response", "score")\ .dialect(2) results = self.redis.ft("idx:cache").search(query) if results.total and results.docs[0].score > self.threshold: return results.docs[0].response return None

Performance Impact

80%
API Call Reduction
With typical cache hit rates of 60-80%, dramatically reduce external API calls and associated costs.
100x
Faster Response
Cache responses in <10ms vs 1-3 seconds for LLM API calls. Near-instant user experience.
Rate Limit Protection
Serve unlimited cached responses even when rate-limited by providers. Maintain service continuity.

Configuration Options

Parameter Default Description
cache_ttl 3600 Time-to-live in seconds for cached responses
similarity_threshold 0.95 Minimum similarity score for semantic cache hits
max_cache_size 4GB Maximum memory allocation for cache
cache_models all Which models to cache (can filter by model name)
cache_streaming true Cache streaming responses chunk by chunk