LLM Proxy with Redis Caching - High-Performance AI Infrastructure

Caching Strategies

🎯

Exact Match Cache

Store responses for exact prompt matches. Fast lookups with minimal overhead. Best for FAQ-style queries and repetitive requests.

SHA-256 hash for keys
Sub-millisecond lookups
Configurable TTL
Memory-efficient storage

🧠

Semantic Cache

Cache responses based on meaning, not exact text. Embed prompts with vector similarity search to match semantically similar queries.

Vector embeddings storage
Similarity threshold tuning
Redis Stack support
90%+ cost savings

⏱️

TTL-Based Cache

Automatic expiration for time-sensitive content. Perfect for models with knowledge cutoffs or frequently updated information.

Flexible TTL per model
Lazy expiration
Memory optimization
Background refresh

🔀

Multi-Level Cache

Hierarchical caching with L1 (local) and L2 (Redis) layers. Minimize latency while maximizing cache coverage across deployments.

In-memory L1 cache
Distributed L2 Redis
Automatic promotion
Cache warming

Architecture Flow

Client Request

Prompt + Model

→

LLM Proxy

Cache Check

→

Redis Cache

Lookup

Cache Hit

Return Cached

Cache Miss

Call LLM API

→

Store Response

Cache for Future

Implementation Examples

Python Setup

Redis Config

Semantic Cache

# Install dependencies
pip install redis litellm

# Basic Redis cache setup
import redis
import hashlib
import json

class LLMCache:
    def __init__(self, redis_url="redis://localhost:6379"):
        self.redis = redis.from_url(redis_url)
        self.ttl = 3600  # 1 hour default
    
    def get_cache_key(self, prompt, model):
        return hashlib.sha256(
            f"{model}:{prompt}".encode()
        ).hexdigest()
    
    def get(self, prompt, model):
        key = self.get_cache_key(prompt, model)
        cached = self.redis.get(key)
        return json.loads(cached) if cached else None
    
    def set(self, prompt, model, response):
        key = self.get_cache_key(prompt, model)
        self.redis.setex(key, self.ttl, json.dumps(response))
                    

# Redis configuration for optimal caching

# redis.conf
maxmemory 4gb
maxmemory-policy allkeys-lru

# Enable Redis Stack for semantic search
loadmodule /path/to/redisearch.so
loadmodule /path/to/redisjson.so

# Persistence options
save 900 1     # Save after 900 sec if at least 1 key changed
appendonly yes  # AOF persistence

# Connection pooling
tcp-backlog 511
timeout 0
tcp-keepalive 300
                    

# Semantic caching with embeddings
import numpy as np
from redis.commands.search.query import Query

class SemanticCache:
    def __init__(self, redis_client, threshold=0.95):
        self.redis = redis_client
        self.threshold = threshold
    
    def find_similar(self, embedding):
        # Vector similarity search
        query = Query("*=>[KNN 1 @embedding $vec AS score]")\
            .add_param("vec", embedding.tobytes())\
            .return_fields("response", "score")\
            .dialect(2)
        
        results = self.redis.ft("idx:cache").search(query)
        
        if results.total and results.docs[0].score > self.threshold:
            return results.docs[0].response
        return None
                    

Performance Impact

80%

API Call Reduction

With typical cache hit rates of 60-80%, dramatically reduce external API calls and associated costs.

100x

Faster Response

Cache responses in <10ms vs 1-3 seconds for LLM API calls. Near-instant user experience.

∞

Rate Limit Protection

Serve unlimited cached responses even when rate-limited by providers. Maintain service continuity.

Configuration Options

Parameter	Default	Description
`cache_ttl`	3600	Time-to-live in seconds for cached responses
`similarity_threshold`	0.95	Minimum similarity score for semantic cache hits
`max_cache_size`	4GB	Maximum memory allocation for cache
`cache_models`	all	Which models to cache (can filter by model name)
`cache_streaming`	true	Cache streaming responses chunk by chunk

🔗 Related Resources

Caching Tutorial | Cost Optimization | Load Balancing | Production Deployment