Prompt Caching Guide

Reduce API costs and latency by caching prompt responses. Learn implementation strategies and best practices.

70%

Cost Reduction

10x

Faster Responses

50%

API Calls Saved

99%

Cache Hit Rate

Caching Strategies

Choose the right strategy for your use case

Exact Match

Cache responses for identical prompts. Simple and highly effective for repeated queries.

60-70% savings

Prefix Caching

Cache common prompt prefixes. Share cache across similar prompts with different variables.

40-50% savings

Semantic Caching

Cache similar prompts using embeddings. Match prompts with semantic similarity.

30-40% savings

Configuration

// Prompt caching configuration
const cache = {
  strategy: 'prefix',
  ttl: 3600,
  prefixLength: 500,
  maxSize: 1000,
  invalidation: {
    manual: true,
    auto: 'model-update'
  }
};

Key Features

✓

Automatic Cache Invalidation

Invalidate cache when models update or manually trigger refresh

✓

Prefix-Based Sharing

Share cache across prompts with common system instructions

✓

TTL Management

Set time-to-live per cache entry for freshness control

✓

Analytics Dashboard

Monitor hit rates, savings, and cache performance

Frequently Asked Questions

How does prompt caching work?

Gateway stores hashed prompts with their responses. When identical prompt arrives, cached response returned without API call.

When should I invalidate the cache?

Invalidate when: model updates, prompt templates change, or on scheduled intervals for fresh responses.

Does cached data affect response quality?

No. Cached responses are exact matches to previous API responses. Quality remains identical.

Related Resources

Context Window

Window management

Token Optimization

Cost reduction

Prompt Engineering

Optimization

Home

Back to hub