Prompt Caching and Semantic Caching: Lower Latency and Cost
Learn exact prompt caching, prefix caching, semantic caching, TTLs, invalidation, cache safety, and when caching LLM responses is a bad idea.
Why Cache LLM Work?
LLM calls are slower and more expensive than most service calls. Caching can reduce latency, token spend, and provider load when requests repeat or share common prefixes.
Key idea: Cache only when the answer is safe to reuse. Correctness, privacy, freshness, and personalization matter more than savings.
Caching Layers
| Cache Type | Match Condition | Good For |
|---|---|---|
| Exact-match cache | Same normalized request hash | Deterministic repeated tasks |
| Semantic cache | Similar meaning above threshold | FAQ and support questions |
| Prefix cache | Shared prompt prefix reused | Long system prompts and static context |
| Retrieval cache | Same query returns same chunks | RAG retrieval reuse |
Exact Prompt Caching
Exact caching hashes the normalized prompt and model parameters.
Cache Key Inputs
| Input | Why It Belongs in Key |
|---|---|
| Model name and version | Different models produce different outputs |
| System prompt | Changes behavior |
| User prompt | Main input |
| Temperature and decoding params | Affects output |
| Tool schema | Changes valid output |
| Prompt version | Supports safe rollout |
| Tenant or permission scope | Prevents data leakage |
Exact caching is safest for deterministic extraction, classification, and repeated internal tasks.
Semantic Caching
Semantic caching embeds the user request and looks for a previously answered request with high similarity.
Threshold Tuning
| Threshold | Behavior |
|---|---|
| Too high | Low hit rate, safer |
| Too low | More hits, higher wrong-answer risk |
| Per intent | Best practical approach |
Semantic caching is useful when many users ask equivalent questions in different words. It is dangerous when small wording changes matter.
Do not semantic-cache sensitive or personalized answers unless the cache key includes the permission and personalization boundary.
Prefix Caching
Many LLM requests share a long prefix: system instructions, tool definitions, policy text, or static product documentation. Prefix caching lets the provider or serving layer reuse computation for that shared prefix.
Prefix Cache Design
- Keep stable instructions at the beginning.
- Avoid changing timestamps or request IDs in the prefix.
- Put dynamic user content later.
- Version prompt prefixes deliberately.
- Measure cache hit rate and latency impact.
Caching RAG Systems
RAG has more cacheable parts than a plain LLM call.
| Layer | Cache Key |
|---|---|
| Parsed document | Document ID plus content hash |
| Embeddings | Chunk hash plus embedding model |
| Retrieval result | Query, filters, index version |
| Rerank result | Query, candidates, reranker version |
| Final answer | Prompt, context IDs, model, user scope |
Final-answer caching is the riskiest. Retrieval and embedding caches are usually safer.
TTL and Invalidation
| Data Type | TTL Strategy |
|---|---|
| Static docs | Long TTL with content-hash invalidation |
| Product policies | Short TTL or versioned invalidation |
| User-specific answers | Very short TTL or no cache |
| Compliance answers | Avoid final-answer cache |
| Analytics summaries | TTL aligned with source refresh |
Invalidation Events
When Not to Cache
Do not cache blindly.
| Situation | Why |
|---|---|
| Medical, legal, or financial decisions | High-stakes freshness and correctness |
| Personalized account answers | Privacy and permission risk |
| Rapidly changing data | Stale response risk |
| Creative generation | Users expect variation |
| Tool-dependent output | Side effects and current state matter |
| Safety-sensitive moderation | Policy and context must be current |
What to Remember for Interviews
- Exact caching is safest: Hash the full normalized request and parameters.
- Semantic caching is approximate: Tune thresholds by task and risk.
- Prefix caching rewards stable prompts: Put reusable content first.
- Cache boundaries must include permissions: Avoid cross-tenant or cross-user leaks.
- Invalidate by version: Prompt, model, document, embedding, and ACL versions matter.
Practice: Design caching for a RAG support bot. Decide which layers to cache, how to key them, how to invalidate them, and which answers should never be cached.