Prompt Caching and Semantic Caching: Lower Latency and Cost

Learn exact prompt caching, prefix caching, semantic caching, TTLs, invalidation, cache safety, and when caching LLM responses is a bad idea.

prompt cachingsemantic cacheprefix cachecost reductionlatency

Why Cache LLM Work?

LLM calls are slower and more expensive than most service calls. Caching can reduce latency, token spend, and provider load when requests repeat or share common prefixes.

✅

Key idea: Cache only when the answer is safe to reuse. Correctness, privacy, freshness, and personalization matter more than savings.

Caching Layers

Cache Type	Match Condition	Good For
Exact-match cache	Same normalized request hash	Deterministic repeated tasks
Semantic cache	Similar meaning above threshold	FAQ and support questions
Prefix cache	Shared prompt prefix reused	Long system prompts and static context
Retrieval cache	Same query returns same chunks	RAG retrieval reuse

Exact Prompt Caching

Exact caching hashes the normalized prompt and model parameters.

Cache Key Inputs

Input	Why It Belongs in Key
Model name and version	Different models produce different outputs
System prompt	Changes behavior
User prompt	Main input
Temperature and decoding params	Affects output
Tool schema	Changes valid output
Prompt version	Supports safe rollout
Tenant or permission scope	Prevents data leakage

Exact caching is safest for deterministic extraction, classification, and repeated internal tasks.

Semantic Caching

Semantic caching embeds the user request and looks for a previously answered request with high similarity.

Threshold Tuning

Threshold	Behavior
Too high	Low hit rate, safer
Too low	More hits, higher wrong-answer risk
Per intent	Best practical approach

Semantic caching is useful when many users ask equivalent questions in different words. It is dangerous when small wording changes matter.

⚠️

Do not semantic-cache sensitive or personalized answers unless the cache key includes the permission and personalization boundary.

Prefix Caching

Many LLM requests share a long prefix: system instructions, tool definitions, policy text, or static product documentation. Prefix caching lets the provider or serving layer reuse computation for that shared prefix.

Prefix Cache Design

Keep stable instructions at the beginning.
Avoid changing timestamps or request IDs in the prefix.
Put dynamic user content later.
Version prompt prefixes deliberately.
Measure cache hit rate and latency impact.

Caching RAG Systems

RAG has more cacheable parts than a plain LLM call.

Layer	Cache Key
Parsed document	Document ID plus content hash
Embeddings	Chunk hash plus embedding model
Retrieval result	Query, filters, index version
Rerank result	Query, candidates, reranker version
Final answer	Prompt, context IDs, model, user scope

Final-answer caching is the riskiest. Retrieval and embedding caches are usually safer.

TTL and Invalidation

Data Type	TTL Strategy
Static docs	Long TTL with content-hash invalidation
Product policies	Short TTL or versioned invalidation
User-specific answers	Very short TTL or no cache
Compliance answers	Avoid final-answer cache
Analytics summaries	TTL aligned with source refresh

Invalidation Events

When Not to Cache

Do not cache blindly.

Situation	Why
Medical, legal, or financial decisions	High-stakes freshness and correctness
Personalized account answers	Privacy and permission risk
Rapidly changing data	Stale response risk
Creative generation	Users expect variation
Tool-dependent output	Side effects and current state matter
Safety-sensitive moderation	Policy and context must be current

What to Remember for Interviews

Exact caching is safest: Hash the full normalized request and parameters.
Semantic caching is approximate: Tune thresholds by task and risk.
Prefix caching rewards stable prompts: Put reusable content first.
Cache boundaries must include permissions: Avoid cross-tenant or cross-user leaks.
Invalidate by version: Prompt, model, document, embedding, and ACL versions matter.

✅

Practice: Design caching for a RAG support bot. Decide which layers to cache, how to key them, how to invalidate them, and which answers should never be cached.

LLM Gateway and Routing: Model Selection, Fallbacks, and Cost Control

Agentic Patterns and Tool Use: ReAct, Function Calling, and Orchestration