Gen AI Systems

Prompt Caching and Semantic Caching: Lower Latency and Cost

Learn exact prompt caching, prefix caching, semantic caching, TTLs, invalidation, cache safety, and when caching LLM responses is a bad idea.

prompt cachingsemantic cacheprefix cachecost reductionlatency

Why Cache LLM Work?

LLM calls are slower and more expensive than most service calls. Caching can reduce latency, token spend, and provider load when requests repeat or share common prefixes.

Key idea: Cache only when the answer is safe to reuse. Correctness, privacy, freshness, and personalization matter more than savings.


Caching Layers

Cache TypeMatch ConditionGood For
Exact-match cacheSame normalized request hashDeterministic repeated tasks
Semantic cacheSimilar meaning above thresholdFAQ and support questions
Prefix cacheShared prompt prefix reusedLong system prompts and static context
Retrieval cacheSame query returns same chunksRAG retrieval reuse

Exact Prompt Caching

Exact caching hashes the normalized prompt and model parameters.

Cache Key Inputs

InputWhy It Belongs in Key
Model name and versionDifferent models produce different outputs
System promptChanges behavior
User promptMain input
Temperature and decoding paramsAffects output
Tool schemaChanges valid output
Prompt versionSupports safe rollout
Tenant or permission scopePrevents data leakage

Exact caching is safest for deterministic extraction, classification, and repeated internal tasks.


Semantic Caching

Semantic caching embeds the user request and looks for a previously answered request with high similarity.

Threshold Tuning

ThresholdBehavior
Too highLow hit rate, safer
Too lowMore hits, higher wrong-answer risk
Per intentBest practical approach

Semantic caching is useful when many users ask equivalent questions in different words. It is dangerous when small wording changes matter.

⚠️

Do not semantic-cache sensitive or personalized answers unless the cache key includes the permission and personalization boundary.


Prefix Caching

Many LLM requests share a long prefix: system instructions, tool definitions, policy text, or static product documentation. Prefix caching lets the provider or serving layer reuse computation for that shared prefix.

Prefix Cache Design

  1. Keep stable instructions at the beginning.
  2. Avoid changing timestamps or request IDs in the prefix.
  3. Put dynamic user content later.
  4. Version prompt prefixes deliberately.
  5. Measure cache hit rate and latency impact.

Caching RAG Systems

RAG has more cacheable parts than a plain LLM call.

LayerCache Key
Parsed documentDocument ID plus content hash
EmbeddingsChunk hash plus embedding model
Retrieval resultQuery, filters, index version
Rerank resultQuery, candidates, reranker version
Final answerPrompt, context IDs, model, user scope

Final-answer caching is the riskiest. Retrieval and embedding caches are usually safer.


TTL and Invalidation

Data TypeTTL Strategy
Static docsLong TTL with content-hash invalidation
Product policiesShort TTL or versioned invalidation
User-specific answersVery short TTL or no cache
Compliance answersAvoid final-answer cache
Analytics summariesTTL aligned with source refresh

Invalidation Events


When Not to Cache

Do not cache blindly.

SituationWhy
Medical, legal, or financial decisionsHigh-stakes freshness and correctness
Personalized account answersPrivacy and permission risk
Rapidly changing dataStale response risk
Creative generationUsers expect variation
Tool-dependent outputSide effects and current state matter
Safety-sensitive moderationPolicy and context must be current

What to Remember for Interviews

  1. Exact caching is safest: Hash the full normalized request and parameters.
  2. Semantic caching is approximate: Tune thresholds by task and risk.
  3. Prefix caching rewards stable prompts: Put reusable content first.
  4. Cache boundaries must include permissions: Avoid cross-tenant or cross-user leaks.
  5. Invalidate by version: Prompt, model, document, embedding, and ACL versions matter.

Practice: Design caching for a RAG support bot. Decide which layers to cache, how to key them, how to invalidate them, and which answers should never be cached.