Streaming and Latency Optimization: TTFT, SSE, KV Cache, and Batching

Design low-latency LLM experiences using streaming, Server-Sent Events, time-to-first-token optimization, KV cache management, speculative decoding, batching, and context reduction.

streamingSSETTFTKV cachespeculative decoding

Why LLM Latency Feels Different

Traditional APIs usually return one complete response. LLMs generate tokens sequentially, so latency has multiple parts: time to start, speed while generating, and total completion time.

Metric	Meaning
TTFT	Time to first token
Inter-token latency	Delay between streamed tokens
Tokens per second	Generation throughput
Total latency	Time until full response is complete

✅

Key idea: Streaming does not always make total generation faster, but it makes the user experience feel faster because output starts earlier.

Latency Breakdown

Major Contributors

Contributor	Optimization
Network round trip	Keep gateway close to users and providers
Queueing	Capacity planning and rate limits
Prompt processing	Shorter context, prefix caching
Retrieval	Cache retrieval and rerank efficiently
Generation	Smaller model, shorter answer, speculative decoding
Client rendering	Stream progressively

Server-Sent Events

SSE is a common streaming protocol for token-by-token responses over HTTP.

txt

event: token
data: {"text":"The"}

event: token
data: {"text":" answer"}

event: done
data: {}

SSE vs WebSockets

Protocol	Best For
SSE	One-way server-to-client token streaming
WebSocket	Bidirectional realtime interaction
Plain HTTP	Non-interactive background tasks

SSE is usually simpler for chat and assistant responses.

Optimizing TTFT

Time to first token depends heavily on input size, provider queueing, routing, and model choice.

Technique	Effect
Reduce prompt tokens	Faster prefill
Cache stable prefix	Reuse prompt processing
Use smaller model	Faster first token
Avoid unnecessary tools before response	Less pre-generation work
Stream immediately	Better perceived latency
Keep retrieval top-k small	Less context assembly overhead

⚠️

Large context windows are not free: Sending 100K tokens because the model supports it can create high latency, high cost, and weaker attention to the important parts.

KV Cache

During generation, transformer models reuse key-value attention states for previous tokens. This is called the KV cache.

The KV cache improves generation speed, but it consumes memory. Longer contexts and more concurrent requests require more memory.

Serving Implications

Concern	Impact
Long prompts	More prefill work and KV memory
Long outputs	More decode steps
Many concurrent streams	More active KV cache
Larger models	More GPU memory
Batching	Better utilization but may affect latency

Batching

Batching groups multiple inference requests to improve GPU utilization.

Batch Type	Use Case
Static batching	Offline jobs with predictable shapes
Dynamic batching	Online serving with variable arrivals
Continuous batching	LLM serving where requests join and leave during decoding

Batching improves throughput but can add queueing delay. Interactive systems need careful latency budgets.

Speculative Decoding

Speculative decoding uses a smaller draft model to propose tokens and a larger model to verify them.

It can reduce generation latency when draft tokens are often accepted. It is mostly a serving-layer optimization and may not be exposed by every provider.

Product UX Patterns

Pattern	Why It Helps
Stream answer text	User sees progress quickly
Show retrieval progress	Makes waiting understandable
Render citations as they arrive	Builds trust
Allow cancel	Saves cost and user time
Generate concise by default	Lower latency and cost
Continue button	Avoids long automatic generations

Latency optimization is not only backend work. UX can reduce perceived waiting and unnecessary generation.

What to Remember for Interviews

Measure TTFT, inter-token latency, and total latency separately.
SSE is the default simple streaming choice for web apps.
Context size drives prefill cost: retrieve and assemble carefully.
KV cache is a memory constraint: concurrency and context length matter.
Batching improves throughput: but may add queueing latency.

✅

Practice: Design a streaming chat API for a RAG assistant. Include retrieval timing, SSE events, cancellation, provider timeout, and metrics for TTFT and total latency.

Agentic Patterns and Tool Use: ReAct, Function Calling, and Orchestration

Guardrails and Output Validation: Safer LLM Responses