Gen AI Systems

Streaming and Latency Optimization: TTFT, SSE, KV Cache, and Batching

Design low-latency LLM experiences using streaming, Server-Sent Events, time-to-first-token optimization, KV cache management, speculative decoding, batching, and context reduction.

streamingSSETTFTKV cachespeculative decoding

Why LLM Latency Feels Different

Traditional APIs usually return one complete response. LLMs generate tokens sequentially, so latency has multiple parts: time to start, speed while generating, and total completion time.

MetricMeaning
TTFTTime to first token
Inter-token latencyDelay between streamed tokens
Tokens per secondGeneration throughput
Total latencyTime until full response is complete

Key idea: Streaming does not always make total generation faster, but it makes the user experience feel faster because output starts earlier.


Latency Breakdown

Major Contributors

ContributorOptimization
Network round tripKeep gateway close to users and providers
QueueingCapacity planning and rate limits
Prompt processingShorter context, prefix caching
RetrievalCache retrieval and rerank efficiently
GenerationSmaller model, shorter answer, speculative decoding
Client renderingStream progressively

Server-Sent Events

SSE is a common streaming protocol for token-by-token responses over HTTP.

txt
event: token
data: {"text":"The"}

event: token
data: {"text":" answer"}

event: done
data: {}

SSE vs WebSockets

ProtocolBest For
SSEOne-way server-to-client token streaming
WebSocketBidirectional realtime interaction
Plain HTTPNon-interactive background tasks

SSE is usually simpler for chat and assistant responses.


Optimizing TTFT

Time to first token depends heavily on input size, provider queueing, routing, and model choice.

TechniqueEffect
Reduce prompt tokensFaster prefill
Cache stable prefixReuse prompt processing
Use smaller modelFaster first token
Avoid unnecessary tools before responseLess pre-generation work
Stream immediatelyBetter perceived latency
Keep retrieval top-k smallLess context assembly overhead
⚠️

Large context windows are not free: Sending 100K tokens because the model supports it can create high latency, high cost, and weaker attention to the important parts.


KV Cache

During generation, transformer models reuse key-value attention states for previous tokens. This is called the KV cache.

The KV cache improves generation speed, but it consumes memory. Longer contexts and more concurrent requests require more memory.

Serving Implications

ConcernImpact
Long promptsMore prefill work and KV memory
Long outputsMore decode steps
Many concurrent streamsMore active KV cache
Larger modelsMore GPU memory
BatchingBetter utilization but may affect latency

Batching

Batching groups multiple inference requests to improve GPU utilization.

Batch TypeUse Case
Static batchingOffline jobs with predictable shapes
Dynamic batchingOnline serving with variable arrivals
Continuous batchingLLM serving where requests join and leave during decoding

Batching improves throughput but can add queueing delay. Interactive systems need careful latency budgets.


Speculative Decoding

Speculative decoding uses a smaller draft model to propose tokens and a larger model to verify them.

It can reduce generation latency when draft tokens are often accepted. It is mostly a serving-layer optimization and may not be exposed by every provider.


Product UX Patterns

PatternWhy It Helps
Stream answer textUser sees progress quickly
Show retrieval progressMakes waiting understandable
Render citations as they arriveBuilds trust
Allow cancelSaves cost and user time
Generate concise by defaultLower latency and cost
Continue buttonAvoids long automatic generations

Latency optimization is not only backend work. UX can reduce perceived waiting and unnecessary generation.


What to Remember for Interviews

  1. Measure TTFT, inter-token latency, and total latency separately.
  2. SSE is the default simple streaming choice for web apps.
  3. Context size drives prefill cost: retrieve and assemble carefully.
  4. KV cache is a memory constraint: concurrency and context length matter.
  5. Batching improves throughput: but may add queueing latency.

Practice: Design a streaming chat API for a RAG assistant. Include retrieval timing, SSE events, cancellation, provider timeout, and metrics for TTFT and total latency.