Streaming and Latency Optimization: TTFT, SSE, KV Cache, and Batching
Design low-latency LLM experiences using streaming, Server-Sent Events, time-to-first-token optimization, KV cache management, speculative decoding, batching, and context reduction.
Why LLM Latency Feels Different
Traditional APIs usually return one complete response. LLMs generate tokens sequentially, so latency has multiple parts: time to start, speed while generating, and total completion time.
| Metric | Meaning |
|---|---|
| TTFT | Time to first token |
| Inter-token latency | Delay between streamed tokens |
| Tokens per second | Generation throughput |
| Total latency | Time until full response is complete |
Key idea: Streaming does not always make total generation faster, but it makes the user experience feel faster because output starts earlier.
Latency Breakdown
Major Contributors
| Contributor | Optimization |
|---|---|
| Network round trip | Keep gateway close to users and providers |
| Queueing | Capacity planning and rate limits |
| Prompt processing | Shorter context, prefix caching |
| Retrieval | Cache retrieval and rerank efficiently |
| Generation | Smaller model, shorter answer, speculative decoding |
| Client rendering | Stream progressively |
Server-Sent Events
SSE is a common streaming protocol for token-by-token responses over HTTP.
event: token
data: {"text":"The"}
event: token
data: {"text":" answer"}
event: done
data: {}
SSE vs WebSockets
| Protocol | Best For |
|---|---|
| SSE | One-way server-to-client token streaming |
| WebSocket | Bidirectional realtime interaction |
| Plain HTTP | Non-interactive background tasks |
SSE is usually simpler for chat and assistant responses.
Optimizing TTFT
Time to first token depends heavily on input size, provider queueing, routing, and model choice.
| Technique | Effect |
|---|---|
| Reduce prompt tokens | Faster prefill |
| Cache stable prefix | Reuse prompt processing |
| Use smaller model | Faster first token |
| Avoid unnecessary tools before response | Less pre-generation work |
| Stream immediately | Better perceived latency |
| Keep retrieval top-k small | Less context assembly overhead |
Large context windows are not free: Sending 100K tokens because the model supports it can create high latency, high cost, and weaker attention to the important parts.
KV Cache
During generation, transformer models reuse key-value attention states for previous tokens. This is called the KV cache.
The KV cache improves generation speed, but it consumes memory. Longer contexts and more concurrent requests require more memory.
Serving Implications
| Concern | Impact |
|---|---|
| Long prompts | More prefill work and KV memory |
| Long outputs | More decode steps |
| Many concurrent streams | More active KV cache |
| Larger models | More GPU memory |
| Batching | Better utilization but may affect latency |
Batching
Batching groups multiple inference requests to improve GPU utilization.
| Batch Type | Use Case |
|---|---|
| Static batching | Offline jobs with predictable shapes |
| Dynamic batching | Online serving with variable arrivals |
| Continuous batching | LLM serving where requests join and leave during decoding |
Batching improves throughput but can add queueing delay. Interactive systems need careful latency budgets.
Speculative Decoding
Speculative decoding uses a smaller draft model to propose tokens and a larger model to verify them.
It can reduce generation latency when draft tokens are often accepted. It is mostly a serving-layer optimization and may not be exposed by every provider.
Product UX Patterns
| Pattern | Why It Helps |
|---|---|
| Stream answer text | User sees progress quickly |
| Show retrieval progress | Makes waiting understandable |
| Render citations as they arrive | Builds trust |
| Allow cancel | Saves cost and user time |
| Generate concise by default | Lower latency and cost |
| Continue button | Avoids long automatic generations |
Latency optimization is not only backend work. UX can reduce perceived waiting and unnecessary generation.
What to Remember for Interviews
- Measure TTFT, inter-token latency, and total latency separately.
- SSE is the default simple streaming choice for web apps.
- Context size drives prefill cost: retrieve and assemble carefully.
- KV cache is a memory constraint: concurrency and context length matter.
- Batching improves throughput: but may add queueing latency.
Practice: Design a streaming chat API for a RAG assistant. Include retrieval timing, SSE events, cancellation, provider timeout, and metrics for TTFT and total latency.