LLM Observability and Evaluation: Traces, Quality Metrics, and Experiments
Build observability and evaluation for LLM systems, including prompt traces, cost tracking, model versions, RAG metrics, LLM-as-judge, A/B tests, and regression datasets.
Why LLM Observability Is Different
Traditional observability tracks latency, errors, and saturation. LLM systems also need to track quality, cost, prompts, context, model versions, retrieved evidence, and user feedback.
Key idea: An LLM call can be technically successful and still produce a bad answer. Observability must include quality signals.
What to Trace
| Span | Useful Fields |
|---|---|
| Request | tenant, feature, user type, request ID |
| Retrieval | query, filters, top-k, document IDs, scores |
| Reranking | candidate count, selected chunks |
| Prompt | prompt version, token count, context IDs |
| LLM | model, provider, latency, tokens, cost |
| Validation | schema result, safety result, retries |
| Response | citations, confidence, user feedback |
Redact sensitive content or sample traces when required by privacy policy.
Metrics
Reliability and Performance
| Metric | Why It Matters |
|---|---|
| Request rate | Capacity and adoption |
| Error rate | Provider and application reliability |
| TTFT | Interactive user experience |
| Total latency | End-to-end performance |
| Retry rate | Provider instability or validation issues |
| Timeout rate | SLO risk |
Cost
| Metric | Why It Matters |
|---|---|
| Input tokens | Context and prompt cost |
| Output tokens | Generation cost |
| Cost per request | Unit economics |
| Cost per successful task | Business value |
| Spend by tenant or feature | Budget ownership |
Quality
| Metric | Meaning |
|---|---|
| Answer relevance | Does the answer address the question? |
| Faithfulness | Is the answer supported by context? |
| Context precision | Were retrieved chunks useful? |
| Context recall | Did retrieval include needed evidence? |
| Citation accuracy | Do citations support claims? |
| Task success | Did the user get the job done? |
RAG Evaluation
RAG has two systems to evaluate: retrieval and generation.
Golden Dataset
Build a dataset with:
- Real user questions.
- Expected relevant documents or chunks.
- Acceptable answer criteria.
- Unsafe or out-of-scope examples.
- Tenant and permission cases.
- Regression cases from incidents.
LLM-as-Judge
LLM-as-judge uses another model to score outputs. It is useful for scale, but it is not perfect.
| Benefit | Risk |
|---|---|
| Scales evaluation | Judge bias |
| Flexible criteria | Inconsistent scoring |
| Good for comparisons | Can miss domain-specific correctness |
| Fast iteration | Needs human calibration |
Use human-labeled examples to calibrate judge prompts and thresholds.
Experiments and A/B Tests
LLM systems change often: prompts, models, retrieval parameters, rerankers, tools, and guardrails.
| Experiment | Example |
|---|---|
| Prompt version | Short vs detailed system prompt |
| Model | Fast model vs high-quality model |
| Retrieval | top-5 vs top-10 |
| Reranking | with vs without reranker |
| Guardrail | strict citation requirement |
| UX | streaming vs non-streaming |
Measure cost and latency alongside quality. A higher-quality variant may not be worth a 3x cost increase for every task.
Prompt and Model Versioning
| Artifact | Version It Because |
|---|---|
| System prompt | Behavior changes |
| Tool schema | Output and action changes |
| Retrieval config | Evidence changes |
| Model | Quality and cost change |
| Embedding model | Index compatibility changes |
| Guardrail policy | Safety behavior changes |
Every production response should be traceable to the versions that produced it.
Alerting
| Alert | Possible Cause |
|---|---|
| Cost spike | Prompt grew, abuse, routing bug |
| Latency spike | Provider issue, retrieval issue, model change |
| Validation failures | Prompt or schema regression |
| No-citation answers | RAG failure |
| Low feedback score | Quality regression |
| Increased refusals | Policy or routing issue |
What to Remember for Interviews
- Trace the whole LLM pipeline: retrieval, prompt, model, validation, and response.
- Track quality, not just uptime: successful HTTP 200 can still be a failed answer.
- Evaluate retrieval and generation separately: they fail differently.
- Use LLM-as-judge carefully: calibrate against human labels.
- Version everything: prompts, models, embeddings, retrieval configs, and guardrails.
Practice: Design an evaluation system for a RAG assistant. Include golden datasets, online feedback, offline regression tests, LLM-as-judge, A/B tests, and alerts for cost and quality regressions.