Gen AI Systems

LLM Observability and Evaluation: Traces, Quality Metrics, and Experiments

Build observability and evaluation for LLM systems, including prompt traces, cost tracking, model versions, RAG metrics, LLM-as-judge, A/B tests, and regression datasets.

LLM observabilityevaluationLLM-as-judgetracingA/B testing

Why LLM Observability Is Different

Traditional observability tracks latency, errors, and saturation. LLM systems also need to track quality, cost, prompts, context, model versions, retrieved evidence, and user feedback.

Key idea: An LLM call can be technically successful and still produce a bad answer. Observability must include quality signals.


What to Trace

SpanUseful Fields
Requesttenant, feature, user type, request ID
Retrievalquery, filters, top-k, document IDs, scores
Rerankingcandidate count, selected chunks
Promptprompt version, token count, context IDs
LLMmodel, provider, latency, tokens, cost
Validationschema result, safety result, retries
Responsecitations, confidence, user feedback

Redact sensitive content or sample traces when required by privacy policy.


Metrics

Reliability and Performance

MetricWhy It Matters
Request rateCapacity and adoption
Error rateProvider and application reliability
TTFTInteractive user experience
Total latencyEnd-to-end performance
Retry rateProvider instability or validation issues
Timeout rateSLO risk

Cost

MetricWhy It Matters
Input tokensContext and prompt cost
Output tokensGeneration cost
Cost per requestUnit economics
Cost per successful taskBusiness value
Spend by tenant or featureBudget ownership

Quality

MetricMeaning
Answer relevanceDoes the answer address the question?
FaithfulnessIs the answer supported by context?
Context precisionWere retrieved chunks useful?
Context recallDid retrieval include needed evidence?
Citation accuracyDo citations support claims?
Task successDid the user get the job done?

RAG Evaluation

RAG has two systems to evaluate: retrieval and generation.

Golden Dataset

Build a dataset with:

  1. Real user questions.
  2. Expected relevant documents or chunks.
  3. Acceptable answer criteria.
  4. Unsafe or out-of-scope examples.
  5. Tenant and permission cases.
  6. Regression cases from incidents.

LLM-as-Judge

LLM-as-judge uses another model to score outputs. It is useful for scale, but it is not perfect.

BenefitRisk
Scales evaluationJudge bias
Flexible criteriaInconsistent scoring
Good for comparisonsCan miss domain-specific correctness
Fast iterationNeeds human calibration

Use human-labeled examples to calibrate judge prompts and thresholds.


Experiments and A/B Tests

LLM systems change often: prompts, models, retrieval parameters, rerankers, tools, and guardrails.

ExperimentExample
Prompt versionShort vs detailed system prompt
ModelFast model vs high-quality model
Retrievaltop-5 vs top-10
Rerankingwith vs without reranker
Guardrailstrict citation requirement
UXstreaming vs non-streaming

Measure cost and latency alongside quality. A higher-quality variant may not be worth a 3x cost increase for every task.


Prompt and Model Versioning

ArtifactVersion It Because
System promptBehavior changes
Tool schemaOutput and action changes
Retrieval configEvidence changes
ModelQuality and cost change
Embedding modelIndex compatibility changes
Guardrail policySafety behavior changes

Every production response should be traceable to the versions that produced it.


Alerting

AlertPossible Cause
Cost spikePrompt grew, abuse, routing bug
Latency spikeProvider issue, retrieval issue, model change
Validation failuresPrompt or schema regression
No-citation answersRAG failure
Low feedback scoreQuality regression
Increased refusalsPolicy or routing issue

What to Remember for Interviews

  1. Trace the whole LLM pipeline: retrieval, prompt, model, validation, and response.
  2. Track quality, not just uptime: successful HTTP 200 can still be a failed answer.
  3. Evaluate retrieval and generation separately: they fail differently.
  4. Use LLM-as-judge carefully: calibrate against human labels.
  5. Version everything: prompts, models, embeddings, retrieval configs, and guardrails.

Practice: Design an evaluation system for a RAG assistant. Include golden datasets, online feedback, offline regression tests, LLM-as-judge, A/B tests, and alerts for cost and quality regressions.