Gen AI Systems

LLM Gateway and Routing: Model Selection, Fallbacks, and Cost Control

Design the gateway layer between applications and LLM providers, including model routing, provider fallback, rate limiting, semantic routing, observability, and cost tracking.

LLM gatewaymodel routingfallbackcost optimizationrate limiting

Why an LLM Gateway?

As soon as multiple product features call LLMs, direct provider calls from each service become messy. You need consistent authentication, model selection, retries, rate limits, logging, cost controls, and safety checks.

An LLM gateway centralizes those concerns.

Key idea: The gateway is not just a proxy. It is a policy, routing, observability, and control plane for model usage.


Gateway Responsibilities

ResponsibilityWhy It Matters
Model routingMatch request complexity to model capability
Provider fallbackImprove availability during outages
Rate limitingProtect budgets and provider quotas
Cost trackingAttribute spend by team, feature, tenant, and model
Prompt policyEnforce safety and formatting rules
ObservabilityTrace prompts, latency, errors, and output quality
Secret managementKeep provider keys out of application services

Request Flow

Gateway Request Contract

json
{
  "feature": "support-chat",
  "tenantId": "acme",
  "taskType": "rag_answer",
  "latencyClass": "interactive",
  "qualityTier": "high",
  "messages": []
}

The application should describe intent and constraints. The gateway decides how to satisfy them.


Model Routing

Not every task needs the most expensive model. Some requests are classification, extraction, rewriting, or summarization and can use smaller or cheaper models.

TaskRouting Choice
Simple classificationSmall fast model
JSON extractionModel with strong structured output
Legal or financial reasoningHigher quality model plus guardrails
RAG answer with citationsStrong instruction following
EmbeddingEmbedding-specific model
Bulk offline summarizationCheaper batch path

Semantic Routing

Semantic routing embeds or classifies the request to choose a model, prompt, or tool chain.

Use semantic routing for intent families, not for hiding business logic that should be explicit.


Fallback Chains

Fallback improves availability, but it can change behavior. A secondary model may have different output style, context limits, tool support, or safety behavior.

Fallback Policy

FailurePossible Action
Provider 5xxRetry once, then fallback
TimeoutFallback if response is interactive
Rate limitQueue, fallback, or reject based on feature
Safety refusalDo not blindly fallback
Schema validation failureRetry with repair prompt, then fail closed
⚠️

Do not fallback around safety: If a model refuses or flags unsafe content, sending the same request to another provider can bypass policy unintentionally.


Rate Limiting and Budgets

LLM systems need rate limits by tenant, user, feature, model, and provider.

Limit TypeProtects
Requests per minuteProvider quota and service stability
Tokens per minuteCost and throughput
Daily budgetFinance controls
Concurrent streamsConnection capacity
Feature-level quotaProduct fairness

Budget enforcement should happen before expensive retrieval and generation when possible.


Cost Tracking

Track cost at the dimensions that map to decisions.

DimensionExample
ModelWhich model drives spend?
FeatureSupport chat vs report generator
TenantEnterprise customer usage
UserAbuse and fairness
Prompt versionCost impact of prompt changes
Retrieval strategyRAG context size cost

Cost Formula

txt
request_cost =
  input_tokens * input_price_per_token
  + output_tokens * output_price_per_token
  + retrieval_cost
  + reranking_cost

Cost dashboards should show both absolute spend and value metrics like successful resolutions.


Gateway State and Data

DataRetention Guidance
Raw prompts and responsesRedact or sample if sensitive
Token countsKeep for cost and capacity planning
Model decisionsKeep for debugging
Provider errorsKeep for reliability analysis
Safety classificationsKeep according to policy
💡

Privacy matters: Gateway logs can contain sensitive user data. Redaction, retention, access control, and audit logs are part of the design.


What to Remember for Interviews

  1. Centralize model access: A gateway gives consistency across teams and features.
  2. Route by task: Use cheaper models for simple tasks and stronger models for hard tasks.
  3. Fallback carefully: Availability fallback must not bypass safety or break schemas.
  4. Rate-limit tokens, not just requests: Token volume drives cost and throughput.
  5. Instrument cost and quality together: Cheap but bad answers are not optimization.

Practice: Design an LLM gateway for three product teams. Include model routing, tenant budgets, streaming, fallback policy, logging, and schema validation.