LLM Gateway and Routing: Model Selection, Fallbacks, and Cost Control
Design the gateway layer between applications and LLM providers, including model routing, provider fallback, rate limiting, semantic routing, observability, and cost tracking.
Why an LLM Gateway?
As soon as multiple product features call LLMs, direct provider calls from each service become messy. You need consistent authentication, model selection, retries, rate limits, logging, cost controls, and safety checks.
An LLM gateway centralizes those concerns.
Key idea: The gateway is not just a proxy. It is a policy, routing, observability, and control plane for model usage.
Gateway Responsibilities
| Responsibility | Why It Matters |
|---|---|
| Model routing | Match request complexity to model capability |
| Provider fallback | Improve availability during outages |
| Rate limiting | Protect budgets and provider quotas |
| Cost tracking | Attribute spend by team, feature, tenant, and model |
| Prompt policy | Enforce safety and formatting rules |
| Observability | Trace prompts, latency, errors, and output quality |
| Secret management | Keep provider keys out of application services |
Request Flow
Gateway Request Contract
{
"feature": "support-chat",
"tenantId": "acme",
"taskType": "rag_answer",
"latencyClass": "interactive",
"qualityTier": "high",
"messages": []
}
The application should describe intent and constraints. The gateway decides how to satisfy them.
Model Routing
Not every task needs the most expensive model. Some requests are classification, extraction, rewriting, or summarization and can use smaller or cheaper models.
| Task | Routing Choice |
|---|---|
| Simple classification | Small fast model |
| JSON extraction | Model with strong structured output |
| Legal or financial reasoning | Higher quality model plus guardrails |
| RAG answer with citations | Strong instruction following |
| Embedding | Embedding-specific model |
| Bulk offline summarization | Cheaper batch path |
Semantic Routing
Semantic routing embeds or classifies the request to choose a model, prompt, or tool chain.
Use semantic routing for intent families, not for hiding business logic that should be explicit.
Fallback Chains
Fallback improves availability, but it can change behavior. A secondary model may have different output style, context limits, tool support, or safety behavior.
Fallback Policy
| Failure | Possible Action |
|---|---|
| Provider 5xx | Retry once, then fallback |
| Timeout | Fallback if response is interactive |
| Rate limit | Queue, fallback, or reject based on feature |
| Safety refusal | Do not blindly fallback |
| Schema validation failure | Retry with repair prompt, then fail closed |
Do not fallback around safety: If a model refuses or flags unsafe content, sending the same request to another provider can bypass policy unintentionally.
Rate Limiting and Budgets
LLM systems need rate limits by tenant, user, feature, model, and provider.
| Limit Type | Protects |
|---|---|
| Requests per minute | Provider quota and service stability |
| Tokens per minute | Cost and throughput |
| Daily budget | Finance controls |
| Concurrent streams | Connection capacity |
| Feature-level quota | Product fairness |
Budget enforcement should happen before expensive retrieval and generation when possible.
Cost Tracking
Track cost at the dimensions that map to decisions.
| Dimension | Example |
|---|---|
| Model | Which model drives spend? |
| Feature | Support chat vs report generator |
| Tenant | Enterprise customer usage |
| User | Abuse and fairness |
| Prompt version | Cost impact of prompt changes |
| Retrieval strategy | RAG context size cost |
Cost Formula
request_cost =
input_tokens * input_price_per_token
+ output_tokens * output_price_per_token
+ retrieval_cost
+ reranking_cost
Cost dashboards should show both absolute spend and value metrics like successful resolutions.
Gateway State and Data
| Data | Retention Guidance |
|---|---|
| Raw prompts and responses | Redact or sample if sensitive |
| Token counts | Keep for cost and capacity planning |
| Model decisions | Keep for debugging |
| Provider errors | Keep for reliability analysis |
| Safety classifications | Keep according to policy |
Privacy matters: Gateway logs can contain sensitive user data. Redaction, retention, access control, and audit logs are part of the design.
What to Remember for Interviews
- Centralize model access: A gateway gives consistency across teams and features.
- Route by task: Use cheaper models for simple tasks and stronger models for hard tasks.
- Fallback carefully: Availability fallback must not bypass safety or break schemas.
- Rate-limit tokens, not just requests: Token volume drives cost and throughput.
- Instrument cost and quality together: Cheap but bad answers are not optimization.
Practice: Design an LLM gateway for three product teams. Include model routing, tenant budgets, streaming, fallback policy, logging, and schema validation.