LLM Gateway and Routing: Model Selection, Fallbacks, and Cost Control

Design the gateway layer between applications and LLM providers, including model routing, provider fallback, rate limiting, semantic routing, observability, and cost tracking.

LLM gatewaymodel routingfallbackcost optimizationrate limiting

Why an LLM Gateway?

As soon as multiple product features call LLMs, direct provider calls from each service become messy. You need consistent authentication, model selection, retries, rate limits, logging, cost controls, and safety checks.

An LLM gateway centralizes those concerns.

✅

Key idea: The gateway is not just a proxy. It is a policy, routing, observability, and control plane for model usage.

Gateway Responsibilities

Responsibility	Why It Matters
Model routing	Match request complexity to model capability
Provider fallback	Improve availability during outages
Rate limiting	Protect budgets and provider quotas
Cost tracking	Attribute spend by team, feature, tenant, and model
Prompt policy	Enforce safety and formatting rules
Observability	Trace prompts, latency, errors, and output quality
Secret management	Keep provider keys out of application services

Request Flow

Gateway Request Contract

json

{
  "feature": "support-chat",
  "tenantId": "acme",
  "taskType": "rag_answer",
  "latencyClass": "interactive",
  "qualityTier": "high",
  "messages": []
}

The application should describe intent and constraints. The gateway decides how to satisfy them.

Model Routing

Not every task needs the most expensive model. Some requests are classification, extraction, rewriting, or summarization and can use smaller or cheaper models.

Task	Routing Choice
Simple classification	Small fast model
JSON extraction	Model with strong structured output
Legal or financial reasoning	Higher quality model plus guardrails
RAG answer with citations	Strong instruction following
Embedding	Embedding-specific model
Bulk offline summarization	Cheaper batch path

Semantic Routing

Semantic routing embeds or classifies the request to choose a model, prompt, or tool chain.

Use semantic routing for intent families, not for hiding business logic that should be explicit.

Fallback Chains

Fallback improves availability, but it can change behavior. A secondary model may have different output style, context limits, tool support, or safety behavior.

Fallback Policy

Failure	Possible Action
Provider 5xx	Retry once, then fallback
Timeout	Fallback if response is interactive
Rate limit	Queue, fallback, or reject based on feature
Safety refusal	Do not blindly fallback
Schema validation failure	Retry with repair prompt, then fail closed

⚠️

Do not fallback around safety: If a model refuses or flags unsafe content, sending the same request to another provider can bypass policy unintentionally.

Rate Limiting and Budgets

LLM systems need rate limits by tenant, user, feature, model, and provider.

Limit Type	Protects
Requests per minute	Provider quota and service stability
Tokens per minute	Cost and throughput
Daily budget	Finance controls
Concurrent streams	Connection capacity
Feature-level quota	Product fairness

Budget enforcement should happen before expensive retrieval and generation when possible.

Cost Tracking

Track cost at the dimensions that map to decisions.

Dimension	Example
Model	Which model drives spend?
Feature	Support chat vs report generator
Tenant	Enterprise customer usage
User	Abuse and fairness
Prompt version	Cost impact of prompt changes
Retrieval strategy	RAG context size cost

Cost Formula

txt

request_cost =
  input_tokens * input_price_per_token
  + output_tokens * output_price_per_token
  + retrieval_cost
  + reranking_cost

Cost dashboards should show both absolute spend and value metrics like successful resolutions.

Gateway State and Data

Data	Retention Guidance
Raw prompts and responses	Redact or sample if sensitive
Token counts	Keep for cost and capacity planning
Model decisions	Keep for debugging
Provider errors	Keep for reliability analysis
Safety classifications	Keep according to policy

💡

Privacy matters: Gateway logs can contain sensitive user data. Redaction, retention, access control, and audit logs are part of the design.

What to Remember for Interviews

Centralize model access: A gateway gives consistency across teams and features.
Route by task: Use cheaper models for simple tasks and stronger models for hard tasks.
Fallback carefully: Availability fallback must not bypass safety or break schemas.
Rate-limit tokens, not just requests: Token volume drives cost and throughput.
Instrument cost and quality together: Cheap but bad answers are not optimization.

✅

Practice: Design an LLM gateway for three product teams. Include model routing, tenant budgets, streaming, fallback policy, logging, and schema validation.

RAG Architecture: Chunking, Retrieval, Reranking, and Generation

Prompt Caching and Semantic Caching: Lower Latency and Cost