Generative AI Systems: Architecture, LLMs, RAG, and Production Considerations

Learn the architecture of generative AI systems through a beginner-friendly story covering LLMs, prompts, embeddings, RAG, guardrails, latency, cost, and production trade-offs.

AILLMRAGgenerative AImachine learningembeddings

Start With a Product Story

Imagine you work at a company with thousands of help-center articles, internal runbooks, API docs, release notes, and support tickets. A customer opens chat and asks:

"Why did my invoice increase this month, and how can I reduce it?"

A normal search box may find documents containing words like "invoice", "billing", and "reduce". But the customer does not want a list of links. They want a useful answer:

"Your invoice increased because usage crossed the free tier on May 12. You can reduce next month's bill by lowering retention from 90 days to 30 days or moving the analytics job to weekly. Here are the relevant policy pages."

This is the promise of a generative AI system. It can understand a natural-language question, gather relevant context, reason over that context, and produce a helpful response.

But here is the catch: an LLM by itself does not know your customer's invoice, your current policy, your internal docs, or your latest product behavior. If you simply send the question to a model and hope for the best, it may produce a confident but wrong answer.

Production Gen AI architecture is about wrapping the model with the systems it needs to be useful, safe, fast, and affordable.

✅

Mental model: The LLM is the language engine. Your application architecture supplies the memory, facts, permissions, tools, safety checks, and feedback loops around it.

What a Generative AI System Actually Is

A beginner often thinks the system is just:

That is a useful starting point, but it is not enough for real products. A production system usually looks more like this:

Each box exists because the model has a limitation:

It does not automatically know private company data.
It can be wrong.
It can be slow.
It costs money per request.
It may produce output in the wrong format.
It may be tricked by malicious or messy input.
It cannot be trusted to enforce authorization by itself.

So the architecture is not "call an LLM." The architecture is "design a controlled path from user intent to grounded answer."

LLMs in Plain English

A Large Language Model is trained to predict and generate text. It has seen a huge amount of text during training, and from that training it learns patterns: grammar, facts, styles, code shapes, reasoning patterns, and relationships between words and ideas.

When you send a prompt, the model does not search the internet or query your database by default. It reads the text you provide and predicts the next token, then the next, then the next.

Tokens

Models do not read text exactly the way humans do. They split text into tokens. A token may be a word, part of a word, punctuation, or whitespace-like unit.

For example, this sentence:

txt

The payment failed because the card expired.

may become pieces like:

txt

The | payment | failed | because | the | card | expired | .

The exact split depends on the model tokenizer. You do not need to memorize tokenization details at first. Just remember: token count affects cost, latency, and how much information fits into the model request.

Context Window

The context window is the amount of text the model can consider in one request. It includes:

system instructions
conversation history
user question
retrieved documents
tool results
expected output format

If you stuff too much into the context, the request becomes slower and more expensive. The model may also pay less attention to the most important parts.

⚠️

Large context is not the same as good context: A smaller, carefully selected set of facts is often better than dumping every document into the prompt.

The First Design: A Simple Chatbot

Let us build the simplest possible AI assistant:

This can work for general questions:

"Summarize this paragraph."
"Rewrite this email politely."
"Generate five title ideas."
"Explain load balancing like I am new to backend engineering."

For these tasks, the user provides most of the required context. The model can transform, summarize, classify, or explain.

But the simple design breaks when the question depends on private or fresh information:

"What is our refund policy for annual plans?"
"Why did this customer's payment fail?"
"Which deployment caused today's latency spike?"
"What changed in version 2.8?"

The model cannot answer these reliably unless your system gives it the right facts.

That brings us to retrieval.

Why RAG Exists

RAG stands for Retrieval-Augmented Generation. It means: before asking the LLM to answer, retrieve relevant information from your own knowledge sources and include that information in the prompt.

Think of the LLM as a smart writer sitting at a desk. RAG is the assistant who brings the right pages from the filing cabinet before the writer starts.

Now the model is no longer answering only from memory. It is answering from the context your system retrieved.

A RAG Example

Suppose the user asks:

"Can I pause my subscription?"

Your system searches the help center and finds:

txt

Source: Billing FAQ
Customers on monthly plans can pause subscriptions for up to 3 months.
Annual plans cannot be paused, but customers can contact support for account credit review.

Then your prompt becomes:

txt

You are a support assistant. Answer only using the provided source.

User question:
Can I pause my subscription?

Source:
Customers on monthly plans can pause subscriptions for up to 3 months.
Annual plans cannot be paused, but customers can contact support for account credit review.

The answer can now be grounded:

"If you are on a monthly plan, you can pause your subscription for up to 3 months. Annual plans cannot be paused, but support can review whether account credit applies."

This is much safer than hoping the model remembers or guesses the policy.

How Documents Become Searchable

A company knowledge base may contain PDFs, Markdown files, HTML pages, tickets, database rows, spreadsheets, and chat transcripts. The system must turn these into searchable units.

The ingestion path usually runs in the background:

Let us unpack that slowly.

Parsing

Parsing extracts meaningful text from the source. This is easy for clean Markdown and harder for PDFs, scanned documents, tables, and webpages with navigation menus.

Bad parsing creates bad retrieval. If every page includes the same sidebar text, the search index will repeatedly retrieve irrelevant navigation content.

Chunking

Models and search systems work better with smaller pieces of text than with huge documents. Chunking splits a document into sections that can be retrieved independently.

Imagine a 30-page API guide. If the user asks about authentication headers, you do not want to retrieve the entire guide. You want the section that explains authentication.

Good chunks usually preserve meaning:

txt

Heading: Authentication
Text: All API requests must include an Authorization header...

Poor chunks cut through ideas:

txt

...must include an Authorization

txt

header with a bearer token. If the token...

The right chunk size depends on the document type and user questions. Technical docs often benefit from heading-aware chunks. Long policy documents may need section-based chunks. Code documentation may need to keep examples with explanation.

Embeddings

An embedding is a list of numbers that represents meaning. Similar ideas produce vectors that are close to each other.

The first two phrases are worded differently, but they mean similar things, so their embeddings should be close. This is why semantic search can find useful documents even when the user does not use the exact words from the document.

Vector Store

A vector store keeps embeddings and lets you search for "nearest" vectors.

When the user asks a question, the system embeds the question, compares it with stored document embeddings, and retrieves the closest chunks.

At beginner level, think of a vector database as a search engine for meaning. Later lessons go deeper into HNSW, approximate nearest-neighbor search, metadata filtering, and database choices.

The Complete RAG Flow

Now combine the offline and online parts.

This flow has two very different engineering concerns.

The offline path cares about document quality, indexing, freshness, permissions, and versioning.

The online path cares about latency, relevance, cost, prompt design, model choice, and safety.

If either side is weak, the user sees a bad answer.

Prompting Is Interface Design

A prompt is not a magic spell. It is the interface between your application and the model.

A good production prompt usually contains:

the role of the assistant
the task
rules and constraints
retrieved context
output format
what to do when information is missing

For example:

txt

You are a customer support assistant.
Answer only from the provided sources.
If the sources do not contain the answer, say you do not know.
Use a friendly tone.
Return the answer with source citations.

This prompt is doing real architecture work. It defines how the model should behave when it has evidence, when it lacks evidence, and how the application expects the response to look.

Structured Output

Sometimes your app needs data, not prose. For example, a support system may need:

json

{
  "answer": "Monthly plans can be paused for up to 3 months.",
  "citations": ["billing-faq-12"],
  "needsHumanReview": false
}

This is called structured output. The application can validate it with a schema before showing it to the user or triggering the next workflow.

✅

Beginner rule: If software needs to consume the model response, ask for structured output and validate it. Do not rely on fragile string parsing.

Why Guardrails Are Necessary

LLMs can be helpful, but they are not naturally safe production components. They may:

answer without enough evidence
return malformed JSON
reveal sensitive information if retrieval is wrong
follow malicious instructions hidden inside retrieved documents
produce harmful or off-policy content
make tool calls with wrong arguments

Guardrails are the checks around the model.

The most important beginner lesson is this: permissions must be enforced by your system, not by the model.

If a user is not allowed to see payroll documents, those documents should never be retrieved into the model context. Do not retrieve them and tell the model, "Please do not mention this." That is not a reliable security boundary.

LLMs Can Use Tools

Some AI systems only answer questions. Others can take actions:

look up an order
create a support ticket
query metrics
run a code search
schedule a meeting
issue a refund

The model should not directly access your database or production systems. Instead, it proposes a tool call, and your application validates and executes it.

Tool use is powerful because it connects language to real systems. It is risky for the same reason. Any tool that changes state, sends messages, spends money, deletes data, or exposes sensitive information needs strong validation and often human confirmation.

Choosing Where the Model Runs

There are two broad ways to use LLMs.

First, you can call a managed model API. This is the fastest way to build because the provider handles model hosting, scaling, updates, and most serving infrastructure.

Second, you can self-host or privately host models. This gives more control over data, latency, customization, and cost at high scale, but it also means you own GPUs, model serving, capacity planning, upgrades, and reliability.

For most teams, managed APIs are the right starting point. Self-hosting becomes attractive when usage is large, privacy requirements are strict, latency must be controlled, or the model is specialized enough to justify the operational burden.

Latency: Why AI Responses Feel Slow

LLM latency feels different from normal API latency because the model generates output token by token.

There are three times to understand:

time until the first token starts
speed of tokens after generation begins
total time until the full answer is complete

Streaming helps because the user can start reading before the full answer is finished.

Latency comes from many places:

retrieval may take time
the prompt may be large
the model may be busy
the model may be large
the answer may be long
output validation may retry

A good system optimizes the whole path, not only the model call.

Cost: Why Architecture Matters

LLM cost usually depends on how much text you send in and how much text the model generates. RAG can improve correctness, but it can also increase cost because retrieved context adds input tokens.

That means architecture choices directly affect spend:

long system prompts cost more
too many retrieved chunks cost more
verbose answers cost more
retries cost more
using powerful models for simple tasks costs more

The usual pattern is to route tasks:

Simple classification or formatting may use a smaller model. Complex reasoning or high-value customer-facing responses may use a stronger model. The system should measure cost by feature, tenant, model, and prompt version.

Observability: Knowing If It Works

A normal service can return HTTP 200 and still be correct. An AI service can return HTTP 200 and be completely wrong.

So you need both system metrics and quality metrics.

You should be able to answer:

Which documents were retrieved?
Which model was used?
How many tokens were sent and generated?
How long did retrieval and generation take?
Did validation pass?
Did the user accept, retry, downvote, or escalate?
Which prompt version produced the answer?

Without this, improving the system becomes guesswork.

A Production Gen AI System, End to End

Let us return to the customer invoice question:

"Why did my invoice increase this month, and how can I reduce it?"

A production system may handle it like this:

Notice what the LLM does and does not do.

The LLM writes the explanation. It does not decide whether the customer is allowed to see the invoice. It does not magically know the invoice details. It does not enforce billing policy alone. It does not get unlimited freedom to call tools.

The surrounding system provides facts, permissions, policy, validation, and monitoring.

That is the core mindset of Gen AI system design.

Common Beginner Misunderstandings

"The model knows everything."

It does not. It only knows what was learned during training and what you provide in the request. For private, fresh, or precise data, use retrieval and tools.

"A bigger model fixes the architecture."

A stronger model may improve reasoning, but it cannot fix missing permissions, stale documents, bad chunking, no observability, or unsafe tool execution.

"RAG eliminates hallucinations."

RAG reduces hallucinations when retrieval is good and prompts force grounding. It does not guarantee truth.

"We can add guardrails later."

Guardrails are part of the architecture. If you log sensitive data, retrieve unauthorized documents, or allow unsafe tool calls, adding a filter at the end will not fully repair the design.

"Longer context is always better."

Long context can help, but it increases cost and latency. It can also distract the model. Retrieve the right context, not all context.

How to Think in Interviews

When asked to design a Gen AI feature, do not jump straight to "use an LLM." Walk through the system:

What does the user want to accomplish?
What facts does the model need?
Where do those facts live?
How will we retrieve only allowed and relevant context?
What model or models should handle the task?
What output format does the application need?
What can go wrong?
What guardrails validate input, retrieval, tool calls, and output?
How do we measure quality, latency, and cost?
What happens when the model is uncertain or the provider is down?

This turns an AI feature from a demo into a production system.

✅

Practice: Design a customer support assistant for a SaaS product. Start with one user question, then draw the path through authentication, retrieval, prompt construction, model call, validation, answer delivery, and observability.

Hexagonal and Clean Architecture: Ports, Adapters, and Dependency Inversion

Embeddings and Vector Databases: Semantic Search at Scale