RAG Architecture: Chunking, Retrieval, Reranking, and Generation

Learn Retrieval-Augmented Generation through a beginner-friendly story covering ingestion, chunking, embeddings, retrieval, reranking, context assembly, citations, evaluation, and advanced RAG patterns.

RAGchunkingrerankinghybrid searchcontext window

Start With a Support Assistant

Imagine you are building an AI assistant for a SaaS product. A customer asks:

"Can I pause my annual subscription?"

The answer is somewhere in your billing documentation, but there are hundreds of docs: pricing pages, refund policies, enterprise contracts, support macros, release notes, and internal runbooks.

If you send only the question to an LLM, the model may produce a pleasant answer that sounds right but is not based on your actual policy.

If you send every billing document to the model, the request becomes slow, expensive, noisy, and may exceed the context window.

RAG exists for the middle path:

Find the few pieces of information that matter.
Give those pieces to the model.
Ask the model to answer from that evidence.
Return the answer with sources.

✅

Mental model: RAG turns an LLM from a guesser into a reader. It gives the model the right pages before asking it to write the answer.

What RAG Means

RAG stands for Retrieval-Augmented Generation.

The name has two halves:

Retrieval: find relevant information from your documents, database, tickets, or tools.
Generation: ask the LLM to produce a useful answer using that retrieved information.

This sounds simple, but every box hides design choices. What counts as relevant? How big should document chunks be? What if two documents disagree? What if the user is not allowed to see one of the retrieved documents? What if retrieval finds nothing?

Those are the real RAG architecture questions.

Why Not Just Fine-Tune the Model?

Beginners often ask: "Why not train the model on our docs?"

Fine-tuning and RAG solve different problems.

Fine-tuning is useful when you want to change behavior: tone, format, classification style, domain-specific response patterns, or task performance.

RAG is useful when you need fresh, private, traceable knowledge.

For support docs, policies, product behavior, prices, API details, and customer-specific information, RAG is usually the first tool to reach for because the knowledge changes often and you need citations.

⚠️

Important distinction: Fine-tuning can teach a model how to answer. RAG gives the model what to answer from.

The Two Sides of RAG

A RAG system has an offline side and an online side.

The offline side prepares knowledge before users ask questions. The online side answers each user request.

Most demos focus on the online part because it feels exciting. Most production failures start in the offline part: messy parsing, bad chunks, stale indexes, missing permissions, or weak metadata.

So let us walk through the system from the beginning.

Step 1: Ingest the Documents

Before the assistant can answer from your docs, the system must read them.

Document ingestion means collecting source content and turning it into clean text plus metadata.

Sources may include:

help-center articles
Markdown docs
PDFs
API reference pages
support tickets
database rows
release notes
internal runbooks

This sounds boring, but it is foundational. If ingestion is poor, the model receives poor evidence.

For example, a PDF parser may accidentally mix headers, footers, and table columns into one confusing paragraph. A webpage parser may include navigation text on every page. A ticket importer may miss timestamps or customer tier. These details affect retrieval quality later.

A good ingestion system preserves:

source URL or document ID
title and headings
document owner
last updated time
product area
tenant or permission scope
parser version
embedding model version

That metadata becomes critical when you debug answers, filter permissions, or re-index later.

Step 2: Chunk the Documents

An LLM does not need an entire 40-page billing guide to answer one question. It needs the few paragraphs that matter.

Chunking splits documents into smaller pieces that can be searched and inserted into the prompt.

The Goldilocks Problem

Chunks can be too small, too large, or just right.

Too small:

txt

This option is not available.

The model cannot tell what option this refers to.

Too large:

txt

The full billing guide with every rule for trials, monthly plans,
annual plans, enterprise contracts, refunds, credits, taxes, and invoices.

The retrieved chunk contains too much noise.

Better:

txt

Annual subscriptions cannot be paused. Customers on annual plans can contact
support for a credit review if there was a billing mistake or exceptional case.

This chunk is specific enough to retrieve and complete enough to answer from.

Practical Chunking Advice

For beginners, start with structure-aware chunking:

split Markdown by headings
keep heading text with each chunk
keep list items together
keep code examples with their explanation
keep table captions and rows together
avoid splitting in the middle of a sentence

Then test with real user questions.

✅

Chunking is not a one-time decision: It is a ranking-quality decision. You improve it by looking at queries where retrieval failed.

Step 3: Embed and Index the Chunks

After chunking, each chunk is passed through an embedding model. The embedding model converts the chunk into a vector, which is a list of numbers representing meaning.

The vector database stores the vector, original text, and metadata. Later, when a user asks a question, the system embeds the question and searches for nearby chunk vectors.

This is what makes semantic retrieval possible. A user can ask "Can I pause my yearly subscription?" and retrieve a chunk that says "Annual subscriptions cannot be paused."

The words are different. The meaning is close.

Step 4: Retrieve Candidate Chunks

Now the user asks:

"Can I pause my annual subscription?"

The online path begins. The system embeds the question and retrieves candidate chunks.

Retrieval should aim for recall first. That means the first retrieval step should gather enough candidates so the right evidence is likely included.

It is okay if the first candidate set includes a few irrelevant chunks. Later stages can clean that up. It is worse if the right chunk never appears at all.

Dense, Sparse, and Hybrid Retrieval

Vector retrieval is also called dense retrieval. It is good at semantic meaning.

Keyword retrieval, often using algorithms like BM25, is called sparse retrieval. It is good at exact terms.

For support and engineering systems, hybrid retrieval is often best.

Why both?

If the query is "billing policy for failed card retries", semantic retrieval is useful.

If the query is "ERR_BILLING_4027", exact keyword retrieval is essential.

If the query is "How do I rotate API keys?", both may help: vector search finds conceptually related security docs, while keyword search catches exact "API key" references.

Step 5: Rerank the Evidence

Initial retrieval may return 20, 50, or 100 candidate chunks. We do not want to dump all of them into the model. That would be slow, expensive, and confusing.

Reranking takes the candidate chunks and scores them more carefully against the user question.

Think of retrieval as casting a wide net. Reranking is choosing the best fish from the net.

Reranking adds latency, but it can dramatically improve answer quality because the final prompt receives cleaner evidence.

For many RAG systems, improving reranking helps more than switching to a bigger generation model.

Step 6: Assemble the Context

Now you have the best chunks. The next job is to build the prompt.

This step is called context assembly.

Good context assembly is not just concatenating chunks. It decides:

which chunks fit in the context budget
what order they appear in
whether nearby parent sections are needed
how source IDs are shown
whether duplicate chunks should be removed
what the model should do if evidence is missing

Example:

txt

System:
You are a support assistant.
Answer only from the provided sources.
If the answer is not in the sources, say you do not know.
Cite source IDs.

User question:
Can I pause my annual subscription?

Sources:
[S1] Annual subscriptions cannot be paused. Customers on annual plans can contact support for a credit review...
[S2] Monthly subscriptions can be paused for up to 3 months...

The model can now produce an answer like:

"Annual subscriptions cannot be paused. If there was a billing mistake or exceptional case, you can contact support for a credit review. Monthly plans can be paused for up to 3 months. [S1] [S2]"

Notice that the answer is grounded in specific sources.

Parent-Child Chunking

Sometimes small chunks retrieve well but do not contain enough context to answer well.

Parent-child chunking solves this by indexing small chunks but returning a larger parent section.

This is useful for long documents where a small paragraph matches the query, but the surrounding section explains the rule.

Query Rewriting

Users often ask vague or messy questions:

"Does this work for yearly?"

The system may need to rewrite the query before retrieval:

"annual subscription pause policy"

Query rewriting can expand abbreviations, add domain language, split compound questions, or produce multiple search queries.

But it must be used carefully. A bad rewrite can change the user's intent.

HyDE: Searching With a Hypothetical Answer

HyDE stands for Hypothetical Document Embeddings.

Instead of embedding the user's short question directly, the system first asks a model to draft a hypothetical answer or document. Then it embeds that draft and uses it for search.

This can help when the user's question is short but the documents are written in a richer style.

The risk is that the hypothetical answer may introduce assumptions. HyDE should improve retrieval, not become the final source of truth.

Multi-Hop Retrieval

Some questions need more than one document.

For example:

"Why did my bill increase after enabling audit logs?"

The system may need:

the customer's usage data
the audit log pricing policy
the retention configuration documentation

That is multi-hop retrieval.

Multi-hop retrieval is powerful, but it increases latency, cost, and debugging complexity. Start with simple retrieval. Add multi-hop only when real questions require it.

What If Retrieval Finds Nothing?

A good RAG system must know how to abstain.

If the system retrieves weak evidence, the model should not make up an answer. It should say something like:

"I could not find this in the available billing documentation."

You can then offer next steps:

ask the user to clarify
route to support
search a broader source
create a documentation gap ticket

This is not a failure. It is a reliability feature.

Evaluating RAG Quality

You cannot improve RAG by vibes. You need examples and measurements.

Create a small evaluation dataset:

txt

Question: Can I pause my annual subscription?
Expected source: Billing FAQ section 4.2
Expected behavior: Say annual plans cannot be paused; mention support credit review.

Then test each part of the pipeline.

Retrieval evaluation asks:

Did the correct source appear in the top results?
Was it ranked high enough?
Did permission filtering remove anything incorrectly?

Generation evaluation asks:

Did the answer use the source correctly?
Did it cite the right source?
Did it avoid unsupported claims?
Did it say "I do not know" when evidence was missing?

Product evaluation asks:

Did users accept the answer?
Did the answer reduce support escalations?
Did users ask follow-up clarification?
Did the system answer quickly enough?

✅

Debugging rule: If a RAG answer is bad, first check whether the right evidence was retrieved. Do not blame the LLM until retrieval has been inspected.

Common Failure Stories

The Wrong Document Was Retrieved

The assistant answers from the monthly subscription policy instead of the annual subscription policy. The chunking may be too broad, the query may need rewriting, or reranking may not distinguish annual from monthly.

The Right Document Was Retrieved But Ignored

The prompt may be too noisy, the important chunk may appear too late, or the model instructions may not require citation-based answering.

The Answer Leaks Private Data

Permission filters may be missing or applied after retrieval. Access control must happen before content enters the prompt.

The Answer Is Stale

The source document changed, but the index was not refreshed. The system needs re-indexing, versioning, and freshness metadata.

The Answer Is Slow

The system may retrieve too many candidates, rerank too much, use a large model unnecessarily, or assemble too much context.

A Complete RAG Request, End to End

Here is the full customer-support flow:

This diagram is the heart of RAG architecture. The LLM is one important box, but the system around it determines whether the answer is correct, safe, explainable, and affordable.

What to Remember for Interviews

When explaining RAG, tell the story in order:

The model needs external facts to answer private or fresh questions.
The offline path parses, cleans, chunks, embeds, and indexes documents.
The online path retrieves, reranks, assembles context, generates, and validates.
Chunking affects retrieval quality.
Hybrid search combines semantic meaning with exact terms.
Reranking improves the final evidence sent to the model.
Context assembly decides what the model actually sees.
Citations and abstention reduce hallucination risk.
Permissions must be enforced before context reaches the model.
Evaluate retrieval and generation separately.

✅

Practice: Design RAG for a customer support assistant. Use one real question, identify the source documents, draw the ingestion path, draw the online answer path, and explain how you would handle missing evidence.

Embeddings and Vector Databases: Semantic Search at Scale

LLM Gateway and Routing: Model Selection, Fallbacks, and Cost Control