Embeddings and Vector Databases: Semantic Search at Scale

Learn embeddings and vector databases through a beginner-friendly story covering semantic search, vector similarity, indexing, metadata filtering, HNSW, and production trade-offs.

embeddingsvector databaseHNSWANN searchsemantic search

Start With a Search Problem

Imagine your company has an internal wiki. It contains onboarding guides, incident runbooks, architecture docs, meeting notes, and support policies.

A new engineer asks:

"How do I get access to production logs?"

The exact wiki page is titled:

"Requesting Observability Permissions"

A traditional keyword search may struggle because the user typed "production logs", while the document says "observability permissions." The words are different, but the meaning is related.

This is the problem embeddings solve.

Keyword search is good at exact words. Embedding search is good at meaning.

✅

Mental model: Embeddings let software compare ideas, not just words.

Keyword Search vs Semantic Search

Keyword search asks:

"Do these words appear in the document?"

Semantic search asks:

"Is this document about the same idea as the question?"

Suppose a user searches:

txt

reset my login password

Keyword search may look for pages containing reset, login, and password.

Semantic search can also find pages titled:

txt

Changing account credentials
Recovering access to your account
Authentication troubleshooting

The words are not identical, but the intent is close.

This does not mean semantic search replaces keyword search everywhere. Exact words still matter for IDs, error codes, product names, API fields, and legal wording. Many strong systems use both.

What Is an Embedding?

An embedding is a list of numbers that represents the meaning of something.

That "something" can be:

a sentence
a paragraph
a document chunk
a product
an image
a piece of code
a user query

For text, an embedding model reads the text and returns a vector:

The vector may have hundreds or thousands of dimensions. Humans cannot look at those numbers and understand them directly. The useful property is distance: vectors with similar meaning should be close together.

So if the query vector is close to a document vector, the document is probably relevant.

⚠️

Embeddings are not truth: They are similarity signals. They help find likely relevant content, but they do not prove the content is correct, current, or allowed for the user.

A Simple Embedding Search Flow

Let us build semantic search for the company wiki.

First, we process the documents in the background:

Then, when a user searches:

The important idea is that documents and queries go through the same kind of embedding model. That puts them into the same vector space, where they can be compared.

Why We Split Documents Into Chunks

A full wiki page may contain many topics. If you embed the whole page, the vector becomes a blurry average of everything on that page.

For example, a page titled "Production Operations" may include:

log access
deployment approvals
rollback steps
on-call escalation
dashboard links

If the user asks about log access, we want the log-access section, not the whole page.

Good chunks are small enough to be specific but large enough to preserve meaning.

If chunks are too small, they lose context:

txt

Ask your manager to approve it.

Approve what? Access to logs? A deployment? A vacation request?

If chunks are too large, search becomes less precise:

txt

This 20-page operations manual contains every production workflow.

The search system may retrieve it often, but it will not know which part matters.

For beginner-friendly systems, start with section-based chunks using headings. Then evaluate with real user questions.

What a Vector Store Actually Stores

A vector database does not store only mysterious number arrays. A useful record usually has the vector plus enough information to explain and filter the result.

A wiki chunk record may look conceptually like this:

json

{
  "id": "wiki-prod-logs-03",
  "text": "To request production log access, open an Observability Permissions ticket...",
  "embedding": [0.12, -0.03, 0.88],
  "metadata": {
    "source": "wiki",
    "title": "Requesting Observability Permissions",
    "team": "platform",
    "visibility": "engineering",
    "lastUpdated": "2026-05-20",
    "embeddingModel": "text-embedding-v1"
  }
}

The metadata is not decoration. It is how the system answers practical questions:

Is this user allowed to see the chunk?
Which document did the answer come from?
Is the document fresh?
Which team owns it?
Which embedding model created the vector?
Should we re-index it after a model or document change?

Never design vector search as "vectors only." Production search needs source text and metadata.

How Similarity Search Works

At query time, the system turns the user's query into a vector and compares it with stored vectors.

The result is usually a ranked list:

txt

1. Requesting Observability Permissions
2. Production Logging Runbook
3. On-call Access Checklist

The system does not "understand" in a human way. It calculates closeness between vectors. That closeness is a useful signal for relevance.

A Gentle View of Distance

You can imagine each idea as a point on a map.

In real systems, the map has many dimensions, not two. But the intuition is the same: nearby points are more related.

Different systems measure closeness in different ways, such as cosine similarity, dot product, or Euclidean distance. As a beginner, do not start by memorizing formulas. Start with this rule:

✅

Use the similarity metric recommended for your embedding model. The embedding model and distance metric are designed to work together.

Why Exact Search Becomes Expensive

The simplest search algorithm is brute force:

Compare the query vector to every stored vector.
Sort by similarity.
Return the top results.

This is called exact nearest-neighbor search.

It is easy to understand and accurate, but it gets expensive when the corpus grows.

If you have 10,000 vectors, brute force may be fine. If you have 100 million vectors and many users searching at once, comparing against everything is too slow and too expensive.

This is why vector databases use indexes.

What an Index Does

An index is a data structure that helps the database avoid scanning everything.

You already know this idea from normal databases. A book index lets you find a topic without reading every page. A B-tree index lets a relational database find rows without scanning the whole table.

A vector index does something similar for similarity search.

Most large vector systems use approximate nearest-neighbor search. "Approximate" means the system may not always find the mathematically perfect closest vectors, but it finds very good candidates much faster.

This creates an important trade-off:

higher recall means better chance of finding the truly relevant result
lower latency means faster user experience
lower memory means cheaper serving

You tune the system based on product needs.

HNSW Without the Scary Vocabulary

HNSW stands for Hierarchical Navigable Small World. The name sounds intimidating, but the idea is friendly if you think about maps.

Imagine you are trying to find a house in a huge city.

You do not inspect every house from left to right. You first use highways to get near the right neighborhood. Then you use main roads. Then side streets. Finally, you inspect nearby houses.

HNSW does something like that with vectors.

The index stores vectors as connected points in a graph. Some layers have long-distance shortcuts. Lower layers have denser local connections.

Search starts with broad jumps, then narrows down.

Why HNSW Is Popular

HNSW is popular because it often gives strong search quality with low latency. It works well for many semantic search and RAG workloads.

But it is not magic:

it can use a lot of memory
index building can take time
updates and deletes need operational care
tuning affects recall and latency

At this stage, you do not need to memorize every HNSW parameter. Know what the knobs mean conceptually:

more connections can improve search quality but use more memory
searching more candidates can improve recall but increase latency
building a better graph can improve search later but slow ingestion

That is enough for first-pass system design understanding.

IVF and Compression, Briefly

Another common idea is to cluster vectors into groups. Instead of searching everything, the system first finds the most relevant clusters, then searches inside them.

This family of techniques is often associated with IVF, or inverted file indexes.

Product quantization is a compression technique. It stores a smaller approximation of vectors so the system can save memory and search larger datasets more cheaply.

For a beginner, the important story is:

HNSW is graph-based search.
IVF is cluster-based search.
product quantization compresses vectors.
all of them trade some combination of quality, latency, memory, and operational complexity.

You will rarely choose these from theory alone. You choose them by testing with your data, your queries, and your latency budget.

Metadata Filtering: The Production Reality

In demos, vector search often means:

"Search all documents."

In production, that is almost never correct.

A real system must filter by things like:

tenant
user permissions
team
region
language
document type
freshness
product area

If an engineer searches for production log access, they should not retrieve HR salary documents just because the vector similarity is high.

⚠️

Security rule: Do not retrieve unauthorized chunks and hope the LLM ignores them. Unauthorized content should never enter the prompt.

Pre-Filtering vs Post-Filtering

Pre-filtering means you filter candidates before vector search. This is safer and often necessary for permissions.

Post-filtering means you search first and remove disallowed or irrelevant results afterward. This can be useful for ranking cleanup, but it is dangerous if used as the only permission boundary.

The safest design applies access control before the result can be shown or inserted into an LLM prompt.

Hybrid Search: Why Vectors Are Not Enough

Semantic search is powerful, but it can miss exact-match needs.

Consider these queries:

txt

ERR-18492
customer_id
KAFKA_ADVERTISED_LISTENERS
invoice INV-2026-00421

These are not mainly semantic. The exact string matters.

That is why many production systems combine vector search with keyword search.

Vector search helps with meaning. Keyword search helps with exact terms. Together they are often better than either alone.

Choosing a Vector Database

Do not start by asking, "Which vector database is best?"

Start by asking what your system needs.

For a small internal tool, Postgres with pgvector may be enough. It keeps your data model simple and avoids another distributed system.

For a large public product with millions or billions of vectors, strict latency targets, heavy metadata filtering, and frequent updates, a dedicated vector database may be better.

Managed services reduce operational burden but introduce vendor dependency and cost considerations. Self-hosted systems give control but require your team to operate indexing, scaling, backups, upgrades, monitoring, and incident response.

Here is a practical decision path:

Questions to ask:

How many vectors will we store now and in one year?
How many searches per second do we need?
What p95 latency is acceptable?
Do we need strong metadata filtering?
How often do documents change?
Can our team operate another database?
Do we need hybrid keyword plus vector search?
What are the backup, migration, and re-indexing plans?

The best choice is the one that fits your scale, team, and product risk.

Re-Indexing and Versioning

Embeddings are not permanent. You may change:

the embedding model
the chunking strategy
the document parser
metadata fields
access-control rules

When that happens, you may need to re-index documents.

A mature system records which embedding model and chunking version produced each vector. That lets you debug strange results and migrate safely.

Without versioning, search quality issues become mysterious:

"Why does this old document retrieve differently from the new ones?"

The answer may be that they were embedded with different models or chunked differently.

How Search Quality Fails

When users complain "search is bad," the vector database is not always the problem.

The issue may be:

documents were parsed badly
chunks are too small or too large
the embedding model does not fit the domain
metadata filters remove the right result
the query needs keyword search
the index is stale
permissions hide the expected document
ranking is returning broad overview pages instead of specific sections

Debugging search quality should follow the pipeline:

Do not jump straight to changing databases. First inspect what text was stored, what metadata was applied, and what results came back before ranking.

What to Remember for Interviews

When explaining embeddings and vector databases, tell the story clearly:

Users ask questions in their own words.
Keyword search misses related meaning.
Embeddings turn text into vectors so similar meanings are close.
Documents are chunked, embedded, and stored with metadata.
User queries are embedded and compared with stored vectors.
Indexes make similarity search fast at scale.
HNSW uses graph shortcuts; IVF uses clusters; compression saves memory.
Metadata filtering and permissions are required in production.
Hybrid search often beats vector-only search.
Search quality must be evaluated with real queries.

✅

Practice: Design semantic search for an internal wiki. Include parsing, chunking, embeddings, vector storage, permission filtering, hybrid search, re-indexing, and how you would debug a bad result.

Generative AI Systems: Architecture, LLMs, RAG, and Production Considerations

RAG Architecture: Chunking, Retrieval, Reranking, and Generation