The Hidden Reason Your RAG System Is Failing - The Problems Caused by Approximate Nearest Neighbor Search in Vector Databases

Your RAG doesn’t work because you use a vector database incorrectly.

Table of Contents

The Technical Reality Behind Vector Databases
1. Metadata Filters Make It Worse
What Can You Do About It?
The Cost of Soft-Deleted Records
Information You Need to Solve the Problems
Conclusion
Sources

The problem lies in search mechanics. Your database doesn’t actually check all your documents when searching for relevant information. A complete scan would always find the perfect match, but with thousands or millions of vectors, such exhaustive searches become impossibly slow.

Instead, vector databases use Approximate Nearest Neighbor (ANN) algorithms that sacrifice some accuracy for speed. Rather than comparing your query against every vector, they examine only a subset organized in tree structures like HNSW. The database follows paths through this tree but stops after examining a predetermined number of nodes—your “compute budget.” If the best match happens to be in this subset, great. If not, you get a mediocre result despite having better matches in your database.

The Technical Reality Behind Vector Databases

Beyond a certain threshold, it becomes infeasible to do a similarity search against all vectors in the database. This would be a linear complexity operation, completely impractical at scale.

To solve this, databases implement approximate nearest neighbors search by:

Storing data in a graph structure to achieve logarithmic lookup time
Setting a compute budget that limits graph traversal
Trading recall for speed and memory efficiency

Metadata Filters Make It Worse

When your query includes filters (like user ID or document type), the database has two options:

Pre-filtering: The database checks each node as it traverses the graph, skipping nodes that don’t match your filter criteria. But here’s the problem: if your data is mixed, most traversed nodes won’t contribute relevant data to your search. The further relevant nodes are from the starting point, the less likely you’ll find them before your compute budget runs out.

Post-filtering: The database retrieves many more nodes than you need and then filters them afterward. This approach is even less efficient and also unlikely to find the best match.

What Can You Do About It?

Option 1: Tune Your Compute Budget

You can control the compute budget by configuring two critical parameters:

Dynamic candidate list size during search: This determines how many vectors your system examines when responding to a query. Larger lists improve recall quality but create slower reads. Every increase in accuracy comes with a corresponding decrease in query performance.

Candidate list size during index building: This controls how thoroughly your system analyzes connections between vectors when creating the index. Larger construction lists yield better recall but require more memory and significantly longer index creation times (slow writes).

This creates a fundamental trade-off: optimize for fast reads with potentially less accurate results or invest in comprehensive indexing that might deliver better results but still requires balancing search parameters.

Option 2: Rethink Your Indexing Strategy

The standard approach of using a single shared index for all documents creates several problems:

No separation of user data: All user data gets mixed together, which may be inefficient, incompliant with regulations, or both.
Inefficient compute budget usage: When your data grows beyond a certain threshold, even with increased compute budgets, you’ll still miss relevant results. The larger your database, the more likely important matches will be overlooked.
Degrading performance over time: As you add more documents, retrieval quality silently deteriorates without obvious errors or warnings.

Alternatively, use multiple smaller indexes instead of one mammoth database. Smaller indexes might allow full scans, dramatically reducing the chance of missing relevant results. For a handful of indexes, query them all and rerank the results. For many indexes, use an LLM to select which ones to search.

Take a hard look at your query patterns and push metadata filters to indexes. For optimal performance, consider making separate indexes for each metadata key and value combination.

Option 3: The Denormalized Way

Instead of one massive index, create separate indices per user and per data source. This approach offers complete separation of user data and allows you to use different compute budgets for different users. Also, you can use different embedding models for each index and fine-tune them for specific data. On the other hand, it’s more complex and requires additional infrastructure. In the worst case, you will end up with many duplicates in your database.

Additionally, you have to deal with the problem of routing queries to the correct index. The easiest way is to map user IDs to index IDs. In the case of mapping data sources or metadata values, we can consider sending all queries to all indexes and reranking the results or using an LLM to decide which index to use (either by classifying the query or by tool calling).

The Cost of Soft-Deleted Records

Another critical consideration is soft-deleted records in your vector database. When documents are marked as deleted but not physically removed, their vectors remain in the graph structure. These lingering nodes continue to occupy space in your index and can impact search results because the compute budget gets wasted on exploring irrelevant nodes, further reducing the likelihood of finding truly relevant matches.

To mitigate this issue, implement a proper deletion strategy that physically removes vectors from your index.

Information You Need to Solve the Problems

Splitting data into multiple indexes isn’t a silver bullet. If it were, databases would be doing it automatically.

Before you start, you need to understand the data and the queries. Here are the things you need to know:

Evaluate data sources: Do you have different data types? Does every user have access to all data? How often is data updated? Can you use the same embedding model for all data?
Check data/query alignment: Do your queries actually look for information contained in your data? What kind of queries are you running? Are queries uniform across all users?
Undestand metadata usage What metadata is available? What filters are used? How often are they used? What data ranges or keywords are users searching for?

Conclusion

We can’t dump all documents into a single index and hope for the best. This approach seems fine until your data grows beyond a certain threshold. At some point, retrieval results will deteriorate while your RAG system becomes increasingly unreliable.

By understanding the technical limitations of vector databases and implementing the right indexing strategy, you can build reliable RAG systems that scale to millions of documents without sacrificing accuracy.

Sources

Is your AI hallucinating in production? Take my 10-minute AI Readiness Assessment to identify critical vulnerabilities or schedule a consultation.

The Hidden Reason Your RAG System Is Failing - The Problems Caused by Approximate Nearest Neighbor Search in Vector Databases