How Much Data Do You Need to Improve RAG Performance?

Regardless of the RAG technique you use, you will always need a lot of data to improve its performance. If you choose in-context learning, a few examples are enough. If you want to fine-tune the model, you need way more data. Additionally, you need some test data to evaluate the performance. The question is, how much data do you need?

Table of Contents

How Much Data Do You Need to Improve RAG Performance?
How to Gather Data?
How to Use the Data?
The One Thing to Remember

The article is based on Jason Liu’s free course on Systematically Improving RAG and my experience with building RAG applications.

How Much Data Do You Need to Improve RAG Performance?

Jason Liu says we have the following milestones while gathering data:

Number of Examples	What becomes possible
10	In-context learning
100	Initial RAG evaluation
1000	Retrieval improvement
+10K	Fine-tuning an LLM
+1M	Training a specialized model from scratch

All of those numbers are just a rough guide not actual thresholds. The values in the table are supposed to give you an idea of the order of magnitude of the data and the relative differences between the stages.

How Many Examples Do You Need for In-Context Learning?

When you have 10 diverse, representative queries, all you can do is use them as examples for in-context learning. It’s also sufficient to demo the RAG system or eyeball the performance.

You probably don’t have a data engineering pipeline for gathering and indexing data at this stage, but you can work on automating the deployment pipeline. You should also start tracking the requests and responses.

How Many Examples Do You Need for RAG Evaluation?

This 100 examples milestone allows for a more comprehensive evaluation of your RAG system. You can identify the common failure modes and calculate meaningful performance metrics. Now, it is the time to choose evaluation metrics for both the retrieval and answer generation components and prepare tools for analyzing the system outputs such as DeepEval or promptfoo.

How Many Examples Do You Need for Retrieval Improvement?

Retrieval is the most critical part of RAG because a mistake in the retrieval step can lead to a completely wrong answer.

When you have a large and diverse dataset of examples, you can improve the retrieval component. Your retrieval examples should be diverse and cover various topics and domains (supported by your RAG).

With 1000 examples, you can work on query processing and indexing strategies. For example, you can use Hypothetical Document Embeddings to generate a synthetic answer and use the answer to query the vector database. The retrieval improvement stage is also when you focus on data engineering. You need reliable data processing to parse and index the documents into the vector database.

One thousand examples seem too much, but you may want to fine-tune the embedding model. In such a case, the 1000 examples are just a starting point.

How Many Examples Do You Need for Fine-Tuning an LLM?

Fine-tuning an LLM for sure will improve the response quality, but first, you have to gather a massive dataset of high-quality examples. For many tasks, you can use small language models, but even those require lots of training data.

Fine-tuning will adapt your RAG to the specific task and domain, but you need a robust ground truth data source. You won’t gather 10k examples by having three people review system outputs and manually choose the best ones. See the section on Data Flywheel below for more details.

How Many Examples Do You Need to Train a Specialized Model From Scratch?

Training a specialized model from scratch is the most ambitious goal. The sky is the limit here. The good news is you can gather the data by doing the same thing as you do for fine-tuning, but for a more extended time.

How to Gather Data?

Jason Liu’s Data Flywheel is a system that gets better the more users you have. It’s self-improving because you gather user feedback and use the feedback to find the best examples that you later use for tuning RAG or the LLM.

Your RAG-based application should have a way for the user to leave feedback. You can use a thumbs-up/down button to judge the final answer or a delete button to remove an irrelevant source document. For example, Langfuse allows you to track user feedback in addition to monitoring the requests and responses.

Jason Liu suggests tracking not only the positive examples (the things you want to see as the output) but also the negative examples (the things you don’t want to see as the output). Having both positive and negative examples gives you more flexibility in testing or tuning the system to tell relevant documents apart from irrelevant ones.

How to Use the Data?

Most likely, you aren’t building a generic RAG but a specialized application for your domain. Therefore, when you review the user’s queries, you should split them into categories. When you segmentize the queries, look for common usage patterns or prompt techniques used by the users. It will give you an idea of how the system is used.

When you have several valid use-case segments, you should start tracking the metrics for each segment. You need to know the query volume (how many queries are assigned to each segment) and the performance metrics (how well the system performs in each segment).

Then, the process becomes simple: you always focus on the highest-impact segment - high volume of queries and low performance.

The One Thing to Remember

As with every software, the AI system you build will be only as good as your tests.

Is your AI hallucinating in production? Take my 10-minute AI Readiness Assessment to identify critical vulnerabilities or schedule a consultation.

How Much Data Do You Need to Improve RAG Performance?

How Much Data Do You Need to Improve RAG Performance?

How Many Examples Do You Need for In-Context Learning?

How Many Examples Do You Need for RAG Evaluation?

How Many Examples Do You Need for Retrieval Improvement?

How Many Examples Do You Need for Fine-Tuning an LLM?

How Many Examples Do You Need to Train a Specialized Model From Scratch?

How to Gather Data?

How to Use the Data?

The One Thing to Remember

Improving RAG Retrieval Accuracy: A Practical Implementation Guide with PydanticAI and Ragas

Comprehensive Guide to AI Workflow Design Patterns with PydanticAI code examples

How Much Data Do You Need to Improve RAG Performance?

How Much Data Do You Need to Improve RAG Performance?

How Many Examples Do You Need for In-Context Learning?

How Many Examples Do You Need for RAG Evaluation?

How Many Examples Do You Need for Retrieval Improvement?

How Many Examples Do You Need for Fine-Tuning an LLM?

How Many Examples Do You Need to Train a Specialized Model From Scratch?

How to Gather Data?

How to Use the Data?

The One Thing to Remember

Improving RAG Retrieval Accuracy: A Practical Implementation Guide with PydanticAI and Ragas

Comprehensive Guide to AI Workflow Design Patterns with PydanticAI code examples

Related Posts

How to Detect and Block AI Hallucinations in Chatbots

Stop LLM Hallucinations in Fintech Apps: A CTO’s Guide to Risk-Proof AI Evaluation

LLM Sampling Demystified: How to Stop Hallucinations in Your Stack