How can we measure improvement in information retrieval quality in RAG systems?

Every RAG system starts with retrieval. If your AI system answers user’s questions but cannot find the most relevant data, your RAG will be a huge disappointment.

AI won’t improve results if your retrieval code leaves out the most relevant documents and gives subpar answers. It’s already too late. Tweaking prompts won’t help you. Better models will increase the cost but won’t give you better answers. Garbage in, garbage out.

Making a RAG system better often involves improving retrieval. But what does better even mean? Do we compare answers from two implementations and try to determine which one returns more relevant documents? Fortunately, we can measure it.

I will show you how to use the ir_measures library to calculate the ranking metrics and quantify the performance of your retrieval code. After all, if you can’t measure how good (or bad) something is, you can’t improve it.

Getting the Ground Truth Data

As always in machine learning, to calculate metrics, we need to compare the results we get with the results we want.

This simple fact comes with a shocking (for some people) revelation: you must know what you want to get from your retrieval code. What’s the expected result for a given query?

Imagine you have a RAG system with retrieval based on semantic search in a vector database. You send a query: “How to fix a vacuum cleaner pipe?” Do you expect to get:

To fix a vacuum cleaner pipe, unplug the vacuum and detach the pipe. Remove clogs with a long brush or wire hanger. For cracks or holes, use duct tape for a quick fix or a plastic pipe repair kit for a permanent solution. Reassemble securely and perform regular maintenance to prevent future issues.

or

A vacuum cleaner pipe, or hose, is a flexible tube fixed to the vacuum's main body and its attachments. It allows the vacuum to suck up dirt, dust, and debris from surfaces and carry them into the vacuum's bin or bag. Made from strong, flexible materials, the pipe can bend without breaking or blocking.

Both paragraphs contain the words “vacuum cleaner pipe” and some form of the word “to fix,” but with two different meanings. Which answer is better? You will have to decide.

You will have to prepare a dataset of questions and expected answers. The answers should be actual documents you have in your database. Of course, you may want a single document answer or a collection in the correct order. Both options are possible, but you may need different metrics to quantify the quality (depending on whether you care only about the first result, order of results, etc.)

Can You Automate Creating the Ground Truth Data?

Technically, you can. You can randomly choose a subset of documents and ask AI to prepare a list of questions those documents can answer.

Do AI-generated questions make sense? Unlikely.

The generated questions may differ from the actual questions the users ask, so even if you create a perfect retrieval implementation, the RAG won’t work well in the production environment.

Using AI to rank documents to find the best three answers to your questions (both real and generated) makes no sense either. AI will choose the answers that look as if they were the most relevant. Looking relevant isn’t the same as being the most relevant or correct.

How to Prepare the Ground Truth Data?

You need a subject matter expert, and you have to ask them to rank documents from the most relevant to a given question to the least relevant (or to rank the three/five/ten/whatever most relevant documents).

If preparing the data looks like a tedious task, it’s because it is tedious.

You may need to prepare a better UX for the person creating the ground truth data. Instead of giving them a question and all the documents, provide them with a question and two or three documents they can choose from. Tell them to select the most relevant one or decide none is appropriate.

Later, you will rank the documents by calculating normalized relevance (divide the number of times the document was the most relevant for a given question by the number of times somebody saw the document) and use the score to rank the documents in the ground truth dataset.

You can pre-filter the data to make sure the documents the experts see have any chance of being relevant. For pre-filtering, you can use the ranking created by asking AI, doing a keyword search, doing a semantic search relevance score, etc.

If your AI system is already deployed, you can show users answers based on two different documents and ask them to decide which one is better. A better answer probably means the source document was better, too, unless you test different versions of retrieval code and prompts at the same time. Don’t change multiple things at once.

If you want to check whether the ground truth data you prepared is similar to the actual questions sent by the users, try clustering. If your ground truth data ends up in a different cluster than the actual questions, you may have a problem. If the questions are mixed in the same cluster, you are on the right track. You may see multiple clusters if there are multiple topics you handle, but the ground truth data should never be separated from the actual questions.

Which Documents Are Better?

How should the experts choose a better answer? They should focus on comparing several criteria in the following order.

  1. Correctness.
  2. Information density.
  3. Level of detail.

If both documents are correct, the reviewers should prefer the more concise one. We don’t want a document filled with fluff where the relevant information is hidden between anecdotes.

If multiple documents are correct and similarly dense, we prefer the document that gives us more details. After retrieval, the next steps of the RAG system will summarize the documents, so it makes no sense to start with a summary and summarize it even more.

How to Measure the Performance of Retrieval in RAG?

Typically, we use two metrics. Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG).

Mean Reciprocal Rank (MRR)

MRR scores where we placed the most relevant document. For example, if for a given query Q1, we want documents D1, D2, and D3, the Reciprocal Rank metric values look as follows:

  1. We will get a score of 1 if our retrieval code places document D1 in the first position (other documents don’t matter).
  2. Score 0.5 if D1 was at a second position, for example: D3, D1, D2.
  3. Score 0.33 if D1 is third, and so on.
  4. Of course, we get 0 if we don’t return D1 at all.

We average the Reciprocal Rank of all documents in our test dataset to get the Mean Reciprocal Rank.

Normalized Discounted Cumulative Gain (NDCG)

NDCG takes into account the entire list of documents (not only the first document) and their position in the list. To be precise, NDCG uses a document’s relevance score, but for our purpose, the score may be based on the position in the list, where the first document has a higher score than the second one, and so on.

The calculation consists of three steps:

  1. Compute Discounted Cumulative Gain by summing the relevance score of the results divided logarithmically based on their position in the list.
  2. Compute the Discounted Cumulative Gain of the ground truth answer for the same query.
  3. Normalize by dividing the result’s CDG by the CDG of the ideal answer.

NDCG doesn’t penalize the retrieval system for returning irrelevant documents. If a document is scored as 0, a set containing it gets the same NDCG score as a set without it (if all relevant documents are at the same positions in both cases).

Evaluating Document Retrieval with the ir_measures Library

The ir_measures Python package provides a standardized interface for several other information retrieval evaluation tools. ir_measures offers both a command line interface and Python API. The command line interface is similar to trec_eval.

Both ir_measures and trec_eval have one annoying property. If you have never used an evaluation tool, their documentation won’t show you how to use them properly.

Let me explain how to use ir_measures.

You will need two datasets:

  • qrels - this is the ground truth dataset containing the queries and documents relevant to the query. For each document, there is also a relevance score. (0 - not relevant, 1 - relevant. Values greater than one can be used if your evaluation metric allows comparing position (like NDCG does). In this case, the highest relevance score goes to the document that’s supposed to be the first one.). Float and negative values are also possible, but check the metric implementation you want to use to see if such values make sense.

  • run file - The run file contains the query ID, document ID, and the score for each query/document pair (higher score = more relevant according to your retrieval code)

Using Python API, we can have both datasets in Pandas Dataframes.

nDCG in ir_measures

from ir_measures import *
import pandas as pd


qrels = pd.DataFrame([
 {'query_id': "Q0", 'doc_id': "D0", 'relevance': 0},
 {'query_id': "Q0", 'doc_id': "D1", 'relevance': 0},
 {'query_id': "Q0", 'doc_id': "D2", 'relevance': 1},
 {'query_id': "Q1", 'doc_id': "D0", 'relevance': 2},
 {'query_id': "Q1", 'doc_id': "D1", 'relevance': 1},
 {'query_id': "Q1", 'doc_id': "D2", 'relevance': 0},
])

import pandas as pd
run = pd.DataFrame([
 {'query_id': "Q0", 'doc_id': "D0", 'score': 0},
 {'query_id': "Q0", 'doc_id': "D1", 'score': 1},
 {'query_id': "Q0", 'doc_id': "D2", 'score': 1},
 {'query_id': "Q1", 'doc_id': "D0", 'score': 2},
 {'query_id': "Q1", 'doc_id': "D1", 'score': 0},
 {'query_id': "Q1", 'doc_id': "D2", 'score': 0}
])

metric = nDCG@5
result = metric.iter_calc(qrels, run)
for single_result in result:
  print(single_result)
Metric(query_id='Q0', measure=nDCG@5, value=1.0)
Metric(query_id='Q1', measure=nDCG@5, value=0.9502344167898356)

The nDCG object’s numeric parameter is called cutoff. It indicates how many query/document pairs are supposed to be evaluated. If we care only about the two most relevant documents, we can set the cutoff to 2. If we need only one document, we would set the cutoff to 1, etc.

nDCG doesn’t penalize for returning irrelevant documents, so Q1 still scores as one even though D1 isn’t relevant, but the retrieval code returned it. Document retrieval for question Q2 missed one relevant document (D1), so the score dropped below 1.

MRR in ir_measures

from ir_measures import *
import pandas as pd


qrels = pd.DataFrame([
 {'query_id': "Q0", 'doc_id': "D0", 'relevance': 0},
 {'query_id': "Q0", 'doc_id': "D1", 'relevance': 0},
 {'query_id': "Q0", 'doc_id': "D2", 'relevance': 1},
 {'query_id': "Q1", 'doc_id': "D0", 'relevance': 2},
 {'query_id': "Q1", 'doc_id': "D1", 'relevance': 1},
 {'query_id': "Q1", 'doc_id': "D2", 'relevance': 0},
])


run = pd.DataFrame([
 {'query_id': "Q0", 'doc_id': "D0", 'score': 1},
 {'query_id': "Q0", 'doc_id': "D1", 'score': 0.4},
 {'query_id': "Q0", 'doc_id': "D2", 'score': 1},
 {'query_id': "Q1", 'doc_id': "D0", 'score': 2},
 {'query_id': "Q1", 'doc_id': "D1", 'score': 0},
 {'query_id': "Q1", 'doc_id': "D2", 'score': 0}
])

metric = RR@1
result = metric.iter_calc(qrels, run)
for single_result in result:
  print(single_result)
Metric(query_id='Q0', measure=RR@1, value=0.0)
Metric(query_id='Q1', measure=RR@1, value=1.0)

In the case of the MRR metric, we always consider only the most relevant document. The cutoff parameter tells the algorithm how many positions to check while looking for a relevant document before giving up and scoring your result as 0.

For MRR, a score equal to 1 or greater is considered relevant, but we can modify the relevance threshold by specifying the rel argument: RR(rel=30)@1. Whatever we choose, it will always be a binary decision: relevant or not relevant. The degree of relevance doesn’t matter for MRR.

In the example above, the query Q0 scored 0 because the retrieval code returned documents D0 and D2. Only D2 is relevant, but we have a cutoff set to 1, so the metric checks only the first value. D0 is the first document, and D0 is irrelevant to the ground truth data for query Q0.

Query Q1 has a score of 1 because D0 is a relevant document, and the retrieval code returned D0 in the first position.

If we change the cutoff parameter to 2, the score for Q0 will be 0.5 because the second returned document is relevant.

metric = RR@2
result = metric.iter_calc(qrels, run)
for single_result in result:
  print(single_result)
Metric(query_id='Q0', measure=RR@2, value=0.5)
Metric(query_id='Q1', measure=RR@2, value=1.0)

The value of Q1 doesn’t change because we got a relevant document as the first.

Suppose you want to use MRR to evaluate finding the most relevant document (not just any relevant document). In that case, only one document should be marked as relevant for each query in the ground truth data.

How to Work with Information Retrieval Metrics?

You have calculated the metrics for your current retrieval code. What’s next?

You have the baseline. When you change the retrieval code, you can calculate the metrics again and compare the results. You get hard data. No more guessing. No more “I think it’s better.” You have the numbers. And if you disagree with the numbers, you should have spent more time preparing the ground truth data.

Is retrieval quality the only thing that matters? No. RAG consists of two parts: retrieval and generation. Even if you got perfect retrieval results, you can always make the answer worse later in RAG. However, getting good answers is impossible if you don’t have good documents to start with.


Do you need help building AI-powered applications for your business?
You can hire me!

Older post

Building a Data Retrieval Workflow for AI with Structured Output Libraries like Marvin and Instructor

How to use Marvin and Instructor to define a structured output for LLMs and build a data retrieval workflow that can answer user's questions about data by checking if the required data is available in the database, planning what data has to be retrieved, generating the query, executing it, and generating a human-readable answer.