How to prevent LLM hallucinations from reaching the users in RAG systems

Hallucinations erode trust. If users see that an AI application generates nonsense, they stop using the app. The problem with Retrieval Augmented Generation (RAG) systems is that most will always return something.

If your retrieval is based on vector similarity, something will always be similar to the query (unless you have an empty database). You may set a similarity threshold, but finding a value that will work for all queries is hard. If you set it too high, the system will return nothing. If you set it too low, the user will see hallucinations.

One possible solution is to use a generative model to filter out irrelevant results. Given data retrieved from a vector database, we ask the model to determine whether the retrieved documents are relevant to the query. If the model says the document is irrelevant, we tell the user we couldn’t find anything.

Getting rid of hallucinations with Llama-Index evaluators

We will use the FaithfulnessEvaluator from the Llama-Index library to determine whether the documents retrieved by vector search contain answers to the user’s question.

The evaluator gets a query and a list of documents. As the output, FaithfulnessEvaluator returns a feedback that explains the decision, a score that is 0 if the documents are irrelevant and one if they are relevant, a passing boolean flag (true if the documents are relevant), and a response that answers the user’s question or explains why we couldn’t find anything.

Preparing the documents

If you run the code in Google Colab, you will need only two libraries: llama-index and llama-index-llms-anthropic. If the code runs on your local machine, install the lxml HTML parsers.

Before we start, we have to load documents into a vector store. We will use the content of the Applied LLMs website (https://applied-llms.org) and load every question and answer as a separate document.

First, we retrieve the website’s content using the requests and parse the HTML with BeautifulSoup. The BeautifulSoup strainer object finds the HTML fragments containing a single question and answer. We end up with a list of elements, each converted to a Llama-index-compatible Document object.

The first line of each section is the title or a question. We will use the first line as the document’s title and pass the value as metadata.

from typing import List
from llama_index.core import Document
import requests
from bs4 import BeautifulSoup, SoupStrainer


def load_html(url: str) -> List[Document]:
 response = requests.get(url)
 strainer = SoupStrainer("section", class_="level3")

 soup = BeautifulSoup(response.content, 'lxml', parse_only=strainer)
 all_elements = soup.find_all(strainer)

  for element in all_elements:
 text_content = element.get_text().strip()
 first_line = text_content.split("\n")[0]

    yield Document(
        text=text_content,
        metadata={"title": first_line}
 )


URL = "https://applied-llms.org/"
documents = list(load_html(URL))

Storing the documents in the vector store

Now, we store the documents in the vector store. A vector store needs a vectorizer to convert text to vectors. We will use the OpenAI Embedding model to transform the text. In Llama-index, we can set the embedding model as a global parameter instead of passing the embedding to every function.

from llama_index.core import VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings


Settings.embed_model = OpenAIEmbedding(api_key=OPENAI_API_KEY)
vector_index = VectorStoreIndex.from_documents(documents)

Configuring the FaithfulnessEvaluator

The only thing the evaluator needs is a LLM model. We will use the Claude 3 Haiku model (but any model supported by the Llama-Index library will work).

from llama_index.llms.anthropic import Anthropic
from llama_index.core.evaluation import FaithfulnessEvaluator


llm = Anthropic(model="claude-3-haiku-20240307", api_key=ANTHROPIC_API_KEY)
evaluator = FaithfulnessEvaluator(llm=llm)

Quering the vector store and evaluating the results

Before we move on, if you use llama-index in Google Colab, you should allow nested asyncio event loops:

import nest_asyncio


nest_asyncio.apply()

Now, we define a variable containing the user’s question, create a query engine for the vector store, and retrieve the documents. Finally, we evaluate the documents using the FaithfulnessEvaluator.

First, let’s try a question for which the scraped website doesn’t contain an answer:

question = "What's the difference between Langchain and Llama-Index?"
query_engine = vector_index.as_query_engine(llm=llm)
response_vector = query_engine.query(question)
eval_result = evaluator.evaluate_response(query=question, response=response_vector)

The eval_result variable contains: The query (which is the user’s question). A context (the list of retrieved documents). Properties with the evaluation results.

The score and the passing properties indicate the documents’ relevance. Those two should be sufficient to show the user a generic message when we can’t find anything.

The feedback property contains the explanation, which is the response to a Chain-of-Thoughts prompt. In the case of my question, the evaluator returned the following feedback:

The given information is not supported by the context.

The context provided does not contain any information about Langchain or Llama-Index. These tools are not mentioned in the given text. Without any relevant information in the provided context, I do not have enough information to compare or differentiate between Langchain and Llama-Index.

The context is focused on discussing the importance of generating structured output from language models to ease downstream integration, as well as the benefits of having humans in the loop when using AI systems. It does not cover the specific tools you asked about.

Therefore, the answer is NO.

The explanation is quite lengthy and contains the model’s decision. However, the decision is in the format the FaithfulnessEvaluator expected, and I wouldn’t show the feedback to the user. Instead, we can consider sending the response to the user:

The context provided does not contain any information about Langchain or Llama-Index. These tools are not mentioned in the given text. Without any relevant information in the provided context, I do not have enough information to compare or differentiate between Langchain and Llama-Index. The context is focused on discussing the importance of generating structured output from language models to ease downstream integration, as well as the benefits of having humans in the loop when using AI systems. It does not cover the specific tools you asked about.

What would happen if the vector store contained a document with the answer to the question? Let’s try a different question:

question = "How to measure RAG performance?"
query_engine = vector_index.as_query_engine(llm=llm)
response_vector = query_engine.query(question)
eval_result = evaluator.evaluate_response(query=question, response=response_vector)

We get a passing flag set to True (and the score equal to 1). The feedback still contains an explanation, but this time, the feedback confirms the documents include the answer to the user’s question:

...

The context does not mention anything about the taste of apple pies. It only describes the ingredients and how they are typically served.

Information: To measure the performance of a RAG (Retrieval-Augmented Generation) system, the context highlights a few key factors to consider:

Relevance: The relevance of the retrieved documents is crucial. Metrics like Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) can be used to quantify how well the system ranks relevant documents higher than irrelevant ones.

Information Density: When documents have similar relevance, the system should prefer more concise and information-dense documents over those with extraneous details. This helps ensure the LLM receives the most relevant information.

Level of Detail: The amount of detail provided in the retrieved documents can also impact performance. For example, in a task like generating SQL queries from natural language, including column descriptions and sample values can help the LLM better understand the semantics of the data.

To measure the impact of these factors, the context suggests running the RAG-based task with the retrieved items shuffled to see how the performance changes. This can help isolate the effect of the retrieval quality on the overall system performance.
Answer: YES
The context clearly supports the information provided, highlighting the key factors to consider when measuring the performance of a RAG system.

Similarly, the response property has the answer to the user’s question. Since the response says the same thing as the feedback, I will skip it here.

Does it make sense to use RetryGuidelineQueryEngine to fix retrieval?

RetryGuidelineQueryEngine is not a magic fix. If we set the resynthesize_query parameter to True, it will attempt to generate a new query based on the user’s question and the feedback from the evaluator. However, the documents retrieved based on the generated query may still be irrelevant.

We must either set the global llm property for llama-index or pass a FeedbackQueryTransformation with the LLM to the RetryGuidelineQueryEngine. If we pass the transformer, we can also modify the resynthesis_prompt used to generate the new query.

I will set the global LLM and keep a default FeedbackQueryTransformation for the RetryGuidelineQueryEngine.

from llama_index.core.query_engine import RetryGuidelineQueryEngine
Settings.llm = llm


question = "What's the difference between Langchain and Llama-Index?"
retry_query_engine = RetryGuidelineQueryEngine(
 query_engine, evaluator, max_retries=3, resynthesize_query=True
)
retry_response = retry_query_engine.query(question)

The retry_response isn’t the same as the eval_result. The response from the RetryGuidelineQueryEngine contains only the final response with the answer to the user’s question and the source_nodes list of the documents used to generate the response. There is no feedback or score, but at least the nodes have a score property.

How can hallucinations in RAG systems be fixed?

We can’t prevent hallucinations in RAG systems, but we can filter out irrelevant results. We can force the LLM to generate an answer saying the user’s question cannot be answered with our data. It’s not a fix or a perfect solution, but at least we aren’t deceiving the user.


Do you need help building AI-powered applications for your business?
You can hire me!

Older post

How can we measure improvement in information retrieval quality in RAG systems?

Every RAG system starts with retrieval. How do you know if your retrieval code is good enough? You measure it. The article shows how to use the ir_measures library to calculate Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) to quantify the performance of your retrieval code.

Are you looking for an experienced AI consultant? Do you need assistance with your RAG or Agentic Workflow?
Schedule a call, send me a message on LinkedIn, or use the chat button in the right-bottom corner. Schedule a call or send me a message on LinkedIn

>