Hallucinations erode trust. If users see that an AI application generates nonsense, they stop using the app. The problem with Retrieval Augmented Generation (RAG) systems is that most will always return something.
Table of Contents
- Getting rid of hallucinations with Llama-Index evaluators
- Does it make sense to use RetryGuidelineQueryEngine to fix retrieval?
- How can hallucinations in RAG systems be fixed?
If your retrieval is based on vector similarity, something will always be similar to the query (unless you have an empty database). You may set a similarity threshold, but finding a value that will work for all queries is hard. If you set it too high, the system will return nothing. If you set it too low, the user will see hallucinations.
One possible solution is to use a generative model to filter out irrelevant results. Given data retrieved from a vector database, we ask the model to determine whether the retrieved documents are relevant to the query. If the model says the document is irrelevant, we tell the user we couldn’t find anything.
Getting rid of hallucinations with Llama-Index evaluators
We will use the FaithfulnessEvaluator
from the Llama-Index library to determine whether the documents retrieved by vector search contain answers to the user’s question.
The evaluator gets a query and a list of documents. As the output, FaithfulnessEvaluator
returns a feedback
that explains the decision, a score
that is 0 if the documents are irrelevant and one if they are relevant, a passing
boolean flag (true if the documents are relevant), and a response
that answers the user’s question or explains why we couldn’t find anything.
Preparing the documents
If you run the code in Google Colab, you will need only two libraries: llama-index
and llama-index-llms-anthropic
. If the code runs on your local machine, install the lxml
HTML parsers.
Before we start, we have to load documents into a vector store. We will use the content of the Applied LLMs website (https://applied-llms.org) and load every question and answer as a separate document.
First, we retrieve the website’s content using the requests
and parse the HTML with BeautifulSoup
. The BeautifulSoup
strainer object finds the HTML fragments containing a single question and answer. We end up with a list of elements, each converted to a Llama-index-compatible Document
object.
The first line of each section is the title or a question. We will use the first line as the document’s title and pass the value as metadata.
from typing import List
from llama_index.core import Document
import requests
from bs4 import BeautifulSoup, SoupStrainer
def load_html(url: str) -> List[Document]:
response = requests.get(url)
strainer = SoupStrainer("section", class_="level3")
soup = BeautifulSoup(response.content, 'lxml', parse_only=strainer)
all_elements = soup.find_all(strainer)
for element in all_elements:
text_content = element.get_text().strip()
first_line = text_content.split("\n")[0]
yield Document(
text=text_content,
metadata={"title": first_line}
)
URL = "https://applied-llms.org/"
documents = list(load_html(URL))
Storing the documents in the vector store
Now, we store the documents in the vector store. A vector store needs a vectorizer to convert text to vectors. We will use the OpenAI Embedding model to transform the text. In Llama-index, we can set the embedding model as a global parameter instead of passing the embedding to every function.
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings
Settings.embed_model = OpenAIEmbedding(api_key=OPENAI_API_KEY)
vector_index = VectorStoreIndex.from_documents(documents)
Configuring the FaithfulnessEvaluator
The only thing the evaluator needs is a LLM model. We will use the Claude 3 Haiku model (but any model supported by the Llama-Index library will work).
from llama_index.llms.anthropic import Anthropic
from llama_index.core.evaluation import FaithfulnessEvaluator
llm = Anthropic(model="claude-3-haiku-20240307", api_key=ANTHROPIC_API_KEY)
evaluator = FaithfulnessEvaluator(llm=llm)
Get Weekly AI Implementation Insights
Join engineering leaders who receive my analysis of common AI production failures and how to prevent them. No fluff, just actionable techniques.
Quering the vector store and evaluating the results
Before we move on, if you use llama-index in Google Colab, you should allow nested asyncio event loops:
import nest_asyncio
nest_asyncio.apply()
Now, we define a variable containing the user’s question, create a query engine for the vector store, and retrieve the documents. Finally, we evaluate the documents using the FaithfulnessEvaluator
.
First, let’s try a question for which the scraped website doesn’t contain an answer:
question = "What's the difference between Langchain and Llama-Index?"
query_engine = vector_index.as_query_engine(llm=llm)
response_vector = query_engine.query(question)
eval_result = evaluator.evaluate_response(query=question, response=response_vector)
The eval_result
variable contains:
The query
(which is the user’s question).
A context
(the list of retrieved documents).
Properties with the evaluation results.
The score
and the passing
properties indicate the documents’ relevance. Those two should be sufficient to show the user a generic message when we can’t find anything.
The feedback
property contains the explanation, which is the response to a Chain-of-Thoughts prompt. In the case of my question, the evaluator returned the following feedback:
The given information is not supported by the context.
The context provided does not contain any information about Langchain or Llama-Index. These tools are not mentioned in the given text. Without any relevant information in the provided context, I do not have enough information to compare or differentiate between Langchain and Llama-Index.
The context is focused on discussing the importance of generating structured output from language models to ease downstream integration, as well as the benefits of having humans in the loop when using AI systems. It does not cover the specific tools you asked about.
Therefore, the answer is NO.
The explanation is quite lengthy and contains the model’s decision. However, the decision is in the format the FaithfulnessEvaluator
expected, and I wouldn’t show the feedback
to the user. Instead, we can consider sending the response
to the user:
The context provided does not contain any information about Langchain or Llama-Index. These tools are not mentioned in the given text. Without any relevant information in the provided context, I do not have enough information to compare or differentiate between Langchain and Llama-Index. The context is focused on discussing the importance of generating structured output from language models to ease downstream integration, as well as the benefits of having humans in the loop when using AI systems. It does not cover the specific tools you asked about.
What would happen if the vector store contained a document with the answer to the question? Let’s try a different question:
question = "How to measure RAG performance?"
query_engine = vector_index.as_query_engine(llm=llm)
response_vector = query_engine.query(question)
eval_result = evaluator.evaluate_response(query=question, response=response_vector)
We get a passing
flag set to True
(and the score
equal to 1). The feedback
still contains an explanation, but this time, the feedback confirms the documents include the answer to the user’s question:
...
The context does not mention anything about the taste of apple pies. It only describes the ingredients and how they are typically served.
Information: To measure the performance of a RAG (Retrieval-Augmented Generation) system, the context highlights a few key factors to consider:
Relevance: The relevance of the retrieved documents is crucial. Metrics like Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) can be used to quantify how well the system ranks relevant documents higher than irrelevant ones.
Information Density: When documents have similar relevance, the system should prefer more concise and information-dense documents over those with extraneous details. This helps ensure the LLM receives the most relevant information.
Level of Detail: The amount of detail provided in the retrieved documents can also impact performance. For example, in a task like generating SQL queries from natural language, including column descriptions and sample values can help the LLM better understand the semantics of the data.
To measure the impact of these factors, the context suggests running the RAG-based task with the retrieved items shuffled to see how the performance changes. This can help isolate the effect of the retrieval quality on the overall system performance.
Answer: YES
The context clearly supports the information provided, highlighting the key factors to consider when measuring the performance of a RAG system.
Similarly, the response
property has the answer to the user’s question. Since the response says the same thing as the feedback
, I will skip it here.
RetryGuidelineQueryEngine
to fix retrieval?
Does it make sense to use RetryGuidelineQueryEngine
is not a magic fix. If we set the resynthesize_query
parameter to True
, it will attempt to generate a new query based on the user’s question and the feedback from the evaluator. However, the documents retrieved based on the generated query may still be irrelevant.
We must either set the global llm
property for llama-index or pass a FeedbackQueryTransformation
with the LLM to the RetryGuidelineQueryEngine.
If we pass the transformer, we can also modify the resynthesis_prompt
used to generate the new query.
I will set the global LLM and keep a default FeedbackQueryTransformation
for the RetryGuidelineQueryEngine
.
from llama_index.core.query_engine import RetryGuidelineQueryEngine
Settings.llm = llm
question = "What's the difference between Langchain and Llama-Index?"
retry_query_engine = RetryGuidelineQueryEngine(
query_engine, evaluator, max_retries=3, resynthesize_query=True
)
retry_response = retry_query_engine.query(question)
The retry_response
isn’t the same as the eval_result
. The response from the RetryGuidelineQueryEngine
contains only the final response
with the answer to the user’s question and the source_nodes
list of the documents used to generate the response. There is no feedback or score, but at least the nodes have a score
property.
How can hallucinations in RAG systems be fixed?
We can’t prevent hallucinations in RAG systems, but we can filter out irrelevant results. We can force the LLM to generate an answer saying the user’s question cannot be answered with our data. It’s not a fix or a perfect solution, but at least we aren’t deceiving the user.
Is your AI-powered applications failing in production? Take my 10-minute AI Readiness Assessment to identify critical vulnerabilities or view my full implementation services.