How do you debug LLMs? What about RAG or AI Agents? The process is quite similar in all cases, but it requires substantial effort, and most of it cannot be automated. Here is what I learned about LLM evaluation and debugging from Hamel Husain and Jason Liu.

Table of Contents

  1. Does LLM evaluation even matter?
  2. Forget about 1-5 ratings, use binary classification
  3. Build a custom annotation tool
  4. What data do you need for AI debugging?
    1. How do you evaluate the retrieval component?
    2. How do you evaluate the generation component?
  5. Automating the evaluation process with an LLM-as-judge
  6. Evaluating multi-turn conversations
    1. Generating test cases for multi-turn conversations
      1. User simulation with LLMs
      2. N-1 testing with real conversations
  7. Sources

Does LLM evaluation even matter?

You are going to hate this, but AI evaluation is the most important part of the AI development process. According to Hamel Husain:

In the projects we’ve worked on, we’ve spent 60-80% of our development time on error analysis and evaluation. Expect most of your effort to go toward understanding failures (i.e., looking at data) rather than building automated checks.

Sounds like fun, right? Will you spend most of your days reading AI’s responses and judging what is good and what is bad? Yes. And in this article, I will show you how to do it in a way that doesn’t make you want to quit your job.

Forget about 1-5 ratings, use binary classification

Likert scale (1-5 ratings) seems like a good idea until you get stuck with reviewers giving 3/5 ratings for most cases.

When the reviewer is unsure, 3/5 seems like a safe option, but zero insight comes out of it. 3 vs 4 vs 2 means whatever the reviewer feels that day. You can’t act on “it’s OK-ish.” Do you have more than one person reviewing the data? Good luck making them agree on what the rating means. Does a 4 from a person who never gives 5 means the same as a 4 from someone who gives 5 to everything? Are you going to normalize the scores to account for such biases?

Save yourself the hassle and use a binary classification. The AI’s answer is either 100% correct or incorrect. Slightly incorrect means wrong. When the reviewer marks the answer as incorrect, they should provide a short explanation of the problem. Comments for the correct answers are optional, but errors must be explained.

Build a custom annotation tool

In the How I Transformed a Failing AI System into 99.7% Accuracy in 4 Days (And Eliminated a €20M Regulatory Risk) article, I said it took me 20 minutes to generate code for a custom annotation tool. 20 minutes. I know people who take longer coffee breaks. Those 20 minutes are the best investment you can make (even if it takes you 2 hours instead of 20 minutes).

An annotation tool built specifically for your case can be more efficient than a generic tool because you can make it retrieve and display the relevant context data and annotations of similar cases, render the data in the same way as it is displayed in your production system, etc. You can make it as easy and convenient as you need by adding filtering, keyboard shortcuts, comment templates, and whatever else you need. Need is the keyword here. Don’t spend time building a perfect evaluation tool. The tools are not your goal.

You will spend most of the time in the annotation tool, so make it pleasant to use and make it produce the data you need.

What data do you need for AI debugging?

In the “Error Analysis is all you need” video, Hamel Husain states the evaluation strategy should emerge from the observer failure patterns. So first, you check the answer and describe the errors you see. Then, you classify them to figure out what error groups you have and which errors are the most common.

When calculating the frequency of a certain error, also consider the frequency of the corresponding query type. (An error you see every time for a query A may still not be as common as an error you see for a query B, which is asked much more often.)

Now, you know what happens when the user gets an incorrect answer. As every RAG system consists of two components, retrieval, and generation, you can assign each failure type to one of those components. For both of them, you will calculate different evaluation metrics. If you have trouble classifying the error, ask yourself if you see relevant information in the context. If not, you are likely to have a retrieval error. If the context seems correct, it may still be retrieval (there may be a better document in the database, but you fail to find it), but generation is also a likely culprit.

How do you evaluate the retrieval component?

To evaluate retrieval, you need to create a dataset of queries paired with their relevant documents. The most efficient way to generate this is synthetically by extracting key facts from your corpus and then generating questions that those facts would answer. This reverse process gives you query-document pairs for measuring retrieval performance without manual annotation.

The process is simple:

  1. Take a document from your corpus
  2. Extract the key facts from it
  3. Generate questions that those facts would answer
  4. Use the original document as the “correct” answer for those queries

This approach is much faster than manually creating query-document pairs, and it ensures that your evaluation dataset covers the actual content in your corpus. You can use the same LLM that powers your system to generate synthetic queries, making the evaluation more realistic.

To choose the right retrieval metric, take a look at my How can we measure improvement in information retrieval quality in RAG systems? article.

How do you evaluate the generation component?

The generation components check whether the LLM uses the retrieved context, whether it hallucinates, and whether it answers the question. You need the human feedback and error analysis to identify failure modes, then you can build LLM-as-judge evaluators and validate them against human annotations.

Jason Liu suggests having 6 RAG eval metrics after you get the basics of retrieval right.

  1. Context Relevance - whether the retrieved context addresses the question
  2. Faithfulness/Groundedness - does the generated answer restrict itself to the claims stated in the provided context?
  3. Answer Relevance - does the generated answer address the question?
  4. Context Support Coverage - is the provided context sufficient to support every claim in the generated answer?
  5. Question Answerability - is it even possible to answer the question given the context we found?
  6. Self-Containment - can the original question be inferred from the generated answer?

In addition to those, Hamel Husain reminds us to have metrics relevant to the business domain. For example, whether the AI system correctly distinguishes between adult vs children drug dosages in a medical context.

Newsletter

Automating the evaluation process with an LLM-as-judge

First of all, it’s fine to use the same model for inference and automated evaluation. Those are two different tasks, and the models have no internal memory. Inference and evaluation won’t affect each other.

What you should consider, though, is the cost of the evaluation and the complexity of the model. Hamel Hussain suggests starting with the most advanced model you can afford and optimizing for cost later once you have a prompt that aligns the judge with human judgments.

Likely, you are going to need several iterations of prompt improvement to align the judge with the human annotators, and you will evaluate the alignment manually (because if you automate it, you have another automation layer to evaluate). What’s worse, after spending some time evaluating the answers, you may notice the human annotators weren’t perfect either. You may see inconsistencies, errors, and overlooked issues. You need to revise the evaluation criteria used by humans and have them prepare the evaluation dataset again. If it happens, you must go back to the first step. Otherwise, you will make an LLM-as-judge that’s perfectly aligned with the wrong expectations.

Evaluating multi-turn conversations

Don’t overthink it. Start by verifying whether the ultimate user’s goal is met. If not, you need to find the step where the interaction failed. When you see it, try to reproduce it with the simplest possible prompt. Maybe your AI system doesn’t need a large context and several messages to fail. The last message before the failure may be enough to trigger the same error.

Generating test cases for multi-turn conversations

When testing multi-turn conversations, you have two main approaches to generate test cases.

User simulation with LLMs

You can simulate users with another LLM to create realistic multi-turn conversations. The simulated user asks questions, follows up, and responds naturally to your AI system’s answers. This approach gives you full control over the conversation flow and lets you test various scenarios systematically.

However, user simulation has limitations. The simulated conversations may not capture the nuances of real user behavior, and the quality depends heavily on the model you use for simulation. As models improve, this approach is becoming more reliable, but it’s still an area to watch closely.

N-1 testing with real conversations

The N-1 testing approach uses actual conversation prefixes from your system. You take a real conversation that has N turns, provide the first N-1 turns to your AI system, and test what happens next. This method often works better because it uses genuine conversation patterns rather than synthetic interactions.

The trade-off is flexibility. N-1 testing doesn’t test the full conversation flow since you’re only evaluating the final response. But it gives you more realistic test cases since you’re working with actual user interactions.

Choose the approach based on your needs. If you need to test specific conversation patterns or edge cases, user simulation gives you more control. If you want to evaluate performance on real user behavior, N-1 testing provides more authentic test scenarios.

Sources


Is your AI hallucinating in production? Take my 10-minute AI Readiness Assessment to identify critical vulnerabilities or schedule a consultation.

Newsletter
Older post

Disposable Company Syndrome

What went wrong with startup culture? A coder reveals the truth behind “move fast,” buggy releases, and founder exits over impact.

Engineering leaders: Is your AI failing in production? Take the 10-minute assessment
>
×
Newsletter