Troubleshooting AI Agents: Advanced Data-Driven Techniques of Improving AI Agent Performance

Building an AI agent by copying code from an online tutorial is relatively easy. However, soon you discover that your agent isn’t perfect. You may want to blame AI, but the truth is the problem is in your data, which is great and terrible news at the same time. It’s great because you are in control of the process. It’s terrible because you need to gather data, and preparing the evaluation dataset takes time.

Table of Contents

  1. An Example AI Agent
  2. What Causes Poor Performance of AI Agents?
  3. What Data Do You Need to Gather?
  4. What Metrics Should You Track?
    1. User Satisfaction
    2. LLM Evaluation Metrics
    3. Data Retrieval Metrics
  5. Techniques to Improve AI Text Generation
  6. Techniques to Improve Data Retrieval and Reranking
  7. Techniques to Improve AI Decision Making
  8. Conclusion
  9. Related Articles

An Example AI Agent

Let’s assume you are building an AI agent that can answer questions about a specific topic using a database of documents. The agent receives the user’s message, generates a database query (or multiple queries), and the data gets retrieved from the database and passed to the agent. The agent has to decide if the data is enough to answer the question and either return the answer or keep querying the database. It’s a fairly standard setup for an AI agent providing a vector database of documents as a tool.

What Causes Poor Performance of AI Agents?

How many things can break in something as simple as an AI agent with a single tool and vector database? At least six:

  • Query Generation

    When you use a vector database as a tool for an AI agent, you don’t pass the entire user’s message to the database. Good. That’s already a decent start. However, it doesn’t mean the AI agent always generates a correct query. The agent may miss important information from the user’s message, get fixated on a specific term, and generate a too-specific query or, on the other hand, generate a too-general query.

  • Data Retrieval

    The data retrieval part of the process may fail in several ways, but the most common are:

    • Failing to match the documents because the vocabulary of the queries is too different from the vocabulary of the searched documents.
    • Perfectly matching a document that describes a problem but failing to find the solution because your chunking strategy put the solution in a different document.
    • Ignoring the crucial search criteria from the user’s message simply because they don’t fit the concept of matching similar text using distance between document embedding vectors. Also known as “Who needs metadata?”
    • Retrieving too many documents and being unable to decide which ones are the most relevant.
  • Document Reranking

    After we retrieve the documents, we probably have too many, especially if we combine several retrieval methods. Therefore, we need to rerank the documents to get the most relevant ones. For reranking, we use an LLM model, so obviously, the model may pick the wrong documents.

  • Interaction Loop Control

    The agent decides which tools to call and how many times. Depending on the framework or libraries you use, the agent may be allowed to call the tool multiple times. If we limit the number of calls to strictly one call every time, we no longer have an agent (it’s an AI workflow with fixed steps). If we allow the agent to call the tool multiple times, we may end up with an infinite loop. But between those two extremes, there is a lot of room for even more errors. Does the agent correctly recognize when it has enough data? Does the agent call the tool multiple times with the same query? Does it constantly ignore one tool?

  • Answer Synthesis

    At some point, the agent has to synthesize the final answer. Besides the obvious problem of generating an incorrect answer, we should also consider whether the answer is based on the retrieved documents, adheres to the content policy we defined, or maintains consistency with the previous messages in the conversation.

  • Performance and Cost

    In addition to all the quality issues, we should also consider performance. How fast is the agent? How much does it cost? How much data does it process? How much do we pay for the data processing pipeline needed to gather the data in the database?

What Data Do You Need to Gather?

You will need detailed tracking. The data you gather should include not only the input and final output of the agent but also the tool parameters and the tool output. If your underlying database returns the relevancy score, you should log such information for every document. Make sure your tracking library includes the time elapsed for each step.

The most crucial feature of the logging tool is the ability to track the entire agent invocation, so you need a correlation ID with the same value for the first LLM call, the tool calls, the subsequent LLM calls, and the call that generates the final answer. As the conversation history is a part of the input, having an identifier for a conversation is not crucial but still convenient. Tracking the user ID may also help if a user complains about the agent but can’t describe what went wrong.

That’s the tracking in the ideal world. However, various regulations may prevent you from gathering all of the data. Sometimes, you need the user’s consent to collect their conversation data. And those who report problems with the agent never agree to share their data, do they?

A general hint: track as much as you can.

What Metrics Should You Track?

It’s not enough to gather data. You need to analyze your information and determine if the agent works correctly according to objective criteria. As Jason Liu says, the worst update you can give as an AI engineer is something like: “We made some improvements to the model. It seems better now.”

If we want to make informed decisions based on data, not just a gut feeling, we must define metrics. We will track many of them, but we should designate one or two we consider the most important. At any given time, we focus on improving the chosen metric while keeping the other metrics in mind and ensuring they don’t get much worse.

Most importantly, you track multiple metrics to know what happens, not to cherry-pick the one metric that looks better when you change something. I can’t believe I had to write the previous sentence, but I have seen machine learning engineers more creative than mafia accountants.

User Satisfaction

We build AI agents for the users, not for the sake of creating something we can brag about. We want to make the agent useful for the right people. What does it mean?

First of all, you have to know who your target users are. If you aren’t building a generic agent, you probably have a specific group in mind. Deciding if the agent interaction was started by the person we want to serve is crucial if we build a domain-specific agent. For example, if your agent is supposed to help write code, you probably don’t care about the data gathered when the user decides to use the agent to write a blog post. They may think code is text and blog post is text; they are the same thing, but you know better. In this case, the decision becomes quite fuzzy when they write a blog post about code, though.

After you decide if the conversation is something you care about, you need to know if the user is satisfied with the agent’s response. Every user interface should have a feedback mechanism. If you show the source documents or the tool calls, you should let the user rate those, too. Of course, most people won’t bother rating the response, so you need a proxy metric.

A decent proxy for user satisfaction is the information on whether they are coming back to use the agent again. The other proxy data source is seeing they are asking the same question in different ways multiple times. It probably means they didn’t get a satisfactory answer.

A word of warning if you build a paid agent: “They are still paying us so they must be satisfied” is a good metric only if they use the service. If they pay and do not use the agent, they may have forgotten about the subscription. Don’t fool yourself with too optimistic assumptions.

LLM Evaluation Metrics

While measuring user satisfaction is quite fuzzy and can be subjective, we now move into the realm of objective metrics. The LLM evaluation isn’t 100% objective yet, especially if you use an LLM to evaluate an LLM, but it’s still more concrete than guessing users’ opinions based on their behavior.

If you decide to calculate the metrics using an LLM evaluator, mark the data as AI-generated so you can distinguish it from manually evaluated data. Both LLMs and humans can be wrong and have their biases, and you may report all evaluations averaged together, but just as you track the usage data, you should track the source of evaluation data, too.

Now, what do we measure?

  • Answer correctness

    The metric is quite obvious, but what isn’t obvious is that we can treat it as a classification metric and calculate the agent’s F1 score, which allows us to compare the performance between different agent versions.

  • Answer Similarity to Ground-Truth (Semantic Similarity)

    We calculate the cosine similarity between the agent’s and ground-truth answers. We get a value between -1 and 1, but if we set a threshold, we can treat the answer as binary information, whether the answer is similar to the ground truth. The “How to Choose a Threshold for an Evaluation Metric for Large Language Models” research paper shows how to choose a threshold for any metric, so you can use it for answer similarity too.

  • Faithfulness

    Faithfulness measures how well the LLM uses the provided context while generating the answer. In a document retrieval agent, faithfulness measures how well the agent uses the retrieved documents to answer the question. A model with a high faithfulness score is less likely to hallucinate.

  • Adherence to Content Policy

    Besides answering the questions correctly, the agent should also behave properly. We use the content policy evaluation to check if the agent doesn’t generate hateful content or other content that violates the content policy.

  • Answer Consistency

    It’s almost like answer similarity, but we compare the answers to each other instead of comparing the answers to the ground truth. We may not want an agent who answers the same question in exactly the same way every time, but we usually want some consistency.

Data Retrieval Metrics

The good news about data retrieval is that we know how to measure it. (Probably because five minutes after people learned how to use permanent storage, they started to have the problem of “where did I put that data?”.)

We want to retrieve a certain number of documents for a given query (generated by the agent based on the user’s message). How many do we want? Usually, no more than 5. First, only a few will be relevant. Second, we can’t put too much data in the prompt. LLMs have a token limit, and too much input data makes LLM calls expensive.

All metrics are measured @k where k is the number of documents retrieved. Jason Liu proposes a few interpretations of what specific values of k mean:

  • @5 - we want to show the user only the most relevant documents
  • @25 - we are testing if the document reranker works well
  • @50 - we are checking if the document retrieval system works fine
  • @100 - we are checking if we have a relevant document anywhere in our database

Usually, we are interested in the final set of documents, so we measure the metric after the reranking step. However, if you debug data retrieval, you can split the metric into two for both the retrieval and reranking steps.

What can we measure?

  • Mean Average Recall @k

    The metric tells us how many of the relevant documents are in the top k retrieved documents compared to the total number of relevant documents. The total number of relevant documents is tricky because we may not know it. If you have a large database, you cannot check all of the documents and tell whether they are relevant for every query you have in your test dataset. Therefore, we use this metric only when you have a small, handcrafted database of documents.

  • Mean Average Precision @k

    The metric tells us how many retrieved documents are relevant compared to the total number of retrieved documents. It is pretty easy to calculate and understand the metric. Out of the k retrieved documents, how many are relevant? In machine learning, we typically use prevision in combination with recall, but recall for retrieval may be impossible to calculate, so we need a different metric.

  • Mean Reciprocal Rank @k

    The metric tells us not only if what we retrieved is relevant but also if a relevant document is at the top of the search result. The top document may not be the most relevant existing document, but that’s not what we care about when we calculate MRR. We want a relevant document at the top of the search results. It doesn’t matter if better documents exist.

  • Context Precision (Context relevance)

    Given the context document, the user’s question, and the expected ground-truth answer (but not the agent’s answer!), we decide whether the context is relevant to the question and the answer. The metric tells us whether a given document helps answer the question. We will need a human evaluator or an LLM to decide if the document is relevant.

Techniques to Improve AI Text Generation

It’s unlikely your model has similar problems generating every kind of text. Most likely, some topics are better handled than others. So, the first step is to identify the poorly handled topics and focus on improving them if the number of queries in the topic is significant. We want to focus on popular topics to improve the agent for a large audience.

The first step is clustering and topic modeling. We want to split the queries into groups and detect each group’s topic. If you do that on your test dataset, you can also determine which topics are poorly handled. Comparing the test dataset performance metric with the topic popularity in the production data will tell you which topics are the most important.

What can you do to improve the text generation in LLMs? We start with prompt engineering and providing better instructions and examples. The technique of explaining what you want by showing examples is called in-context learning, and because the only thing you need to do is tweak the prompt, you can easily test multiple versions in a short time.

When you gather sufficient query data, you can start thinking of fine-tuning or training your own LLM model, but that shouldn’t be the first thing you try. Also, you will need a source of the expected answers to train the model. Apparently, LLM can’t generate them well if you consider fine-tuning, so you need human annotators.

Techniques to Improve Data Retrieval and Reranking

Don’t stick to vector search-based document retrieval. Use a combination of retrieval methods. You can combine vector search with full-text search and use a reranking model to determine the order of retrieved documents.

I have written a separate article about advanced retrieval techniques. Below, I include a summary of the most popular methods.

In AI workflows where the user’s query is directly passed to the vector database, we can use query expansion. An LLM generates several versions of the user query using synonyms or domain-specific terms. An AI agent generates the query itself, but we can still use query expansion, either internally in the retrieval step or by instructing the agent to generate a list of queries.

When we retrieve data from a vector database, we compare a question to the document, hoping that the question and the document are similar. The database contains documents, so perhaps we should compare documents with other documents. This observation is the basis of the Hypothetical Document Embedding technique. Instead of comparing the documents to the question, we generate a synthetic document from it and compare the generated documents with the documents in the database. The generated document may contain incorrect information, but it should use the same vocabulary as the correct answer.

Often, the documents we have follow a specific structure. For example, in the first paragraph, the author describes the problem, and in the second paragraph, the author describes the solution. If we split the data by paragraphs, we can perfectly match the problem description but fail to find the solution. To address the issue, we have the Parent Document Retrieval technique. We match the documents by their chunks, but the database returns the entire document.

We often overlook obvious solutions. We have a vector database, so we assume everything needs to be a vector. That’s not true. We can store metadata to limit the search space. The problem with metadata is that we have to modify the data ingestion pipeline and predict what metadata will be useful. Then, we must instruct the agent to use the metadata (which becomes yet another area you must track and measure).

When everything else fails, we can change the embedding model to a more domain-specific one. If such a model doesn’t exist or perform well enough, you can train your embedding model. Dagshub has a great article on how to train a custom embedding model.

Techniques to Improve AI Decision Making

Because we build an AI agent, we have a problem with decision-making. AI-based workflows are simpler. There is an LLM performing some tasks, but our code determines the code execution path. With agents, the LLM makes the decisions. We need a way to steer those decisions.

The simplest method, which is also quite powerful (after all, it became the basis of the recent reasoning models), is chain of thought prompting. In the examples provided in the prompt, we show the model step by step how to solve the problem or make a decision. In the LLM’s output, we want to see the exact step-by-step reasoning followed by the final decision based on the reasoning.

In the “TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage” research papers, the authors show how to use LLMs to plan task execution and tool usage. The planning step may be part of the agent’s prompt, or you can use an LLM to generate the plan before calling the agent. In this case, the pre-defined plan is passed together with the user’s message to the agent.

Conclusion

As we can see, there are many ways to improve the performance of AI agents. Some are simple, like prompt engineering, and some are complex, like training a custom embedding model. Some require data preparation, like metadata-based search, and some repeat the same step multiple times, like query expansion.

You need to gather test data and track metrics. You can’t improve the agent if you don’t know what’s wrong. By “what is wrong,” I mean knowing which part of the process causes the problem.

Besides the articles linked in the text above, you may also take a look at the following texts:


Do you need help building a reliable AI agent for your business?
You can hire me!

Older post

Comprehensive Guide to AI Workflow Design Patterns with PydanticAI code examples

Learn how to implement AI workflows and autonomous agents with PydanticAI. This guide shows an example implementation of patterns described in the Anthropic article 'Building effective agents' such as prompt chaining, routing, parallelization, and orchestrator-workers.

Are you looking for an experienced AI consultant? Do you need assistance with your RAG or Agentic Workflow?
Book a Quick Consultation, send me a message on LinkedIn. Book a Quick Consultation or send me a message on LinkedIn

>