When AI engineers build applications, they focus on the wrong things: choosing models, tweaking prompts, installing metric libraries, and automating evaluations. Manual reviews are dismissed because “we’re here to automate.” The most effective way to improve an AI application is looking at your fucking data. If only I knew about it sooner.
Table of Contents
- How to Evaluate the AI Results
- How to Categorize Failure Modes and Create Metrics
- How to Automate Tests
- How to Iterate Faster
- Sources and Related Articles
In his “A Field Guide to Rapidly Improving AI Products”, Hamel Husain describes how to iteratively improve AI applications through five simple steps:
- Manually review the data and write feedback on what you like and don’t like.
- Categorize the types of failures you see in the data.
- Build specific tests and metrics to catch these issues.
- Measure the improvement using these metrics.
- Focus on running more experiments, learning faster, and making quick iterations.
No fancy tools. No automation (at least not at the beginning). Only the dreaded manual work.
I wish Hamel had written that article two years ago when I was building an AI-based search application for finding similar technical issues from the past. The idea was to reuse solutions when users asked about recurring problems.
For a long time, I calculated generic metrics and tweaked prompts only to gain minuscule improvements. Everything changed when I realized most incorrectly handled cases stemmed from two AI issues: misclassification and getting lost in the noise (where the AI couldn’t find relevant information in long issue descriptions and was focusing on the wrong details, a common problem with GPT-3). Misclassification was easy to solve with few-shot in-context learning, a technique where you provide the AI with relevant examples directly in the prompt. The second failure mode, however, couldn’t be fixed with prompt adjustments. I needed to rework the entire AI pipeline and add processing steps.
How to Evaluate the AI Results
Start by analyzing errors. They will reveal where to invest your time and what to measure. To do this, you must identify incorrect results by reviewing the AI’s input and output.
You may not find a convenient tool for gathering feedback. Existing tools are often too generic or won’t fit your specific needs. Fortunately, you don’t need anything fancy. Create a simple single-page interface that displays the context used by the AI, the AI’s response, and a free-form feedback field. With Cursor, I built a working application in 10 minutes and added features as needed.
In your feedback, write specific comments about what works and what doesn’t. Don’t try to debug the AI. Instead, record what you observe. For your first iterations, a “I know it when I see it” approach is sufficient. Though tempting, avoid using an LLM to generate this feedback (at least initially).
When should you stop evaluating? Once you stop learning new things. Don’t worry about missing errors because you’ll conduct another review in your next iteration.
In my search application project, I initially used Excel for error analysis. It was a mess. I scrolled through a huge spreadsheet with a single column of feedback, constantly switching between windows to see AI responses, my notes, and the relevant context. At some point, my screen was full of text that was blurring into an incomprehensible mess, and I wasn’t even sure if I was still editing the right cell. Learn from my mistake: build a simple, dedicated tool.
How to Categorize Failure Modes and Create Metrics
Identifying patterns in errors is your next crucial step. If patterns aren’t immediately obvious, use an LLM to generate error categories from your feedback, then ask AI to assign feedback to these categories. Review the results and refine the categories until they make sense. You might need to merge similar categories or split overly broad ones. For guidance on organizing categories, see my article on document clustering and topic modeling.
Once you’ve established failure modes, create specific metrics for each one. These metrics should track how frequently each failure occurs and show how well you’ve fixed each problem. Such data-specific metrics prove far more effective than generic measures like “truthfulness” or “helpfulness.” Make these metrics even more valuable by determining which failure modes matter most.
To prioritize failure modes, classify user intents into categories and count which occur most frequently. Focus on fixing common failures in popular intents (or, realistically, the failure mode that annoys your CEO most).
When a client tells me everything is important, I don’t ask what they want fixed next. I don’t want to just rephrase the original overwhelming question. Instead, I select two categories I believe are most important and ask them to choose between these specific options. Sometimes they’ll suggest something different, but at least I’ve established a clear direction.
Get Weekly AI Implementation Insights
Join engineering leaders who receive my analysis of common AI production failures and how to prevent them. No fluff, just actionable techniques.
How to Automate Tests
Eventually, you won’t be able to review all data manually. Once you have your first version of metrics, begin automating evaluation using LLMs.
The LLM-as-a-judge approach requires aligning the LLM’s decisions with human expectations.
For proper, unbiased alignment: first review the data yourself and write feedback. Then ask the LLM to review the same data. Finally, compare your decisions with the LLM’s judgments.
Avoid the shortcut of letting the LLM review first and then evaluating its feedback. This introduces bias. You’ll only determine whether you like the automated feedback, not identify the specific mistakes the LLM-judge makes.
Even with automation, you must continue manual reviews. The LLM-judge requires constant calibration to ensure its decisions remain aligned with your expectations and to maintain trust in its judgments.
Since reviewing all data manually becomes impractical at scale, be strategic. In my article on making AI evaluation affordable, I demonstrate techniques for reducing dataset size while maintaining representative results. You can use topic modeling to group data into categories, and then filter queries to focus on specific priorities. Those may be the most popular queries, the most important for your business, or interesting edge cases.
Be intentional about dataset selection rather than reviewing the same examples repeatedly. I always include some randomly selected examples in each batch I review manually. This helps me discover issues I might otherwise miss. While these issues might not be priorities, I prefer knowing about them.
How to Iterate Faster
Make your experiment cycle as short as possible. If you have a domain expert on your team, let them not only review data but also experiment with prompts directly. Having experts tweak prompts themselves is faster than having them explain desired changes to engineers, who then modify the AI’s input. After all, prompts are in natural language.
If your application doesn’t exist yet and you lack real data, synthetic data can help you start. However, don’t rely exclusively on AI-generated data. Synthetic data is rarely representative of real-world use, and you’ll miss actual issues. Switch to a combination of real and synthetic data as soon as you get your first users. Use synthetic data strategically: to supplement examples of rare user intents, create diverse adversarial examples, or implement new features.
The nights I spent tweaking prompts when I should have been examining data taught me an expensive lesson. When my client asked why the AI search was suddenly working so well, I showed them the failure mode categories and specific improvements for each. “This is the first AI explanation that actually makes sense,” they said. That day, I transformed from “an AI engineer who builds barely functioning MVPs” to “the person who makes AI work when it matters.” The difference? I finally looked at my fucking data.
Still, it took me too much time and my process wasn’t nearly as efficient as it would have been if I knew Hamel’s advice. Listen to Hamel. Save yourself some time.
Sources and Related Articles
- Your AI Product Needs Evals
- A Field Guide to Rapidly Improving AI Products
- Task-Specific LLM Evals that Do & Don’t Work
- AI-Powered Topic Modeling: Using Word Embeddings and Clustering for Document Analysis
- How to Systematically Improve RAG Applications
- How to Make AI Evaluation Affordable: Research-Backed Methods to Cut LLM Evaluation Costs
- From API Wrappers to Reliable AI: Essential MLOps Practices for LLM Applications
- Creating a LLM-as-a-Judge That Drives Business Results
Is your building reliable AI applications failing in production? Take my 10-minute AI Readiness Assessment to identify critical vulnerabilities or view my full implementation services.