Let’s explore how to evaluate SQL-generating AI agents using Ragas, an open-source library designed for evaluating Large Language Model (LLM) applications. We’ll focus on agents implemented with the PydanticAI library and measure their performance in generating SQL queries.
Table of Contents
- What is Ragas?
- Setting Up the Environment
- Defining the ground truth data and database schema
- Implementing the test code
- Preparing the Ragas metrics
- Creating the SQL Agent with PydanticAI
- Running the evaluation
- Improving the SQL Agent
- The problem of LLM-based metrics
- Conclusion
What is Ragas?
Ragas is an evaluation library with tools for assessing various aspects of LLM applications. It offers reference-free evaluations, meaning you don’t need ground truth data to measure performance. However, we need the ground truth data and a description of the database schema for the SQL generation task.
Most of the metrics implemented in Ragas are LLM-based, so we use a language model to determine if the output of another language model is correct. Automated evaluation isn’t reliable (both LLMs can be wrong), but it’s much faster than reviewing the output manually, so we may still use it as a good enough tradeoff.
Setting Up the Environment
As mentioned earlier, we will use PydanticAI to implement the SQL-generating agents and Ragas to evaluate their performance. Ragas is a wrapper library around LLM clients, so we need the Langchain client for the OpenAI API. We will use the tqdm
library to see a friendly progress bar.
pip install ragas pydantic-ai langchain-openai tqdm
Defining the ground truth data and database schema
We will start with a database schema and a set of queries to evaluate. The set of queries consists of a question and the expected SQL query. For the sake of brevity, we will use only 20 queries.
DB_SCHEMA = """
CREATE TABLE Vendors (
VendorID INT PRIMARY KEY,
VendorName VARCHAR(100) NOT NULL,
ContactPerson VARCHAR(100),
Email VARCHAR(100),
Phone VARCHAR(20),
Address VARCHAR(255)
);
CREATE TABLE Departments (
DepartmentID INT PRIMARY KEY,
DepartmentName VARCHAR(50) NOT NULL
);
CREATE TABLE ExpenseCategories (
CategoryID INT PRIMARY KEY,
CategoryName VARCHAR(50) NOT NULL,
Description VARCHAR(255)
);
CREATE TABLE Invoices (
InvoiceID INT PRIMARY KEY,
VendorID INT,
DepartmentID INT,
InvoiceNumber VARCHAR(50) NOT NULL,
InvoiceDate DATE NOT NULL,
DueDate DATE,
TotalAmount DECIMAL(10, 2) NOT NULL,
Status VARCHAR(20) DEFAULT 'Pending',
FOREIGN KEY (VendorID) REFERENCES Vendors(VendorID),
FOREIGN KEY (DepartmentID) REFERENCES Departments(DepartmentID)
);
CREATE TABLE InvoiceLines (
LineID INT PRIMARY KEY,
InvoiceID INT,
CategoryID INT,
Description VARCHAR(255),
Quantity DECIMAL(10, 2),
UnitPrice DECIMAL(10, 2),
TotalPrice DECIMAL(10, 2),
FOREIGN KEY (InvoiceID) REFERENCES Invoices(InvoiceID),
FOREIGN KEY (CategoryID) REFERENCES ExpenseCategories(CategoryID)
);
"""
qa = [
("What is the total amount spent across all invoices?",
"SELECT SUM(TotalAmount) AS TotalSpent FROM Invoices;"),
("Which vendor has the highest total invoice amount?",
"""SELECT v.VendorName, SUM(i.TotalAmount) AS TotalSpent
FROM Vendors v
JOIN Invoices i ON v.VendorID = i.VendorID
GROUP BY v.VendorID, v.VendorName
ORDER BY TotalSpent DESC
LIMIT 1;"""),
("How many invoices are currently pending payment?",
"SELECT COUNT(*) AS PendingInvoices FROM Invoices WHERE Status = 'Pending';"),
("What are the top 5 expense categories by total amount spent?",
"""SELECT ec.CategoryName, SUM(il.TotalPrice) AS TotalSpent
FROM ExpenseCategories ec
JOIN InvoiceLines il ON ec.CategoryID = il.CategoryID
GROUP BY ec.CategoryID, ec.CategoryName
ORDER BY TotalSpent DESC
LIMIT 5;"""),
("Which department has spent the most in the last month?",
"""SELECT d.DepartmentName, SUM(i.TotalAmount) AS TotalSpent
FROM Departments d
JOIN Invoices i ON d.DepartmentID = i.DepartmentID
WHERE i.InvoiceDate >= DATE_SUB(CURDATE(), INTERVAL 1 MONTH)
GROUP BY d.DepartmentID, d.DepartmentName
ORDER BY TotalSpent DESC
LIMIT 1;"""),
("What is the average invoice amount?",
"SELECT AVG(TotalAmount) AS AverageInvoiceAmount FROM Invoices;"),
("How many unique vendors have we purchased from?",
"SELECT COUNT(DISTINCT VendorID) AS UniqueVendors FROM Invoices;"),
("What is the total amount spent by each department?",
"""SELECT d.DepartmentName, SUM(i.TotalAmount) AS TotalSpent
FROM Departments d
LEFT JOIN Invoices i ON d.DepartmentID = i.DepartmentID
GROUP BY d.DepartmentID, d.DepartmentName
ORDER BY TotalSpent DESC;"""),
("Which invoices are overdue?",
"""SELECT InvoiceID, InvoiceNumber, DueDate
FROM Invoices
WHERE DueDate < CURDATE() AND Status = 'Pending';"""),
("What is the most expensive single item purchased?",
"""SELECT i.InvoiceNumber, il.Description, il.UnitPrice
FROM InvoiceLines il
JOIN Invoices i ON il.InvoiceID = i.InvoiceID
ORDER BY il.UnitPrice DESC
LIMIT 1;"""),
("How many invoices does each vendor have?",
"""SELECT v.VendorName, COUNT(i.InvoiceID) AS InvoiceCount
FROM Vendors v
LEFT JOIN Invoices i ON v.VendorID = i.VendorID
GROUP BY v.VendorID, v.VendorName
ORDER BY InvoiceCount DESC;"""),
("What is the total amount spent in each month of the current year?",
"""SELECT MONTH(InvoiceDate) AS Month, SUM(TotalAmount) AS TotalSpent
FROM Invoices
WHERE YEAR(InvoiceDate) = YEAR(CURDATE())
GROUP BY MONTH(InvoiceDate)
ORDER BY Month;"""),
("Which expense category has the highest average item price?",
"""SELECT ec.CategoryName, AVG(il.UnitPrice) AS AveragePrice
FROM ExpenseCategories ec
JOIN InvoiceLines il ON ec.CategoryID = il.CategoryID
GROUP BY ec.CategoryID, ec.CategoryName
ORDER BY AveragePrice DESC
LIMIT 1;"""),
("What is the distribution of invoice statuses?",
"SELECT Status, COUNT(*) AS Count FROM Invoices GROUP BY Status;"),
("Who are the top 3 vendors by number of invoices?",
"""SELECT v.VendorName, COUNT(i.InvoiceID) AS InvoiceCount
FROM Vendors v
JOIN Invoices i ON v.VendorID = i.VendorID
GROUP BY v.VendorID, v.VendorName
ORDER BY InvoiceCount DESC
LIMIT 3;"""),
("What is the average time between invoice date and due date?",
"SELECT AVG(DATEDIFF(DueDate, InvoiceDate)) AS AverageDaysToPay FROM Invoices;"),
("Which department has the highest number of distinct expense categories?",
"""SELECT d.DepartmentName, COUNT(DISTINCT il.CategoryID) AS UniqueCategories
FROM Departments d
JOIN Invoices i ON d.DepartmentID = i.DepartmentID
JOIN InvoiceLines il ON i.InvoiceID = il.InvoiceID
GROUP BY d.DepartmentID, d.DepartmentName
ORDER BY UniqueCategories DESC
LIMIT 1;"""),
("What is the total amount of pending invoices?",
"SELECT SUM(TotalAmount) AS TotalPendingAmount FROM Invoices WHERE Status = 'Pending';"),
("Which vendors have we not purchased from in the last 6 months?",
"""SELECT VendorName
FROM Vendors
WHERE VendorID NOT IN (
SELECT DISTINCT VendorID
FROM Invoices
WHERE InvoiceDate >= DATE_SUB(CURDATE(), INTERVAL 6 MONTH)
);"""),
("What is the average number of line items per invoice?",
"""SELECT AVG(LineItemCount) AS AverageLineItems
FROM (
SELECT InvoiceID, COUNT(*) AS LineItemCount
FROM InvoiceLines
GROUP BY InvoiceID
) AS LineItemCounts;""")
]
Implementing the test code
We will use the same test code to run all agents and compare the results. We need a function to pass questions to the agent and return the result together with the agent’s input:
def run_tests(agent, test_data, to_answer = lambda x: x.data):
for (question, expected_answer) in test_data:
response = agent.run_sync(question)
actual_answer = to_answer(response)
yield (question, expected_answer, actual_answer)
The agent
parameter is the PydanticAI Agent to evaluate. The test_data
parameter is a list of tuples, where each tuple contains a question and the expected SQL query. The to_answer
parameter is a function capable of extracting the answer from the agent’s response.
To evaluate the answer using Ragas, we will need to create a dataset of samples. In our case, we have a single interaction between the user and the agent, so we create SingleTurnSample
objects.
from ragas.dataset_schema import SingleTurnSample
def make_ragas_samples(questions_and_responses):
for (question, expected_answer, actual_answer) in questions_and_responses:
yield SingleTurnSample(
user_input=question,
reference=expected_answer,
response=actual_answer,
reference_contexts=[f"""DB SCHEMA: {DB_SCHEMA}"""]
)
Later, we will use the make_ragas_samples
function to create a dataset of samples and pass them to the Ragas evaluator.
Preparing the Ragas metrics
To evaluate the performance of the SQL-generating agents, we will use the built-in SQL equivalence metric and a custom LLM-based metric to check if the output contains only SQL with no additional text.
First, we have to define the language model to use during the evaluation:
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
model = ChatOpenAI(model="gpt-4o", api_key=OPENAI_API_KEY)
evaluator_llm = LangchainLLMWrapper(model)
Next, we can define the metrics and configure them with the LLM client:
from ragas.metrics import LLMSQLEquivalence
from ragas.metrics import AspectCritic
sql_scorer = LLMSQLEquivalence()
sql_scorer.llm = evaluator_llm
clean_sql_binary = AspectCritic(
name="clean_sql_binary",
definition="Did the model return a pure SQL query with no additional markdown or explanations?",
llm=evaluator_llm,
)
Creating the SQL Agent with PydanticAI
In my previous article, I have shown how to use PydanticAI to obtain structured output from the LLM. I recommend reading it to understand how PydanticAI works.
Our first agent will generate a free-form text response. We hope the model returns only the valid SQL query.
from datetime import date
from pydantic_ai import Agent
agent = Agent("openai:gpt-4o-mini", result_type=str)
@agent.system_prompt
async def system_prompt() -> str:
return f"""\
Given the following PostgreSQL table of records, your job is to
write an SQL query that suits the user's request.
Database schema:
{DB_SCHEMA}
today's date = {date.today()}
Example
request: Show me invoices issued by "XYZ"
response: SELECT * FROM Invoices WHERE VendorID = (SELECT VendorID FROM Vendors WHERE VendorName = 'XYZ');
Example
request: Which vendors have we added during the last 3 months:
response: SELECT VendorName FROM Vendors WHERE VendorID IN (SELECT VendorID FROM Invoices WHERE InvoiceDate >= DATE_SUB(CURDATE(), INTERVAL 3 MONTH) ) AND VendorID NOT IN (SELECT VendorID FROM Invoices WHERE InvoiceDate < DATE_SUB(CURDATE(), INTERVAL 3 MONTH));
"""
Want to build AI systems that actually work?
Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.
Running the evaluation
Finally, we can run inference using our agent and create an EvaluationDataset
with the results:
from tqdm import tqdm
from ragas import evaluate, EvaluationDataset
result = run_tests(agent, qa)
samples = make_ragas_samples(result)
samples = list(tqdm(samples, total=len(qa)))
dataset = EvaluationDataset(samples=samples)
When the dataset is ready, we can evaluate the LLM’s output using the Ragas evaluator:
eval_result = evaluate(
dataset=dataset,
metrics=[sql_scorer, clean_sql_binary],
)
eval_result
The eval_result
variable contains the evaluation results. When we print it, we get the average score for each metric. In my case:
{'llm_sql_equivalence_with_reference': 0.7000, 'clean_sql_binary': 0.0000}
We see the SQL equivalence metric is 0.7, which means the model generates valid SQL queries 70% of the time. The custom metric is 0, which means the model always fails to return a pure SQL query with no additional text.
We can see the actual values by obtaining a Pandas DataFrame with the results:
eval_result.to_pandas()
Improving the SQL Agent
The SQL queries generated by the agents aren’t perfect, but the bigger problem is that the returned text is useless if we want to use the query inside an application. We need something we can pass to a database without any additional processing.
Therefore, for the next agent, we will use structured output to generate an object containing two fields: an explanation and the SQL query. The explanation will be a free-form text where the model can plan the query. In the SQL field, I expect to see only the SQL query:
from pydantic import BaseModel, Field
class SqlQuery(BaseModel):
explanation: str = Field(description="Explanation of what data has to be returned and from which tables")
sql_query: str = Field(description="SQL query to be executed")
agent = Agent("openai:gpt-4o-mini", result_type=SqlQuery)
@agent.system_prompt
async def system_prompt() -> str:
return f"""\
Given the following PostgreSQL table of records, your job is to
write an SQL query that suits the user's request.
Database schema:
{DB_SCHEMA}
today's date = {date.today()}
Example
request: Show me invoices issued by "XYZ"
response: SELECT * FROM Invoices WHERE VendorID = (SELECT VendorID FROM Vendors WHERE VendorName = 'XYZ');
Example
request: Which vendors have we added during the last 3 months:
response: SELECT VendorName FROM Vendors WHERE VendorID IN (SELECT VendorID FROM Invoices WHERE InvoiceDate >= DATE_SUB(CURDATE(), INTERVAL 3 MONTH) ) AND VendorID NOT IN (SELECT VendorID FROM Invoices WHERE InvoiceDate < DATE_SUB(CURDATE(), INTERVAL 3 MONTH));
"""
When we run the evaluation, we get the following results:
{'llm_sql_equivalence_with_reference': 0.4000, 'clean_sql_binary': 1.0000}
The output format is correct, but we get more invalid SQL queries. When I reviewed the results, I saw the model’s tendency to retrieve more data than required and a lack of knowledge about the possible values of text columns. As the next improvement, I added a dataset description to guide the model and replaced the gpt-4o-mini
model with gpt-4o
.
I reached the sql_equivalence score of 0.8 while retaining the clean_sql_binary score of 1.0.
The problem of LLM-based metrics
We quickly noticed a problem with the SQL equivalence metric when we looked at the evaluation results. The metric sometimes rejects correct SQL queries, retrieving the same data as the reference query. What’s even worse, the model sometimes accepts incorrect SQL queries.
When we use an LLM-based metric, we assume an LLM can score its own output. The problem is that the LLM may be wrong when it generates the output and/or scores the result.
A better approach would be to use a test database and compare the results of the generated SQL queries with those of the expected SQL queries. Dataset comparison would miss the cases where the query is wrong, but we don’t have data points to reveal the error.
When we compare long queries, human reviewers can’t reliably determine if the queries are equivalent. At some point, we have to combine all those metrics while remembering that neither is perfect.
Also, in production, the model’s response time and the query execution time may be more important than getting the shortest SQL query that retrieves only the required data (as long as the returned data is correct).
Which metrics to use?
Pixion published a description of the metrics available in Ragas: Ragas Evaluation: In-Depth Insights. They provide a better explanation than the Ragas documentation.
Conclusion
While SQL equivalence metrics are valuable, they’re just one part of the evaluation story. Verifying that queries produce identical datasets is crucial but may still not catch all potential issues. A comprehensive evaluation strategy should consider query structure, result validation, and runtime performance.
Do you need help building AI Agents for data analysis for your business?
You can hire me!