Measuring SQL Generation Performance of AI Agents with Ragas

Let’s explore how to evaluate SQL-generating AI agents using Ragas, an open-source library designed for evaluating Large Language Model (LLM) applications. We’ll focus on agents implemented with the PydanticAI library and measure their performance in generating SQL queries.

Table of Contents

What is Ragas?
Setting Up the Environment
Defining the ground truth data and database schema
Implementing the test code
Preparing the Ragas metrics
Creating the SQL Agent with PydanticAI
Running the evaluation
Improving the SQL Agent
The problem of LLM-based metrics
1. Which metrics to use?
Conclusion

What is Ragas?

Ragas is an evaluation library with tools for assessing various aspects of LLM applications. It offers reference-free evaluations, meaning you don’t need ground truth data to measure performance. However, we need the ground truth data and a description of the database schema for the SQL generation task.

Most of the metrics implemented in Ragas are LLM-based, so we use a language model to determine if the output of another language model is correct. Automated evaluation isn’t reliable (both LLMs can be wrong), but it’s much faster than reviewing the output manually, so we may still use it as a good enough tradeoff.

Setting Up the Environment

As mentioned earlier, we will use PydanticAI to implement the SQL-generating agents and Ragas to evaluate their performance. Ragas is a wrapper library around LLM clients, so we need the Langchain client for the OpenAI API. We will use the tqdm library to see a friendly progress bar.

pip install ragas pydantic-ai langchain-openai tqdm

Defining the ground truth data and database schema

We will start with a database schema and a set of queries to evaluate. The set of queries consists of a question and the expected SQL query. For the sake of brevity, we will use only 20 queries.

DB_SCHEMA = """
CREATE TABLE Vendors (
 VendorID INT PRIMARY KEY,
 VendorName VARCHAR(100) NOT NULL,
 ContactPerson VARCHAR(100),
 Email VARCHAR(100),
 Phone VARCHAR(20),
 Address VARCHAR(255)
);

CREATE TABLE Departments (
 DepartmentID INT PRIMARY KEY,
 DepartmentName VARCHAR(50) NOT NULL
);

CREATE TABLE ExpenseCategories (
 CategoryID INT PRIMARY KEY,
 CategoryName VARCHAR(50) NOT NULL,
 Description VARCHAR(255)
);
CREATE TABLE Invoices (
 InvoiceID INT PRIMARY KEY,
 VendorID INT,
 DepartmentID INT,
 InvoiceNumber VARCHAR(50) NOT NULL,
 InvoiceDate DATE NOT NULL,
 DueDate DATE,
 TotalAmount DECIMAL(10, 2) NOT NULL,
 Status VARCHAR(20) DEFAULT 'Pending',
 FOREIGN KEY (VendorID) REFERENCES Vendors(VendorID),
 FOREIGN KEY (DepartmentID) REFERENCES Departments(DepartmentID)
);
CREATE TABLE InvoiceLines (
 LineID INT PRIMARY KEY,
 InvoiceID INT,
 CategoryID INT,
 Description VARCHAR(255),
 Quantity DECIMAL(10, 2),
 UnitPrice DECIMAL(10, 2),
 TotalPrice DECIMAL(10, 2),
 FOREIGN KEY (InvoiceID) REFERENCES Invoices(InvoiceID),
 FOREIGN KEY (CategoryID) REFERENCES ExpenseCategories(CategoryID)
);
"""

qa = [
 ("What is the total amount spent across all invoices?",
     "SELECT SUM(TotalAmount) AS TotalSpent FROM Invoices;"),

 ("Which vendor has the highest total invoice amount?",
     """SELECT v.VendorName, SUM(i.TotalAmount) AS TotalSpent
 FROM Vendors v
 JOIN Invoices i ON v.VendorID = i.VendorID
 GROUP BY v.VendorID, v.VendorName
 ORDER BY TotalSpent DESC
 LIMIT 1;"""),

 ("How many invoices are currently pending payment?",
 "SELECT COUNT(*) AS PendingInvoices FROM Invoices WHERE Status = 'Pending';"),

 ("What are the top 5 expense categories by total amount spent?",
 """SELECT ec.CategoryName, SUM(il.TotalPrice) AS TotalSpent
 FROM ExpenseCategories ec
 JOIN InvoiceLines il ON ec.CategoryID = il.CategoryID
 GROUP BY ec.CategoryID, ec.CategoryName
 ORDER BY TotalSpent DESC
 LIMIT 5;"""),

 ("Which department has spent the most in the last month?",
 """SELECT d.DepartmentName, SUM(i.TotalAmount) AS TotalSpent
 FROM Departments d
 JOIN Invoices i ON d.DepartmentID = i.DepartmentID
 WHERE i.InvoiceDate >= DATE_SUB(CURDATE(), INTERVAL 1 MONTH)
 GROUP BY d.DepartmentID, d.DepartmentName
 ORDER BY TotalSpent DESC
 LIMIT 1;"""),

 ("What is the average invoice amount?",
 "SELECT AVG(TotalAmount) AS AverageInvoiceAmount FROM Invoices;"),

 ("How many unique vendors have we purchased from?",
 "SELECT COUNT(DISTINCT VendorID) AS UniqueVendors FROM Invoices;"),

 ("What is the total amount spent by each department?",
 """SELECT d.DepartmentName, SUM(i.TotalAmount) AS TotalSpent
 FROM Departments d
 LEFT JOIN Invoices i ON d.DepartmentID = i.DepartmentID
 GROUP BY d.DepartmentID, d.DepartmentName
 ORDER BY TotalSpent DESC;"""),

 ("Which invoices are overdue?",
 """SELECT InvoiceID, InvoiceNumber, DueDate
 FROM Invoices
 WHERE DueDate < CURDATE() AND Status = 'Pending';"""),

 ("What is the most expensive single item purchased?",
 """SELECT i.InvoiceNumber, il.Description, il.UnitPrice
 FROM InvoiceLines il
 JOIN Invoices i ON il.InvoiceID = i.InvoiceID
 ORDER BY il.UnitPrice DESC
 LIMIT 1;"""),

 ("How many invoices does each vendor have?",
 """SELECT v.VendorName, COUNT(i.InvoiceID) AS InvoiceCount
 FROM Vendors v
 LEFT JOIN Invoices i ON v.VendorID = i.VendorID
 GROUP BY v.VendorID, v.VendorName
 ORDER BY InvoiceCount DESC;"""),

 ("What is the total amount spent in each month of the current year?",
 """SELECT MONTH(InvoiceDate) AS Month, SUM(TotalAmount) AS TotalSpent
 FROM Invoices
 WHERE YEAR(InvoiceDate) = YEAR(CURDATE())
 GROUP BY MONTH(InvoiceDate)
 ORDER BY Month;"""),

 ("Which expense category has the highest average item price?",
 """SELECT ec.CategoryName, AVG(il.UnitPrice) AS AveragePrice
 FROM ExpenseCategories ec
 JOIN InvoiceLines il ON ec.CategoryID = il.CategoryID
 GROUP BY ec.CategoryID, ec.CategoryName
 ORDER BY AveragePrice DESC
 LIMIT 1;"""),

 ("What is the distribution of invoice statuses?",
 "SELECT Status, COUNT(*) AS Count FROM Invoices GROUP BY Status;"),

 ("Who are the top 3 vendors by number of invoices?",
 """SELECT v.VendorName, COUNT(i.InvoiceID) AS InvoiceCount
 FROM Vendors v
 JOIN Invoices i ON v.VendorID = i.VendorID
 GROUP BY v.VendorID, v.VendorName
 ORDER BY InvoiceCount DESC
 LIMIT 3;"""),

 ("What is the average time between invoice date and due date?",
 "SELECT AVG(DATEDIFF(DueDate, InvoiceDate)) AS AverageDaysToPay FROM Invoices;"),

 ("Which department has the highest number of distinct expense categories?",
 """SELECT d.DepartmentName, COUNT(DISTINCT il.CategoryID) AS UniqueCategories
 FROM Departments d
 JOIN Invoices i ON d.DepartmentID = i.DepartmentID
 JOIN InvoiceLines il ON i.InvoiceID = il.InvoiceID
 GROUP BY d.DepartmentID, d.DepartmentName
 ORDER BY UniqueCategories DESC
 LIMIT 1;"""),

 ("What is the total amount of pending invoices?",
 "SELECT SUM(TotalAmount) AS TotalPendingAmount FROM Invoices WHERE Status = 'Pending';"),

 ("Which vendors have we not purchased from in the last 6 months?",
 """SELECT VendorName
 FROM Vendors
 WHERE VendorID NOT IN (
 SELECT DISTINCT VendorID
 FROM Invoices
 WHERE InvoiceDate >= DATE_SUB(CURDATE(), INTERVAL 6 MONTH)
 );"""),

 ("What is the average number of line items per invoice?",
 """SELECT AVG(LineItemCount) AS AverageLineItems
 FROM (
 SELECT InvoiceID, COUNT(*) AS LineItemCount
 FROM InvoiceLines
 GROUP BY InvoiceID
 ) AS LineItemCounts;""")
]

Implementing the test code

We will use the same test code to run all agents and compare the results. We need a function to pass questions to the agent and return the result together with the agent’s input:

def run_tests(agent, test_data, to_answer = lambda x: x.data):
  for (question, expected_answer) in test_data:
 response = agent.run_sync(question)
 actual_answer = to_answer(response)
    yield (question, expected_answer, actual_answer)

The agent parameter is the PydanticAI Agent to evaluate. The test_data parameter is a list of tuples, where each tuple contains a question and the expected SQL query. The to_answer parameter is a function capable of extracting the answer from the agent’s response.

To evaluate the answer using Ragas, we will need to create a dataset of samples. In our case, we have a single interaction between the user and the agent, so we create SingleTurnSample objects.

from ragas.dataset_schema import SingleTurnSample


def make_ragas_samples(questions_and_responses):
  for (question, expected_answer, actual_answer) in questions_and_responses:
    yield SingleTurnSample(
 user_input=question,
 reference=expected_answer,
 response=actual_answer,
 reference_contexts=[f"""DB SCHEMA: {DB_SCHEMA}"""]
 )

Later, we will use the make_ragas_samples function to create a dataset of samples and pass them to the Ragas evaluator.

Preparing the Ragas metrics

To evaluate the performance of the SQL-generating agents, we will use the built-in SQL equivalence metric and a custom LLM-based metric to check if the output contains only SQL with no additional text.

First, we have to define the language model to use during the evaluation:

from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI


model = ChatOpenAI(model="gpt-4o", api_key=OPENAI_API_KEY)
evaluator_llm = LangchainLLMWrapper(model)

Next, we can define the metrics and configure them with the LLM client:

from ragas.metrics import LLMSQLEquivalence
from ragas.metrics import AspectCritic


sql_scorer = LLMSQLEquivalence()
sql_scorer.llm = evaluator_llm

clean_sql_binary = AspectCritic(
 name="clean_sql_binary",
 definition="Did the model return a pure SQL query with no additional markdown or explanations?",
 llm=evaluator_llm,
)

Creating the SQL Agent with PydanticAI

In my previous article, I have shown how to use PydanticAI to obtain structured output from the LLM. I recommend reading it to understand how PydanticAI works.

Our first agent will generate a free-form text response. We hope the model returns only the valid SQL query.

from datetime import date
from pydantic_ai import Agent


agent = Agent("openai:gpt-4o-mini", result_type=str)

@agent.system_prompt
async def system_prompt() -> str:
    return f"""\
Given the following PostgreSQL table of records, your job is to
write an SQL query that suits the user's request.

Database schema:

{DB_SCHEMA}

today's date = {date.today()}

Example
 request: Show me invoices issued by "XYZ"
 response: SELECT * FROM Invoices WHERE VendorID = (SELECT VendorID FROM Vendors WHERE VendorName = 'XYZ');
Example
 request: Which vendors have we added during the last 3 months:
 response: SELECT VendorName FROM Vendors WHERE VendorID IN (SELECT VendorID FROM Invoices WHERE InvoiceDate >= DATE_SUB(CURDATE(), INTERVAL 3 MONTH) ) AND VendorID NOT IN (SELECT VendorID FROM Invoices WHERE InvoiceDate < DATE_SUB(CURDATE(), INTERVAL 3 MONTH));
"""

Running the evaluation

Finally, we can run inference using our agent and create an EvaluationDataset with the results:

from tqdm import tqdm
from ragas import evaluate, EvaluationDataset


result = run_tests(agent, qa)
samples = make_ragas_samples(result)
samples = list(tqdm(samples, total=len(qa)))

dataset = EvaluationDataset(samples=samples)

When the dataset is ready, we can evaluate the LLM’s output using the Ragas evaluator:

eval_result = evaluate(
 dataset=dataset,
 metrics=[sql_scorer, clean_sql_binary],
)
eval_result

The eval_result variable contains the evaluation results. When we print it, we get the average score for each metric. In my case:

{'llm_sql_equivalence_with_reference': 0.7000, 'clean_sql_binary': 0.0000}

We see the SQL equivalence metric is 0.7, which means the model generates valid SQL queries 70% of the time. The custom metric is 0, which means the model always fails to return a pure SQL query with no additional text.

We can see the actual values by obtaining a Pandas DataFrame with the results:

eval_result.to_pandas()

Improving the SQL Agent

The SQL queries generated by the agents aren’t perfect, but the bigger problem is that the returned text is useless if we want to use the query inside an application. We need something we can pass to a database without any additional processing.

Therefore, for the next agent, we will use structured output to generate an object containing two fields: an explanation and the SQL query. The explanation will be a free-form text where the model can plan the query. In the SQL field, I expect to see only the SQL query:

from pydantic import BaseModel, Field


class SqlQuery(BaseModel):
 explanation: str = Field(description="Explanation of what data has to be returned and from which tables")
 sql_query: str = Field(description="SQL query to be executed")

agent = Agent("openai:gpt-4o-mini", result_type=SqlQuery)


@agent.system_prompt
async def system_prompt() -> str:
    return f"""\
Given the following PostgreSQL table of records, your job is to
write an SQL query that suits the user's request.

Database schema:

{DB_SCHEMA}

today's date = {date.today()}

Example
 request: Show me invoices issued by "XYZ"
 response: SELECT * FROM Invoices WHERE VendorID = (SELECT VendorID FROM Vendors WHERE VendorName = 'XYZ');
Example
 request: Which vendors have we added during the last 3 months:
 response: SELECT VendorName FROM Vendors WHERE VendorID IN (SELECT VendorID FROM Invoices WHERE InvoiceDate >= DATE_SUB(CURDATE(), INTERVAL 3 MONTH) ) AND VendorID NOT IN (SELECT VendorID FROM Invoices WHERE InvoiceDate < DATE_SUB(CURDATE(), INTERVAL 3 MONTH));
"""

When we run the evaluation, we get the following results:

{'llm_sql_equivalence_with_reference': 0.4000, 'clean_sql_binary': 1.0000}

The output format is correct, but we get more invalid SQL queries. When I reviewed the results, I saw the model’s tendency to retrieve more data than required and a lack of knowledge about the possible values of text columns. As the next improvement, I added a dataset description to guide the model and replaced the gpt-4o-mini model with gpt-4o.

I reached the sql_equivalence score of 0.8 while retaining the clean_sql_binary score of 1.0.

The problem of LLM-based metrics

We quickly noticed a problem with the SQL equivalence metric when we looked at the evaluation results. The metric sometimes rejects correct SQL queries, retrieving the same data as the reference query. What’s even worse, the model sometimes accepts incorrect SQL queries.

When we use an LLM-based metric, we assume an LLM can score its own output. The problem is that the LLM may be wrong when it generates the output and/or scores the result.

A better approach would be to use a test database and compare the results of the generated SQL queries with those of the expected SQL queries. Dataset comparison would miss the cases where the query is wrong, but we don’t have data points to reveal the error.

When we compare long queries, human reviewers can’t reliably determine if the queries are equivalent. At some point, we have to combine all those metrics while remembering that neither is perfect.

Also, in production, the model’s response time and the query execution time may be more important than getting the shortest SQL query that retrieves only the required data (as long as the returned data is correct).

Which metrics to use?

Pixion published a description of the metrics available in Ragas: Ragas Evaluation: In-Depth Insights. They provide a better explanation than the Ragas documentation.

Conclusion

While SQL equivalence metrics are valuable, they’re just one part of the evaluation story. Verifying that queries produce identical datasets is crucial but may still not catch all potential issues. A comprehensive evaluation strategy should consider query structure, result validation, and runtime performance.

Is your AI hallucinating in production? Take my 10-minute AI Readiness Assessment to identify critical vulnerabilities or schedule a consultation.

Measuring SQL Generation Performance of AI Agents with Ragas

What is Ragas?

Setting Up the Environment

Defining the ground truth data and database schema

Implementing the test code

Preparing the Ragas metrics

Creating the SQL Agent with PydanticAI

Running the evaluation

Improving the SQL Agent

The problem of LLM-based metrics

Which metrics to use?

Conclusion

Using PydanticAI to obtain Structured Output from RAG in Python

Improving RAG Retrieval Accuracy: A Practical Implementation Guide with PydanticAI and Ragas

Measuring SQL Generation Performance of AI Agents with Ragas

What is Ragas?

Setting Up the Environment

Defining the ground truth data and database schema

Implementing the test code

Preparing the Ragas metrics

Creating the SQL Agent with PydanticAI

Running the evaluation

Improving the SQL Agent

The problem of LLM-based metrics

Which metrics to use?

Conclusion

Using PydanticAI to obtain Structured Output from RAG in Python

Improving RAG Retrieval Accuracy: A Practical Implementation Guide with PydanticAI and Ragas

Related Posts

Multilingual RAG: Does Query-Doc Language Mismatch Matter?

Fix AI Pipeline Hallucinations: 95% Accurate Data Extraction in One Day

How to Detect and Block AI Hallucinations in Chatbots