The following article is based on the conversation started by Hamza Tahir in the MLOps Community Slack channel. If you are building AI-based software and aren’t a part of the MLOps Community, you are missing out.
Table of Contents
Why AI Isn’t Regular Software
We’re treating AI agents like they’re just regular software components. And frankly, that’s a recipe for disaster. - Hamza Tahir
Exactly. Everyone who writes “hot takes” (lukewarm at best, to be honest) with dismissive comments calling AI-based software “API wrappers” is missing the point. What’s even worse is that they have no slightest clue how wrong they are.
Programmers tend to deploy LangGraph or LlamaIndex-based software like standard microservices. The guardrails, if present at all, are just Pydantic schemas. At the same time, the tests assume AI’s output is deterministic, or teams rely on “vibe-driven development” (“it looks good, let’s deploy it”).
The problem is AI models aren’t deterministic! Even if you do basic research and look only at the documentation of OpenAI API, you will see nobody promises deterministic output no matter what parameter you tweak. temperature
, top_p
, presence_penalty
, frequence_penalty
, and logit_bias
, they all affect the likelihood of tokens. Still, nobody says “set temperatue to 0 to get deterministic output” or “set presence_penalty to -2 to prevent the model from switching topics”.
In the PydanticAI documentation, you can find a great take on the current state of Large Language Models:
From a software engineers point of view, you can think of LLMs as the worst database you’ve ever heard of, but worse. If LLMs weren’t so bloody useful, we’d never touch them.
Every properly built software based on AI isn’t just an API wrapper. It’s a complex system that requires a lot of work to be done. The work never ends, and most of it isn’t even programming.
Want to build AI systems that actually work?
Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.
Why Do You Need MLOps for AI?
AI is machine learning. Your AI-based software needs the same evaluation rigor as machine learning models. At the same time, software engineering practices like automated testing are still useful but not nearly as reliable as in the case of regular software. Suppose you use an AI model deployed on a third-party server and available via API. In that case, the only thing the passing test says is that at a given time, the currently deployed version of the model worked as expected in the one case you tested. However, passing a test doesn’t guarantee that the model will work the same tomorrow or even in 15 minutes. Also, a similar but not identical input may trigger model hallucinations.
We live in the land of stochastic software, where everything we do is supposed to increase the likelihood of desired behavior.
Different Prompt Versions Are Separate Experiments
A prompt isn’t just a configuration. It is almost like a set of hyperparameters for a machine learning experiment.
Also, those are the hyperparameters for a specific version of the model. You can’t keep the same prompt when you update the model or switch to a model produced by a different team. You may be lucky, and the prompts will still work, but nobody can guarantee it.
Therefore, the prompts should be properly documented and version-controlled. You should be able to tell which version of the prompt gave you the best results according to which evaluation dataset and metric. Otherwise, your software engineering process will resemble an attempt to fix a radio by randomly moving the antenna around.
Random Input vs. Edge Cases
If you allow the users to write free-form text as the model’s input, you can safely assume you have to deal with random input. The input may more or less resemble human language, and maybe you can even assume it’s a conversation on some specific topics, but people can send anything to the model. You can’t treat everything you didn’t properly test as an edge case because there is no edge. The barrier between on-topic and off-topic conversation is blurry.
In machine learning, we have evaluation datasets (sets! plural!) and metrics instead of binary-output unit tests. Each metric for each dataset tells us how well the model with its prompts performs on a given task. Extrapolating the results from the evaluation datasets to the real world is a tricky topic. First, you must ensure you have a representative dataset. Second, the metric should be aligned with human expectations. Third, you must remember that the performance on the evaluation dataset tells you nothing about off-topic conversations.
Semantic Drift
AI’s performance may diminish (or improve, but that’s rare) over time, even if you make no changes to the model, prompts, installed libraries, or anything else. The real world changes. People start using the software in new ways or see something new somewhere else and try the same input on your AI-based software.
You can detect the semantic drift by checking if your evaluation datasets are still similar to the actual inputs. When the dataset is no longer a representative sample of the real world, you can’t trust the results of the evaluation. When it happens, you have to build a new evaluation dataset.
Semantic drift is shocking for many programmers and almost all product managers. Programmers are used to giving software constant care and updates even if no new features are added, but checking if your tests still make sense is a new concept.
How Do You Start MLOps for AI?
As you already know, you need at least one evaluation dataset with example prompts and expected outputs of your AI model. Additionally, you must pick several metrics telling you how well the model performs the task and whether AI works consistently.
Suppose your AI workflow consists of several steps (for example, a few AI calls, data retrieval based on the AI-generated query, etc.). In that case, you need evaluation datasets for all internal steps to make data-driven decisions regarding the workflow element that needs fixing.
What kind of datasets and metrics? I have written a long article about Troubleshooting AI Agents (it’s also useful when you build an AI workflow, not an agent).
When you build a new version of your prompts and decide to deploy them in production, you should do it as if you were deploying a machine learning model. Once again, AI is machine learning. Usually, we do a shadow deployment followed by monitoring key metrics on the real data processed by the new model version. Later, we gradually roll out when the current and the new models handle some portion of the traffic. The process is described in my text about shadow deployments and canary releases.
Finally, we need to communicate the changes properly. The stakeholders must get used to hearing about metrics and evaluation datasets. You aren’t adding new features to deterministic software. Every change is an experiment. Jason Liu wrote a great guide to “The right way to do AI engineering updates.”
Do you need help building AI-powered applications for your business?
You can hire me!