The Failures That Don’t Crash: MLOps for AI Agents

This article is a text version of my talk, "The Failures That Don’t Crash: MLOps for AI Agents," presented at Berlin Buzzwords 2026.

Table of Contents

Pattern 1: Shadow deployments
Pattern 2: Circuit Breakers
Pattern 3: The Eval Harness
The Human at the Bottom

I lived with other developers at university. Every evening was a conversation about our coding jobs. We looked at .NET with envy, and at LINQ, Windows Presentation Foundation, and Silverlight, which seemed like magic. We were all Java developers, so all we had were verbose strategy patterns and Java Swing for UI development. Which made every app look as if it was designed by someone who’d had beauty described to them but never seen it.

When did we kill the joy of programming? Was it when we stopped talking about code with our colleagues? Was AI coding the final blow to the enjoyment of code?

Imagine talking with friends about what your AI agent did today. You hunch over a table that’s way too low to work comfortably, drink a sour liquid passing off as instant coffee, and an episode of Star Trek: The Next Generation plays on the TV in the background, but nobody really watches it because you all got your first programming job a few months ago, and every evening turns into a small meetup group. You talk about patterns, libraries, and features you would like to see in the programming language you are forced to use. Now, put an AI agent in that conversation. It would turn into something like this: “I wrote a prompt, and it generated some code. I think it works.”

I think it works.

Perry et al., Stanford, ACM CCS 2023. A research paper about the use of AI-assisted coding tools, showing that AI makes code less secure, but we are more confident that it’s secure.

But when it comes to AI, whether used to build an app or built into an app, many can’t even say they think it works because they have no clue. According to LangChain, State of Agent Engineering, published in 2026, 38% of agent teams run online evals while 53% run offline evals. Half of the people who use AI in their product have no clue whether it works correctly. At least half. And those who use generic eval libraries are measuring nothing their customers care about.

AI is machine learning, so we can use MLOps techniques we already know. You just need to learn which patterns survive the transition.

Pattern 1: Shadow deployments

In MLOps, we use shadow deployments and A/B testing. We run multiple versions of the same model simultaneously and compare their results.

Four years ago, I owned the ML deployment pipeline at Riskmethods. Sixteen language and source combinations. BentoML. Sagemaker. Customer-facing classification, multi-tenant, can’t go down.

The pattern we built was simple in principle. Every new model got mirrored production traffic before it ever served a real response. The shadow saw exactly what production saw. We compared them on a scoreboard. We promoted nothing until the shadow looked at least as good as the incumbent on every metric we cared about.

That was deterministic ML. Same input, same output. The diff was meaningful. If the shadow version disagreed with production on a sample, you knew something had changed.

With agents, you can’t do that. Run the same prompt twice, and you get two different outputs. Both probably fine. Neither identical. Diffing them is useless.

The pattern still works, but the comparison shifts. Stop asking, “Did the outputs match?” Start asking, “Did the distributions shift?” Latency distribution. Cost per request. Refusal rate. Pass rate on each category in your eval suite.

Latency, both the waiting time for the full answer and the time to the first token. Cost per response because the easiest way to make an AI agent worse, is to make it more verbose. Refusal rate: the more safeguards you put in, the more likely your model is to reject a valid request. Eval score per category. Finally, the user wanted to get some results. Did they get it? It’s the score per category because aggregate hides regressions. If the new agent gets 10% worse in one category and 3% better in four others, the aggregate looks great. Obviously, the category that got ten percent worse is always the one your top customer cares about.

Machine learning is full of little traps. Here is the first one. Your shadow looks fine. You promote. Production blows up. Why?

Because your shadow sampled biased traffic. Maybe you only mirrored requests from your top three customers because they generate the volume. The fourth customer hits a category you never tested. Or you sampled the daytime hour because your dashboards looked nice, and you missed the batch job that runs at 3 in the morning.

If your shadow doesn’t see what production sees, the distribution it’s matching is incorrect. The match is meaningless.

So that’s Pattern 1. Mirror production traffic into a shadow. Compare the distributions, not on exact outputs. Watch the per-category metrics, not the aggregate. And make sure your shadow actually sees what production sees.

Notice the question I haven’t answered: who decides if the new distribution is better? Hold that thought.

Pattern 2: Circuit Breakers

You know the classic circuit breaker. Error rate exceeds a threshold, trip the breaker, fall back to a cached response, or default. Stop hammering the downstream. Wait for it to recover.

That pattern assumes failures look like failures. Exceptions. Timeouts. HTTP 5xx.

Agents don’t fail like that. Agents succeed at the protocol level; they return a string, they exit cleanly, and the trace looks fine, even though they are wrong. Confidently. Plausibly. Wrong.

Error-rate circuit breakers are useless against a system whose failure mode is looking right. But we can still have some error detection.

Hard limits at the platform layer. Maximum steps. Maximum tokens. Wall-clock budget. These are the railings. If your agent runs for fifteen minutes when it should have run for ninety seconds, something has gone catastrophically wrong, and you want to fail loudly, not let it cook.

Loop detection. If the agent calls the same tool with the same arguments twice in a row, it’s stuck. Interrupt it. Don’t wait for the token budget to run out.

And finally, a human gate. Reserved for the small subset of actions that are expensive to reverse. Not for everything.

In the case of AI agents that need to cooperate with other software, we can treat them as an unreliable external API.

Last year, I got called in to fix an AI pipeline. Invoices in, structured data out, into a vector database. Built with a popular no-code tool. The prompt told the model: “the entire output MUST be ONLY the raw JSON object.” That was the contract. A sentence. In English. Begging.

The pipeline was hallucinating fields, skipping fields, and crashing before anything reached the database. So someone on the team was manually entering those invoices while they figured it out. The pipeline that was supposed to save labor was generating it.

I replaced the brittle prompt-magic with a small REST service. Wrapped the model in a BAML function to get schema enforcement, validation, and automatic retries. Swapped the OpenAI LLM for Mistral 7B running locally. Took less than a working day.

Extraction accuracy went from ‘we don’t know because it crashes’ to 95%, with 100% structural correctness on every field.

The contract was at the wrong layer, inside English, inside the prompt. I moved it to the machine boundary, where contracts belong. Schema enforcement IS a circuit breaker. It trips on malformed output before that output enters your system.

So that’s Pattern 2. Stack your breakers. Hard limits, loop detection, API contracts, human gate. Hook them at the tool-call boundary, where actions happen. The signal you are looking for is not just exceptions. You can track confidence drops, logprobs if your model exposes them, retrieval scores, and schema validation pass rates. Anything that tells you the agent’s certainty just collapsed before its output broke.

Pattern 3: The Eval Harness

Look. At. Your. Data.

Look at your data.

AI agents don’t fail loudly. You remember the first time your application crashed during a public presentation or a client demo. I was on a stage, with 30 to 60 people looking at the screen showing a mirror of an Android app. Suddenly, 1/4 of that screen was filled with a NullPointerException. It turns out there is a difference between no list and an empty list. I had worse null pointer bugs since, but I still remember that one. AI agents won’t do that to you. They will just quietly become useless.

There is another way agents fail quietly, and it is the one that scares me most. What if your assumptions are wrong? The agent is doomed before you even start.

I remember one strategic initiative for high-value locations. Board-level. C-level dashboards. Several teams in motion: product, engineering, user-generated content, and marketing.

The whole thing rested on one premise: those locations were hotels by the sea. Beach destinations. That assumption shaped the marketing: audience targeting, UX, advertising.

Nobody had plotted them on a map. So I plotted them. Five minutes in a notebook. Those hotels were everywhere. Mountains. Cities. By the sea, too, obviously. The premise was wrong. I talked with the product team and we cut about 40% of the locations. The board metric started meaning something.

Four teams. Board-level initiative. Saved by five minutes of looking at the data. Nobody had looked. Not because anyone was lazy. Because the assumption was old, confident, and felt true. At that point in the project, nobody’s job was looking.

Look at your data. Start with inputs. Pull a sample of the last hundred requests your agent saw.

Bucket them by intent: “What was the user actually trying to do?”
Bucket them by domain: “Which part of your product or which customer segment?”
Bucket them by length, because the short ones break differently from the long ones.
Finally, by language, every agent that ships in English breaks first in everything else.

You don’t need a tool for this. You need an afternoon and a spreadsheet. The point of the categorization is not the categories. The point is that you can no longer pretend the input space is a single blob.

Same thing on the output side. Pull a sample of the bad responses. Bucket them too.

Hallucinated facts: the agent invented something.
Schema violations: the agent broke the contract you tried to enforce in Pattern 2.
Wrong refusals: the agent refused something it should have handled.
Wrong tool call: the agent picked the wrong action.
Infinite loop: the agent never terminated.
Off-topic: the agent wandered.

Six buckets, minimum. You will discover a seventh and an eighth. Add them.

You can’t fix what you can’t name. If you can’t name it, you can’t talk about it with your team or your colleagues in the student’s dormitory. A shocking failure mode is even more interesting than discovering ORMs for the first time in your career.

Once you have buckets, you have a harness. Three layers.

A regression set for each category you just defined. Inputs you’ve seen before, with the outputs you wanted. You run these on every change. You watch for new categories or shifts in the category distributions.
LLM as a judge with a specification of what good looks like. Not vibes. A written-down rubric that the judge model can score against.
A small human-graded gold set. Maybe a hundred examples. The gold set is what you use to check the judge. Every month, every week, if you can, re-grade the gold set by hand and compare it to the judge’s grades. If they’re drifting apart, the judge is broken, and you fix it before your eval starts lying to you.

And who graded the gold set? Who re-anchors the judge? A small group of humans who now use AI tools to do their grading. Hold that thought, too. We’re going to come back to all three.

Pattern 1 needs a human to decide whether the new distribution is better.
Pattern 2 needs a human to set what counts as confident, what “calibrated” means.
Pattern 3 needs a human to grade the gold set the entire eval pyramid balances on.

Every pattern I just gave you has a human at the bottom.

The Human at the Bottom

So let me tell you what we know about that human.

The human who reviews your dataset once could remember 200 lines of code just to tell their colleagues about them in the afternoon. That human could build an Android app in a single evening to make their friend’s phone go from silent to full volume after receiving a text message with a Futurama quote, so they could find the phone when it got lost again. But the same human is incapable of babysitting an automated system.

In 1993, Parasuraman and colleagues had operators monitor an automated system. The system was reliable; most of the time, it did the right thing. They measured how long it took for the human to stop catching the system’s occasional mistakes.

Twenty minutes. After twenty minutes of watching a system that mostly worked, the human stopped catching the moments it didn’t. The more reliable your system has been in front of a particular reviewer, the worse that reviewer is at detecting eventual failures.

The longer your agent has been working correctly, the less likely your reviewer is to catch the moment it stops.

The obvious response is: train the reviewers. Send them to a workshop. Tell them to be careful. Doesn’t work.

Bahner and Manzey in 2008 compared two training conditions. Condition one: tell people the system can fail. Condition two: make them experience the system failing. Only condition two reduced subsequent errors. The lecture doesn’t work. The fire drill does.

So what do we do? Inject known-bad outputs into your review queue. A handful per week. Stuff that should be caught. Track who catches them and who doesn’t. The injection rate is your control surface; the catch rate is your oversight metric.

Years ago, people stood on conference stages talking about chaos engineering and injecting failure into software systems. Let’s do the same to your human reviewers. You can’t fix the humans. You don’t have to. You just have to assume they’ll fail and engineer for the day they do.

The last thing. What should you do on Monday morning? Six things.

Categorize your last hundred inputs.
Categorize your last hundred failures.
Pick the biggest failure bucket. That’s your gold set seed.
Version the gold set like production code. Pull requests. Reviews. History.
Inject five known-bad outputs per week into your review queue.
Make every irreversible action expensive.

I asked at the start when we killed the joy of programming. We didn’t. We outsourced the looking. The joy was always in finding the weird thing: the failure mode nobody had named yet, the contract at the wrong layer, and now, the reviewer who stopped catching anything after twenty minutes. That work is still there. AI didn’t take it. It just moved somewhere else. Now you find the weird stuff by looking at your data.

Go. Look at your data.