MLOps at small companies

This article is a text version of my talk, "MLOPs for the rest of us," which I presented during the Infoshare conference (October 6-7, 2022 in Gdańsk, Poland).


If you need a month to get a machine learning model in production, your company is already dead.

My data engineering team has created a setup that allows us to deploy a new model in production in less than 60 minutes. It didn’t happen overnight. We started simple and kept adding new features when the lack of them hurt us. We didn’t want to overengineer it. After all, you can’t overcomplicate things when you have only one MLOps engineer on the team.

It wasn’t all unicorns and rainbows. It mainly looked like this:

The first text message
The first text message

It’s a text message I received on Saturday morning. I didn’t notice it for three hours. At the time when I replied, the problem was already solved. So nothing happened, right? Not quite. Because on the next day, I received this message:

The second text message
The second text message

Our models were producing the same prediction no matter the input data. That’s not good. How did it happen?

To answer that question, I have to tell you what we do.

Supply Chain Risk Management

We create Supply Chain Risk Management Software. It sounds fancy. What does it mean? If you own a factory, we will warn you when something happens to your suppliers, and you may get affected. We track real-world events. Our machine learning models tell us whether those events are relevant. To be precise, they tell us whether the events are irrelevant. We use machine learning to filter out useless information. The rest goes to our risk assessment team, who decide whether our clients should get a notification or not.

Now, you see that getting the same relevance score from the model for every event is a big problem. We overwhelmed the team, and they could miss an important event.

How did it happen? If you aren’t familiar with machine learning models processing text, you must know that such a setup consists of two parts. The machine learning model itself and word embeddings. Machine learning models work with numbers. The word embeddings convert text into those numbers.

Problems with word embeddings

I received the first text message when we ran out of space in Redis. As you may expect, we use Redis as a cache. The same Redis database stores our embeddings. The person fixing the issue replaced the Redis instance with another one with more storage space. That’s ok. But they forgot to run the script that populates the database with embeddings again.

It was easy to fix the issue when we realized the cause. We ran the script, and everything worked fine again.

But all of this wouldn’t happen if we upgraded our models faster.

We needed those embeddings only for the old models created a few years ago. We ran them in Tensorflow Serving. Tensorflow serving doesn’t run any preprocessing code, so we put the text-to-vector conversion inside the backend application that uses the models.

In the new setup, we have AWS Sagemaker Endpoints with the models and embeddings inside a single Docker container deployed as a single endpoint. But at that time, we were in the middle of moving the models to Sagemaker.

Why were we moving models to Sagemaker?

We started with a handful of models supporting a few languages. Those models were created by the previous machine learning team who left the company. There was nothing wrong with their models. I admire the simplicity of those models. The old team deployed the models in Tensorflow Serving running on a Heroku instance together with the backend service using them.

But machine learning models drift. Over time, they perform worse because the reality differs from what they have learned from the training dataset. For example, the previous team trained our old models so long ago that they knew nothing about COVID. We had to retrain them, but we didn’t want to make them as good as they were in the past. We wanted a bigger improvement. We had to switch to a more powerful model architecture - the BERT model.

A single BERT model takes over 1 GB of storage space. A single Heroku application has a hard limit of 500 MBs. There is no way to deploy BERT models in Heroku. We needed something else.

We have built a Docker container using BentoML and deployed them as Sagemaker Endpoints. That’s not complicated. You can get a working setup by following the official BentoML tutorial, or read my tutorial. However, I have to warn you. You have probably heard that thousands of times, but don’t copy-paste code. You will make a mistake while editing it later.

When I had to deploy a second model in Sagemaker, I copied the repository with the first one and started making changes. I didn’t want to extract the common code yet because I wasn’t sure what we would need in every model. It was too early to create a general solution. Of course, I forgot to modify a parameter in the new repository and deployed an endpoint that didn’t work. That’s a problem you can spot easily. There were worse issues.

Breaking changes in Sagemaker

One day, I realized that the AWS deep-learning containers project was still in the phase of development when breaking changes wasn’t a big deal. At least for its authors. For me, a user, those breaking changes were a big deal. I tried to deploy a Sagemaker Endpoint, and it couldn’t find a model in the Docker container. I didn’t change anything in the deployment code, which worked a few days earlier. What happened?

Someone decided that leading zeros in model versions make no sense. I had one model in my container, so I put zero as the model version! The fix was easy, but I would like to know that my deployment code doesn’t work anymore before I must urgently deploy a model. We added a deployment test pipeline in AWS CodePipelines. It was deploying a Sagemaker Endpoint once a day, running a few tests, and removing it. We would know when AWS made another breaking change in Sagemaker.

Testing the preprocessing code

A machine learning model isn’t just the model. We already know that in the case of text processing, we have word embeddings too, but that’s not the only thing we have. There is also data preprocessing code. After all, a Sagemaker Endpoint is just a REST API that gets some text and returns the relevance score. The preprocessing has to happen inside of it.

As with every code, the preprocessing may have a bug too. How would you know that? For example, you may see weird predictions in production. That’s a little bit too late. Our deployment pipeline needed tests. Now, we run Docker locally in the CodePipeline and send a set of queries to the model. We can deploy the model if the responses match the expected values provided by machine learning engineers. It’s not the best test you can write. It won’t tell us which part of the code is broken, but at least we won’t deploy a bug in production.

Selecting the right model

In the backend application, we had to switch between the old models in Tensorflow Serving and the new ones in Sagemaker. We didn’t want to run a simple replacement. We knew the new models were better, but how better were they? We wanted to measure their performance. In short, we needed to A/B test the new models.

There are two ways to test new machine learning models in production. You can run a shadow deployment. In this setup, the old model still handles the entire traffic and generates responses. The new model gets a copy of the requests and processes them, but nobody uses the results. You store them somewhere and compare the values later. That was the first part of our release.

After some time, we needed to switch to the new models. Again, the simple replacement of old models with new ones wasn’t a safe option. We needed to perform a canary release. That’s the second way to test machine learning models in production.

In this setup, a small percentage of the requests gets sent to the new model, and the model returns values used in production. The application tracks which model generated the prediction so we can compare the results later. We slowly increased the traffic percentage until the new model received 100% of requests.

Controlling the canary release and shadow deployments

You need a way to control the canary release configuration. What’s the simplest option? Environment variables! For every model, we had a parameter with the Sagemaker Endpoint name, the percentage of traffic it should handle, the properties it should extract and pass to the model, and the threshold value for the relevance score. All of that, for every supported language. Even if we use a multilanguage model, the source of data for the model, the threshold, and the canary release configuration may differ. Keeping that in the environment variables work fine when you have one model. It became a mess after deploying the third model. We needed something else.

We needed a separate service to configure the models - AWS AppConfig. It stores the application configuration as a JSON object. The backend application can use the AWS SDK to retrieve the JSON. In our application, we cache the configuration for some time to avoid retrieving the same data a thousand times per minute. Because of that, we get better response times.

Monitoring of configuration changes

What will happen when you have a big JSON that controls everything? You will make a mistake. I configured one endpoint to use a field that doesn’t exist in the language supported by its model. I made the change. Then I deployed it. I walked away. I made a cup of tea. I looked at the email on my phone. After some time, I returned to the computer. And I saw that a small percentage of requests fail. It was a small percentage of overall requests. On the other hand, it was also 100% of requests in that misconfigured language. Fortunately, you can revert a change quickly in AppConfig.

After that, I needed a way to validate the configuration and revert it automatically in case of errors. Both of those things are supported in AppConfig out of the box. We configured an AWS Lambda to validate the JSON with configuration. That covers the validation part. After that, we had to log a CloudWatch metric with the response status. In addition, we needed a CloudWatch Alert to monitor the metric. AppConfig can revert a change automatically when an alert gets triggered after a deployment.

But what does “after a deployment” mean?

The revert feature works because the deployment consists of two parts—the deployment itself - when AppConfig rolls out the change to all application instances. And the baking time - when AppConfig waits for a signal to roll back the change. If the alert gets triggered- it will automatically revert to the previous configuration.

What I expect

I think I will never get a text message about a production failure on the weekend. But I also know what will be the next problem. One day, I will make a mistake while configuring the models, and nobody will notice it for days. It won’t be a mistake that triggers the alert, but for example, a wrong threshold set to an incorrect language in the configuration.

What will happen next?

Perhaps, we will move the configuration to a git repository and require a pull request with approval before we merge it and automatically deploy the change. Why don’t I implement it now? Because I don’t need it yet. I may never need it, and I need to implement dozens of other things soon.

If you need a month to deploy a new model, don’t worry. You can do it step by step. You can do it as we did or use one of the ML platforms. I recommend the Qwak.ai platform (I work for them as a part-time, freelance MLOps evangelist). I didn’t use Qwak because, when I was developing that deployment pipeline, I didn’t know Qwak existed.

Deploying the first model

Many of you may be deploying your first ML model. Don’t overthink it. Deploy it. Fix the problems later. Don’t be yet another company that needs three months of work to get a model in production just because you want to prepare for everything. You won’t be ready for everything anyway.

If you deploy the first model, pretty much the only thing you need is a place to deploy it. Anything that can run a Docker container will do. You don’t need an automated training pipeline yet. You don’t need a feature store. You most likely can’t build a feature store because your machine learning engineer already forgot the SQL query they used to retrieve the training data.

What else do you not need yet? Experiment tracking? No. You have one model. You have one version of one model. Experiment tracking gives you no benefits at such an early stage.

In the case of the first model, you need to get results fast. You need to show that machine learning makes sense in the case of your project. When you prove it, you can add the missing parts. After all, your first model may be a flop, and the business may decide to give up on machine learning. (Imagine that…)

What’s next? When you start working on the second model, I recommend building a proper, deterministic training pipeline. You want to get the same model every time you run the pipeline with the same input data. No randomness.

But the data will change. So the next step is tracking the changes in the data. Soon you will discover that having a feature store starts to make sense. The feature store creates a separation layer between the data engineers who retrieve the data from the source databases and preprocess it and the machine learning engineers who train the models. With a feature store, both teams may work independently without affecting each other.

What’s next? Deployment automation. A single button you can click to get a model deployed in production is even better than an automated training pipeline. It’s better because you can use the same mechanism to revert to the previous version of the model.

Speaking of model versions. At this point, machine learning engineers will need an experiment-tracking tool. Of course, using Excel for tracking experiment results is always a possible solution. However, a proper experiment tracking setup tracks the value of the validation metric and the entire code that trained the model.

You can deploy every tool independently, but you don’t need to. When you get overwhelmed with day-to-day operations, switch to an ML platform. You will know exactly what you expect from them and appreciate those MLOps platforms more.

Older post

Why should you practice TDD?

What are the benefits of TDD for programmers and companies that hire them?

Newer post

How to pitch your idea

What a co-founder of DeepMind teaches us about pitching our ideas to investors

Are you looking for an experienced AI consultant? Do you need assistance with your RAG or Agentic Workflow?
Schedule a call, send me a message on LinkedIn, or use the chat button in the right-bottom corner. Schedule a call or send me a message on LinkedIn

>