What is the most important KPI for an MLOps team? Is it the number of deployed models? Is it the average model accuracy? What about the amount of processed data or processing time? Don’t forget about user satisfaction. After all, someone sees the model predictions and may be disappointed with them.
What is it? What is the most important KPI? All of those things matter, so how can we pick the one most important metric?
First of all, we should pick one metric. If there are ten most important things, none of them is important, or the team has no idea what they are doing. Possibly both. There is only one thing preventing us from improving all of the other metrics. Measure it.
In business, we often pretend that one KPI does not exist. It is implicit. We are ashamed of such a KPI. What is it? The amount of cash in your bank account. None of your mission statements, company values, customer obsession, innovation, or being disruptive matter if you are about to go bankrupt. Yet, we don’t say it out loud. Have you ever seen a company admitting its mission is to earn money for the owner?
What is the implicit KPI of MLOps? What do we all need but don’t want to say out loud?
In my opinion, the most important KPI of an MLOps team is the cycle time - how fast can you get a model in production. Of course, this metric consists of two parts - training the model and deploying a trained model. We should consider both separately.
The training time depends on the domain difficulty. If we had to say how fast we can deploy in production a model for a self-driving car that works everywhere in the world, we would probably say anything between 20 and 50 years. It may be easier to get it working in 40% of places and adjust the traffic rules everywhere else…
However, a part of the training time depends on us. How fast can we turn a new idea into a trained model? We will compare the training time of models in the same domain, so the domain difficulty is the same for every trained model. I think such a KPI is useful because improving it forces us to remove impediments of getting the training data and the processing power required for training.
The second part is getting the model in production. How fast can we do it? Does it require any manual steps? Is there one person that must be available during deployment because otherwise, nothing works? Or is everything automated, and the human involvement ends after selecting the model to deploy?
What about the testing time? Does anyone has to send the request to the model and carefully look at the predictions, or do you have a test suite automatically run during every deployment?
How fast can you start using a deployed model? Do you have to reconfigure the client application? Do you have to change something in the code and redeploy it? Or is it as easy as setting a new value in a database or changing an environment variable?
Why is cycle time the most important?
Why not the number of deployed models or the average accuracy? Does this mean the inference time does not matter? Of course, it does matter! The model performance matters. The prediction correctness matters. All of that is important. However, short cycle time lets you improve the other metrics fast. If you can get a new, tested model in production in 10-15 minutes, you can make many minor improvements every day. You no longer need to wait for the big deployment day.
Getting a short cycle time requires investing time in automation. The automation takes over the tedious manual steps, so you no longer waste time doing the boring stuff. When you have free time, you can deploy more models or focus on model performance.
In MLOps, similarly to DevOps, automation is the biggest game-changer. Average teams turn into high-performers when they no longer have to do easily automatable things manually.