Deploy LLMs with Confidence: A Comprehensive Guide to Software Architecture for Production-Ready AI

Do MLOps for LLMs (LLMOps, if you will) differ from general MLOps? Regrettably, it does. LLMs, while significantly more powerful than other machine learning models, also become more unpredictable when they fail.

Table of Contents

How to incorporate AI in production applications?
How to put it all together?
1. Go From AI Janitor to AI Architect

For instance, if we train a neural network for text classification and the network returns a value between 0 and 1, the worst-case scenario results in an incorrect classification but always within the expected range of values. In contrast, with LLMs, we may encounter a result entirely outside the expected range of values. We have all seen numerous screenshots of people attempting to “break” LLMs. It’s relatively easy to trick them into performing unintended actions. Consequently, we must exercise extra caution when deploying LLMs in production.

How to incorporate AI in production applications?

Proper software engineering entails building adapters around unreliable external dependencies. Regardless of whether you use XGBoost, linear regression, or an LLM, the same general principles apply. First and foremost, treat the machine learning component as a potentially random and faulty piece of your application.

Naturally, your choice of tools matters, but that is the second decision to make. The first decision concerns the architecture of your solution. I will concentrate on the architecture here. What components do you need and why? What do we require? Let’s use a text classification service as an example.

What components of the architecture can you expect to see?

the AI/ML model
ML-specific data preprocessing code (word embeddings and tokenization)
application-specific data preprocessing code (converting from the business domain to the ML input format, data validation)
ML response postprocessing code (converting from the ML output format to the meaning in the business domain)
request/response logging
monitoring
access control

Why is ML-specific data preprocessing code necessary?

Machine learning models work with numbers, and each model has its numeric input format. In the case of processing text, we must convert the text into vectors of numbers. If we process images, we have tensors describing the pixels in the picture. In the case of tabular data, we have an encoding of categorical variables, and so on.

However, these conversions aren’t part of the standard data preprocessing code; they are model-specific. For instance, when using a neural network for text classification, we must employ the same word embeddings and tokenization as during training. The model is useless without the tokenization code, and it’s also useless if you change the tokenization code without retraining the model. These elements belong together. Therefore, I recommend deploying the model and its tokenization code together, ensuring the model always uses the correct word embeddings.

Why is application-specific data preprocessing code necessary?

There’s a stage where data is converted from the business domain into the technical domain of the model. Such a conversion is not the same as tokenization. I wouldn’t go directly from the business domain to the numeric representation used by the model, as these code parts change for different reasons. When your business domain evolves, you may modify the domain model. If you retrain the model, you may change the numeric representation. Both of these changes should be independent. That’s why I recommend having a separate layer of code converting the business domain into the input of the model’s tokenization code.

For example, suppose you decide your model needs the client’s age and address. The application-specific data preprocessing code retrieves the age and the address from your business domain and passes it to the tokenization code. If the business domain changes and you store the address differently, such as allowing clients to specify multiple addresses, you only have to alter the application-specific data preprocessing code and choose the right value. The tokenization code doesn’t have to change.

Similarly, if you decide to convert the address into geolocation coordinates, you only need to change the application-specific data preprocessing code. The business domain code can continue using the original representation of the address.

Why is ML response postprocessing code necessary?

Your classification model returns 0.32. What does the value mean in your business domain? You need a layer of code to translate the numeric values into decisions. The postprocessing layer becomes particularly important when you realize that setting the decision threshold to 0.5 works well only in beginner-level tutorials. In reality, the threshold is another parameter you can control and adjust.

Much like the application-specific preprocessing code, the decision threshold parameter lies between the model and the business domain. Of course, the value is model-specific, and you will need a new threshold every time you retrain the model. However, adjustments to the parameter are a business decision. Do you want the model to be more conservative or more aggressive?

Why is request/response logging necessary?

We will log interactions with the model. It’s obvious. But what exactly do we log, and how will we know whether the model made the right decision?

How long can we keep the logs? Are we allowed to use the logged data to retrain the model later, or is it illegal? Can we keep the logs only for as long as necessary to debug the system? If we cannot store all the data indefinitely but need to maintain a track record of the requests, what are we allowed to keep?

You must make these decisions before you even start logging the requests. What’s worse, we need to make the decision three times. After all, we have three layers of ML-related code: preprocessing, the model (and its tokenization code), and postprocessing. If possible and allowed, we should log the input and output of each layer. With a detailed log, we can easily debug the system when something goes wrong. Also, we can decide later to add additional data to the training dataset of the model if we have logged a snapshot of the raw input from the business domain.

A mistake in the model may lead to problems later in the business domain. How will you find the issue? You will need a correlation id for all interactions with the model. When you have it, and it gets propagated through all the layers, you can easily find the logs related to a specific request. But all of those things don’t magically happen by themselves. You need to design the system to support them.

Why is monitoring necessary?

I’m not referring to technical monitoring of resources used by the deployed model and response times. Naturally, you need that as well, and you should have alerts informing you when the model doesn’t respond in time. However, you also need to monitor the quality of the model. How do you know if the model is still good? How do you know if it’s not time to retrain it?

The biggest problem with model monitoring is the need for the reference value. You have to compare the model’s response to the actual value. But how do you obtain the true value? You will need a way to track users’ interactions with the data produced by the model. Have you recommended something, and they clicked and quickly returned to the previous page? Such behavior may indicate a disappointing recommendation.

Has the model classified the data, but the user changed the classification? It’s a strong signal indicating the model made a mistake. Or, in the case of financial institutions, it may be a signal that an employee is committing fraud by overriding correct automated decisions. Not all corrections must fix a mistake; sometimes, people think they know better, but they don’t. Occasionally, they know they are wrong but take the harmful action on purpose.

What if you can’t track users’ interactions? It may be impossible for various reasons. Sometimes the law doesn’t allow you; sometimes, it’s not feasible. If your model recommends a workout plan, you will never know whether users followed the plan and achieved the results they wanted (even if you ask them, some will lie). In such cases, you will need to compare a subset of recommendations with the results of manual evaluation. You will need to ask a domain expert what they think. It’s slow, error-prone, and doesn’t work in real time. But it’s better than nothing.

Why is access control necessary?

If you have an internal machine learning model deployed in a private network with no public access, do you need access control? Yes, you do. First of all, zero-trust is a good security practice. But also, you need to know who is using the model, so you have some control over the use case and can spot when someone is trying to use the model in a way it wasn’t designed for.

Access control becomes especially important when you build an LLM-based system with a vector database from which the model can automatically retrieve the data it needs. In this case, model access control becomes data access control, and you need to protect it as well as you would protect the production database. You don’t want anyone to steal proprietary data by asking clever questions, do you?

How to put it all together?

I will not tell you what tools you should use to implement all these components. You can choose whatever is best for you. Also, I will not tell you whether you should deploy the model, the preprocessing, and the postprocessing code as three separate services or keep them all in one service. Deployment is a technical detail. You can deploy it all as part of a monolithic application as long as the logical components are separated.

Go From AI Janitor to AI Architect

Stop debugging unpredictable AI systems. I can help you build, measure, and deploy reliable, production-grade AI applications that don't hallucinate.

Message me on LinkedIn