How to deploy a Transformer-based model with custom preprocessing code to Sagemaker Endpoints using BentoML

Many BentoML tutorials exist, but most of them show how to deploy yet another Iris classification model. None of those tutorials were helpful when I had to deploy a Transformer-based model as a Sagemaker Endpoint for text classification. The inference pipeline I had to deploy uses a Huggingface tokenizer to convert the text into numeric vectors. After that, it passes the vectors to a Transformer-based model to get a prediction. Additionally, before we give the input to tokenization, we have to preprocess it because the raw input data is quite messy.

Table of Contents

  1. A Tensorflow Model as a BentoML Artifact
  2. The Pre-trained Tokenizer as a BentoML Artifact
  3. Preparing the Inference Code
  4. Saving Artifacts and Building a Docker Image
  5. Deploying to a Sagemaker Endpoint
  6. Updating the Sagemaker Endpoint

I deploy all of the code as one Sagemaker Endpoint, which processes the requests one by one in real-time. The preprocessing code is deployed with the model because it is model-specific. I prefer to encapsulate it as an internal part of the endpoint. It would make no sense to run tokenization in the service calling the model. I would have to maintain a compatible codebase in two separate repositories. Having to synchronize two repositories causes lots of bugs, so I don’t want to do it.

A Tensorflow Model as a BentoML Artifact

First, we must load the Tensorflow model from a file (or train a new model in the same script).

from tensorflow import keras
model = keras.models.load_model(model_path)

Loading the model requires having the tensorflow library in the virtual environment running the script. We must use the identical library versions in the environment and, later, in the BentoML service definition!

The Pre-trained Tokenizer as a BentoML Artifact

I deploy a text classification endpoint, so I need a tokenizer to convert the human-readable text into vectors for the ML model. In this example, I use the BertTokenizer from Hugging Face:

from transformers import BertTokenizer
tokenizer_name = 'whatever tokenizer you use'
tokenizer = BertTokenizer.from_pretrained(tokenizer_name, do_lower_case=True, cache_dir="/tmp/tokenizer")

Preparing the Inference Code

We must save the code below in a separate file. In my example, I will call the file endpoint.py.

In the file, I define a new ModelName class which is a BentoService. The BentoService has two artifacts defined: model and tokenizer. We will need those names later when we store the model. In the ModelName class, I can refer to those artifacts using the self.artifacts object.

In the @env decorator, I specify the libraries required during the inference. We must use identical versions as the ones defined in the virtual environment running the deployment script!

In the predict method, we get a single JSON object because I disabled batch processing. I read the content of the given JSON object, run the cleanup function (it is very project-specific, so I had to censor it). Next, I pass the cleaned text to the tokenizer and get both the input_ids and the attention_mask.

When the input is ready, I pass it to the model to get a prediction. Finally, I build a dictionary that will be returned as a JSON object.

from bentoml import api, env, BentoService, artifacts
from bentoml.artifact import PickleArtifact, TensorflowSavedModelArtifact
from bentoml.adapters import JsonInput
from bentoml.types import JsonSerializable


@artifacts([
    TensorflowSavedModelArtifact('model'),
    PickleArtifact('tokenizer')
])
@env(pip_packages=['tensorflow==your_version', 'transformers==your_version', 'keras==your_version'])
class ModelName(BentoService):
    @api(input=JsonInput(), batch=False)
    def predict(self, parsed_json):
        def clean_field_1(text):
            # here you put your cleaning code
            pass

        cleaned = clean_field_1(parsed_json['field_1'])

        tf_batch = self.artifacts.tokenizer(cleaned,
                             max_length=PUT_THE_LENGTH_HERE,
                             truncation=True,
                             pad_to_max_length = True,
                             return_attention_mask = True,
                             return_tensors='tf')

        input_ids = tf_batch["input_ids"]
        input_mask = tf_batch["attention_mask"]

        model_input = [input_ids, input_mask]

        response = self.artifacts.model(model_input)

        return {"result": response.numpy().item()}

Saving Artifacts and Building a Docker Image

Now, we can load the class from the endpoint.py file, instantiate it, pass the dependencies as artifacts and save the BentoML service.

from endpoint import ModelName

service = ModelName()
service.pack('model', model)
service.pack('tokenizer', tokenizer)

service.save()

Deploying to a Sagemaker Endpoint

To deploy the service as an AWS Sagemaker Endpoint, we have to download the BentoML/aws-sagemaker-deploy repository from GitHub. To run the code below, you will need configured AWS CLI and user permissions to define and apply CloudFormation templates, define IAM roles and permissions. The user also needs access to ECR, AWS Lambda, and API Gateways.

The aws-sagemaker-deploy directory contains the sagemaker_config.json file. We must open it and specify the endpoint configuration, such as the region, instance type, and the number of instances. In the file, we can also enable the Data Capture feature to log all requests automatically. The file may look like this:

{
    "region": "eu-central-1",
    "api_name": "predict",
    "instance_type": "ml.t3.2xlarge",
    "initial_instance_count": 4,
    "workers": 1,
    "timeout": 60,
    "enable_data_capture": true,
    "data_capture_s3_prefix": "s3://the_logs_bucket/and_the_prefix",
    "data_capture_sample_percent": 100
}

Now, we have to get the BentoML bundle path using the command line. ModelName:latest is the name of the BentoML bundle. The ModelName comes from the class name defined in the endpoint.py file.

BENTO_BUNDLE=$(bentoml get ModelName:latest --print-location -q)

and run the deployment script:

python deploy.py $BENTO_BUNDLE endpoint-name sagemaker_config.json

The deploy.py script will build the Docker image, upload it to AWS ECR, and define a CloudFormation Stack which consists of a Sagemaker Endpoint, an AWS Lambda function passing requests to that endpoint, and an API Gateway which provides HTTP access to the Lambda function.

If you don’t need the API Gateway or want to secure it, you will need to modify the code downloaded from the BentoML/aws-sagemaker-deploy repository.

Updating the Sagemaker Endpoint

If you want to modify an existing endpoint, repeat all of the steps. The deployment script will update the CloudFormation Stack, and the Sagemaker Endpoint will be updated too. An update does NOT require removing the Sagemaker Endpoint, so you can keep serving the requests while the endpoint is being updated.

Older post

How to teach your team to write automated tests?

How to teach writing automated tests: TDD, BDD, and other techniques

Newer post

Shadow deployment vs. canary release of machine learning models

What is shadow deployment in machine learning? What is a canary release? What is the difference?

Are you looking for an experienced AI consultant? Do you need assistance with your RAG or Agentic Workflow?
Book a Quick Consultation, send me a message on LinkedIn. Book a Quick Consultation or send me a message on LinkedIn

>