---
title: "How to deploy a Transformer-based model with custom preprocessing code to Sagemaker Endpoints using BentoML"
description: "Deploy a machine learning model with custom inference code to a Sagemaker Endpoint using BentoML"
author: "Bartosz Mikulski"
author_bio: "Principal AI Engineer & MLOps Architect. I bridge the gap between \"it works in a notebook\" and \"it works for 200 million users.\""
author_url: https://mikulskibartosz.name
author_linkedin: https://www.linkedin.com/in/mikulskibartosz/
author_github: https://github.com/mikulskibartosz
canonical_url: https://mikulskibartosz.name/bentoml-transformers-with-custom-code-to-sagemaker-endpoints
---

Many BentoML tutorials exist, but most of them show how to deploy yet another Iris classification model. None of those tutorials were helpful when I had to deploy a Transformer-based model as a Sagemaker Endpoint for text classification. The inference pipeline I had to deploy uses a Huggingface tokenizer to convert the text into numeric vectors. After that, it passes the vectors to a Transformer-based model to get a prediction. Additionally, before we give the input to tokenization, we have to preprocess it because the raw input data is quite messy.

I deploy all of the code as one Sagemaker Endpoint, which processes the requests one by one in real-time. The preprocessing code is deployed with the model because it is model-specific. I prefer to encapsulate it as an internal part of the endpoint. It would make no sense to run tokenization in the service calling the model. I would have to maintain a compatible codebase in two separate repositories. Having to synchronize two repositories causes lots of bugs, so I don't want to do it.

## A Tensorflow Model as a BentoML Artifact

First, we must load the Tensorflow model from a file (or train a new model in the same script).

```python
from tensorflow import keras
model = keras.models.load_model(model_path)
```

Loading the model requires having the tensorflow library in the virtual environment running the script. **We must use the identical library versions in the environment and, later, in the BentoML service definition!**

## The Pre-trained Tokenizer as a BentoML Artifact

I deploy a text classification endpoint, so I need a tokenizer to convert the human-readable text into vectors for the ML model. In this example, I use the `BertTokenizer` from Hugging Face:

```python
from transformers import BertTokenizer
tokenizer_name = 'whatever tokenizer you use'
tokenizer = BertTokenizer.from_pretrained(tokenizer_name, do_lower_case=True, cache_dir="/tmp/tokenizer")
```

## Preparing the Inference Code

**We must save the code below in a separate file.** In my example, I will call the file `endpoint.py`.

In the file, I define a new `ModelName` class which is a `BentoService`. The BentoService has two artifacts defined: `model` and `tokenizer`. We will need those names later when we store the model. In the `ModelName` class, I can refer to those artifacts using the `self.artifacts` object.

In the `@env` decorator, I specify the libraries required during the inference. We must use identical versions as the ones defined in the virtual environment running the deployment script!

In the `predict` method, we get a single JSON object because I disabled batch processing. I read the content of the given JSON object, run the cleanup function (it is very project-specific, so I had to censor it). Next, I pass the cleaned text to the tokenizer and get both the `input_ids` and the `attention_mask`.

When the input is ready, I pass it to the model to get a prediction. Finally, I build a dictionary that will be returned as a JSON object.

```python
from bentoml import api, env, BentoService, artifacts
from bentoml.artifact import PickleArtifact, TensorflowSavedModelArtifact
from bentoml.adapters import JsonInput
from bentoml.types import JsonSerializable

@artifacts([
    TensorflowSavedModelArtifact('model'),
    PickleArtifact('tokenizer')
])
@env(pip_packages=['tensorflow==your_version', 'transformers==your_version', 'keras==your_version'])
class ModelName(BentoService):
    @api(input=JsonInput(), batch=False)
    def predict(self, parsed_json):
        def clean_field_1(text):
            # here you put your cleaning code
            pass

        cleaned = clean_field_1(parsed_json['field_1'])

        tf_batch = self.artifacts.tokenizer(cleaned,
                             max_length=PUT_THE_LENGTH_HERE,
                             truncation=True,
                             pad_to_max_length = True,
                             return_attention_mask = True,
                             return_tensors='tf')

        input_ids = tf_batch["input_ids"]
        input_mask = tf_batch["attention_mask"]

        model_input = [input_ids, input_mask]

        response = self.artifacts.model(model_input)

        return {"result": response.numpy().item()}
```

## Saving Artifacts and Building a Docker Image

Now, we can load the class from the `endpoint.py` file, instantiate it, pass the dependencies as artifacts and save the BentoML service.

```python
from endpoint import ModelName

service = ModelName()
service.pack('model', model)
service.pack('tokenizer', tokenizer)

service.save()
```

## Deploying to a Sagemaker Endpoint

To deploy the service as an AWS Sagemaker Endpoint, we have to download the `BentoML/aws-sagemaker-deploy` repository from GitHub. To run the code below, you will need configured AWS CLI and user permissions to define and apply CloudFormation templates, define IAM roles and permissions. The user also needs access to ECR, AWS Lambda, and API Gateways.

The `aws-sagemaker-deploy` directory contains the `sagemaker_config.json` file. We must open it and specify the endpoint configuration, such as the region, instance type, and the number of instances. In the file, we can also enable the Data Capture feature to log all requests automatically. The file may look like this:

```json
{
    "region": "eu-central-1",
    "api_name": "predict",
    "instance_type": "ml.t3.2xlarge",
    "initial_instance_count": 4,
    "workers": 1,
    "timeout": 60,
    "enable_data_capture": true,
    "data_capture_s3_prefix": "s3://the_logs_bucket/and_the_prefix",
    "data_capture_sample_percent": 100
}
```

Now, we have to get the BentoML bundle path using the command line. `ModelName:latest` is the name of the BentoML bundle. The `ModelName` comes from the class name defined in the `endpoint.py` file.

```bash
BENTO_BUNDLE=$(bentoml get ModelName:latest --print-location -q)
```

and run the deployment script:

```bash
python deploy.py $BENTO_BUNDLE endpoint-name sagemaker_config.json
```

The `deploy.py` script will build the Docker image, upload it to AWS ECR, and define a CloudFormation Stack which consists of a Sagemaker Endpoint, an AWS Lambda function passing requests to that endpoint, and an API Gateway which provides HTTP access to the Lambda function.

If you don't need the API Gateway or want to secure it, you will need to modify the code downloaded from the `BentoML/aws-sagemaker-deploy` repository.

## Updating the Sagemaker Endpoint

If you want to modify an existing endpoint, repeat all of the steps. The deployment script will update the CloudFormation Stack, and the Sagemaker Endpoint will be updated too. An update does NOT require removing the Sagemaker Endpoint, so you can keep serving the requests while the endpoint is being updated.