Many BentoML tutorials exist, but most of them show how to deploy yet another Iris classification model. None of those tutorials were helpful when I had to deploy a Transformer-based model as a Sagemaker Endpoint for text classification. The inference pipeline I had to deploy uses a Huggingface tokenizer to convert the text into numeric vectors. After that, it passes the vectors to a Transformer-based model to get a prediction. Additionally, before we give the input to tokenization, we have to preprocess it because the raw input data is quite messy.
Table of Contents
- A Tensorflow Model as a BentoML Artifact
- The Pre-trained Tokenizer as a BentoML Artifact
- Preparing the Inference Code
- Saving Artifacts and Building a Docker Image
- Deploying to a Sagemaker Endpoint
- Updating the Sagemaker Endpoint
I deploy all of the code as one Sagemaker Endpoint, which processes the requests one by one in real-time. The preprocessing code is deployed with the model because it is model-specific. I prefer to encapsulate it as an internal part of the endpoint. It would make no sense to run tokenization in the service calling the model. I would have to maintain a compatible codebase in two separate repositories. Having to synchronize two repositories causes lots of bugs, so I don’t want to do it.
A Tensorflow Model as a BentoML Artifact
First, we must load the Tensorflow model from a file (or train a new model in the same script).
from tensorflow import keras
model = keras.models.load_model(model_path)
Loading the model requires having the tensorflow library in the virtual environment running the script. We must use the identical library versions in the environment and, later, in the BentoML service definition!
The Pre-trained Tokenizer as a BentoML Artifact
I deploy a text classification endpoint, so I need a tokenizer to convert the human-readable text into vectors for the ML model. In this example, I use the BertTokenizer
from Hugging Face:
from transformers import BertTokenizer
tokenizer_name = 'whatever tokenizer you use'
tokenizer = BertTokenizer.from_pretrained(tokenizer_name, do_lower_case=True, cache_dir="/tmp/tokenizer")
Preparing the Inference Code
We must save the code below in a separate file. In my example, I will call the file endpoint.py
.
In the file, I define a new ModelName
class which is a BentoService
. The BentoService has two artifacts defined: model
and tokenizer
. We will need those names later when we store the model. In the ModelName
class, I can refer to those artifacts using the self.artifacts
object.
In the @env
decorator, I specify the libraries required during the inference. We must use identical versions as the ones defined in the virtual environment running the deployment script!
In the predict
method, we get a single JSON object because I disabled batch processing. I read the content of the given JSON object, run the cleanup function (it is very project-specific, so I had to censor it). Next, I pass the cleaned text to the tokenizer and get both the input_ids
and the attention_mask
.
When the input is ready, I pass it to the model to get a prediction. Finally, I build a dictionary that will be returned as a JSON object.
from bentoml import api, env, BentoService, artifacts
from bentoml.artifact import PickleArtifact, TensorflowSavedModelArtifact
from bentoml.adapters import JsonInput
from bentoml.types import JsonSerializable
@artifacts([
TensorflowSavedModelArtifact('model'),
PickleArtifact('tokenizer')
])
@env(pip_packages=['tensorflow==your_version', 'transformers==your_version', 'keras==your_version'])
class ModelName(BentoService):
@api(input=JsonInput(), batch=False)
def predict(self, parsed_json):
def clean_field_1(text):
# here you put your cleaning code
pass
cleaned = clean_field_1(parsed_json['field_1'])
tf_batch = self.artifacts.tokenizer(cleaned,
max_length=PUT_THE_LENGTH_HERE,
truncation=True,
pad_to_max_length = True,
return_attention_mask = True,
return_tensors='tf')
input_ids = tf_batch["input_ids"]
input_mask = tf_batch["attention_mask"]
model_input = [input_ids, input_mask]
response = self.artifacts.model(model_input)
return {"result": response.numpy().item()}
Saving Artifacts and Building a Docker Image
Now, we can load the class from the endpoint.py
file, instantiate it, pass the dependencies as artifacts and save the BentoML service.
from endpoint import ModelName
service = ModelName()
service.pack('model', model)
service.pack('tokenizer', tokenizer)
service.save()
Want to build AI systems that actually work?
Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.
Deploying to a Sagemaker Endpoint
To deploy the service as an AWS Sagemaker Endpoint, we have to download the BentoML/aws-sagemaker-deploy
repository from GitHub. To run the code below, you will need configured AWS CLI and user permissions to define and apply CloudFormation templates, define IAM roles and permissions. The user also needs access to ECR, AWS Lambda, and API Gateways.
The aws-sagemaker-deploy
directory contains the sagemaker_config.json
file. We must open it and specify the endpoint configuration, such as the region, instance type, and the number of instances. In the file, we can also enable the Data Capture feature to log all requests automatically. The file may look like this:
{
"region": "eu-central-1",
"api_name": "predict",
"instance_type": "ml.t3.2xlarge",
"initial_instance_count": 4,
"workers": 1,
"timeout": 60,
"enable_data_capture": true,
"data_capture_s3_prefix": "s3://the_logs_bucket/and_the_prefix",
"data_capture_sample_percent": 100
}
Now, we have to get the BentoML bundle path using the command line. ModelName:latest
is the name of the BentoML bundle. The ModelName
comes from the class name defined in the endpoint.py
file.
BENTO_BUNDLE=$(bentoml get ModelName:latest --print-location -q)
and run the deployment script:
python deploy.py $BENTO_BUNDLE endpoint-name sagemaker_config.json
The deploy.py
script will build the Docker image, upload it to AWS ECR, and define a CloudFormation Stack which consists of a Sagemaker Endpoint, an AWS Lambda function passing requests to that endpoint, and an API Gateway which provides HTTP access to the Lambda function.
If you don’t need the API Gateway or want to secure it, you will need to modify the code downloaded from the BentoML/aws-sagemaker-deploy
repository.
Updating the Sagemaker Endpoint
If you want to modify an existing endpoint, repeat all of the steps. The deployment script will update the CloudFormation Stack, and the Sagemaker Endpoint will be updated too. An update does NOT require removing the Sagemaker Endpoint, so you can keep serving the requests while the endpoint is being updated.