Sagemaker Endpoints are pretty expensive. If there is a time when you don’t use a model, it makes sense to shut down the endpoint temporarily (write an automated script to remove the endpoint when you don’t need it and start it again when you need a prediction). That is one method of saving money. We can also reuse an existing endpoint to serve multiple models. It makes sense to deploy various models in a single Sagemaker Endpoint if none of your models use all available resources.
Table of Contents
- Preparing the Docker Image
- Uploading the Docker Image to ECR
- Uploading the Models
- Configuring a Sagemaker Endpoint
- Using Multimodel Endpoints
- Limitations of Multimodel Sagemaker Endpoints
To deploy multiple machine learning models in a single Sagemaker Endpoint, we need to use the multimodel deployment feature. It does not work out-of-the-box, so we have to prepare a few things:
- the Docker image with the model serving software,
- code that loads the model and makes the prediction,
- and the Sagemaker Endpoint configuration.
In this example, I will use Tensorflow Serving as the underlying serving software for a BERT-based text classification model.
Preparing the Docker Image
First, we have to prepare the Dockerfile
and install all required components in the Docker image:
FROM ubuntu:20.04
# Set a docker label to advertise multi-model support on the container
LABEL com.amazonaws.sagemaker.capabilities.multi-models=true
# Set a docker label to enable container to use SAGEMAKER_BIND_TO_PORT environment variable if present
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
RUN apt-get update && \
apt-get -y install --no-install-recommends \
build-essential \
ca-certificates \
openjdk-8-jdk-headless \
python3-dev \
curl \
vim \
&& rm -rf /var/lib/apt/lists/* \
&& curl -O https://bootstrap.pypa.io/pip/get-pip.py \
&& python3 get-pip.py
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 1
RUN update-alternatives --install /usr/local/bin/pip pip /usr/local/bin/pip3 1
RUN pip3 --no-cache-dir install tensorflow==2.4.1 \
transformers==4.5.0 \
multi-model-server \
sagemaker-inference \
retrying
COPY dockerd-entrypoint.py /usr/local/bin/dockerd-entrypoint.py
RUN chmod +x /usr/local/bin/dockerd-entrypoint.py
RUN mkdir -p /home/model-server/
COPY model_handler.py /home/model-server/model_handler.py
ENTRYPOINT ["python", "/usr/local/bin/dockerd-entrypoint.py"]
CMD ["serve"]
We see a few additional files in the Dockerfile
. Therefore, we need to prepare dockerd-entrypoint.py
and model_handler.py
.
In the dockerd-entrypoint.py
, we start the Sagemaker model server:
import subprocess
import sys
import shlex
import os
from retrying import retry
from subprocess import CalledProcessError
from sagemaker_inference import model_server
def _retry_if_error(exception):
return isinstance(exception, CalledProcessError or OSError)
@retry(stop_max_delay=1000 * 50,
retry_on_exception=_retry_if_error)
def _start_mms():
model_server.start_model_server(handler_service='/home/model-server/model_handler.py:handle')
def main():
if sys.argv[1] == 'serve':
_start_mms()
else:
subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
# prevent docker exit
subprocess.call(['tail', '-f', '/dev/null'])
main()
In the model_handler.py
file, we need to define the handle
function, which gets the input data and the context (containing Sagemaker metadata), makes the prediction and returns the result. For convenience, we will wrap all the required operations in a class.
In the initialize method, we load the Tensorflow model and store it in an object field. The
preprocess method reads data from the JSON input, tokenizes the values, and returns the input for a BERT model. In the
inference method, we call the model to get a prediction. Finally, the
postprocess` method extracts the prediction value from the model response.
import JSON
import logging
import re
import tensorflow as tf
from transformers import AutoTokenizer
class ModelHandler(object):
def __init__(self):
self.initialized = False
self.model = None
self.max_seq_length = 64
self.tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", cache_dir="/tmp/tokenizer")
def initialize(self, context):
self.initialized = True
properties = context.system_properties
model_dir = properties.get("model_dir")
try:
self.model = tf.keras.models.load_model(model_dir + '/0')
except RuntimeError as memerr:
if re.search('Failed to allocate (.*) Memory', str(memerr), re.IGNORECASE):
logging.error("Memory allocation exception: {}".format(memerr))
# When we raise a MemoryError, the Sagemaker Endpoint will remove from memory the least recently used model and load the new model again
raise MemoryError
raise
def preprocess(self, request):
# Here we preprocess the given JSON input using a text tokenizer.
# The preprocessing code in your case will be different.
# Also, the structure of the input may be different because I sent the following JSON to the endpoint:
# {"input_text": "the text..."}
data = request[0]['body']
data_str = data.decode("utf-8")
jsonlines = data_str.split("\n")
text_before_tokenization = json.loads(jsonlines[0])["input_text"]
encode_plus_tokens = self.tokenizer(
text_before_tokenization,
add_special_tokens=True,
truncation=True,
max_length=self.max_seq_length,
padding="max_length",
return_attention_mask=True,
return_token_type_ids= False,
return_tensors="tf"
)
input_ids = encode_plus_tokens["input_ids"]
input_mask = encode_plus_tokens["attention_mask"]
return [input_ids, input_mask]
def inference(self, model_input):
return self.model(model_input)
def postprocess(self, inference_output):
return inference_output.numpy().tolist()[0]
def handle(self, data, context):
model_input = self.preprocess(data)
model_out = self.inference(model_input)
return self.postprocess(model_out)
_service = ModelHandler()
def handle(data, context):
if not _service.initialized:
_service.initialize(context)
if data is None:
return None
return _service.handle(data, context)
Uploading the Docker Image to ECR
In the next step, we have to build the Docker image and upload it to the AWS Elastic Container Registry. I assume that you have installed Docker, configured the AWS CLI. Remember to grant the required permissions to the AWS account used to upload the files and replace the [region]
and [your AWS id]
placeholders with the AWS region and your AWS account number.
docker build --tag [your AWS id].dkr.ecr.[region].amazonaws.com/multi-model-server:latest .
$(aws ecr get-login --region [region] --no-include-email)
docker push [your AWS id].dkr.ecr.[region].amazonaws.com/multi-model-server:latest
Uploading the Models
Compress the model files into a tar.gz
archive and put them in an S3 location. We’ll use the file names as the model id later.
Configuring a Sagemaker Endpoint
In the end, we have to configure a Sagemaker Endpoint:
import sagemaker
sagemaker_session = sagemaker.Session()
container = {
'Image': '[your AWS id].dkr.ecr.[region].amazonaws.com/multi-model-server:latest',
'ModelDataUrl': 's3://s3_path/that/contains/the/model/files',
'Mode': 'MultiModel'
}
multi_model = sagemaker_session.create_model(
name='multi-model',
role='arn_role_with_access_to_s3_with_the_models',
container_defs=[container])
create_endpoint_config_response = sagemaker_session.create_endpoint_config(
name='multi-model-endpoint-cfg',
model_name='multi-model',
initial_instance_count=1,
instance_type='ml.t2.medium')
sagemaker_session.create_endpoint(
endpoint_name='multi-model-endpoint',
config_name='multi-model-endpoint-cfg')
I suggest running the code in a AWS Code Pipeline. After a few minutes, you should have a Sagemaker Endpoint running.
Using Multimodel Endpoints
We will use boto3
to create an instance of the Sagemaker client and call the invoke_endpoint
function:
import boto3
import JSON
payload = json.dumps({"input_text": 'the input to the model'})
runtime = boto3.client("runtime.sagemaker")
response = runtime.invoke_endpoint(
EndpointName='multi-model-endpoint',
TargetModel='file_name.tar.gz',
Body=payload)
response = response["Body"].read()
result = json.loads(response.decode("utf-8"))
Limitations of Multimodel Sagemaker Endpoints
Multimodel deployments don’t support data capture. If you try to configure it, you’ll get a “DataCapture feature is not supported with MultiModel mode.” error. Therefore, you have to add the logging code to the application that uses the model or the model_handler.py
file.