Sagemaker Endpoints are pretty expensive. If there is a time when you don’t use a model, it makes sense to shut down the endpoint temporarily (write an automated script to remove the endpoint when you don’t need it and start it again when you need a prediction). That is one method of saving money. We can also reuse an existing endpoint to serve multiple models. It makes sense to deploy various models in a single Sagemaker Endpoint if none of your models use all available resources.

To deploy multiple machine learning models in a single Sagemaker Endpoint, we need to use the multimodel deployment feature. It does not work out-of-the-box, so we have to prepare a few things:

  • the Docker image with the model serving software,
  • code that loads the model and makes the prediction,
  • and the Sagemaker Endpoint configuration.

In this example, I will use Tensorflow Serving as the underlying serving software for a BERT-based text classification model.

Preparing the Docker Image

First, we have to prepare the Dockerfile and install all required components in the Docker image:

FROM ubuntu:20.04

# Set a docker label to advertise multi-model support on the container
LABEL com.amazonaws.sagemaker.capabilities.multi-models=true
# Set a docker label to enable container to use SAGEMAKER_BIND_TO_PORT environment variable if present
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true

RUN apt-get update && \
    apt-get -y install --no-install-recommends \
    build-essential \
    ca-certificates \
    openjdk-8-jdk-headless \
    python3-dev \
    curl \
    vim \
    && rm -rf /var/lib/apt/lists/* \
    && curl -O \
    && python3

RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 1
RUN update-alternatives --install /usr/local/bin/pip pip /usr/local/bin/pip3 1

RUN pip3 --no-cache-dir install tensorflow==2.4.1 \
                                transformers==4.5.0 \
                                multi-model-server \
                                sagemaker-inference \

COPY /usr/local/bin/
RUN chmod +x /usr/local/bin/

RUN mkdir -p /home/model-server/

COPY /home/model-server/

ENTRYPOINT ["python", "/usr/local/bin/"]

CMD ["serve"]

We see a few additional files in the Dockerfile. Therefore, we need to prepare and

In the, we start the Sagemaker model server:

import subprocess
import sys
import shlex
import os
from retrying import retry
from subprocess import CalledProcessError
from sagemaker_inference import model_server

def _retry_if_error(exception):
    return isinstance(exception, CalledProcessError or OSError)

@retry(stop_max_delay=1000 * 50,
def _start_mms():

def main():
    if sys.argv[1] == 'serve':
        subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))

    # prevent docker exit['tail', '-f', '/dev/null'])


In the file, we need to define the handle function, which gets the input data and the context (containing Sagemaker metadata), makes the prediction and returns the result. For convenience, we will wrap all the required operations in a class.

In the initialize method, we load the Tensorflow model and store it in an object field. The preprocess method reads data from the JSON input, tokenizes the values, and returns the input for a BERT model. In the inference method, we call the model to get a prediction. Finally, the postprocess` method extracts the prediction value from the model response.

import JSON
import logging
import re
import tensorflow as tf
from transformers import AutoTokenizer

class ModelHandler(object):

    def __init__(self):
        self.initialized = False
        self.model = None

        self.max_seq_length = 64
        self.tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", cache_dir="/tmp/tokenizer")

    def initialize(self, context):
        self.initialized = True
        properties = context.system_properties
        model_dir = properties.get("model_dir")

            self.model = tf.keras.models.load_model(model_dir + '/0')
        except RuntimeError as memerr:
            if'Failed to allocate (.*) Memory', str(memerr), re.IGNORECASE):
                logging.error("Memory allocation exception: {}".format(memerr))
                # When we raise a MemoryError, the Sagemaker Endpoint will remove from memory the least recently used model and load the new model again
                raise MemoryError

    def preprocess(self, request):
        # Here we preprocess the given JSON input using a text tokenizer.
        # The preprocessing code in your case will be different.
        # Also, the structure of the input may be different because I sent the following JSON to the endpoint:
        # {"input_text": "the text..."}
        data = request[0]['body']
        data_str = data.decode("utf-8")
        jsonlines = data_str.split("\n")

        text_before_tokenization = json.loads(jsonlines[0])["input_text"]

        encode_plus_tokens = self.tokenizer(
            return_token_type_ids= False,

        input_ids = encode_plus_tokens["input_ids"]
        input_mask = encode_plus_tokens["attention_mask"]

        return [input_ids, input_mask]

    def inference(self, model_input):
        return self.model(model_input)

    def postprocess(self, inference_output):
        return inference_output.numpy().tolist()[0]

    def handle(self, data, context):
        model_input = self.preprocess(data)
        model_out = self.inference(model_input)
        return self.postprocess(model_out)

_service = ModelHandler()

def handle(data, context):
    if not _service.initialized:

    if data is None:
        return None

    return _service.handle(data, context)

Uploading the Docker Image to ECR

In the next step, we have to build the Docker image and upload it to the AWS Elastic Container Registry. I assume that you have installed Docker, configured the AWS CLI. Remember to grant the required permissions to the AWS account used to upload the files and replace the [region] and [your AWS id] placeholders with the AWS region and your AWS account number.

docker build --tag [your AWS id].dkr.ecr.[region] .

$(aws ecr get-login --region [region] --no-include-email)

docker push [your AWS id].dkr.ecr.[region]

Uploading the Models

Compress the model files into a tar.gz archive and put them in an S3 location. We’ll use the file names as the model id later.

Configuring a Sagemaker Endpoint

In the end, we have to configure a Sagemaker Endpoint:

import sagemaker

sagemaker_session = sagemaker.Session()

container = {
    'Image': '[your AWS id].dkr.ecr.[region]',
    'ModelDataUrl': 's3://s3_path/that/contains/the/model/files',
    'Mode': 'MultiModel'

multi_model = sagemaker_session.create_model(

create_endpoint_config_response = sagemaker_session.create_endpoint_config(


I suggest running the code in a AWS Code Pipeline. After a few minutes, you should have a Sagemaker Endpoint running.

Using Multimodel Endpoints

We will use boto3 to create an instance of the Sagemaker client and call the invoke_endpoint function:

import boto3
import JSON

payload = json.dumps({"input_text": 'the input to the model'})

runtime = boto3.client("runtime.sagemaker")

response = runtime.invoke_endpoint(

    response = response["Body"].read()
    result = json.loads(response.decode("utf-8"))

Limitations of Multimodel Sagemaker Endpoints

Multimodel deployments don’t support data capture. If you try to configure it, you’ll get a “DataCapture feature is not supported with MultiModel mode.” error. Therefore, you have to add the logging code to the application that uses the model or the file.

Older post

How to speed up Pandas?

Is the Pandas library too slow? Here are two methods to speed it up!

Newer post

The ugly truth about product demo storytelling in data teams

How to make data product demos more engaging and persuade people to care about the data