---
title: "Multimodel deployment in Sagemaker Endpoints"
description: "How to deploy multiple models in a single Sagemaker Endpoint?"
author: "Bartosz Mikulski"
author_bio: "Principal AI Engineer & MLOps Architect. I bridge the gap between \"it works in a notebook\" and \"it works for 200 million users.\""
author_url: https://mikulskibartosz.name
author_linkedin: https://www.linkedin.com/in/mikulskibartosz/
author_github: https://github.com/mikulskibartosz
canonical_url: https://mikulskibartosz.name/sagemaker-endpoints-multimodel-deployments
---

Sagemaker Endpoints are pretty expensive. If there is a time when you don't use a model, it makes sense to shut down the endpoint temporarily (write an automated script to remove the endpoint when you don't need it and start it again when you need a prediction). That is one method of saving money. We can also reuse an existing endpoint to serve multiple models. It makes sense to deploy various models in a single Sagemaker Endpoint if none of your models use all available resources.

To deploy multiple machine learning models in a single Sagemaker Endpoint, we need to use the multimodel deployment feature. It does not work out-of-the-box, so we have to prepare a few things:
* the Docker image with the model serving software,
* code that loads the model and makes the prediction,
* and the Sagemaker Endpoint configuration.

In this example, I will use Tensorflow Serving as the underlying serving software for a BERT-based text classification model.

## Preparing the Docker Image

First, we have to prepare the `Dockerfile` and install all required components in the Docker image:

```Dockerfile
FROM ubuntu:20.04

# Set a docker label to advertise multi-model support on the container
LABEL com.amazonaws.sagemaker.capabilities.multi-models=true
# Set a docker label to enable container to use SAGEMAKER_BIND_TO_PORT environment variable if present
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true

RUN apt-get update && \
    apt-get -y install --no-install-recommends \
    build-essential \
    ca-certificates \
    openjdk-8-jdk-headless \
    python3-dev \
    curl \
    vim \
    && rm -rf /var/lib/apt/lists/* \
    && curl -O https://bootstrap.pypa.io/pip/get-pip.py \
    && python3 get-pip.py

RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 1
RUN update-alternatives --install /usr/local/bin/pip pip /usr/local/bin/pip3 1

RUN pip3 --no-cache-dir install tensorflow==2.4.1 \
                                transformers==4.5.0 \
                                multi-model-server \
                                sagemaker-inference \
                                retrying

COPY dockerd-entrypoint.py /usr/local/bin/dockerd-entrypoint.py
RUN chmod +x /usr/local/bin/dockerd-entrypoint.py

RUN mkdir -p /home/model-server/

COPY model_handler.py /home/model-server/model_handler.py

ENTRYPOINT ["python", "/usr/local/bin/dockerd-entrypoint.py"]

CMD ["serve"]
```

We see a few additional files in the `Dockerfile`. Therefore, we need to prepare `dockerd-entrypoint.py` and `model_handler.py`.

In the `dockerd-entrypoint.py`, we start the Sagemaker model server:

```python
import subprocess
import sys
import shlex
import os
from retrying import retry
from subprocess import CalledProcessError
from sagemaker_inference import model_server

def _retry_if_error(exception):
    return isinstance(exception, CalledProcessError or OSError)

@retry(stop_max_delay=1000 * 50,
       retry_on_exception=_retry_if_error)
def _start_mms():
    model_server.start_model_server(handler_service='/home/model-server/model_handler.py:handle')

def main():
    if sys.argv[1] == 'serve':
        _start_mms()
    else:
        subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))

    # prevent docker exit
    subprocess.call(['tail', '-f', '/dev/null'])

main()
```

In the `model_handler.py` file, we need to define the `handle` function, which gets the input data and the context (containing Sagemaker metadata), makes the prediction and returns the result. For convenience, we will wrap all the required operations in a class.

In the `initialize method, we load the Tensorflow model and store it in an object field. The `preprocess` method reads data from the JSON input, tokenizes the values, and returns the input for a BERT model. In the `inference` method, we call the model to get a prediction. Finally, the `postprocess` method extracts the prediction value from the model response.

```python
import JSON
import logging
import re
import tensorflow as tf
from transformers import AutoTokenizer

class ModelHandler(object):

    def __init__(self):
        self.initialized = False
        self.model = None

        self.max_seq_length = 64
        self.tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", cache_dir="/tmp/tokenizer")

    def initialize(self, context):
        self.initialized = True
        properties = context.system_properties
        model_dir = properties.get("model_dir")

        try:
            self.model = tf.keras.models.load_model(model_dir + '/0')
        except RuntimeError as memerr:
            if re.search('Failed to allocate (.*) Memory', str(memerr), re.IGNORECASE):
                logging.error("Memory allocation exception: {}".format(memerr))
                # When we raise a MemoryError, the Sagemaker Endpoint will remove from memory the least recently used model and load the new model again
                raise MemoryError
            raise

    def preprocess(self, request):
        # Here we preprocess the given JSON input using a text tokenizer.
        # The preprocessing code in your case will be different.
        # Also, the structure of the input may be different because I sent the following JSON to the endpoint:
        # {"input_text": "the text..."}
        data = request[0]['body']
        data_str = data.decode("utf-8")
        jsonlines = data_str.split("\n")

        text_before_tokenization = json.loads(jsonlines[0])["input_text"]

        encode_plus_tokens = self.tokenizer(
            text_before_tokenization,
            add_special_tokens=True,
            truncation=True,
            max_length=self.max_seq_length,
            padding="max_length",
            return_attention_mask=True,
            return_token_type_ids= False,
            return_tensors="tf"
        )

        input_ids = encode_plus_tokens["input_ids"]
        input_mask = encode_plus_tokens["attention_mask"]

        return [input_ids, input_mask]

    def inference(self, model_input):
        return self.model(model_input)

    def postprocess(self, inference_output):
        return inference_output.numpy().tolist()[0]

    def handle(self, data, context):
        model_input = self.preprocess(data)
        model_out = self.inference(model_input)
        return self.postprocess(model_out)

_service = ModelHandler()

def handle(data, context):
    if not _service.initialized:
        _service.initialize(context)

    if data is None:
        return None

    return _service.handle(data, context)
```

## Uploading the Docker Image to ECR

In the next step, we have to build the Docker image and upload it to the AWS Elastic Container Registry. I assume that you have installed Docker, configured the AWS CLI. Remember to grant the required permissions to the AWS account used to upload the files and replace the `[region]` and `[your AWS id]` placeholders with the AWS region and your AWS account number.

```bash
docker build --tag [your AWS id].dkr.ecr.[region].amazonaws.com/multi-model-server:latest .

$(aws ecr get-login --region [region] --no-include-email)

docker push [your AWS id].dkr.ecr.[region].amazonaws.com/multi-model-server:latest
```

## Uploading the Models

Compress the model files into a `tar.gz` archive and put them in an S3 location. We'll use the file names as the model id later.

## Configuring a Sagemaker Endpoint

In the end, we have to configure a Sagemaker Endpoint:

```python
import sagemaker

sagemaker_session = sagemaker.Session()

container = {
    'Image': '[your AWS id].dkr.ecr.[region].amazonaws.com/multi-model-server:latest',
    'ModelDataUrl': 's3://s3_path/that/contains/the/model/files',
    'Mode': 'MultiModel'
}

multi_model = sagemaker_session.create_model(
    name='multi-model',
    role='arn_role_with_access_to_s3_with_the_models',
    container_defs=[container])

create_endpoint_config_response = sagemaker_session.create_endpoint_config(
    name='multi-model-endpoint-cfg',
    model_name='multi-model',
    initial_instance_count=1,
    instance_type='ml.t2.medium')

sagemaker_session.create_endpoint(
    endpoint_name='multi-model-endpoint',
    config_name='multi-model-endpoint-cfg')
```

I suggest running the code in a [AWS Code Pipeline](https://www.mikulskibartosz.name/deploy-tensorflow-using-sagemaker-endpoints/). After a few minutes, you should have a Sagemaker Endpoint running.

## Using Multimodel Endpoints

We will use `boto3` to create an instance of the Sagemaker client and call the `invoke_endpoint` function:

```python
import boto3
import JSON

payload = json.dumps({"input_text": 'the input to the model'})

runtime = boto3.client("runtime.sagemaker")

response = runtime.invoke_endpoint(
    EndpointName='multi-model-endpoint',
    TargetModel='file_name.tar.gz',
    Body=payload)

    response = response["Body"].read()
    result = json.loads(response.decode("utf-8"))
```

## Limitations of Multimodel Sagemaker Endpoints

Multimodel deployments don't support data capture. If you try to configure it, you'll get a "DataCapture feature is not supported with MultiModel mode." error. Therefore, you have to add the logging code to the application that uses the model or the `model_handler.py` file.