---
title: "Building a classification service with Llama2 in Python"
description: "How to use the Llama2 AI model in Python to build a text classification service"
author: "Bartosz Mikulski"
author_bio: "Principal AI Engineer & MLOps Architect. I bridge the gap between \"it works in a notebook\" and \"it works for 200 million users.\""
author_url: https://mikulskibartosz.name
author_linkedin: https://www.linkedin.com/in/mikulskibartosz/
author_github: https://github.com/mikulskibartosz
canonical_url: https://mikulskibartosz.name/building-classification-service-with-llama2-in-python
---

Can we use a free LLM to build a review classification service instead of paying OpenAI for GPT3 or GPT4? With Llama2, we can do it in several lines of Python code. We will use the smallest Llama2 model (meta-llama/Llama-2-7b).

I managed to run the model on a machine with 32 GiB of RAM and no GPU. In such a setup, the model needs several seconds to classify a text. Of course, we can speed it up by using a GPU.

## Getting Access to the Llama2 Model

Before we start, we need to get access to the model. This requires three steps and takes around 1 hour. However, Meta says they may need up to 2 days to process the request.

First, let's open the ["Llama 2 Community License Agreement" website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/). Read the document carefully and make sure the intended use case doesn't violate the Acceptable Use Policy.

In the form, we have to provide the email address. The email must be the same as the one associated with the Hugging Face account. Otherwise, we will get access to the model, but it won't work with the Hugging Face API.

Second, switch to the [Llama2 page](https://huggingface.co/meta-llama/Llama-2-7b) on the Hugging Face website and request access to the model.

The final step is creating an API token on the [Hugging Face API token website](https://huggingface.co/settings/token). We will need the token to download the model.

## Implementing a Classifier With the Llama2 Model

We will build a REST API. The service will receive the review text as a POST request and return the sentiment of the text. We will encode the sentiment as an integer: 1 for positive, 0 for neutral, and -1 for negative.

### Dependencies

Before we start, we have to install the required Python packages. The transformers package gives us access to models hosted on Hugging Face. The langchain package lets us define the prompt template and build a pipeline. The accelerate package is required when we use the `device_map="auto"` parameter. The Flask and Flask-Cors packages are required to build a REST API.

```
!pip install -q transformers langchain accelerate Flask Flask-Cors
```

### Downloading the Tokenizer

We will use the Hugging Face API to download the tokenizer and the model. The API requires an API token. We can store the token using HuggingFace CLI (`huggingface-cli login --token <token>`) or in Python using the `HfFolder.save_token` function. I choose Python.

Finally, we download a pre-trained tokenizer from HuggingFace.

```python
from huggingface_hub.hf_api import HfFolder
from langchain import HuggingFacePipeline
from transformers import AutoTokenizer
import transformers
import torch

HfFolder.save_token("huggingface token")

model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)
```

### Creating a Pipeline

The `pipeline` function of the `transformers` library downloads the model and creates and configures all objects required to run the model.

When we specify the `text-generation` as the `task` parameter, the pipeline will turn the input into embeddings, pass them to the model, get a result, and decode the result into text. The `model` parameter is the name of the model we want to use. The `tokenizer` parameter is the tokenizer we downloaded in the previous step.

`torch_dtype` specifies the precision for this model. `device_map` set to `auto` (if the accelerate library is available) lets the library automatically decide if the model should be loaded to a GPU or RAM available to the CPU. It will prioritize using GPU if possible.

`max_length` is the maximum length of the generated text. `do_sample` set to `True` means that the model will generate tokens using a sampling strategy. `top_k` is the number of tokens to consider when sampling. `num_return_sequences` is the number of generated sequences. `eos_token_id` is the ID of the token used to identify the end of the sequence.

The last configuration step is the `HuggingFacePipeline`. Here, we specify an additional `temperature` parameter with the value `0`, which makes the model deterministic. A higher value would make the model return different text for the same input. We build a classifier, so we don't want that.

```python
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    max_length=30,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id
)

llm = HuggingFacePipeline(pipeline = pipeline, model_kwargs = {'temperature':0})
```

### Prompt Engineering for Text Classification with Llama2

As with every large language model, Llama2 generates text based on the prompt we provide. In the prompt, we use the few-shot in-context learning technique by giving examples of inputs and desired outputs and explaining the task.

The prompt template contains a placeholder `text`. The `PromptTemplate` will replace the placeholder at runtime with the value we provide.

In the last line, we create an `LLMChain` to chain the prompt template and the pipeline together. The `LLMChain` will use the `prompt` to generate the prompt from the template and the input variable. After generating the full prompt, the chain will pass the prompt to the LLM and return the result.

```python
from langchain import PromptTemplate,  LLMChain

template = """Classify the text into neutral, negative, or positive. Reply with only one word: Positive, Negative, or Neutral.

Examples:
Text: Big variety of snacks (sweet and savoury) and very good espresso Machiatto with reasonable prices, you can't get wrong if you choose the place for a quick meal or coffee.
Sentiment: Positive.

Text: I got food poisoning
Sentiment: Negative.

Text: {text}
Sentiment:"""

prompt = PromptTemplate(template=template, input_variables=["text"])

llm_chain = LLMChain(prompt=prompt, llm=llm)
```

### Classifying Texts

Our pipeline returns text, which technically is a value describing the sentiment, but the text may not be the most convenient thing to work with or store in a database. Therefore, we write an adapter around the LLM and return the sentiment as an integer.

In a production application, we could mock this function and write deterministic tests for all supported cases without worrying that the LLM may return an unexpected result.

```python
def classify(text):
    raw_llm_answer = llm_chain.run(text)
    llm_answer = raw_llm_answer.lower()
    if "neutral" in llm_answer:
        return 0
    elif "positive" in llm_answer:
        return 1
    elif "negative" in llm_answer:
        return -1
    else:
        raise ValueError(f"Invalid response from the LLM. Response: {raw_llm_answer}")
```

## Building a REST Service

Finally, we can build a REST API using our classifier. We will use the Flask library to implement the API and the Flask-Cors library to enable CORS. Naturally, we should consider implementing access control, monitoring, and logging, but it is out of the scope of this article.

```python
from flask import Flask, request, jsonify
from flask_cors import CORS

app = Flask(__name__)
CORS(app)

@app.route('/classify', methods=['POST'])
def classify():
    text = request.json['text']
    sentiment = classify(text)
    return jsonify({'sentiment': sentiment})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)
```

