Building a classification service with Llama2 in Python

Can we use a free LLM to build a review classification service instead of paying OpenAI for GPT3 or GPT4? With Llama2, we can do it in several lines of Python code. We will use the smallest Llama2 model (meta-llama/Llama-2-7b).

I managed to run the model on a machine with 32 GiB of RAM and no GPU. In such a setup, the model needs several seconds to classify a text. Of course, we can speed it up by using a GPU.

Getting Access to the Llama2 Model

Before we start, we need to get access to the model. This requires three steps and takes around 1 hour. However, Meta says they may need up to 2 days to process the request.

First, let’s open the “Llama 2 Community License Agreement” website. Read the document carefully and make sure the intended use case doesn’t violate the Acceptable Use Policy.

In the form, we have to provide the email address. The email must be the same as the one associated with the Hugging Face account. Otherwise, we will get access to the model, but it won’t work with the Hugging Face API.

Second, switch to the Llama2 page on the Hugging Face website and request access to the model.

The final step is creating an API token on the Hugging Face API token website. We will need the token to download the model.

Implementing a Classifier With the Llama2 Model

We will build a REST API. The service will receive the review text as a POST request and return the sentiment of the text. We will encode the sentiment as an integer: 1 for positive, 0 for neutral, and -1 for negative.


Before we start, we have to install the required Python packages. The transformers package gives us access to models hosted on Hugging Face. The langchain package lets us define the prompt template and build a pipeline. The accelerate package is required when we use the device_map="auto" parameter. The Flask and Flask-Cors packages are required to build a REST API.

!pip install -q transformers langchain accelerate Flask Flask-Cors

Downloading the Tokenizer

We will use the Hugging Face API to download the tokenizer and the model. The API requires an API token. We can store the token using HuggingFace CLI (huggingface-cli login --token <token>) or in Python using the HfFolder.save_token function. I choose Python.

Finally, we download a pre-trained tokenizer from HuggingFace.

from huggingface_hub.hf_api import HfFolder
from langchain import HuggingFacePipeline
from transformers import AutoTokenizer
import transformers
import torch

HfFolder.save_token("huggingface token")

model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)

Creating a Pipeline

The pipeline function of the transformers library downloads the model and creates and configures all objects required to run the model.

When we specify the text-generation as the task parameter, the pipeline will turn the input into embeddings, pass them to the model, get a result, and decode the result into text. The model parameter is the name of the model we want to use. The tokenizer parameter is the tokenizer we downloaded in the previous step.

torch_dtype specifies the precision for this model. device_map set to auto (if the accelerate library is available) lets the library automatically decide if the model should be loaded to a GPU or RAM available to the CPU. It will prioritize using GPU if possible.

max_length is the maximum length of the generated text. do_sample set to True means that the model will generate tokens using a sampling strategy. top_k is the number of tokens to consider when sampling. num_return_sequences is the number of generated sequences. eos_token_id is the ID of the token used to identify the end of the sequence.

The last configuration step is the HuggingFacePipeline. Here, we specify an additional temperature parameter with the value 0, which makes the model deterministic. A higher value would make the model return different text for the same input. We build a classifier, so we don’t want that.

pipeline = transformers.pipeline(

llm = HuggingFacePipeline(pipeline = pipeline, model_kwargs = {'temperature':0})

Prompt Engineering for Text Classification with Llama2

As with every large language model, Llama2 generates text based on the prompt we provide. In the prompt, we use the few-shot in-context learning technique by giving examples of inputs and desired outputs and explaining the task.

The prompt template contains a placeholder text. The PromptTemplate will replace the placeholder at runtime with the value we provide.

In the last line, we create an LLMChain to chain the prompt template and the pipeline together. The LLMChain will use the prompt to generate the prompt from the template and the input variable. After generating the full prompt, the chain will pass the prompt to the LLM and return the result.

from langchain import PromptTemplate,  LLMChain

template = """Classify the text into neutral, negative, or positive. Reply with only one word: Positive, Negative, or Neutral.

Text: Big variety of snacks (sweet and savoury) and very good espresso Machiatto with reasonable prices, you can't get wrong if you choose the place for a quick meal or coffee.
Sentiment: Positive.

Text: I got food poisoning
Sentiment: Negative.

Text: {text}

prompt = PromptTemplate(template=template, input_variables=["text"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

Classifying Texts

Our pipeline returns text, which technically is a value describing the sentiment, but the text may not be the most convenient thing to work with or store in a database. Therefore, we write an adapter around the LLM and return the sentiment as an integer.

In a production application, we could mock this function and write deterministic tests for all supported cases without worrying that the LLM may return an unexpected result.

def classify(text):
    raw_llm_answer =
    llm_answer = raw_llm_answer.lower()
    if "neutral" in llm_answer:
        return 0
    elif "positive" in llm_answer:
        return 1
    elif "negative" in llm_answer:
        return -1
        raise ValueError(f"Invalid response from the LLM. Response: {raw_llm_answer}")

Building a REST Service

Finally, we can build a REST API using our classifier. We will use the Flask library to implement the API and the Flask-Cors library to enable CORS. Naturally, we should consider implementing access control, monitoring, and logging, but it is out of the scope of this article.

from flask import Flask, request, jsonify
from flask_cors import CORS

app = Flask(__name__)

@app.route('/classify', methods=['POST'])
def classify():
    text = request.json['text']
    sentiment = classify(text)
    return jsonify({'sentiment': sentiment})

if __name__ == '__main__':'', port=5000)

Do you need help building AI-powered applications for your business?
You can hire me!

Older post

Monitoring AI applications with Langsmith

Monitor interactions with LLM in Langchain and gather feedback about the model's performance using Langsmith

Newer post

Finding information in long documents with AI using vector databases and MapReduceChain from Langchain

How to find information in long documents with AI, vector databases, and Langchain using MapReduceChain and ParentDocumentRetriever