Alternatives to OpenAI GPT model: using an open-source Cerebras model with LangChain

In this article, I will show you how to use an open-source Cerebras model with LangChain. The Cerebras model is a model with a GPT-3 style architecture. Cerebras has created several versions of the model with a different number of parameters.

I will use the cerebras/Cerebras-GPT-2.7B model, which is the largest model I managed to load on the Google Colab Pro+ platform. All larger models are too big to fit on the Colab Pro+ platform, even when you have 50GB of RAM available. All Cerebras-GPT models are available on HuggingFace.

Required libraries

I will show you how to use the model with prompt templates and Langchain agents. We will need the Transformers library to download the models, Langchain to use it, and SERP API as an example tool for the Agent.

Let’s install them first:

pip install transformers langchain google-search-results

Loading the Cerebras Model with Transformers

Because the Cerebras model is available on HuggingFace, we can load both the model and the text tokenizer using the Transformers library:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "cerebras/Cerebras-GPT-2.7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Now, we need to create a HuggingFace text-generation pipeline to turn the input text into tokens, pass the tokens to the model, and convert the output tokens back to text. Additionally, the pipeline contains the configuration text generation features as described in the HuggingFace documentation.

In this case, we setup the max_new_tokens parameter, which controls the maximal number of tokens generated by the model. We also set the early_stopping parameter to True, so the text generation doesn’t try to find better candidates. We also set the no_repeat_ngram_size parameter to 2, which means that the model won’t repeat the same n-grams of size 2.

from transformers import pipeline

pipe = pipeline(
    "text-generation", model=model, tokenizer=tokenizer,
    max_new_tokens=100, early_stopping=True, no_repeat_ngram_size=2

Using a Model from HuggingFace with LangChain

In the next step, we have to import the HuggingFacePipeline from Langchain. We will use it as a model implementation. If you follow any other Langchain tutorial, the HuggingFacePipeline is the only thing you need to change when you want to replace OpenAI with a model from HuggingFace.

from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=pipe)

Creating a Prompt Template

Let’s use the model. For a start, we will create a prompt template without providing any additional text in the prompt. Instead, the template will pass the input verbatim into the model.

from langchain import PromptTemplate
from langchain import LLMChain

template = """

prompt = PromptTemplate(

chain = LLMChain(

When I run the chain, with the input “When I opened the door, I saw a” it generated:

response ="""When I opened the door, I saw a""")
woman in a white dress,
with a black veil over her face.
She was holding a baby in her arms.

It’s a decent start, and we know the model is working. Let’s make something more useful by adding a prompt template instructing the model to extract the topic from a tweet:

template = """
Given a tweet:
The topic of the tweet is:

prompt = PromptTemplate(

chain = LLMChain(

Let’s test it with one of my tweets:

response ="""After writing the same client code in two languages:
Perhaps, SDKs should be thin clients for a backend SDK service that dispatches the requests to actual backend services.""")

The output isn’t perfect, but it is a good start:

- What is the difference between a thin client and a thick client?
- What are the advantages and disadvantages of each?
- How do you write a client in one language and another in another? (e.g. in Java)

Smaller Open-Source Models vs. GPT-3 or GPT-4

As we see, we must carefully write the prompt to trick the model into generating exactly what we want. With more advanced models, we can ask multiple questions at once or ask the model about abstract concepts. Simpler models have trouble generating text that is not directly related to the input. Therefore, it helps when a part of the answer is already written in the prompt, and the model only needs to fill in the gaps.

In the next example, I will show you why cerebras/Cerebras-GPT-2.7B isn’t good enough to be used with Langchain agents. Later, I tried also cerebras/Cerebras-GPT-13B, but it didn’t help much. Both of them aren’t good enough to work as a Langchain agent.

Using a Cerebras model with LangChain Agents

Our LangChain agent will retrieve the current weather forecast from Google results. GPT-3 (and, obviously, GPT-4) can easily handle the task when we provide a tool they can use to retrieve the results. However, the Cerebras-GPT-2.7B model is insufficient to handle the job. Let’s see why.

from langchain import SerpAPIWrapper
from langchain.agents import Tool
from langchain.agents import initialize_agent

serpapi = SerpAPIWrapper(serpapi_api_key='...')
tools = [
        name = "Search",,
        description="useful for when you need to get a weather forecast"

agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)

Now, we ask the agent to retrieve the weather forecast for Poznan, Poland:"What is the weather forecast for Poznan, Poland")

We are going to get an exception from Langchain and a weird-looking answer:

ValueError: Could not parse LLM output: ` The weather is warm and sunny.
Answer: Warm and Sunny

What is a thought?
A thought is an idea or a feeling. It is not a fact. A thought can be a
conclusion, a question, or an action.

What happened? Why did it generate something like this? The LangChain agents use the technique called MRKL (Modular Reasoning, Knowledge, and Language). When we use tools, the model receives a prompt that looks like this:

Answer the following questions as best you can. You have access to the following tools:

// Here is a list of tools

Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question"""

Question: {input}

The model iteratively goes through at least three steps:

  • First, it generates the Thought, Action, and Action Input. The Thought is when the model can break down the query into a plan. After the initial thought, the model chooses a tool and provides a text input for the tool.
  • At this point, Langchain interrupts the model and runs the tool.
  • The tool returns an observation, and the model continues to generate the next Thought. After the though, it may choose another tool or generate the Final Answer.

In the output of the model I used, we see that the model was confused with the given format, but at least it tried to do something.

Do you need help building AI-powered applications for your business?
You can hire me!

Older post

AI-Powered Pair Programming: Enhance Your Web Development Skills with GPT-4 Assistance

Improve your coding skills and elevate your writing with GPT-4 as your AI-driven pair programming partner, guiding you through the process of building a web application that functions as a user-friendly reverse dictionary.

Newer post

Don't use AI to generate tests for your code or how to do test-driven development with AI

How to use AI to geneate test cases for your code