Small language models are just like large ones, but they are much faster, cheaper, and easier to fine-tune.
Table of Contents
- When to Use Small Language Models
- Defining the Task for the Model
- Loading the Model
- Data Preparation
- Fine-Tuning the Model
- Evaluating the Model
A small language model (SLM) has fewer parameters and often a simpler architecture. It is not a general-purpose model that can be instructed to do almost everything. Typically, an SLM is deployed for a single specific task (answering questions about a certain line of products, summarizing one type of document, processing data in a specific format, etc.).
They are a great alternative when using an LLM would be an overkill.
When to Use Small Language Models
You may consider using a small language model when:
- You have a limited budget
SLMs don’t require hundreds of GBs of RAM and a high-end GPU. A quantized model may run on a regular laptop or be small enough to embed on a website and serve requests directly in the client’s browser (if you use WebGPU)
- You need to serve requests fast
Obviously, a small model using a fraction of the parameters of a large one will be much faster if it is run on the same hardware. However, a small model may be faster than LLMs, even running on cheaper servers.
- You need more accuracy on your specific task
You won’t get better accuracy out of the box, but when you fine-tune the model, you will get better results. Of course, you can also fine-tune an open-source LLM, but fine-tuning a large model requires more resources and more curated data.
- You care about the environmental impact of your code
Small language models use less powerful machines to run and finish tasks faster, thus consuming less energy.
- You need transparency and scrutiny
As mentioned in the Salesforce article “The Ever-Growing Power of Small Models”, small models have reduced the size of training data, so it is easier to understand what they have been trained on.
Silvio Savarese, executive president and chief scientist at Salesforce, says:
There is a largely accepted assumption that smaller models must perform worse than larger models, and this is simply wrong. For companies looking for models which are focused on well-defined domains and on specific tasks, such as knowledge retrieval, technical support, answering customer questions, smaller models are often as competitive, as bigger, large models.
Defining the Task for the Model
I admit the task for the model will be odd, but if we can get a model to do it, we can get an SLM to do almost any text-processing task.
In the article about the in-context learning prompt engineering technique, I instructed a large language model to do weird text processing tasks. The model was supposed to turn a sentence like this: Cats sleep in boxes
into a JSON: {"Cats": "sleep", "in": "boxes"}
. I did it by writing a prompt like this:
1. Split the given sentence into words.
2. Every odd word is a key of a JSON object.
3. Every even word is the value of the key defined before that word.
4. Return a JSON object with all words in the sentence.
Example:
Input: Every dog fetches balls.
Output: {"Every": "dog", "fetches": "balls"}
###
Cats sleep in boxes
When I used the same prompt with the SmolLM
small language model, the model failed miserably. Instead of returning the JSON object, SmolLM repeated the prompt and then repeated the last line until it reached the token limit:
1. Split the given sentence into words.\n2. Every odd word is a key of a JSON object.\n3. Every even word is the value of the key defined before that word.\n4. Return a JSON object with all words in the sentence.\n\nExample:\nInput: Every dog fetches balls.\nOutput: {"Every": "dog", "fetches": "balls"}\n\n###\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n
Loading the Model
We load the model from the Huggingface repository, so the first step is to install the transformers
library. We will also need datasets
to pass the training data to the model.
pip install transformers datasets
We import the classes in Python, download the model and its tokenizer, and move the model to the GPU.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model_name = "HuggingFaceTB/SmolLM-135M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
Data Preparation
To prepare the data, we need to know what is the end-of-sentence token for the model. In the case of SmolLM, the EoS token is <|endoftext|>
. If the documentation doesn’t tell us, we can take the token id and use the tokenizer to decode the text value:
eos_string = tokenizer.decode([tokenizer.eos_token_id])
eos_string
# <|endoftext|>
Before we can fine-tune the model, we need to prepare the data. I have a dataset with around 20000 examples of sentences and their JSON versions. Every example consists of the input line, a new line character, the output line, and the end-of-sentence token:
shop where he worked\n{\"shop\": \"where\", \"he\": \"worked\"}<|endoftext|>
In the case of my dataset, the entire training set is a single JSON array stored in a file. We can load the file, parse the content with the JSON module, and create a dataset:
import json
from datasets import Dataset
with open("text_for_ai.json", "r") as f:
text_for_ai = json.load(f)
dataset = Dataset.from_dict({"text": text_for_ai})
Now, I have a dataset with a single column and values looking like this:
dataset.column_names
# ['text']
dataset["text"][0]
# Bamus Volcano is a\n{"Bamus": "Volcano", "is": "a"}<|endoftext|>
The tokenizer of the SmolLM model doesn’t use a separate pad token, so we will reuse the end-of-sentence token for padding and tokenize the dataset:
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=30, return_tensors="pt")
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=dataset.column_names)
tokenized_dataset
# Dataset({
# features: ['input_ids', 'attention_mask'],
# num_rows: 21263
# })
After tokenization, we split the dataset into training and validation sets.
tokenized_dataset = tokenized_dataset.train_test_split(test_size=0.05)
tokenized_dataset
# DatasetDict({
# train: Dataset({
# features: ['input_ids', 'attention_mask'],
# num_rows: 20200
# })
# test: Dataset({
# features: ['input_ids', 'attention_mask'],
# num_rows: 1063
# })
# })
Want to build AI systems that actually work?
Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.
What Does a Data Collator Do?
Data collators turn samples into a minibatch.
We need a collator for causal language modeling because the model predicts the next label after the input. In this case, the labels are a copy of the input (with padding to make them the same length). We will use the DataCollatorForLanguageModeling
, which, by default, does masked language modeling (replacing some tokens with a mask and making the model predict them), so we have to set the mlm
parameter to False
.
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
If we run the data collator on our tokenized dataset, we get labels
in addition to existing input_ids
, and attention_mask
values.
out = data_collator([tokenized_dataset["train"][i] for i in range(5)])
for key in out:
print(f"{key} shape: {out[key].shape}")
input_ids shape: torch.Size([5, 30]) attention_mask shape: torch.Size([5, 30]) labels shape: torch.Size([5, 30])
And when we look at the labels, we may see something like this:
```python
out["labels"][0]
#tensor([14812, 9087, 1482, 367, 304, 279, 102, 23280, 198, 39428,
# 14812, 9087, 1799, 476, 305, 1002, 476, 1018, 1799, 476,
# 94, 279, 102, 23280, 23597, -100, -100, -100, -100, -100])
The values end and are padded with token -100
. It’s not a token from our tokenizer. It’s a special value used by the DataCollatorForLanguageModeling
.
The -100
token causes an issue when we use the fine-tuned model. The model won’t generate the end-of-sentence token, so the model keeps generating text until the token limit is reached. For example, for the input: Small models are great.\n
, the model may generate:
Small models are great.
{"Small": "models", "are": "great"}
{"Small": "models", "are": "great"}
{"Small": "
Defining a Custom Data Collator
We could manually replace the -100
token in the labels with the end-of-sentence token after generating labels, but it’s easier to do the replacement inside the data collator.
class CustomDataCollatorForLanguageModeling(DataCollatorForLanguageModeling):
def __call__(self, examples):
batch = super().__call__(examples)
labels = batch['labels']
eos_token_id = self.tokenizer.eos_token_id
labels[labels == -100] = eos_token_id
batch['labels'] = labels
return batch
data_collator = CustomDataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False,
)
With a custom collator, the labels use the end-of-sentence token:
out = data_collator([tokenized_dataset["train"][i] for i in range(5)])
out["labels"][0]
#tensor([14812, 9087, 1482, 367, 304, 279, 102, 23280, 198, 39428,
# 14812, 9087, 1799, 476, 305, 1002, 476, 1018, 1799, 476,
# 94, 279, 102, 23280, 23597, 0, 0, 0, 0, 0])
Fine-Tuning the Model
Finally, we can define the training parameters and start fine-tuning the model!
from transformers import Trainer, TrainingArguments
args = TrainingArguments(
output_dir="SmolLM",
per_device_train_batch_size=128,
per_device_eval_batch_size=128,
eval_strategy="steps",
eval_steps=250,
gradient_accumulation_steps=8,
num_train_epochs=3,
weight_decay=0.1,
warmup_steps=50,
lr_scheduler_type="cosine",
learning_rate=5e-4,
save_steps=500,
fp16=True,
push_to_hub=False,
)
trainer = Trainer(
model=model,
tokenizer=tokenizer,
args=args,
data_collator=data_collator,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["test"],
)
trainer.train()
What’s the meaning of the training arguments?
- output_dir - the location of the model checkpoints
- per_device_train_batch_size and per_device_eval_batch_size - the batch size for training and evaluation
- eval_strategy - we can choose to evaluate the model after every epoch or after every n steps. I chose to evaluate after every 250 steps.
- eval_steps - the number of steps between evaluations
- gradient_accumulation_steps - the number of steps to accumulate gradients before updating the model in the backward pass (note that it affects how the trainer counts steps for logging, evaluation and saving!)
- num_train_epochs - the number of epochs to train the model
- weight_decay - the weight decay for the optimizer to use L2 regularization
- warmup_steps - the number of steps to warm up the learning rate scheduler
- lr_scheduler_type - the learning rate scheduler type. A cosine scheduler first linearly increases the learning rate from 0 to the maximum learning rate and then decreases it to 0 using values of the cosine function between the target learning rate and 0.
- learning_rate - the target learning rate
- save_steps - the number of steps between saving the model
- fp16 - whether to use 16-bit floating point precision to speed up the training and consume less memory (as a tradeoff, it may cause some numerical instability and worse results)
- push_to_hub - whether to push the model checkpoints to the Huggingface hub
Evaluating the Model
The model is ready, and we can process text into JSON objects:
trained_model = trainer.model
prompt = "Small models are great.\n"
input_ids = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False).to(device)
generated_ids = trained_model.generate(
input_ids,
max_new_tokens=30,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(generated_text)
#Small models are great.
#{"Small": "models", "are": "great"}
Of course, now we can push the model to the Huggingface hub or save it locally.
from huggingface_hub import notebook_login
notebook_login()
trained_model.push_to_hub("SmolLM-135M-fine-tuned")
or
trained_model.save_pretrained("SmolLM-135M-fine-tuned")
Do you need help building AI-powered applications for your business?
You can hire me!