How to fine-tune a super-fast small language model SmolLM from HuggingFace

Small language models are just like large ones, but they are much faster, cheaper, and easier to fine-tune.

A small language model (SLM) has fewer parameters and often a simpler architecture. It is not a general-purpose model that can be instructed to do almost everything. Typically, an SLM is deployed for a single specific task (answering questions about a certain line of products, summarizing one type of document, processing data in a specific format, etc.).

They are a great alternative when using an LLM would be an overkill.

When to Use Small Language Models

You may consider using a small language model when:

  • You have a limited budget

SLMs don’t require hundreds of GBs of RAM and a high-end GPU. A quantized model may run on a regular laptop or be small enough to embed on a website and serve requests directly in the client’s browser (if you use WebGPU)

  • You need to serve requests fast

Obviously, a small model using a fraction of the parameters of a large one will be much faster if it is run on the same hardware. However, a small model may be faster than LLMs, even running on cheaper servers.

  • You need more accuracy on your specific task

You won’t get better accuracy out of the box, but when you fine-tune the model, you will get better results. Of course, you can also fine-tune an open-source LLM, but fine-tuning a large model requires more resources and more curated data.

  • You care about the environmental impact of your code

Small language models use less powerful machines to run and finish tasks faster, thus consuming less energy.

  • You need transparency and scrutiny

As mentioned in the Salesforce article “The Ever-Growing Power of Small Models”, small models have reduced the size of training data, so it is easier to understand what they have been trained on.

Silvio Savarese, executive president and chief scientist at Salesforce, says:

There is a largely accepted assumption that smaller models must perform worse than larger models, and this is simply wrong. For companies looking for models which are focused on well-defined domains and on specific tasks, such as knowledge retrieval, technical support, answering customer questions, smaller models are often as competitive, as bigger, large models.

Defining the Task for the Model

I admit the task for the model will be odd, but if we can get a model to do it, we can get an SLM to do almost any text-processing task.

In the article about the in-context learning prompt engineering technique, I instructed a large language model to do weird text processing tasks. The model was supposed to turn a sentence like this: Cats sleep in boxes into a JSON: {"Cats": "sleep", "in": "boxes"}. I did it by writing a prompt like this:

1. Split the given sentence into words.
2. Every odd word is a key of a JSON object.
3. Every even word is the value of the key defined before that word.
4. Return a JSON object with all words in the sentence.

Example:
Input: Every dog fetches balls.
Output: {"Every": "dog", "fetches": "balls"}

###

Cats sleep in boxes

When I used the same prompt with the SmolLM small language model, the model failed miserably. Instead of returning the JSON object, SmolLM repeated the prompt and then repeated the last line until it reached the token limit:

1. Split the given sentence into words.\n2. Every odd word is a key of a JSON object.\n3. Every even word is the value of the key defined before that word.\n4. Return a JSON object with all words in the sentence.\n\nExample:\nInput: Every dog fetches balls.\nOutput: {"Every": "dog", "fetches": "balls"}\n\n###\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n\nCats sleep in boxes.\n

Loading the Model

We load the model from the Huggingface repository, so the first step is to install the transformers library. We will also need datasets to pass the training data to the model.

pip install transformers datasets

We import the classes in Python, download the model and its tokenizer, and move the model to the GPU.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer


device = "cuda"
model_name = "HuggingFaceTB/SmolLM-135M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

Data Preparation

To prepare the data, we need to know what is the end-of-sentence token for the model. In the case of SmolLM, the EoS token is <|endoftext|>. If the documentation doesn’t tell us, we can take the token id and use the tokenizer to decode the text value:

eos_string = tokenizer.decode([tokenizer.eos_token_id])
eos_string

# <|endoftext|>

Before we can fine-tune the model, we need to prepare the data. I have a dataset with around 20000 examples of sentences and their JSON versions. Every example consists of the input line, a new line character, the output line, and the end-of-sentence token:

shop where he worked\n{\"shop\": \"where\", \"he\": \"worked\"}<|endoftext|>

In the case of my dataset, the entire training set is a single JSON array stored in a file. We can load the file, parse the content with the JSON module, and create a dataset:

import json
from datasets import Dataset


with open("text_for_ai.json", "r") as f:
 text_for_ai = json.load(f)

dataset = Dataset.from_dict({"text": text_for_ai})

Now, I have a dataset with a single column and values looking like this:

dataset.column_names

# ['text']

dataset["text"][0]

# Bamus Volcano is a\n{"Bamus": "Volcano", "is": "a"}<|endoftext|>

The tokenizer of the SmolLM model doesn’t use a separate pad token, so we will reuse the end-of-sentence token for padding and tokenize the dataset:

tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=30, return_tensors="pt")

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=dataset.column_names)
tokenized_dataset

# Dataset({
#     features: ['input_ids', 'attention_mask'],
#     num_rows: 21263
# })

After tokenization, we split the dataset into training and validation sets.

tokenized_dataset = tokenized_dataset.train_test_split(test_size=0.05)
tokenized_dataset

# DatasetDict({
#     train: Dataset({
#         features: ['input_ids', 'attention_mask'],
#         num_rows: 20200
#     })
#     test: Dataset({
#         features: ['input_ids', 'attention_mask'],
#         num_rows: 1063
#     })
# })

What Does a Data Collator Do?

Data collators turn samples into a minibatch.

We need a collator for causal language modeling because the model predicts the next label after the input. In this case, the labels are a copy of the input (with padding to make them the same length). We will use the DataCollatorForLanguageModeling, which, by default, does masked language modeling (replacing some tokens with a mask and making the model predict them), so we have to set the mlm parameter to False.

from transformers import DataCollatorForLanguageModeling


data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

If we run the data collator on our tokenized dataset, we get labels in addition to existing input_ids, and attention_mask values.

out = data_collator([tokenized_dataset["train"][i] for i in range(5)])
for key in out:
    print(f"{key} shape: {out[key].shape}")

input_ids shape: torch.Size([5, 30]) attention_mask shape: torch.Size([5, 30]) labels shape: torch.Size([5, 30])


And when we look at the labels, we may see something like this:

```python
out["labels"][0]

#tensor([14812,  9087,  1482,   367,   304,   279,   102, 23280,   198, 39428,
#        14812,  9087,  1799,   476,   305,  1002,   476,  1018,  1799,   476,
#           94,   279,   102, 23280, 23597,  -100,  -100,  -100,  -100,  -100])

The values end and are padded with token -100. It’s not a token from our tokenizer. It’s a special value used by the DataCollatorForLanguageModeling.

The -100 token causes an issue when we use the fine-tuned model. The model won’t generate the end-of-sentence token, so the model keeps generating text until the token limit is reached. For example, for the input: Small models are great.\n, the model may generate:

Small models are great.
{"Small": "models", "are": "great"}
{"Small": "models", "are": "great"}
{"Small": "

Defining a Custom Data Collator

We could manually replace the -100 token in the labels with the end-of-sentence token after generating labels, but it’s easier to do the replacement inside the data collator.

class CustomDataCollatorForLanguageModeling(DataCollatorForLanguageModeling):
    def __call__(self, examples):
        batch = super().__call__(examples)
        labels = batch['labels']
        eos_token_id = self.tokenizer.eos_token_id

        labels[labels == -100] = eos_token_id

        batch['labels'] = labels
        return batch

data_collator = CustomDataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

With a custom collator, the labels use the end-of-sentence token:

out = data_collator([tokenized_dataset["train"][i] for i in range(5)])
out["labels"][0]


#tensor([14812,  9087,  1482,   367,   304,   279,   102, 23280,   198, 39428,
#        14812,  9087,  1799,   476,   305,  1002,   476,  1018,  1799,   476,
#           94,   279,   102, 23280, 23597,    0,    0,    0,    0,    0])

Fine-Tuning the Model

Finally, we can define the training parameters and start fine-tuning the model!

from transformers import Trainer, TrainingArguments


args = TrainingArguments(
    output_dir="SmolLM",
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    eval_strategy="steps",
    eval_steps=250,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    weight_decay=0.1,
    warmup_steps=50,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    save_steps=500,
    fp16=True,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
)

trainer.train()

What’s the meaning of the training arguments?

  • output_dir - the location of the model checkpoints
  • per_device_train_batch_size and per_device_eval_batch_size - the batch size for training and evaluation
  • eval_strategy - we can choose to evaluate the model after every epoch or after every n steps. I chose to evaluate after every 250 steps.
  • eval_steps - the number of steps between evaluations
  • gradient_accumulation_steps - the number of steps to accumulate gradients before updating the model in the backward pass (note that it affects how the trainer counts steps for logging, evaluation and saving!)
  • num_train_epochs - the number of epochs to train the model
  • weight_decay - the weight decay for the optimizer to use L2 regularization
  • warmup_steps - the number of steps to warm up the learning rate scheduler
  • lr_scheduler_type - the learning rate scheduler type. A cosine scheduler first linearly increases the learning rate from 0 to the maximum learning rate and then decreases it to 0 using values of the cosine function between the target learning rate and 0.
  • learning_rate - the target learning rate
  • save_steps - the number of steps between saving the model
  • fp16 - whether to use 16-bit floating point precision to speed up the training and consume less memory (as a tradeoff, it may cause some numerical instability and worse results)
  • push_to_hub - whether to push the model checkpoints to the Huggingface hub

Evaluating the Model

The model is ready, and we can process text into JSON objects:

trained_model = trainer.model

prompt = "Small models are great.\n"
input_ids = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False).to(device)

generated_ids = trained_model.generate(
 input_ids,
    max_new_tokens=30,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id
)

generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(generated_text)

#Small models are great.
#{"Small": "models", "are": "great"}

Of course, now we can push the model to the Huggingface hub or save it locally.

from huggingface_hub import notebook_login
notebook_login()


trained_model.push_to_hub("SmolLM-135M-fine-tuned")

or

trained_model.save_pretrained("SmolLM-135M-fine-tuned")

Do you need help building AI-powered applications for your business?
You can hire me!

Older post

Enhancing RAG System Accuracy - Advanced RAG techniques explained

Discover advanced techniques to enhance the accuracy of your Retrieval-Augmented Generation (RAG) systems. Learn about semantic search, query expansion, HyDE, and keyword search to improve data retrieval and answer quality.

Newer post

Analyzing customer reviews with AI

AI transforms customer review analysis by providing actionable insights to improve service quality and customer satisfaction across franchises.

Are you looking for an experienced AI consultant? Do you need assistance with your RAG or Agentic Workflow?
Schedule a call, send me a message on LinkedIn. Schedule a call or send me a message on LinkedIn

>