How to fine-tune an OpenAI model using custom data

At some point, in-context learning becomes insufficient, and we may need to fine-tune the model for a specific task. In the case of OpenAI models, the fine-tuning process is fully automated. We only need to prepare the training data in a specified format.

Of course, we will not receive the model files. Instead, OpenAI will host the model, and we can use our fine-tuned version through the API. The model will only be available in the account that trained it and will not be shared with anyone else.

When is it appropriate to fine-tune the model?

I would consider fine-tuning the model in two cases:

  1. If an LLM cannot handle the task without receiving multiple examples with the prompt, we may consider fine-tuning the model to limit the number of tokens sent with every request. The fine-tuned model will be cheaper because we are charged per token. Of course, we also have to consider the cost of training the model.

  2. If we have a specific task and would like to reduce the probability of someone tricking the model into returning an inappropriate answer, we may fine-tune the model. Fine-tuned versions tend to perform worse on generic tasks but better on the specific task they were trained on.

Does fine-tuning make sense when the generic version cannot perform the task well? I don’t think so. If you could not instruct the generic model to perform the task, I wouldn’t assume you have good enough training examples to train the model. Of course, if you have already tried all prompt engineering tricks, fine-tuning may be the only option left.

Prepare the data

OpenAI requires the training data to follow a specific JSON format. The JSON objects look simple: {"prompt": "<prompt text>", "completion": "<ideal generated text>"}. However, there are also requirements regarding the provided prompts and completions.

Prompts

Each prompt should end with the same fixed separator. The model uses the separator to determine when the prompt ends and the completion begins. The separator should not appear elsewhere in any prompt. OpenAI recommends using \n\n###\n\n as the separator.

During the inference, all prompts must end with the same separator as during the training.

Completions

Each completion should start with a whitespace because the tokenization tokenizes most words with the preceding whitespace.

Each completion should end with the same fixed stop sequence. The stop sequence informs the model when the completion ends. A stop sequence could be \n, ###, or any other token that does not appear in any completion.

During the inference, we have to use the same stop sequence as during the training.

How many examples do you need?

According to the OpenAI documentation:

Fine-tuning performs better with more high-quality examples. To fine-tune a model that performs better than using a high-quality prompt with our base models, you should provide at least a few hundred high-quality examples, ideally vetted by human experts. From there, performance tends to linearly increase with every doubling of the number of examples. Increasing the number of examples is usually the best and most reliable way of improving performance.

How to create the examples?

If you already have an AI-based system in production, I recommend logging all interactions with it. Later, you can use the logs to create the training data. You cannot pass the log as the training dataset without reviewing the content or fixing the issues. After all, if you could, it would mean that the existing system performs the task well, and you would not need to fine-tune the model. Instead, you should review all of the training examples and reject (or, better, fix) the invalid ones.

How to prepare the data?

I wanted to fine-tune a question-answering model using the articles from my blog as the source of questions and answers.

First, I passed the article to the generic OpenAI model and asked the model to generate the questions from my articles. In another OpenAI request, I sent the same article and the question. In the second request, I asked the model to answer the questions. I used the generated questions and answers as the training data.

import openai
openai.api_key = "..."

def query_gpt(prompt):
    response = openai.Completion.create(
      model="text-davinci-003",
      prompt=prompt,
      temperature=0.7,
      max_tokens=512
    )
    return response['choices'][0]['text'], text.strip()[3:] for text in response['choices'][0]['text'].split('\n') if text]

question_1 = "\n\n###\nGiven the article above. Write three questions that can be answered with the article content."
question_2 = "\n\nAnswer those questions by providing 1-2 sentences extracted from the article."

def generate_questions_and_answers(article):
    try:
        prompt = article.content + question_1
        raw_response, questions = query_gpt(prompt)
        prompt = prompt + raw_response + question_2
        raw_response, answers = query_gpt(prompt)
        return list(zip(questions, answers))
    except Exception as e:
        print(e)
        return []

q_and_a = []
articles = # load the articles as text
for article in articles:
    q_and_a.extend(generate_questions_and_answers(article))

After running the code, I had a list q_and_a containing tuples with questions and answers. Now, I had to store them in the supported format.

import pandas as pd
df = pd.DataFrame(q_and_a)
df.columns = ['question', 'answer']

# here, I add the separator to the end of the prompt
df['prompt'] = df['prompt'] + "\n\n###\n\n"
# this adds the required space at the beginning of the answer and the stop sequence at the end
df['answer'] = " " + df['answer'] + '###'

training_data = df[['prompt', 'answer']]
training_data.columns = ['prompt', 'completion']
training_data.to_json('training_data.jsonl', lines=True, orient='records')

Fine-tune the OpenAI model

When the training dataset is ready, we can run the fine-tuning process. In the case of OpenAI, all we need is a command-line tool:

openai api fine_tunes.create -t training_data.jsonl -m davinci

You will receive an identifier for the fine-tuning job and see the logs. If the connection gets interrupted, the training will not end. You can get the logs again by running: openai api fine_tunes.follow -i the_unique_identifier. In the end, you will receive the identifier of the fine-tuned model.

Model validation

In addition to the training dataset, we can prepare a validation dataset in the same format. It is crucial to ensure that both datasets are mutually exclusive! If you have a validation dataset, you can pass the validation data to the training command like this:

openai api fine_tunes.create -t training_data.jsonl -v validation_data.jsonl -m davinci

When you provide a validation dataset, OpenAI will calculate metrics and report them in the results file. You can retrieve the results using: openai api fine_tunes.results -i the_unique_identifier

Classification models

You can train a data classification model using the standard command for text completion, but in the case of classification, you will be interested in a different set of metrics. OpenAI doesn’t know how you will use the model, so you have to inform them by passing the --compute_classification_metrics flag. When you use the flag, passing the number of classes using the --classification_n_classes flag is mandatory. If you are doing binary classification, you must also pass the --classification_positive_class flag to inform the model which class is the positive one.

openai api fine_tunes.create \
-t training_data.jsonl \
-v validation_data.jsonl \
-m davinci \
--compute_classification_metrics \
--classification_n_classes 2 \
--classification_positive_class "positive"

or

openai api fine_tunes.create \
-t training_data.jsonl \
-v validation_data.jsonl \
-m davinci \
--compute_classification_metrics \
--classification_n_classes 7

If those commands don’t work in your shell, remove ‘' and put everything in one line.

You will find additional metrics such as accuracy or F1-score in the result file. The binary classification additionally reports precision, recall, AUROC, and AUPRC. However, you must remember that those metrics are calculated using the 0.5 threshold. If you are going to use a different threshold, you should run your validation code and calculate the metrics yourself.

Use the fine-tuned model

When the model is ready, we can send our questions. Remember to add the required separator at the end of the prompt and the stop sequence. It’s best to create a function to call the model so you never forget about it.

def query_gpt(prompt):
    response = openai.Completion.create(
      model="davinci:the_unique_identifier",
      prompt=prompt + "\n\n###\n\n",
      temperature=0.7,
      max_tokens=512,
      stop="###"
    )
    return response['choices'][0]['text']

query_gpt("your question")

Naturally, the results will be as good (or bad) as the training data.


Do you need help building AI-powered applications for your business?
You can hire me!

Older post

Deploy LLMs with Confidence: A Comprehensive Guide to Software Architecture for Production-Ready AI

Learn the essentials of deploying large language models in production with our comprehensive guide on software architecture for AI

Newer post

Which index should you use while building an application with LlamaIndex?

Which Llama index should you use? When is it better to use GPTVectorStoreIndex, GPTListIndex, GPTKeywordTableIndex, or GPTKnowledgeGraphIndex?

Are you looking for an experienced AI consultant? Do you need assistance with your RAG or Agentic Workflow?
Schedule a call, send me a message on LinkedIn, or use the chat button in the right-bottom corner. Schedule a call or send me a message on LinkedIn

>