How to save a machine learning model into a file

In this brief tutorial, I am going to show you how to save Scikit-learn machine learning model into a file using the joblib library and how to load it from the file. We are not going to use the “pickle” library because Scikit-learn authors do not recommend it.

Table of Contents

Creating the model
Adding metadata to the model
Saving the model
Loading the model from a file

Creating the model

First, we need to define a model and fit it to the training data. For the sake of this tutorial, I am going to make a silly model that checks if the given binary number is odd or even. For sure, using machine learning for something like this is a terrible idea, but for demonstration purposes, I need something trivial.

import pandas as pd
import sklearn.pipeline
import sklearn.preprocessing
import sklearn.tree

data = pd.DataFrame([
    [0, 0, 0, 1],
    [0, 0, 1, 0],
    [0, 1, 0, 1],
    [0, 1, 1, 0],
    [1, 0, 0, 1],
    [1, 0, 1, 0],
    [1, 1, 0, 1],
    [1, 1, 1, 0]
])

X = data.drop(columns = [3])
y = data[3]

Typically, we want to store the entire pipeline as one object. That is why I am going to use the Scikit Pipeline to define both data transformations and the classifier.

Note that due to the joblib library limitations (the “pickle” library has the same problem) I cannot use lambda functions. Because of that, I had to define the “get_last_column” function and pass it to the pipeline step as a parameter.

def get_last_column(X):
    return X[:, -1].reshape(-1, 1)

pipeline = sklearn.pipeline.Pipeline(steps = [
    ('filter_columns', sklearn.preprocessing.FunctionTransformer(get_last_column)),
    ('classifier', sklearn.tree.DecisionTreeClassifier())
])

pipeline.fit(X, y)

Adding metadata to the model

I strongly suggest storing not only the fitted model but also some metadata which describe the model. We may need it later when we load the model and don’t remember who made it and when. We may also log this information in the production application using this model, to be sure which model runs in production.

In this case, I am going to save some information about the problem we solve, the author of the solution, the date when we fitted the model, a hash of the git commit that contains the code which defines the entire model, and information about the model’s accuracy.

toBePersisted = dict({
    'model': pipeline,
    'metadata': {
        'name': 'Is an even number?',
        'author': 'Bartosz Mikulski',
        'date': '2019-01-15T15:45:00CEST',
        'source_code_version': 'c1fd8820eb8eb61740229c1c6c0d1ca53f82120e',
        'metrics': {
            'accuracy': 1.0
        }
    }
})

Saving the model

I created a Python dictionary which contains both the model and the metadata. Now, I can dump that dictionary to a file.

from joblib import dump
dump(toBePersisted, 'model.joblib')

Done, that is all we need to save a machine learning model.

Loading the model from a file

Loading a model is even easier. All we need is the file name.

from joblib import load
loaded = load('model.joblib')
loaded

Now the “loaded” variable contains the dictionary defined earlier.

To use the model, I must extract it from that dictionary.

loaded['model'].predict(X)

Make sure that the required functions and modules are available!

I created the get_last_column function in the “__main__” module (the default one).

When I try to load the model, the same function must exist in the same module.
Otherwise, I will get this error: AttributeError: module ‘__main__’ has no attribute ‘get_last_column’.