Using scikit-automl for building a classification model

I decided to give scikit-automl a try. After all, if it works well, it allows me to focus on the exciting part of machine learning — feature engineering and getting more data.

We already automate a considerable part of the model building. Nobody sets the hyperparameters manually. In every project, there is either GridSearch or RandomSearch. Why can’t we automate also selecting the right algorithm and applying basic preprocessing?

It turns out we can, but it is not as easy as promised. At least not in scikit-automl. I decided to start with the classical “tutorial dataset” — Titanic.

The first problem I encountered was installing the right packages. That one was quite easy to fix, and I wrote another blog post about it.

When I was able to import the package in my code, I assumed that I could have numerical and categorical variables in my dataset. I kind of can. It is possible to specify the feature type, as long as all the columns are numerical. The parameter tells scikit-automl how to preprocess the variables, but it will not automatically convert text to a numeric representation.

That was disappointing but not as disappointing as lack of ColumnTransformer. It turned out scikit-automl supports only scikit-learn versions between 0.19 and 0.20. My scikit-learn was downgraded, so instead of ColumnTransformer, I had to use LabelEncoder.

I converted the text to labels and continued playing with the tool.

import pandas as pd

data = pd.read_csv('../input/train.csv')
X = data.drop(columns = ['Survived', 'Name', 'Ticket', 'Cabin', 'PassengerId'])
y = data[['Survived']]

X['Embarked'] = X['Embarked'].apply(str)

from sklearn.preprocessing import LabelEncoder

pclass_encoder = LabelEncoder()
X['pclass'] = pclass_encoder.fit_transform(X['Pclass'])

sex_encoder = LabelEncoder()
X['Sex'] = sex_encoder.fit_transform(X['Sex'])

embarked_encoder = LabelEncoder()
X['Embarked'] = embarked_encoder.fit_transform(X['Embarked'])

column_types = (['Categorical'] * 2) + (['Numerical'] * 4) +['Categorical', 'Numerical']

The second problem was the misleading behavior of the n_jobs parameter. Usually, -1 means “use all CPUs.” It should work like this also in scikit-automl, but for some reason, the current version tries to use a non-existing index in some array and throws an error.

Not a huge problem, I can get the number of available CPUs using the following code and pass it as a parameter.

import multiprocessing

cpus = multiprocessing.cpu_count()

Finally, I got it working.

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
import numpy
import sklearn

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)

import autosklearn.classification

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=3600,
    per_run_time_limit=120,
    delete_tmp_folder_after_terminate=True,
    ensemble_memory_limit = 12288,
    n_jobs = cpus,
    resampling_strategy='cv',
    resampling_strategy_arguments={'folds': 5}
)

automl.fit(X_train.copy(), y_train.copy(), feat_type=column_types)
automl.refit(X_train.copy(), y_train.copy())

predictions = automl.predict(X_test)

print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))

Using scikit-automl for building a classification model

How to return rows with missing values in Pandas DataFrame

How To Avoid Data Leakage While Building A Machine Learning Model

Using scikit-automl for building a classification model

How to return rows with missing values in Pandas DataFrame

How To Avoid Data Leakage While Building A Machine Learning Model

Related Posts

A.I. in production: your next stylist is going to be a neural network

What is the difference between training, validation, and test sets in machine learning

How to plot the decision trees from XGBoost classifier