I decided to give scikit-automl a try. After all, if it works well, it allows me to focus on the exciting part of machine learning — feature engineering and getting more data.
Table of Contents
We already automate a considerable part of the model building. Nobody sets the hyperparameters manually. In every project, there is either GridSearch or RandomSearch. Why can’t we automate also selecting the right algorithm and applying basic preprocessing?
It turns out we can, but it is not as easy as promised. At least not in scikit-automl. I decided to start with the classical “tutorial dataset” — Titanic.
The first problem I encountered was installing the right packages. That one was quite easy to fix, and I wrote another blog post about it.
When I was able to import the package in my code, I assumed that I could have numerical and categorical variables in my dataset. I kind of can. It is possible to specify the feature type, as long as all the columns are numerical. The parameter tells scikit-automl how to preprocess the variables, but it will not automatically convert text to a numeric representation.
Get Weekly AI Implementation Insights
Join engineering leaders who receive my analysis of common AI production failures and how to prevent them. No fluff, just actionable techniques.
That was disappointing but not as disappointing as lack of ColumnTransformer. It turned out scikit-automl supports only scikit-learn versions between 0.19 and 0.20. My scikit-learn was downgraded, so instead of ColumnTransformer, I had to use LabelEncoder.
I converted the text to labels and continued playing with the tool.
import pandas as pd
data = pd.read_csv('../input/train.csv')
X = data.drop(columns = ['Survived', 'Name', 'Ticket', 'Cabin', 'PassengerId'])
y = data[['Survived']]
X['Embarked'] = X['Embarked'].apply(str)
from sklearn.preprocessing import LabelEncoder
pclass_encoder = LabelEncoder()
X['pclass'] = pclass_encoder.fit_transform(X['Pclass'])
sex_encoder = LabelEncoder()
X['Sex'] = sex_encoder.fit_transform(X['Sex'])
embarked_encoder = LabelEncoder()
X['Embarked'] = embarked_encoder.fit_transform(X['Embarked'])
column_types = (['Categorical'] * 2) + (['Numerical'] * 4) +['Categorical', 'Numerical']
The second problem was the misleading behavior of the n_jobs
parameter. Usually, -1 means “use all CPUs.” It should work like this also in scikit-automl, but for some reason, the current version tries to use a non-existing index in some array and throws an error.
Not a huge problem, I can get the number of available CPUs using the following code and pass it as a parameter.
import multiprocessing
cpus = multiprocessing.cpu_count()
Finally, I got it working.
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
import numpy
import sklearn
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)
import autosklearn.classification
automl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=3600,
per_run_time_limit=120,
delete_tmp_folder_after_terminate=True,
ensemble_memory_limit = 12288,
n_jobs = cpus,
resampling_strategy='cv',
resampling_strategy_arguments={'folds': 5}
)
automl.fit(X_train.copy(), y_train.copy(), feat_type=column_types)
automl.refit(X_train.copy(), y_train.copy())
predictions = automl.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))