Preprocessing the input Pandas DataFrame using ColumnTransformer in Scikit-learn

What do we do with input DataFrame before building the model? After exploratory data analysis, we start modifying features. We are going to remove some of them, a few needs to be scaled or normalized. Then we encode the categorical features as numbers. A lot of work.

Transforming all input features at once would be nice. Fortunately, we can easily do it in Scikit-Learn. Let’s do it step by step.

Probably everyone who tried creating a machine learning model at least once is familiar with the Titanic dataset. Because of that, I am going to use as an example.

After loading the dataset, I decided that Name, Cabin, Ticket, and PassengerId columns are redundant. My preprocessing pipeline has to remove them. Right now, I am going to store their names in an array:

to_be_removed = ['Name', 'Cabin', 'Ticket', 'PassengerId']

After that, I looked for numeric features which should be normalized. There are two such columns: Age and Fare. I have also noticed missing age values. I am going to replace them with the median of passenger’s age.

Now, I can store the names of the numeric columns in another array. I must also define the pipeline which provides the default values for missing features and normalizes all numeric features.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

numeric_features = ['Age', 'Fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', MinMaxScaler())])

Finally, I can deal with categorical variables. In the first step, I am going to impute the missing values, but in this case, I want to use the most frequently occurring value as the default. The second step of the pipeline transforms categorical variables using one-hot encoding.

As before, I also put the names of the categorical columns in an array.

from sklearn.preprocessing import OneHotEncoder

categorical_features = ['Embarked', 'Sex', 'Pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder())])

I have everything I need to configure a ColumnTransformer.

I want to keep the columns which have not been transformed, so I set the remainder to “passthrough.” I can also instruct the transformer to drop such columns (just put “drop” as the value or don’t specify it, that is the default behavior).

The second parameter is the combined pipeline. This time, I must configure not only the name of the step and the class that implements it but also the columns that should be processed by that step.

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    remainder = 'passthrough',
    transformers=[
        ('numeric', numeric_transformer, numeric_features),
        ('categorical', categorical_transformer, categorical_features),
        ('remove', 'drop', to_be_removed)
])

To transform the columns, call the fit_transform function.

preprocessor.fit_transform(data)
Older post

How to install scikit-automl in a Kaggle notebook

error: command ‘swig’ failed with exit status 1 while installing scikit-automl

Newer post

How to return rows with missing values in Pandas DataFrame

How does it work and why the most popular solution is wrong