What do we do with input DataFrame before building the model? After exploratory data analysis, we start modifying features. We are going to remove some of them, a few needs to be scaled or normalized. Then we encode the categorical features as numbers. A lot of work.
Transforming all input features at once would be nice. Fortunately, we can easily do it in Scikit-Learn. Let’s do it step by step.
Probably everyone who tried creating a machine learning model at least once is familiar with the Titanic dataset. Because of that, I am going to use as an example.
After loading the dataset, I decided that Name, Cabin, Ticket, and PassengerId columns are redundant. My preprocessing pipeline has to remove them. Right now, I am going to store their names in an array:
to_be_removed = ['Name', 'Cabin', 'Ticket', 'PassengerId']
After that, I looked for numeric features which should be normalized. There are two such columns: Age and Fare. I have also noticed missing age values. I am going to replace them with the median of passenger’s age.
Now, I can store the names of the numeric columns in another array. I must also define the pipeline which provides the default values for missing features and normalizes all numeric features.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
numeric_features = ['Age', 'Fare']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', MinMaxScaler())])
Finally, I can deal with categorical variables. In the first step, I am going to impute the missing values, but in this case, I want to use the most frequently occurring value as the default. The second step of the pipeline transforms categorical variables using one-hot encoding.
As before, I also put the names of the categorical columns in an array.
from sklearn.preprocessing import OneHotEncoder
categorical_features = ['Embarked', 'Sex', 'Pclass']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder())])
I have everything I need to configure a ColumnTransformer.
Want to build AI systems that actually work?
Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.
I want to keep the columns which have not been transformed, so I set the remainder to “passthrough.” I can also instruct the transformer to drop such columns (just put “drop” as the value or don’t specify it, that is the default behavior).
The second parameter is the combined pipeline. This time, I must configure not only the name of the step and the class that implements it but also the columns that should be processed by that step.
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
remainder = 'passthrough',
transformers=[
('numeric', numeric_transformer, numeric_features),
('categorical', categorical_transformer, categorical_features),
('remove', 'drop', to_be_removed)
])
To transform the columns, call the fit_transform function.
preprocessor.fit_transform(data)