How To Avoid Data Leakage While Building A Machine Learning Model

Data leakage is a terrible mistake that is surprisingly easy to make. It is also the simplest way to look like a fool. When data leakage happens, the model we have built performs flawlessly during testing but fails when it runs in production.

The terrifying part of the problem is the fact that is some cases data leakage may be undetectable. It breaks our model, and we may not even notice it!

What is data leakage?

Data leakage happens when the model learns from data which are not available in the real-life scenario.

It is like cheating during an exam when the student can answer the question because he/she has information about the scoring method, but can’t apply the knowledge outside of the classroom.

It does not need to be the most straightforward way of cheating like knowing the exact answers. Sometimes it is enough to know that in the case of multiple-choice questions the teacher likes to put the correct answer in the last position.

How does it apply to machine learning?

The analogy of teaching students does not break yet. Just like a student, the model is going to minimalize effort required to generalize knowledge, so if there is a way to cheat it is going to use it to get a good score.

Let’s start with features that make cheating simple, but are easy to overlook during data preprocessing.

Summary statistics

We sometimes include in the dataset some summary features like the number of purchases in the month or the value of purchased products. Here is the problem. Is the month the previous month? Is it the previous 30 days?

What if I have those 3 observations in my dataset:

user_id;date;purchased_product_id;price;total_purchase_value_in_month
1;03–03–2019;1;22.00;75
1;14–03–2019;43;40.00;75
1;26–03–2019;54;13.00;75

Now, I split the dataset into training and test datasets, and as a result, the first two purchases end up in the training dataset, but the last one is in the test dataset. What happens? The model should know only about two purchases during training, but because of the summary statistic, it can “look into future” and also knows about the upcoming purchase!

The problem is that every summary is susceptible to such mistakes. It does not matter if we calculate the total number of purchases, the total number of purchased items or the average price. How often do we recalculate summary features after splitting data? Perhaps we should do it more often.

Normalization

We loaded the data, performed some exploratory analysis, and now we see that it is necessary to normalize features because we would like to begin with making a linear model. We make a pipeline, normalize value, split the data into training and test set. Do you already see the problem?

I just used the whole available dataset to normalize features. If some outliers ended up in test dataset, the model is going to know about them during training because the training dataset was affected by those extreme values.

The correct approach is first to split the datasets, then use only the training dataset to build the pipeline (and preferably to do the exploratory analysis, to avoid human bias) and train the model.

Obviously, at this point, someone is going to complain that if I use, for example, a MinMaxScaler from Scikit-learn it learns incorrect minimum and maximum values. Later, during testing, it is going to return a value which is out of the expected range.

Yes, that is going to happen, and we should make the pipeline robust. It must perform well enough even in such a case. After all, guess what values it is going to process when we run it in production.

Tags, badges, and labels

Badges are another kind of data that may reveal information about the future. Let’s look at the data from the “summary example.” If the user gets a “high spender” badge after the third purchase, but I keep that information in the training set, the model will get access to information which is not available in real-life. If in the first half of the month, I already knew that the user would be a “high spender” I wouldn’t need to build the model in the first place ;)

Every time we have a feature which indicates a badge, we should make sure that the training dataset is enough to explain why every person has their badges. To verify those assumptions, we need knowledge about business processes and the implementation of the software which produced the data.

Updateable attributes

What if the user from the summary example moved to a new place on 20.03.2019? Imagine that the user lived in London, but in the middle of the month moved to Berlin. If I have a “place of residence” attribute in my dataset, what is its value? If I always update it when the user moves to a new place, I am going to teach the model that the user was in Berlin while purchasing the items from the training set.

That is why the data should be immutable. If that is not possible, we should consider removing the mutable attributes from the dataset, because we can’t be sure that the values were not altered.

How to prevent data leakage?

For sure, everyone can avoid sampling mistakes and errors related to summary attributes, but the only thing that may successfully prevent all data leakages is business knowledge. We must spend more time doing data analysis, look skeptically at data and try to answer questions like: “Is it even possible?” or “Does it make sense?”

Unfortunately, that cannot be taught in an online tutorial or during a Kaggle competition. It requires real-world experience and being familiar with the problem domain, software running in the company, relations between teams, and even company history.

Even worse, when you move to a new company, you must start from scratch. That kind of knowledge rarely is transferable. Two different companies working in the same industry may have different internal processes.

Older post

Using scikit-automl for building a classification model

My first attempt to use scikit-automl and how I got it working

Newer post

Encoding categorical variables in machine learning

One-hot encoding, dummy coding, and effect coding in Scikit learn and Pandas