What should we do when we train a machine learning model, and time is one of the features? In this article, I will show you a few options and tell you why only some of them make sense ;)
Table of Contents
- Turn the day of the week into boolean value “is_weekend”
- One-hot encoding
- Encoding day of the week as a number
- Angular distance
I am going to use a DataFrame with two features. The first one is a date and the second one is the number of people who visited this blog on a given day.
date visitors
3/17/21 842
3/18/21 914
3/19/21 956
3/20/21 361
3/21/21 410
Before I start presenting the options to handle the day of the week, I will convert the date
column into datetime and derive an additional column with a number indicating the day of the week.
data['date'] = pd.to_datetime(data.date)
data['day_of_week'] = data.date.dt.weekday
Turn the day of the week into boolean value “is_weekend”
Before we start training the model, we do data exploration, and we may notice a difference between values during weekdays and weekends. In such a situation, we may convert the day of week feature into a boolean value to indicate whether a given day was Saturday or Sunday.
data['is_weekend'] = data['day_of_week'].isin([5, 6])
This solution may be good enough when weekdays data differs from weekend data, but values during weekends are similar.
One-hot encoding
We can use one-hot encoding to produce a boolean feature for every day of the week. Such a solution gives us information about the day of the week, but we get rid of relations between the days. Effectively, we decide that the order of days no longer matters. Is it the case? Not always. Usually, we should not use one-hot encoding to encode days of weeks.
day_of_week_columns = pd.get_dummies(data['day_of_week'])
data.merge(day_of_week_columns, left_index=True, right_index=True)
Want to build AI systems that actually work?
Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.
Encoding day of the week as a number
We have already done it in this example. The day_of_week
column contains values between 0 and 6 that denote the day of the week.
It is not the right way to encode days of the week if we want to use the data to train machine learning models! In reality, Saturday is closer to Monday than Wednesday. Encoding days of the week as numbers changes the sense of data.
Angular distance
We don’t want to lose the information about the circular nature of weeks and the actual distance between the days. Therefore, we can encode the day of week feature as “points” on a circle: 0° = Monday, 51.5° = Tuesday, etc.
There is one problem. We know that it is a circle, but for a machine learning model, the difference between Sunday and Monday is 308.5° instead of 51.5°. That is wrong.
To solve the problem we have to calculate the cosinus and sinus values of the degree. We need both because both functions produce duplicate outputs for difference inputs, but when we use them together we get unique pairs of values:
data['day_of_week_sin'] = np.sin(data['day_of_week'] * (2 * np.pi / 7))
data['day_of_week_cos'] = np.cos(data['day_of_week'] * (2 * np.pi / 7))