Recently, I had to convert a Pandas data frame into training data for an RNN model. It is not as easy as I expected. I think (I hope…) there is an easier way to do it. If you know it, please send me a message on Twitter.
Table of Contents
Input
I have a data frame which contains a customer identifier, a date, and four numeric variables. For every customer, I have a time series of five values (every value consists of four numeric features).
customer_id feature_1 feature_2 feature_3 feature_4 date
0 0000f46a3911fa3c0805444483337064 0.0 0.0 416.0 0.0 2018-04-30
1 0000f6ccb0745a6a4b88665a16c9f078 0.0 0.0 200.0 0.0 2018-04-30
2 0004aac84e0df4da2b147fca70cf8255 0.0 0.0 167.0 0.0 2018-04-30
3 0004bd2a26a76fe21f786e4fbd80607f 0.0 0.0 25.0 0.0 2018-04-30
4 00050ab1314c0e55a6ca13cf7181fecf 0.0 0.0 10.0 0.0 2018-04-30
Steps
First, I have to scale the values to the range from 0 to 1, because neural networks need scaled data to function correctly.
from sklearn import preprocessing
data['feature_1'] = preprocessing.MinMaxScaler().fit_transform(data['feature_1'].values.reshape(-1, 1))
data['feature_2'] = preprocessing.MinMaxScaler().fit_transform(data['feature_2'].values.reshape(-1, 1))
data['feature_3'] = preprocessing.MinMaxScaler().fit_transform(data['feature_3'].values.reshape(-1, 1))
data['feature_4'] = preprocessing.MinMaxScaler().fit_transform(data['feature_4'].values.reshape(-1, 1))
I must create a dataset of all possible customer id and date pairs. I am going to use Pandas multi-index. I have to define the index as a pair of values: the customer identifier and a date. After that, I drop the index to get the columns in the data frame.
customers = data['customer_id'].unique()
dates = data['date'].unique()
index = pd.MultiIndex.from_product([customers, dates], names = ["customer_id", "date"])
all_customers_and_dates = pd.DataFrame(index = index).reset_index()
Now, I merge all possible pairs with the data I already have. I do it to make sure that I have the same number of values in every of the time series.
merged = pd.merge(all_customers_and_dates, data, how='left', left_on=['customer_id', 'date'], right_on = ['customer_id', 'date'])
Obviously, I get a lot of empty values. I want to focus on converting the data, so I will replace the empty values with the previous existing value.
But first, I have to be sure that there is a previous value I can use (I don’t want to use the values of a different customer!). I have to select the first date and set the value to zero if there is no data.
merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_1'])), 'feature_1'] = 0
merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_2'])), 'feature_2'] = 0
merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_3'])), 'feature_3'] = 0
merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_4'])), 'feature_4'] = 0
I use the forward fill function to replace missing values with the previous existing value.
merged = merged.fillna(axis = 0, method = 'ffill')
Now, I know that every five consecutive rows contain data of a single customer. Additionally, I know how many customers I have in the data frame (in this case, 95420).
I put every numeric column into a separate variable and reshape it to (95420, 5) (95420 people, five values for every one of them).
feature_1 = merged['feature_1'].values.reshape(95420, 5)
feature_2 = merged['feature_2'].values.reshape(95420, 5)
feature_3 = merged['feature_3'].values.reshape(95420, 5)
feature_4 = merged['feature_4'].values.reshape(95420, 5)
Now, I can merge the features into one array and reshape it again, to get data of 95420 people, four features, five different dates.
The problem is that reshape puts dates as “columns,” and we need the dates as “rows.” We can solve the issue by transposing the inner arrays.
reshaped = numpy.hstack(
(feature_1, feature_2, feature_3, feature_4)
).reshape(95420, 4, 5).transpose(0, 2, 1)
Now I have the data in the correct order: 95420 people, five dates, four features. I can use that 3D array as an input of RNN.
from keras import layers
from keras.layers import recurrent
from keras.models import Sequential
model = Sequential([
layers.SimpleRNN(60, input_shape=(5,4)),
...
])