---
title: "How to turn Pandas data frame into time-series input for RNN"
description: "From Pandas dataframe to RNN input"
author: "Bartosz Mikulski"
author_bio: "Principal AI Engineer & MLOps Architect. I bridge the gap between \"it works in a notebook\" and \"it works for 200 million users.\""
author_url: https://mikulskibartosz.name
author_linkedin: https://www.linkedin.com/in/mikulskibartosz/
author_github: https://github.com/mikulskibartosz
canonical_url: https://mikulskibartosz.name/how-to-turn-pandas-data-frame-into-time-series-input-for-rnn
---

Recently, I had to convert a Pandas data frame into training data for an RNN model. It is not as easy as I expected.
I think (I hope...) there is an easier way to do it. If you know it, please send me a message on Twitter.

## Input

I have a data frame which contains a customer identifier, a date, and four numeric variables. For every customer, I have a time series of five values (every value consists of four numeric features).

```
customer_id	feature_1	feature_2	feature_3	feature_4	date
0	0000f46a3911fa3c0805444483337064	0.0	0.0	416.0	0.0	2018-04-30
1	0000f6ccb0745a6a4b88665a16c9f078	0.0	0.0	200.0	0.0	2018-04-30
2	0004aac84e0df4da2b147fca70cf8255	0.0	0.0	167.0	0.0	2018-04-30
3	0004bd2a26a76fe21f786e4fbd80607f	0.0	0.0	25.0	0.0	2018-04-30
4	00050ab1314c0e55a6ca13cf7181fecf	0.0	0.0	10.0	0.0	2018-04-30
```

## Steps

First, I have to scale the values to the range from 0 to 1, because neural networks need scaled data to function correctly.

```python
from sklearn import preprocessing

data['feature_1'] = preprocessing.MinMaxScaler().fit_transform(data['feature_1'].values.reshape(-1, 1))
data['feature_2'] = preprocessing.MinMaxScaler().fit_transform(data['feature_2'].values.reshape(-1, 1))
data['feature_3'] = preprocessing.MinMaxScaler().fit_transform(data['feature_3'].values.reshape(-1, 1))
data['feature_4'] = preprocessing.MinMaxScaler().fit_transform(data['feature_4'].values.reshape(-1, 1))
```

I must create a dataset of all possible customer id and date pairs. I am going to use Pandas multi-index. I have to define the index as a pair of values: the customer identifier and a date. After that, I drop the index to get the columns in the data frame.

```python
customers = data['customer_id'].unique()
dates = data['date'].unique()

index = pd.MultiIndex.from_product([customers, dates], names = ["customer_id", "date"])
all_customers_and_dates = pd.DataFrame(index = index).reset_index()
```

Now, I merge all possible pairs with the data I already have. I do it to make sure that I have the same number of values in every of the time series.

```python
merged = pd.merge(all_customers_and_dates, data,  how='left', left_on=['customer_id', 'date'], right_on = ['customer_id', 'date'])
```

Obviously, I get a lot of empty values. I want to focus on converting the data, so I will replace the empty values with the previous existing value.

But first, I have to be sure that there is a previous value I can use (I don't want to use the values of a different customer!). I have to select the first date and set the value to zero if there is no data.

```python
merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_1'])), 'feature_1'] = 0
merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_2'])), 'feature_2'] = 0
merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_3'])), 'feature_3'] = 0
merged.loc[(merged['date'] == '1986-04-26') & (pd.isna(merged['feature_4'])), 'feature_4'] = 0
```

I use the forward fill function to replace missing values with the previous existing value.

```python
merged = merged.fillna(axis = 0, method = 'ffill')
```

Now, I know that every five consecutive rows contain data of a single customer. Additionally, I know how many customers I have in the data frame (in this case, 95420).

I put every numeric column into a separate variable and reshape it to (95420, 5) (95420 people, five values for every one of them).

```python
feature_1 = merged['feature_1'].values.reshape(95420, 5)
feature_2 = merged['feature_2'].values.reshape(95420, 5)
feature_3 = merged['feature_3'].values.reshape(95420, 5)
feature_4 = merged['feature_4'].values.reshape(95420, 5)
```

Now, I can merge the features into one array and reshape it again, to get data of 95420 people, four features, five different dates.

The problem is that reshape puts dates as "columns," and we need the dates as "rows." We can solve the issue by transposing the inner arrays.

```python
reshaped = numpy.hstack(
    (feature_1, feature_2, feature_3, feature_4)
).reshape(95420, 4, 5).transpose(0, 2, 1)
```

Now I have the data in the correct order: 95420 people, five dates, four features. I can use that 3D array as an input of RNN.

```python
from keras import layers
from keras.layers import recurrent
from keras.models import Sequential

model = Sequential([
    layers.SimpleRNN(60, input_shape=(5,4)),
    ...
])
```