In this blog post, I am going to describe how to measure the performance of a timeseries forecasting model using a variant of crossvalidation called “nested crossvalidation.” As an example, I am going to use the ARMA model from Statsmodels library.
Crossvalidation in time series forecasting
In the case of time series, the crossvalidation is not trivial. I cannot choose random samples and assign them to either the test set or the train set because it makes no sense to use the values from the future to forecast values in the past. There is a temporal dependency between observations, and we must preserve that relation during testing.
Before we start crossvalidation, we must split the dataset into the crossvalidation subset and the test set. In my example, I have a dataset of 309 observations and I am going to use 20% of them as the test set (aka the holdout set).
cross_validation = values[:247]
test = values[247:]
Nested crossvalidation
The idea of crossvalidation should be more straightforward to grasp when we look at an example. Imagine that I have only 5 observations in my crossvalidation set and I want to perform 4fold crossvalidation.
Here is my dataset: [1, 2, 3, 4, 5]
What I want to do is to create 4 pairs of training/test sets that follow those two rules:

every test set contains unique observations

observations from the training set occur before their corresponding test set
There is only one way to generate such pairs from my dataset. As a result, I get 4 pairs of training/test sets:

Training: [1] Test: [2]

Training: [1, 2] Test: [3]

Training: [1, 2, 3] Test: [4]

Training: [1, 2, 3, 4] Test: [5]
Fortunately, I don’t need to do it because there is the TimeSeriesSplit class in Scikitlearn which can generate those pairs.
After generating the training/test sets, I am going to fit an ARMA model and make a prediction. I store the root mean squared error of the prediction in the “rmse” array. After the last test, I am going to calculate the average error.
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error
tscv = TimeSeriesSplit(n_splits = 4)
rmse = []
for train_index, test_index in tscv.split(cross_validation):
cv_train, cv_test = cross_validation.iloc[train_index], cross_validation.iloc[test_index]
arma = sm.tsa.ARMA(cv_train, (2,2)).fit(disp=False)
predictions = arma.predict(cv_test.index.values[0], cv_test.index.values[1])
true_values = cv_test.values
rmse.append(sqrt(mean_squared_error(true_values, predictions)))
print("RMSE: {}".format(np.mean(rmse)))
Holdout set
Now, I can tweak the parameters of the ARMA model as long as I want.
When I am satisfied with the result, I can use the test set created in the first code snippet to calculate the final error metric of the model.
arma = sm.tsa.ARMA(cross_validation, (2,2)).fit(disp=False)
predictions = arma.predict(test.index.values[0], test.index.values[1])
true_values = test.values
sqrt(mean_squared_error(true_values, predictions))