Nested cross-validation in time series forecasting using Scikit-learn and Statsmodels

In this blog post, I am going to describe how to measure the performance of a time-series forecasting model using a variant of cross-validation called “nested cross-validation.” As an example, I am going to use the ARMA model from Statsmodels library.

Table of Contents

  1. Cross-validation in time series forecasting
  2. Nested cross-validation
  3. Holdout set

Cross-validation in time series forecasting

In the case of time series, the cross-validation is not trivial. I cannot choose random samples and assign them to either the test set or the train set because it makes no sense to use the values from the future to forecast values in the past. There is a temporal dependency between observations, and we must preserve that relation during testing.

Before we start cross-validation, we must split the dataset into the cross-validation subset and the test set. In my example, I have a dataset of 309 observations and I am going to use 20% of them as the test set (aka the holdout set).

cross_validation = values[:247]
test = values[247:]

Nested cross-validation

The idea of cross-validation should be more straightforward to grasp when we look at an example. Imagine that I have only 5 observations in my cross-validation set and I want to perform 4-fold cross-validation.

Here is my dataset: [1, 2, 3, 4, 5]

What I want to do is to create 4 pairs of training/test sets that follow those two rules:

  • every test set contains unique observations

  • observations from the training set occur before their corresponding test set

There is only one way to generate such pairs from my dataset. As a result, I get 4 pairs of training/test sets:

  • Training: [1] Test: [2]

  • Training: [1, 2] Test: [3]

  • Training: [1, 2, 3] Test: [4]

  • Training: [1, 2, 3, 4] Test: [5]

Fortunately, I don’t need to do it because there is the TimeSeriesSplit class in Scikit-learn which can generate those pairs.

After generating the training/test sets, I am going to fit an ARMA model and make a prediction. I store the root mean squared error of the prediction in the “rmse” array. After the last test, I am going to calculate the average error.

from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error

tscv = TimeSeriesSplit(n_splits = 4)
rmse = []

for train_index, test_index in tscv.split(cross_validation):
    cv_train, cv_test = cross_validation.iloc[train_index], cross_validation.iloc[test_index]

    arma = sm.tsa.ARMA(cv_train, (2,2)).fit(disp=False)

    predictions = arma.predict(cv_test.index.values[0], cv_test.index.values[-1])
    true_values = cv_test.values
    rmse.append(sqrt(mean_squared_error(true_values, predictions)))

print("RMSE: {}".format(np.mean(rmse)))

Holdout set

Now, I can tweak the parameters of the ARMA model as long as I want.

When I am satisfied with the result, I can use the test set created in the first code snippet to calculate the final error metric of the model.

arma = sm.tsa.ARMA(cross_validation, (2,2)).fit(disp=False)

predictions = arma.predict(test.index.values[0], test.index.values[-1])

true_values = test.values
sqrt(mean_squared_error(true_values, predictions))
Older post

How to perform an A/B test correctly in Python

What can we expect from a correctly performed A/B test?

Newer post

Business metrics that make no sense

There are three kinds of metrics that won’t destroy your business.

Are you looking for an experienced AI consultant? Do you need assistance with your RAG or Agentic Workflow?
Schedule a call, send me a message on LinkedIn. Schedule a call or send me a message on LinkedIn

>