In this blog post, I am going to describe how to measure the performance of a time-series forecasting model using a variant of cross-validation called “nested cross-validation.” As an example, I am going to use the ARMA model from Statsmodels library.
Table of Contents
Cross-validation in time series forecasting
In the case of time series, the cross-validation is not trivial. I cannot choose random samples and assign them to either the test set or the train set because it makes no sense to use the values from the future to forecast values in the past. There is a temporal dependency between observations, and we must preserve that relation during testing.
Before we start cross-validation, we must split the dataset into the cross-validation subset and the test set. In my example, I have a dataset of 309 observations and I am going to use 20% of them as the test set (aka the holdout set).
cross_validation = values[:247]
test = values[247:]
Want to build AI systems that actually work?
Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.
Nested cross-validation
The idea of cross-validation should be more straightforward to grasp when we look at an example. Imagine that I have only 5 observations in my cross-validation set and I want to perform 4-fold cross-validation.
Here is my dataset: [1, 2, 3, 4, 5]
What I want to do is to create 4 pairs of training/test sets that follow those two rules:
-
every test set contains unique observations
-
observations from the training set occur before their corresponding test set
There is only one way to generate such pairs from my dataset. As a result, I get 4 pairs of training/test sets:
-
Training: [1] Test: [2]
-
Training: [1, 2] Test: [3]
-
Training: [1, 2, 3] Test: [4]
-
Training: [1, 2, 3, 4] Test: [5]
Fortunately, I don’t need to do it because there is the TimeSeriesSplit class in Scikit-learn which can generate those pairs.
After generating the training/test sets, I am going to fit an ARMA model and make a prediction. I store the root mean squared error of the prediction in the “rmse” array. After the last test, I am going to calculate the average error.
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error
tscv = TimeSeriesSplit(n_splits = 4)
rmse = []
for train_index, test_index in tscv.split(cross_validation):
cv_train, cv_test = cross_validation.iloc[train_index], cross_validation.iloc[test_index]
arma = sm.tsa.ARMA(cv_train, (2,2)).fit(disp=False)
predictions = arma.predict(cv_test.index.values[0], cv_test.index.values[-1])
true_values = cv_test.values
rmse.append(sqrt(mean_squared_error(true_values, predictions)))
print("RMSE: {}".format(np.mean(rmse)))
Holdout set
Now, I can tweak the parameters of the ARMA model as long as I want.
When I am satisfied with the result, I can use the test set created in the first code snippet to calculate the final error metric of the model.
arma = sm.tsa.ARMA(cross_validation, (2,2)).fit(disp=False)
predictions = arma.predict(test.index.values[0], test.index.values[-1])
true_values = test.values
sqrt(mean_squared_error(true_values, predictions))