How to Use XGBoost for Time Series Forecasting

↑↑↑ Follow and Star “Datawhale”

Daily insights & monthly study groups, don’t miss out

Datawhale Insights

Source: Jason Brownlee, Organized by Data Science THU

This article is approximately 3300 words, and is recommended to read in 10minutes

This article introduces how to use XGBoost for time series forecasting, including transforming time series into a supervised learning prediction problem, using forward validation for model evaluation, and providing actionable code examples.

For classification and regression problems, XGBoost is an efficient implementation of the gradient boosting algorithm.

It balances speed and efficiency, and performs excellently in many predictive modeling tasks, widely favored by winners in data science competitions such as Kaggle.

XGBoost can also be used for time series forecasting, although the time series dataset must first be converted into a format suitable for supervised learning. It also requires a specialized technique for model evaluation called forward validation, as model evaluation using k-fold cross-validation can lead to positively biased results.

In this article, you will learn how to develop an XGBoost model for time series forecasting.

After completing this tutorial, you will know:

XGBoost is an implementation of gradient boosting ensemble methods for classification and regression problems.
Time series datasets can be adapted for supervised learning by using a sliding time window representation.
How to fit, evaluate, and predict using the XGBoost model for time series forecasting.

Let’s get started!

Tutorial Overview

This tutorial is divided into three parts:

1. XGBoost Ensemble

2. Time Series Data Preparation

3. XGBoost for Time Series Forecasting

1. XGBoost Ensemble

XGBoost stands for Extreme Gradient Boosting, which is an efficient implementation of stochastic gradient boosting.

The stochastic gradient boosting algorithm (also known as gradient boosting machines or tree boosting) is a powerful machine learning technique that performs very well, even best, on many challenging machine learning problems.

Tree boosting has been shown to give state-of-the-art results on many standard classification benchmarks.

— XGBoost: A Scalable Tree Boosting System, 2016.

https://arxiv.org/abs/1603.02754

It is an ensemble of decision tree algorithms where new trees can correct the results of existing trees in the model. We can keep adding decision trees until we achieve satisfactory results.

XGBoost is an efficient implementation of the stochastic gradient boosting algorithm that can control the model throughout the training process through a series of model hyperparameters.

The most important factor behind the success of XGBoost is its scalability in all scenarios. The system runs more than ten times faster than existing popular solutions on a single machine and scales to billions of examples in distributed or memory-limited settings.

— XGBoost: A Scalable Tree Boosting System, 2016.

https://arxiv.org/abs/1603.02754

XGBoost is designed for classification and regression problems on tabular datasets, and can also be used for time series forecasting.

For more information on GDBT and XGBoost implementations, please see the following tutorials:

“A Brief Overview of Gradient Boosting Algorithms in Machine Learning”

Link: https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/

First, XGBoost needs to be installed, and you can do it with pip as follows:

After installation, you can confirm whether it was successful and the installed version using the following code:

How to Use XGBoost for Time Series Forecasting

Executing the above code will show the version number as follows, and it may be a higher version:

How to Use XGBoost for Time Series Forecasting

Although the XGBoost library has its own Python interface, you can also use the XGBRegressor wrapper class from the scikit-learn API.

An instance of the model can be instantiated and used for model evaluation like any other scikit-learn class. For example:

How to Use XGBoost for Time Series Forecasting

Now that we are familiar with XGBoost, let’s look at how to prepare the time series dataset for supervised learning.

2. Time Series Data Preparation

Time data can be used for supervised learning.

Given a series of numbers from a time series dataset, we can reconstruct the data to make it look like a supervised learning problem. We can use the data from the previous time step as input variables and the next time step as the output variable.

Let’s learn specifically with an example. Suppose we have this set of time series data:

How to Use XGBoost for Time Series Forecasting

We can reshape this time series dataset into a supervised learning format, using the value from the previous time step to predict the value of the next time step.

By reorganizing the time series dataset in this way, the data will look as follows:

How to Use XGBoost for Time Series Forecasting

Note! We have removed the time column, and there are several rows of data that cannot be used for training, such as the first and last rows.

This representation is called a sliding window, as the input and expected output windows move forward in time, creating new “samples” for the supervised learning model.

For more information on preparing time series forecasting data using the sliding window method, please refer to the tutorial:

“Time Series Forecasting as Supervised Learning”

Link: https://machinelearningmastery.com/time-series-forecasting-supervised-learning/

You can use the shift() method from the pandas library to transform the time series data into a new framework according to the given input-output length.

This will be a useful tool as it allows us to explore different frameworks for the time series problem using machine learning algorithms to see which method may produce a better model.

The following function takes the time series as a NumPy array with one or more columns and converts it into a supervised learning problem with the specified number of inputs and outputs.

How to Use XGBoost for Time Series Forecasting

We can use this function to prepare a time series dataset for XGBoost.

For more information on the step-by-step development of this function, please refer to the tutorial:

“How to Convert Time Series to Supervised Learning Problem in Python”

Link: https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/

Once the dataset is prepared, we need to focus on how to use it to fit and evaluate a model.

For example, a model that predicts historical data using future data is invalid. The model must predict the future based on historical data.

This means that during the model evaluation phase, methods such as random splitting of the dataset similar to k-fold cross-validation are not applicable. Instead, we must use a technique called forward validation.

In forward validation, the data is first divided into training and testing sets by selecting a split point, for example, removing the last 12 months of data for training and using the last 12 months for testing.

If we are interested in one-step predictions, such as one month, we can evaluate the model by training on the training dataset and predicting the first step in the test dataset. Then we can add the actual observations from the test set to the training dataset, re-train the model, and let the model predict the second step in the test dataset.

Repeating this process over the entire test set allows us to obtain one-step predictions and calculate the error rate to evaluate the model’s performance.

For more information on forward validation, please refer to the tutorial:

“How To Backtest Machine Learning Models for Time Series Forecasting”

Link: https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

The following function performs forward validation.

The parameters are the entire time series dataset and the number of rows for the test set.

It then iterates through the test set, calling the xgboost_forecast() function to make one-step predictions. It calculates error metrics and returns details for analysis.

How to Use XGBoost for Time Series Forecasting

The train_test_split() function is used to divide the dataset into training and testing sets. This method can be defined as follows:

How to Use XGBoost for Time Series Forecasting

The XGBRegressor class can be used to make one-step predictions. The xgboost_forecast() method implements fitting the model using the training set and test set inputs as function inputs, then making one-step predictions.

How to Use XGBoost for Time Series Forecasting

Now that we know how to prepare the time series dataset for forecasting and evaluate the XGBoost model, we can use XGBoost on actual datasets.

3. XGBoost for Time Series Forecasting

In this section, we will explore how to use XGBoost for time series forecasting.

We will use a standard univariate time series dataset, with the goal of making one-step predictions using the model.

You can use the code in this section to start your own project, as it can easily be adapted for multivariate input, multivariate prediction, and multi-step predictions.

The following links can be used to download the dataset, which should be imported into the local working directory with the file name “daily-total-female-births.csv”.

Dataset (daily-total-female-births.csv)

Link: https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-total-female-births.csv
Description (daily-total-female-births.names)

Link: https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-total-female-births.names

The first few rows of the dataset are as follows:

How to Use XGBoost for Time Series Forecasting

First, import the data and plot the dataset. The complete example is as follows:

How to Use XGBoost for Time Series Forecasting

Running this example will produce a line graph for this dataset. It can be observed that there is no obvious trend or seasonality.

In predicting the number of births in the next 12 months, the persistence model achieved an average absolute error (MAE) of 6.7. This provides an effective benchmark for the model.

Next, we will evaluate the performance of the XGBoost model on this dataset and make one-step predictions for the last 12 months of data.

We will only use the first three time steps as input to the model, with the default model hyperparameters, but changing the loss to ‘reg:squarederror’ (to avoid warning messages) and using 1000 trees in the ensemble (to avoid underfitting).

The complete example is as follows:

# forecast monthly births with xgboost
from numpy import asarray
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from sklearn.metrics import mean_absolute_error
from xgboost import XGBRegressor
from matplotlib import pyplot

# transform a timeseries dataset into a supervised learning dataset
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    df = DataFrame(data)
    cols = list()
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))
    # put it all together
    agg = concat(cols, axis=1)
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg.values

# split a univariate dataset into train/test sets
def train_test_split(data, n_test):
    return data[:-n_test, :], data[-n_test:, :]

# fit an xgboost model and make a one step prediction
def xgboost_forecast(train, testX):
    # transform list into array
    train = asarray(train)
    # split into input and output columns
    trainX, trainy = train[:, :-1], train[:, -1]
    # fit model
    model = XGBRegressor(objective='reg:squarederror', n_estimators=1000)
    model.fit(trainX, trainy)
    # make a one-step prediction
    yhat = model.predict(asarray([testX]))
    return yhat[0]

# walk-forward validation for univariate data
def walk_forward_validation(data, n_test):
    predictions = list()
    # split dataset
    train, test = train_test_split(data, n_test)
    # seed history with training dataset
    history = [x for x in train]
    # step over each time-step in the test set
    for i in range(len(test)):
        # split test row into input and output columns
        testX, testy = test[i, :-1], test[i, -1]
        # fit model on history and make a prediction
        yhat = xgboost_forecast(history, testX)
        # store forecast in list of predictions
        predictions.append(yhat)
        # add actual observation to history for the next loop
        history.append(test[i])
        # summarize progress
        print('&gt;expected=%.1f,predicted=%.1f' % (testy, yhat))
    # estimate prediction error
    error = mean_absolute_error(test[:, 1], predictions)
    return error, test[:, 1], predictions

# load the dataset
series = read_csv('daily-total-female-births.csv', header=0, index_col=0)
values = series.values

# transform the timeseries data into supervised learning
data = series_to_supervised(values, n_in=3)

# evaluate
mae, y, yhat = walk_forward_validation(data, 12)
print('MAE: %.3f' % mae)

# plot expected vs predicted
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
pyplot.show()

Running this example will report the expected and predicted values for each time in the test set, and then report the MAE for all predictions.

We can see that the model performs better than the persistence model with an MAE of about 5.3 births.

Can you do better?

You can try different XGBoost hyperparameters and different time step inputs to see if you can achieve a better model. Feel free to share your results in the comments section.

How to Use XGBoost for Time Series Forecasting

The following figure plots the predicted values and actual values for the last 12 months, providing a visual representation of the model’s performance on the test set.

How to Use XGBoost for Time Series Forecasting

Once the final XGBoost model parameters are selected, a model can be determined and used to make predictions on new data.

This is called out-of-sample prediction, such as predictions beyond the training set. This is the same as making predictions during model evaluation: because the process of evaluating which model to select and making predictions with that model on new data is the same.

The following example demonstrates how to fit the final XGBoost model on all available data and make a one-step prediction beyond the end of the dataset.

# finalize model and make a prediction for monthly births with xgboost
from numpy import asarray
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from xgboost import XGBRegressor

# transform a timeseries dataset into a supervised learning dataset
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    df = DataFrame(data)
    cols = list()
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))
    # put it all together
    agg = concat(cols, axis=1)
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg.values

# load the dataset
series = read_csv('daily-total-female-births.csv', header=0, index_col=0)
values = series.values

# transform the timeseries data into supervised learning
train = series_to_supervised(values, n_in=3)

# split into input and output columns
trainX, trainy = train[:, :-1], train[:, -1]

# fit model
model = XGBRegressor(objective='reg:squarederror', n_estimators=1000)
model.fit(trainX, trainy)

# construct an input for a new prediction
row = values[-3:].flatten()

# make a one-step prediction
yhat = model.predict(asarray([row]))
print('Input: %s, Predicted: %.3f' % (row, yhat[0]))

Running this code builds an XGBoost model based on all available data.

Using the last three months of known data as the new input row and predicting the next month after the end of the dataset.

How to Use XGBoost for Time Series Forecasting

Further Reading

If you want to dive deeper, this section will provide more resources on the topic.

Related Tutorials

A Brief Introduction to Gradient Boosting Algorithms in Machine Learning

https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/
Transforming Time Series Forecasting into a Supervised Learning Problem

https://machinelearningmastery.com/time-series-forecasting-supervised-learning/
How to Convert Time Series Problems to Supervised Learning Problems in Python

https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/
How To Backtest Machine Learning Models for Time Series Forecasting

https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

Summary

In this tutorial, you learned how to develop an XGBoost model for time series forecasting.

Specifically, you learned:

XGBoost is an implementation of gradient boosting ensemble algorithms for classification and regression.
Time series datasets can be transformed into supervised learning through sliding window representation.
How to fit, evaluate, and predict using the XGBoost model for time series forecasting.

Original title:

How to Use XGBoost for Time Series Forecasting

Original link:

https://machinelearningmastery.com/xgboost-for-time-series-forecasting/

“Learn effectively,likeand share three times↓

Leave a Comment Cancel reply