Author: Yishui Hancheng, CSDN Blog Expert, Research Directions: Machine Learning, Deep Learning, NLP, CV
Blog: http://yishuihancheng.blog.csdn.net
Traditional linear models struggle with multivariate or multi-input problems, whereas neural networks like LSTM excel at handling multiple variables, making them suitable for time series prediction tasks. In the following article, you will learn how to build an LSTM model using the deep learning library Keras to handle multivariate time series prediction problems. You will master:
-
How to convert raw data into a format suitable for time series prediction; -
How to prepare data and build an LSTM model for time series prediction; -
How to use the model for predictions.
1. Air Pollution Prediction
In this blog, we will use the air quality dataset. The data comes from the U.S. Embassy in Beijing, which collected hourly weather and air pollution index data from 2010 to 2014.
The dataset includes date, PM2.5 concentration, dew point, temperature, wind direction, wind speed, accumulated hourly snowfall, and accumulated hourly rainfall. The complete features in the raw data are as follows:
1.No Row Number
2.year Year
3.month Month
4.day Day
5.hour Hour
6.pm2.5 PM2.5 Concentration
7.DEWP Dew Point
8.TEMP Temperature
9.PRES Atmospheric Pressure
10.cbwd Wind Direction
11.lws Wind Speed
12.ls Accumulated Snowfall
13.lr Accumulated Rainfall
We can use this dataset to build a prediction model that predicts the pollution level at the next (current) time based on the weather conditions and pollution data from the previous hour or several hours. The dataset can be downloaded from the UCI Machine Learning Repository.
Beijing PM2.5 Data Set
https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data
2. Data Processing
Before using the data, some processing is required. The part of the data that needs processing is as follows:
No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
1,2010,1,1,0,NA,-21,-11,1021,NW,1.79,0,0
2,2010,1,1,1,NA,-21,-12,1020,NW,4.92,0,0
3,2010,1,1,2,NA,-21,-11,1019,NW,6.71,0,0
4,2010,1,1,3,NA,-21,-14,1019,NW,9.84,0,0
5,2010,1,1,4,NA,-20,-12,1018,NW,12.97,0,0
Upon a rough observation of the dataset, we find that the PM2.5 values for the first 24 hours are all NA, so this part of the data needs to be deleted. For the other moments with a small number of missing values, we can use fillna in Pandas to fill them; we also need to integrate the date data to serve as the index in Pandas.
The following code completes the above processing, while removing the “No” column from the original data and renaming the columns to clearer names.
from pandas import read_csv
from datetime import datetime
# load data
def parse(x):
return datetime.strptime(x, '%Y %m %d %H')
dataset = read_csv('raw.csv', parse_dates=[['year', 'month', 'day', 'hour']], index_col=0, date_parser=parse)
dataset.drop('No', axis=1, inplace=True)
# manually specify column names
dataset.columns = ['pollution', 'dew', 'temp', 'press', 'wnd_dir', 'wnd_spd', 'snow', 'rain']
dataset.index.name = 'date'
# mark all NA values with 0
dataset['pollution'].fillna(0, inplace=True)
# drop the first 24 hours
dataset = dataset[24:]
# summarize first 5 rows
print(dataset.head(5))
# save to file
dataset.to_csv('pollution.csv')
The processed data is stored in the “pollution.csv” file, part of which is as follows:
pollution dew temp press wnd_dir wnd_spd snow rain
date
2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 SE 1.79 0 0
2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 SE 2.68 0 0
2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 SE 3.57 0 0
2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 SE 5.36 1 0
2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 SE 6.25 2 0
The current data format is more suitable for processing, and we can easily plot each column. The following code loads the “pollution.csv” file and plots each column of data except for the categorical feature “wind speed”.
from pandas import read_csv
from matplotlib import pyplot
# load dataset
dataset = read_csv('pollution.csv', header=0, index_col=0)
values = dataset.values
# specify columns to plot
groups = [0, 1, 2, 3, 5, 6, 7]
i = 1
# plot each column
pyplot.figure()
for group in groups:
pyplot.subplot(len(groups), 1, i)
pyplot.plot(values[:, group])
pyplot.title(dataset.columns[group], y=0.5, loc='right')
i += 1
pyplot.show()
Run the above code to plot the 7 variables over the 5-year range.
3. Multivariate LSTM Prediction Model
3.1 LSTM Data Preparation
When using the LSTM model, the first step is to adapt the data, which includes converting the dataset into a supervised learning problem and normalizing variables (including input and output values), enabling predictions of current pollution levels (t) using the pollution data and weather conditions from the previous moment (t-1). The above processing is straightforward and simple, intended only to get the ball rolling; other processing methods can also be explored, such as:
-
Using pollution data and weather conditions from the past 24 hours to predict current pollution; -
Predicting possible weather conditions for the next moment (t+1);
The code below first loads the “pollution.csv” file and uses sklearn’s preprocessing module to encode the categorical feature “wind direction”. Of course, one-hot encoding can also be applied to this feature. Next, we normalize all features and convert the dataset into a supervised learning problem, while removing the weather condition features that need to be predicted at the current time (t). The complete code is as follows:
# convert series to supervised learning
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t-n, ... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# forecast sequence (t, t+1, ... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
# put it all together
agg = concat(cols, axis=1)
agg.columns = names
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return agg
# load dataset
dataset = read_csv('pollution.csv', header=0, index_col=0)
values = dataset.values
# integer encode direction
encoder = LabelEncoder()
values[:,4] = encoder.fit_transform(values[:,4])
# ensure all data is float
values = values.astype('float32')
# normalize features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
# frame as supervised learning
reframed = series_to_supervised(scaled, 1, 1)
# drop columns we don't want to predict
reframed.drop(reframed.columns[[9,10,11,12,13,14,15]], axis=1, inplace=True)
print(reframed.head())
Running the above code will show you the transformed dataset, which includes 8 input variables (features) and 1 output variable (the air pollution value at the current time t, label).
The dataset processing is relatively simple, and there are many other approaches to try. Some directions to explore include:
-
Dummy encoding of the “wind direction” feature; -
Adding seasonal features; -
Time steps greater than 1.
Among these, the third method may be the most important for handling time series problems with LSTM.
3.2 Constructing the Model
In this section, we will construct the LSTM model.
First, we need to split the processed dataset into training and testing sets. To speed up the training of the model, we will only use the data from the first year for training and then evaluate using the remaining 4 years.
The code below splits the dataset and then divides the training and testing sets into input and output variables, finally reshaping the input<span>(X)</span>
into the LSTM input format, which is<span>[samples,timesteps,features]</span>
.
# split into train and test sets
values = reframed.values
n_train_hours = 365 * 24
train = values[:n_train_hours, :]
test = values[n_train_hours:, :]
# split into input and outputs
train_X, train_y = train[:, :-1], train[:, -1]
test_X, test_y = test[:, :-1], test[:, -1]
# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
Running the above code will print the input and output formats for the training and testing sets:
(8760, 1, 8) (8760,) (35039, 1, 8) (35039,)
Now we can build the LSTM model.
In the LSTM model, the hidden layer has 50 neurons, the output layer has 1 neuron (regression problem), and the input variable is the features of one time step<span>(t-1)</span>
. The loss function uses Mean Absolute Error (MAE), the optimization algorithm is Adam, and the model is trained for 50 epochs with a batch size of 72.
Finally, in the<span>fit()</span>
function, we set the validation_data parameter to record the losses of the training and testing sets, and plot the loss graph after training and testing are completed.
# design network
model = Sequential()
model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
# fit network
history = model.fit(train_X, train_y, epochs=50, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False)
# plot history
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()
# design network
model = Sequential()
model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
# fit network
history = model.fit(train_X, train_y, epochs=50, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False)
# plot history
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()
3.3 Model Evaluation
Next, we will evaluate the model’s performance.
It is worth noting that we need to combine the predicted results with part of the test dataset and then invert the scaling, while also converting the expected values on the test set back to their original scale.
After the above processing, we can calculate the loss using RMSE (Root Mean Square Error).
# make a prediction
yhat = model.predict(test_X)
test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))
# invert scaling for forecast
inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)
v_inv_yhat = scaler.inverse_transform(inv_yhat)
v_inv_yhat = inv_yhat[:,0]
# invert scaling for actual
test_y = test_y.reshape((len(test_y), 1))
v_inv_y = concatenate((test_y, test_X[:, 1:]), axis=1)
v_inv_y = scaler.inverse_transform(inv_y)
v_inv_y = inv_y[:,0]
# calculate RMSE
rmse = sqrt(mean_squared_error(inv_y, inv_yhat))
print('Test RMSE: %.3f' % rmse)
The complete code for the entire project is as follows:
from math import sqrt
from numpy import concatenate
from matplotlib import pyplot
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
# convert series to supervised learning
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t-n, ... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# forecast sequence (t, t+1, ... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
# put it all together
agg = concat(cols, axis=1)
agg.columns = names
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return agg
# load dataset
dataset = read_csv('pollution.csv', header=0, index_col=0)
values = dataset.values
# integer encode direction
encoder = LabelEncoder()
values[:,4] = encoder.fit_transform(values[:,4])
# ensure all data is float
values = values.astype('float32')
# normalize features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
# frame as supervised learning
reframed = series_to_supervised(scaled, 1, 1)
# drop columns we don't want to predict
reframed.drop(reframed.columns[[9,10,11,12,13,14,15]], axis=1, inplace=True)
print(reframed.head())
# split into train and test sets
values = reframed.values
n_train_hours = 365 * 24
train = values[:n_train_hours, :]
test = values[n_train_hours:, :]
# split into input and outputs
train_X, train_y = train[:, :-1], train[:, -1]
test_X, test_y = test[:, :-1], test[:, -1]
# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
# design network
model = Sequential()
model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
# fit network
history = model.fit(train_X, train_y, epochs=50, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False)
# plot history
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()
# make a prediction
yhat = model.predict(test_X)
test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))
# invert scaling for forecast
inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)
v_inv_yhat = scaler.inverse_transform(inv_yhat)
v_inv_yhat = inv_yhat[:,0]
# invert scaling for actual
test_y = test_y.reshape((len(test_y), 1))
v_inv_y = concatenate((test_y, test_X[:, 1:]), axis=1)
v_inv_y = scaler.inverse_transform(inv_y)
v_inv_y = inv_y[:,0]
# calculate RMSE
rmse = sqrt(mean_squared_error(inv_y, inv_yhat))
print('Test RMSE: %.3f' % rmse)
Running the above code will first plot the training and testing loss during training.
Each epoch during training records and plots the losses for the training and testing sets, and after the training is completed, the final RMSE of the model is plotted. In the graph below, you can see that the overall model’s RMSE reaches 26.496.
...
Epoch 46/50
0s - loss: 0.0143 - val_loss: 0.0133
Epoch 47/50
0s - loss: 0.0143 - val_loss: 0.0133
Epoch 48/50
0s - loss: 0.0144 - val_loss: 0.0133
Epoch 49/50
0s - loss: 0.0143 - val_loss: 0.0133
Epoch 50/50
0s - loss: 0.0144 - val_loss: 0.0133
Test RMSE: 26.496
This model has not been tuned. Can you do better? Please tell us your problem framework, model configuration, and RMSE in the comments below.
Many people have suggested how to adjust the above example to train the model based on multiple time steps. When I wrote the original article, I tried this method and countless other configurations, but I decided not to include them because they did not improve the model’s performance. Nevertheless, I provide this example below as a reference template that you can adjust for your own problem. The changes required to train the model based on multiple time steps are minimal, as follows: first, when calling series_to_supervised(), the problem must be constructed appropriately. We will use data from 3 hours as input. Also, note that we no longer explicitly remove columns from ob(t) for all other fields.
# specify the number of lag hours
n_hours = 3
n_features = 8
# frame as supervised learning
reframed = series_to_supervised(scaled, n_hours, 1)
Next, we need to be more careful when specifying input and output columns. Our framed dataset has 3 * 8 + 8 columns. We will use 3 * 8 or 24 columns as input for all features of the observations from the past 3 hours. We will only take the pollution variable as the output for the next hour, as follows:
# split into input and outputs
n_obs = n_hours * n_features
train_X, train_y = train[:, :n_obs], train[:, -n_features]
test_X, test_y = test[:, :n_obs], test[:, -n_features]
print(train_X.shape, len(train_X), train_y.shape)
Next, we can properly reshape the input data to reflect the time steps and features.
# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], n_hours, n_features))
test_X = test_X.reshape((test_X.shape[0], n_hours, n_features))
Fitting the model is the same. The only other small change is how we evaluate the model. Specifically, how we reconstruct rows with 8 columns that fit the inverse scaling operation so that we can return y and yhat to the original scale for RMSE calculation. The key change is that we concatenate the y or yhat columns with the last 7 features of the test dataset for inverse scaling, as follows:
# invert scaling for forecast
inv_yhat = concatenate((yhat, test_X[:, -7:]), axis=1)
v_inv_yhat = scaler.inverse_transform(inv_yhat)
v_inv_yhat = inv_yhat[:,0]
# invert scaling for actual
test_y = test_y.reshape((len(test_y), 1))
v_inv_y = concatenate((test_y, test_X[:, -7:]), axis=1)
v_inv_y = scaler.inverse_transform(inv_y)
v_inv_y = inv_y[:,0]
We can relate all these modifications back to the example above. The complete example of multivariate time series multi-lag input prediction is as follows:
from math import sqrt
from numpy import concatenate
from matplotlib import pyplot
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
# convert series to supervised learning
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t-n, ... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# forecast sequence (t, t+1, ... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
# put it all together
agg = concat(cols, axis=1)
agg.columns = names
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return agg
# load dataset
dataset = read_csv('pollution.csv', header=0, index_col=0)
values = dataset.values
# integer encode direction
encoder = LabelEncoder()
values[:,4] = encoder.fit_transform(values[:,4])
# ensure all data is float
values = values.astype('float32')
# normalize features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
# specify the number of lag hours
n_hours = 3
n_features = 8
# frame as supervised learning
reframed = series_to_supervised(scaled, n_hours, 1)
print(reframed.shape)
# split into train and test sets
values = reframed.values
n_train_hours = 365 * 24
train = values[:n_train_hours, :]
test = values[n_train_hours:, :]
# split into input and outputs
n_obs = n_hours * n_features
train_X, train_y = train[:, :n_obs], train[:, -n_features]
test_X, test_y = test[:, :n_obs], test[:, -n_features]
print(train_X.shape, len(train_X), train_y.shape)
# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], n_hours, n_features))
test_X = test_X.reshape((test_X.shape[0], n_hours, n_features))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
# design network
model = Sequential()
model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
# fit network
history = model.fit(train_X, train_y, epochs=50, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False)
# plot history
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()
# make a prediction
yhat = model.predict(test_X)
test_X = test_X.reshape((test_X.shape[0], n_hours*n_features))
# invert scaling for forecast
inv_yhat = concatenate((yhat, test_X[:, -7:]), axis=1)
v_inv_yhat = scaler.inverse_transform(inv_yhat)
v_inv_yhat = inv_yhat[:,0]
# invert scaling for actual
test_y = test_y.reshape((len(test_y), 1))
v_inv_y = concatenate((test_y, test_X[:, -7:]), axis=1)
v_inv_y = scaler.inverse_transform(inv_y)
v_inv_y = inv_y[:,0]
# calculate RMSE
rmse = sqrt(mean_squared_error(inv_y, inv_yhat))
print('Test RMSE: %.3f' % rmse)
After fitting, the output is as follows:
...
Epoch 45/50
1s - loss: 0.0143 - val_loss: 0.0154
Epoch 46/50
1s - loss: 0.0143 - val_loss: 0.0148
Epoch 47/50
1s - loss: 0.0143 - val_loss: 0.0152
Epoch 48/50
1s - loss: 0.0143 - val_loss: 0.0151
Epoch 49/50
1s - loss: 0.0143 - val_loss: 0.0152
Epoch 50/50
1s - loss: 0.0144 - val_loss: 0.0149
The losses on the training and testing sets are plotted as shown below:
Finally, the RMSE test is printed, which does not show any technical advantages, at least not for this problem.
Test RMSE: 27.177
I want to add that LSTM does not seem suitable for autoregressive type problems; it is better to use a large window to study MLP. I hope this example can help you complete your own time series prediction experiments.
Appreciate the Author
The Python Chinese community, as a decentralized global technical community, aims to become a spiritual tribe of 200,000 Python Chinese developers worldwide. It currently covers major mainstream media and collaboration platforms, establishing extensive connections with well-known companies and technical communities in the industry, including Alibaba, Tencent, Baidu, Microsoft, Amazon, Open Source China, CSDN, etc. It has tens of thousands of registered members from more than a dozen countries and regions, with members from government agencies, research institutions, financial institutions, and well-known companies at home and abroad, such as the Ministry of Industry and Information Technology, Tsinghua University, Peking University, Beijing University of Posts and Telecommunications, People’s Bank of China, Chinese Academy of Sciences, CICC, Huawei, BAT, Google, Microsoft, etc. Nearly 200,000 developers follow across all platforms.
Cool Tricks! Can Pandas Be Used to Write Crawlers?
Veteran Driver Teaches You to Understand Python Decorators in 5 Minutes
Implement Particle Swarm Algorithm with Python
2020 Python Recruitment Internal Push Channel Open!
▼ Click to Read the Original Text Immediately Snatch Cloud Products!