Air Pollution Time Series Prediction Using Deep Learning

Traditional linear models struggle to solve multivariable or multi-input problems, while neural networks like LSTM excel at handling multiple variables. This characteristic makes them useful for addressing time series prediction issues. In the following article, you will learn how to build an LSTM model using the deep learning library Keras to tackle multivariable time series prediction problems. You will master:

How to transform raw data into a format suitable for time series prediction;
How to prepare the data and build an LSTM model for time series prediction;
How to use the model for predictions.

1. Air Pollution Prediction

In this blog, we will use an air quality dataset. The data is sourced from the U.S. Embassy in Beijing, which collected hourly weather and air pollution index data over a period of five years from 2010 to 2014.

The dataset includes date, PM2.5 concentration, dew point, temperature, wind direction, wind speed, cumulative hourly snowfall, and cumulative hourly rainfall. The complete features in the raw data are as follows:

1.No row number
2.year year
3.month month
4.day day
5.hour hour
6.pm2.5 PM2.5 concentration
7.DEWP dew point
8.TEMP temperature
9.PRES atmospheric pressure
10.cbwd wind direction
11.lws wind speed
12.ls cumulative snowfall
13.lr cumulative rainfall

We can use this dataset to build a predictive model that uses the weather conditions and pollution data from the previous hour or several hours to predict the pollution level at the next (current) moment. The dataset can be downloaded from the UCI Machine Learning Repository.

Beijing PM2.5 Data Set

https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data

2. Data Processing

Before using the data, some processing is required. The raw data to be processed is as follows:

No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
1,2010,1,1,0,NA,-21,-11,1021,NW,1.79,0,0
2,2010,1,1,1,NA,-21,-12,1020,NW,4.92,0,0
3,2010,1,1,2,NA,-21,-11,1019,NW,6.71,0,0
4,2010,1,1,3,NA,-21,-14,1019,NW,9.84,0,0
5,2010,1,1,4,NA,-20,-12,1018,NW,12.97,0,0

A rough observation of the dataset reveals that the PM2.5 values for the first 24 hours are all NA, so this part of the data needs to be removed. For the other moments with a small number of missing values, we can use Pandas’ fillna to fill them in; we also need to integrate the date data to serve as the index in Pandas.

The following code completes the above processing, while also removing the “No” column from the raw data and renaming the columns to clearer names.

from pandas import read_csv
from datetime import datetime
# load data
def parse(x):
    return datetime.strptime(x, '%Y %m %d %H')
dataset = read_csv('raw.csv',  parse_dates = [['year', 'month', 'day', 'hour']], index_col=0, date_parser=parse)
dataset.drop('No', axis=1, inplace=True)
# manually specify column names
dataset.columns = ['pollution', 'dew', 'temp', 'press', 'wnd_dir', 'wnd_spd', 'snow', 'rain']
dataset.index.name = 'date'
# mark all NA values with 0
dataset['pollution'].fillna(0, inplace=True)
# drop the first 24 hours
dataset = dataset[24:]
# summarize first 5 rows
print(dataset.head(5))
# save to file
dataset.to_csv('pollution.csv')

The processed data is stored in the “pollution.csv” file, part of which is as follows:

                     pollution  dew  temp   press wnd_dir  wnd_spd  snow  rain
date
2010-01-02 00:00:00      129.0  -16  -4.0  1020.0      SE     1.79     0     0
2010-01-02 01:00:00      148.0  -15  -4.0  1020.0      SE     2.68     0     0
2010-01-02 02:00:00      159.0  -11  -5.0  1021.0      SE     3.57     0     0
2010-01-02 03:00:00      181.0   -7  -5.0  1022.0      SE     5.36     1     0
2010-01-02 04:00:00      138.0   -7  -5.0  1022.0      SE     6.25     2     0

The data format is now more suitable for processing, and we can easily plot each column. The following code loads the “pollution.csv” file and plots each column of data except for the categorical feature “wind speed”.

from pandas import read_csv
from matplotlib import pyplot
# load dataset
dataset = read_csv('pollution.csv', header=0, index_col=0)
values = dataset.values
# specify columns to plot
groups = [0, 1, 2, 3, 5, 6, 7]
i = 1
# plot each column
pyplot.figure()
for group in groups:
    pyplot.subplot(len(groups), 1, i)
    pyplot.plot(values[:, group])
    pyplot.title(dataset.columns[group], y=0.5, loc='right')
    i += 1
pyplot.show()

Run the above code to plot the seven variables over a five-year range.

3. Multivariable LSTM Prediction Model

3.1 LSTM Data Preparation

When using the LSTM model, the first step is to adapt the data, which includes transforming the dataset into a supervised learning problem and normalizing variables (including input and output values) so that it can predict the current pollution level (t) based on the pollution data and weather conditions from the previous moment (t-1). The above processing method is straightforward and relatively simple, merely intended to provide a starting point; other processing methods can also be explored, such as:

Using pollution data and weather conditions from the past 24 hours to predict the current pollution level;
Predicting possible weather conditions for the next moment (t+1);

The code below first loads the “pollution.csv” file and uses sklearn’s preprocessing module to encode the categorical feature “wind direction”; it can also be one-hot encoded. Next, all features are normalized, then the dataset is transformed into a supervised learning problem, while the weather condition features for the current moment (t) that need to be predicted are removed. The complete code is as follows:

# convert series to supervised learning
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    df = DataFrame(data)
    cols, names = list(), list()
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))
        if i == 0:
            names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
        else:
            names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
    # put it all together
    agg = concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg
 
# load dataset
dataset = read_csv('pollution.csv', header=0, index_col=0)
values = dataset.values
# integer encode direction
encoder = LabelEncoder()
values[:,4] = encoder.fit_transform(values[:,4])
# ensure all data is float
values = values.astype('float32')
# normalize features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
# frame as supervised learning
reframed = series_to_supervised(scaled, 1, 1)
# drop columns we don't want to predict
reframed.drop(reframed.columns[[9,10,11,12,13,14,15]], axis=1, inplace=True)
print(reframed.head())

Running the above code will show the transformed dataset, which includes 8 input variables (input features) and 1 output variable (the pollution value at the current moment t, label).

The data processing is quite simple, and there are many ways to experiment. Some directions to explore include:

Dummy encoding the “wind direction” feature;
Adding seasonal features;
Using a time step longer than 1.

Among these, the third approach may be the most crucial for handling time series problems with LSTM.

3.2 Building the Model

In this section, we will construct the LSTM model.

First, we need to split the processed dataset into training and testing sets. To speed up model training, we will only use the first year’s data for training and then evaluate the remaining four years.

The following code splits the dataset, then divides the training and testing sets into input and output variables, ultimately transforming the input (X) into the LSTM input format, which is [samples,timesteps,features].

# split into train and test sets
values = reframed.values
n_train_hours = 365 * 24
train = values[:n_train_hours, :]
test = values[n_train_hours:, :]
# split into input and outputs
train_X, train_y = train[:, :-1], train[:, -1]
test_X, test_y = test[:, :-1], test[:, -1]
# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

Run the above code to print the input and output formats of the training and testing sets:

(8760, 1, 8) (8760,) (35039, 1, 8) (35039,)

Now we can build the LSTM model.

In the LSTM model, there are 50 neurons in the hidden layer and 1 neuron in the output layer (regression problem). The input variable is a time step (t-1) feature, the loss function used is Mean Absolute Error (MAE), the optimization algorithm is Adam, and the model is run for 50 epochs with a batch size of 72.

Finally, in the fit() function, set the validation_data parameter to record the loss of the training and testing sets, and plot the loss graph after training and testing are complete.

# design network
model = Sequential()
model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
# fit network
history = model.fit(train_X, train_y, epochs=50, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False)
# plot history
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()
 
# design network
model = Sequential()
model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
# fit network
history = model.fit(train_X, train_y, epochs=50, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False)
# plot history
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()

3.3 Model Evaluation

Next, we will evaluate the model’s performance.

It is important to note that we need to combine the predicted results with some test set data and then invert the scaling, while also converting the expected values on the test set back to their original scale.

After these processes, we will calculate the loss using RMSE (Root Mean Square Error).

# make a prediction
yhat = model.predict(test_X)
test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))
# invert scaling for forecast
inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)
iv_yhat = scaler.inverse_transform(inv_yhat)
iv_yhat = inv_yhat[:,0]
# invert scaling for actual
test_y = test_y.reshape((len(test_y), 1))
v_inv_y = concatenate((test_y, test_X[:, 1:]), axis=1)
v_inv_y = scaler.inverse_transform(inv_y)
v_inv_y = inv_y[:,0]
# calculate RMSE
rmse = sqrt(mean_squared_error(inv_y, inv_yhat))
print('Test RMSE: %.3f' % rmse)

The complete code for the entire project is as follows:

from math import sqrt
from numpy import concatenate
from matplotlib import pyplot
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
 
# convert series to supervised learning
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    df = DataFrame(data)
    cols, names = list(), list()
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))
        if i == 0:
            names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
        else:
            names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
    # put it all together
    agg = concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg
 
# load dataset
dataset = read_csv('pollution.csv', header=0, index_col=0)
values = dataset.values
# integer encode direction
encoder = LabelEncoder()
values[:,4] = encoder.fit_transform(values[:,4])
# ensure all data is float
values = values.astype('float32')
# normalize features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
# frame as supervised learning
reframed = series_to_supervised(scaled, 1, 1)
# drop columns we don't want to predict
reframed.drop(reframed.columns[[9,10,11,12,13,14,15]], axis=1, inplace=True)
print(reframed.head())
 
# split into train and test sets
values = reframed.values
n_train_hours = 365 * 24
train = values[:n_train_hours, :]
test = values[n_train_hours:, :]
# split into input and outputs
train_X, train_y = train[:, :-1], train[:, -1]
test_X, test_y = test[:, :-1], test[:, -1]
# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
 
# design network
model = Sequential()
model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
# fit network
history = model.fit(train_X, train_y, epochs=50, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False)
# plot history
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()
 
# make a prediction
yhat = model.predict(test_X)
test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))
# invert scaling for forecast
inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)
iv_yhat = scaler.inverse_transform(inv_yhat)
iv_yhat = inv_yhat[:,0]
# invert scaling for actual
test_y = test_y.reshape((len(test_y), 1))
v_inv_y = concatenate((test_y, test_X[:, 1:]), axis=1)
v_inv_y = scaler.inverse_transform(inv_y)
v_inv_y = inv_y[:,0]
# calculate RMSE
rmse = sqrt(mean_squared_error(inv_y, inv_yhat))
print('Test RMSE: %.3f' % rmse)

Running the above code will first plot the training and testing loss during the training process.

Each epoch during training will record and plot the losses for the training and testing sets, and after the entire training is complete, the final RMSE of the model will be plotted. In the figure below, the overall RMSE of the model reaches 26.496.

...
Epoch 46/50
0s - loss: 0.0143 - val_loss: 0.0133
Epoch 47/50
0s - loss: 0.0143 - val_loss: 0.0133
Epoch 48/50
0s - loss: 0.0144 - val_loss: 0.0133
Epoch 49/50
0s - loss: 0.0143 - val_loss: 0.0133
Epoch 50/50
0s - loss: 0.0144 - val_loss: 0.0133
Test RMSE: 26.496

This model has not been tuned. Can you do better? Please tell me your problem framework, model configuration, and RMSE in the comments below.

Many have suggested how to adjust the above example to train the model based on multiple previous time steps. When I wrote the original article, I tried this approach and countless other configurations, but I decided not to include them because they did not enhance the model’s skills. Nevertheless, I provide this example as a reference template that you can adjust according to your problem. The changes needed to train the model with multiple previous time steps are minimal, as follows: first, when calling series_to_supervised(), the problem must be constructed appropriately. We will use data from the last 3 hours as input. It is also important to note that we no longer explicitly remove columns from ob(t) of all other fields.

# specify the number of lag hours
n_hours = 3
n_features = 8
# frame as supervised learning
reframed = series_to_supervised(scaled, n_hours, 1)

Next, when specifying input and output columns, we need to be more careful. Our framed dataset will have 3 * 8 + 8 columns. We will use 3 * 8 or 24 columns as input for all features of the observations in the last 3 hours. We will only use the pollution variable as the output for the next hour, as follows:

# split into input and outputs
n_obs = n_hours * n_features
train_X, train_y = train[:, :n_obs], train[:, -n_features]
test_X, test_y = test[:, :n_obs], test[:, -n_features]
print(train_X.shape, len(train_X), train_y.shape)

Next, we can reshape the input data correctly to reflect the time steps and features.

# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], n_hours, n_features))
test_X = test_X.reshape((test_X.shape[0], n_hours, n_features))

The model fitting is the same. The only other small change is how we evaluate the model. Specifically, how we reconstruct the rows with 8 columns that are suitable for the inverse scaling operation, allowing us to return y and yhat to their original scale so we can calculate RMSE. The key change is that we will concatenate the y or yhat columns with the last 7 features of the test dataset for inverse scaling, as shown below:

# invert scaling for forecast
inv_yhat = concatenate((yhat, test_X[:, -7:]), axis=1)
iv_yhat = scaler.inverse_transform(inv_yhat)
iv_yhat = inv_yhat[:,0]
# invert scaling for actual
test_y = test_y.reshape((len(test_y), 1))
v_inv_y = concatenate((test_y, test_X[:, -7:]), axis=1)
v_inv_y = scaler.inverse_transform(inv_y)
v_inv_y = inv_y[:,0]

We can relate all these modifications to the example above. The complete example of multivariate time series multi-lag input prediction is as follows:

from math import sqrt
from numpy import concatenate
from matplotlib import pyplot
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
 
# convert series to supervised learning
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
 n_vars = 1 if type(data) is list else data.shape[1]
 df = DataFrame(data)
 cols, names = list(), list()
 # input sequence (t-n, ... t-1)
 for i in range(n_in, 0, -1):
  cols.append(df.shift(i))
  names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
 # forecast sequence (t, t+1, ... t+n)
 for i in range(0, n_out):
  cols.append(df.shift(-i))
  if i == 0:
   names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
  else:
   names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
 # put it all together
 agg = concat(cols, axis=1)
 agg.columns = names
 # drop rows with NaN values
 if dropnan:
  agg.dropna(inplace=True)
 return agg
 
# load dataset
dataset = read_csv('pollution.csv', header=0, index_col=0)
values = dataset.values
# integer encode direction
encoder = LabelEncoder()
values[:,4] = encoder.fit_transform(values[:,4])
# ensure all data is float
values = values.astype('float32')
# normalize features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
# specify the number of lag hours
n_hours = 3
n_features = 8
# frame as supervised learning
reframed = series_to_supervised(scaled, n_hours, 1)
print(reframed.shape)
 
# split into train and test sets
values = reframed.values
n_train_hours = 365 * 24
train = values[:n_train_hours, :]
test = values[n_train_hours:, :]
# split into input and outputs
n_obs = n_hours * n_features
train_X, train_y = train[:, :n_obs], train[:, -n_features]
test_X, test_y = test[:, :n_obs], test[:, -n_features]
print(train_X.shape, len(train_X), train_y.shape)
# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], n_hours, n_features))
test_X = test_X.reshape((test_X.shape[0], n_hours, n_features))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
 
# design network
model = Sequential()
model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
# fit network
history = model.fit(train_X, train_y, epochs=50, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False)
# plot history
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()
 
# make a prediction
yhat = model.predict(test_X)
test_X = test_X.reshape((test_X.shape[0], n_hours*n_features))
# invert scaling for forecast
inv_yhat = concatenate((yhat, test_X[:, -7:]), axis=1)
iv_yhat = scaler.inverse_transform(inv_yhat)
iv_yhat = inv_yhat[:,0]
# invert scaling for actual
test_y = test_y.reshape((len(test_y), 1))
v_inv_y = concatenate((test_y, test_X[:, -7:]), axis=1)
v_inv_y = scaler.inverse_transform(inv_y)
v_inv_y = inv_y[:,0]
# calculate RMSE
rmse = sqrt(mean_squared_error(inv_y, inv_yhat))
print('Test RMSE: %.3f' % rmse)

After fitting, the output is as follows:

...
Epoch 45/50
1s - loss: 0.0143 - val_loss: 0.0154
Epoch 46/50
1s - loss: 0.0143 - val_loss: 0.0148
Epoch 47/50
1s - loss: 0.0143 - val_loss: 0.0152
Epoch 48/50
1s - loss: 0.0143 - val_loss: 0.0151
Epoch 49/50
1s - loss: 0.0143 - val_loss: 0.0152
Epoch 50/50
1s - loss: 0.0144 - val_loss: 0.0149

The training and testing losses are plotted as shown below:

Finally, the RMSE test is printed out, showing no significant advantage in any techniques, at least not on this problem.

Test RMSE: 27.177

I would like to add that LSTM does not seem suitable for autoregressive type problems; it is better to use a larger window to study MLP. I hope this example helps you complete your own time series prediction experiments.

Author: Yishui Hancheng Blog: http://yishuihancheng.blog.csdn.net