Author: Occam’s Razor
Personal Blog: https://blog.csdn.net/yilulvxing
Paper Link: https://arxiv.org/abs/1704.02971
Github Code Link: https://github.com/LeronQ/DA-RNN
The paper is titled “Recurrent Neural Network Based on Two-Stage Attention Mechanism”. Essentially, the article is based on the Seq2Seq model, combined with an attention mechanism to realize time series prediction methods. A major highlight of the article is that it introduces the attention mechanism not only in the input phase of the decoder but also in the encoder phase. The attention mechanism in the encoder phase achieves feature selection and captures temporal dependencies.
It is divided into two stages:
-
First Stage: Using the attention mechanism to adaptively extract features at each moment, which is the biggest highlight of this paper. -
Second Stage: Using the attention mechanism to select related encoder hidden states.
1: Model Architecture Diagram

Algorithm Implementation Process:
-
Encoder phase, which is the input phase, utilizes the Attention mechanism, i.e., the original:
Using the Attention mechanism, combined with hidden layer information, will assign certain weights to each feature dimension, transforming it into: thus achieving adaptive extraction of various dimensional features at each moment, using the updated as the input to the encoder, which is also the biggest highlight of this article!
-
Decoder phase, which is the output phase, functions similarly to traditional Attention, using another attention mechanism to select related encoder hidden states.
2: Attention in the Input Phase
The implementation process of the encoder Attention mechanism in the first stage is as follows:

In the article, it is defined as the hidden state of the encoder at time t, where m is the size of the hidden state.
In the first stage, using the current input: and the hidden state of the encoder at the previous moment: to calculate the hidden state of the encoder at the current moment:, where m is the size of the encoder. The update formula can be written as:
where is a nonlinear activation function, we can use the usual recurrent neural networks vanilla RNN or LSTM as , in this article, LSTM is used to capture long dependencies.
Here, in order to adaptively select relevant features (i.e., assign certain weights to each feature), the author introduces the attention mechanism. In simple terms, it assigns certain attention weights to each influencing factor for the input at each moment, i.e.:.
This measures the importance of the k-th feature at time t. The calculation of the updated $ ilde{oldsymbol{x}}_{t}$ is:
So how is calculated?
The method given in the article: is calculated based on the previous hidden state and cell state of the encoder: , where is the concatenation of the hidden state and cell state.
This formula linearly combines the k-th driving series (in the article, the driving series refers to the meaning of features) with the previous hidden state and cell state, and then uses tanh activation to obtain.
After calculating
, it is then normalized using the softmax function:
Then update
The update method is:
as the input for the next stage temporal Attention.
The input attention mechanism enables the encoder to focus on important features among the input features, rather than treating all features equally, which is the essence of all attention mechanisms.
3: Temporal Attention in the Decoder
The implementation process of the temporal Attention mechanism in the second stage is as follows:

For distinction, following Luo Weimeng’s suggestion, the time series index in the decoder is labeled to differentiate from the index in the encoder.
The design of the attention mechanism in the decoder for the second stage is similar to the traditional attention mechanism in seq2seq, meaning that the temporal attention mechanism in the second stage is actually the traditional attention mechanism.
The problem solved by traditional Attention is: in traditional seq2seq models, the context vector output by the encoder is based on the last hidden state or the average of all hidden states. Thus, the output context vector is the same for all time t, without reflecting differentiation, just like a person not concentrating on key parts, unable to perform the function of selecting relevant encoder hidden states.
The idea to solve the problem is to use different context vectors at different moments. Similar to seq2seq, the simplest method is to take a weighted average of all moments, i.e.:
where is calculated based on the previous hidden state
and cell state
according to the model flow in the article, we can see that the input to the decoder is composed of the previous target sequence and the hidden state as well as the context vector
i.e.:
Then, similar to the last formula of the encoder, the activation function here is still chosen to be LSTM.
4: Prediction Part
The article reviews the ultimate goal of the Nonlinear Autoregressive Exogenous (NARX) model, which is to establish the relationship between the current input and all past inputs as well as previous outputs, i.e.:
Through the training of the previous encoder-decoder model, the hidden state and context vector of the decoder have been obtained,
with.
Finally, use a fully connected layer to perform regression on to obtain
the final prediction.
5: Summary

The article discusses input Attention and temporal Attention separately, while the model architecture diagram is presented together. At first, I had some misunderstandings after reading the paper:
For example, in the input attention, f1 uses LSTM, then as the input to the LSTM in the temporal Attention, and then in the decoding layer, LSTM is used for prediction. Doesn’t that mean there are a total of 3 LSTMs being trained?
After deeper reading and reviewing the source code, I found that my previous understanding was incorrect; there are actually only 2 stages of LSTM, corresponding to the attention module in the input attention phase for extracting adaptive features and the LSTM in the decoding phase. I reorganized the model architecture diagram and annotated the arrows to indicate the corresponding positions.
The left side’s input attention is actually just the input of one moment in the temporal attention, which can be observed from the corresponding positions in the input attention and temporal attention.
In other words, the input attention of the coordinates is actually just the computational process details extracted by temporal attention for a certain moment. Looking only at the right side of the temporal attention, it is actually just an implementation of the attention in Seq2Seq, with no differences. The author separates the input of temporal attention to emphasize its computational process, which is the implementation mechanism of the input attention, aiming to illustrate the highlight of the article: achieving adaptive feature extraction based on attention in the input phase.


