In the previous lecture, we discussed the seq2seq model. Although the seq2seq model is powerful, its effectiveness can be significantly reduced if used in isolation. This section introduces the attention model, which simulates the human attention intuition within the encoder-decoder framework.
Principle of Attention Mechanism
The attention mechanism in the human brain is essentially a model for allocating attention resources. For instance, when we read a paper, our attention at any given moment is likely to be focused on a specific line of text. When we see an image, our attention is concentrated on a particular part of it. As our gaze moves, our attention shifts to another line of text or another part of the image. Therefore, at any moment, our attention distribution varies across a paper or an image. This is the origin of the famous attention mechanism model. We have previously mentioned the idea of attention mechanisms in the context of computer vision target detection, where Fast R-CNN utilizes Regions of Interest (RoI) to perform detection tasks more effectively; RoI is an application of the attention model in computer vision.
The attention model is more commonly used in the field of natural language processing, particularly in applications such as machine translation. In natural language processing, the attention model is typically applied within the classic Encoder-Decoder framework, which is the well-known N vs M model in RNNs. The seq2seq model is a typical example of an Encoder-Decoder framework.
The Encoder-Decoder framework, as a general framework, is not sufficiently refined for specific natural language processing tasks. In other words, the pure Encoder-Decoder framework does not effectively focus on the input targets, which means that models like seq2seq cannot achieve their maximum effectiveness when used alone. For example, in the image above, the encoder encodes the input into a context variable c, and during decoding, each output Y indiscriminately uses this c for decoding. The attention model aims to encode different c for each time step of the sequence, combining each different c during decoding to produce more accurate results.
Unified encoding as c:
Using unified c for decoding:
After applying the attention model, each input is independently encoded, and during decoding, there will be corresponding c for each input, rather than a one-size-fits-all approach:
Consequently, the original Encoder-Decoder framework transforms into the following structure after introducing the attention mechanism:
The above is a diagram of the attention model. Next, let’s see how the attention model can be described using formulas. The formula description is as follows:
A simple attention model is usually described by the following three formulas: 1) Calculate the attention score 2) Normalize 3) Combine the attention score and hidden state value to calculate the context state c. Here, u is the state value at a certain time step in decoding, which is the feature vector matching the current task, vi is the state value at the i-th time step in encoding, and a() is the function that calculates u and vi. a() can typically take the following forms:
For example, in machine translation, a diagram of calculating an attention score:
The attention score is calculated from the hidden state h in encoding and the hidden state h’ in decoding. Therefore, each context variable c will automatically select the most suitable contextual information corresponding to the current output y. Specifically, we use aij to measure the relevance between the j-th stage of the encoder’s hj and the i-th stage of the decoder, and ultimately, the contextual information ci for the i-th stage of input in the decoder comes from the weighted sum of all hj based on aij.
Machine Translation Based on Attention Model
Assuming we want to translate a time text expressed in English into a numerical time format using an RNN model with an attention mechanism. The input and output format examples are as follows:
Import task-related packages:
Read input and output related text data:
Tokenize the input time text and preprocess it, perform one-hot preprocessing on the output time, and define the preprocessing function:
Preprocess the data and split the dataset:
View examples of data before and after preprocessing:Once the data is ready, we can begin modeling. In this example, the formulas for calculating the attention score and context vector are as follows:
Define the attention mechanism process for one time step:
Then, define an attention network layer based on the attention process for one time step:
Then pass the data, compile the model, and train it:
Training example is as follows:
The model structure containing the attention layer is as follows:
The overall structure of the model is illustrated in the following diagram (not entirely consistent, for illustration only):
Next, evaluate the trained model on the test set:
Finally, let’s take a brief look at what the attention model expresses during the translation process, using a visualization method to showcase the attention model:It can be seen that in the above test example, the attention mechanism allows the model to focus on the corresponding input characters.
References:
deeplearningai.com
https://github.com/luwill/Attention_Network_With_Keras
https://zhuanlan.zhihu.com/p/28054589
Previous Highlights:
Lecture 46: seq2seq Model
Lecture 45: GloVe Word Vectors and Related Applications
Lecture 44: Training a word2vec Word Vector
Lecture 43: Word2vec in Natural Language Processing
Lecture 42: Word Embedding and Word Vectors in Natural Language Processing
Lecture 41: Implementation of LSTM in numpy and keras
Lecture 40: RNNs and Long Short-Term Memory Networks (LSTM)
Lecture 39: RNNs and Gated Recurrent Units (GRU)
Lecture 38: Four Types of RNNs
A data scientist’s learning journey
Long press the QR code. Follow the Machine Learning Laboratory