Click on the above “Beginner’s Guide to Vision” to choose to add “Star” or “Pin“
Important content delivered to you first
Source | Zhihu
Author | Lucas
Understanding Attention Mechanism and Its Implementation in PyTorch
Biologically Inspired Attention Model -> Resource Allocation
The attention mechanism in deep learning is a biomimetic of the human visual attention mechanism, essentially serving as a resource allocation mechanism. The physiological principle is that human visual attention can receive high-resolution information from a specific area of an image while perceiving the surrounding area at a lower resolution, and the focus can change over time. In other words, the human eye quickly scans the global image to find the target area that needs attention, then allocates more attention to this area to gain more detailed information and suppress other irrelevant information, thereby improving the efficiency of representation. For example, in the image below, my main focus is on the icon in the center and the word ATTENTION, while I pay less attention to the stripes on the border, which can be a bit dizzying to look at.

Encoder-Decoder Framework == Sequence to Sequence Conditional Generation Framework
The Encoder-Decoder framework, also known as the sequence to sequence conditional generation framework[1], is a research model in the field of text processing. In a conventional encoder-decoder approach, the first step is to encode the input sentence sequence X into a fixed-length context vector C through a neural network, which represents the semantic meaning of the text. The second step involves another neural network acting as a decoder that predicts the target word sequence based on the context vector C encoded from the words already predicted. During this process, the RNNs of the encoder and decoder are jointly trained, but the supervisory information only appears on the decoder RNN side, with gradients backpropagating to the encoder RNN side. Using LSTM for text modeling is currently a popular and effective method[2].
The most typical application of the attention mechanism is in statistical machine translation. Given a task, the input is “Echt”, “Dicke” and “Kiste” into the encoder, using RNN to represent the text as a fixed-length vector h3. However, the problem arises when the decoder generates y1, as it relies solely on the last hidden state h3, which is the sentence embedding. This means that h3 must encode all information from the input sentence. In practice, traditional Encoder-Decoder models fail to achieve this functionality. While LSTM[3] is designed to solve the long-term dependency problem, LSTMs still have limitations. RNNs must sequentially pass through all previous units before accessing long-term information, making them prone to the vanishing gradient problem. The introduction of LSTM helps to some extent by using gating mechanisms. Indeed, LSTMs, GRUs, and their variants can learn a significant amount of long-term information, but they can only remember relatively long information, not larger or longer contexts.

So, let’s summarize the general paradigm and issues with traditional encoder-decoder models: the task is to translate the Chinese “我/爱/赛尔” into English. Traditional encoder-decoder first inputs the entire sentence, encoding the last word “赛尔” and using RNN to generate a representation vector C for the entire sentence. During conditional generation, when translating to the second word “赛尔”, it needs to backtrack to find the previously predicted h_1 and the context representation C, then decode the output.
From Equal Attention to Focused Attention
In the traditional Encoder-Decoder framework: the decoder predicts the target word sequence based on the context vector C encoded from the words already predicted. This means that regardless of which word is generated, the sentence encoding representation C remains the same. In other words, every word in the sentence has the same influence on generating a particular target word P_yi, which means equal attention. This is clearly counterintuitive. Intuitively, when I translate a part, I should focus on the original text corresponding to that part. When translating to the first word, I should pay more attention to the meaning of the first word in the original text. See the pseudocode and the image below:
P_y1 = F(E<start>,C),P_y2 = F((E<the>,C)P_y3 = F((E<black>,C)

Next, observe the difference between the two images: the same context representation C will be replaced with a context Ci that changes according to the current generated word.

The text translation process becomes:
P_y1 = F(E<start>,C_0),P_y2 = F((E<the>,C_1)P_y3 = F((E<black>,C_2)
Code Implementation of Encoder-Decoder Framework[4]
class EncoderDecoder(nn.Module): """ A standard Encoder-Decoder architecture. Base for this and many other models. """ def __init__(self, encoder, decoder, src_embed, tgt_embed, generator): super(EncoderDecoder, self).__init__() self.encoder = encoder self.decoder = decoder self.src_embed = src_embed self.tgt_embed = tgt_embed self.generator = generator def forward(self, src, tgt, src_mask, tgt_mask): "Take in and process masked src and target sequences." return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask) def encode(self, src, src_mask): return self.encoder(self.src_embed(src), src_mask) def decode(self, memory, src_mask, tgt, tgt_mask): return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)
Considerations on Interpretability
The traditional encoder-decoder without attention mechanisms has poor interpretability: we lack a clear understanding of what information is encoded in the encoding vector, how to utilize this information, and the reasons behind the specific behaviors of the decoder. Structures that include attention mechanisms provide a relatively simple way for us to understand the reasoning process of the decoder and what the model is actually learning. Although this is a form of weak interpretability, it makes sense.
Facing the Core Formula of Attention
When predicting the ith word of the target language, the weight of the jth word in the source language is , the size of the weight can be seen as a form of soft alignment information between the source and target languages.
Conclusion
The use of the attention method essentially involves predicting a target word yi while automatically obtaining semantic information from different positions in the original sentence, assigning a weight to the semantic information at each position, which is the “soft” alignment information, and organizing this information to calculate the original sentence vector representation c_i for the current word yi.
PyTorch Implementation of Attention
import torchimport torch.nn as nn
class BiLSTM_Attention(nn.Module): def __init__(self): super(BiLSTM_Attention, self).__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim) self.lstm = nn.LSTM(embedding_dim, n_hidden, bidirectional=True) self.out = nn.Linear(n_hidden * 2, num_classes)
# lstm_output : [batch_size, n_step, n_hidden * num_directions(=2)], F matrix def attention_net(self, lstm_output, final_state): hidden = final_state.view(-1, n_hidden * 2, 1) # hidden : [batch_size, n_hidden * num_directions(=2), 1(=n_layer)] attn_weights = torch.bmm(lstm_output, hidden).squeeze(2) # attn_weights : [batch_size, n_step] soft_attn_weights = F.softmax(attn_weights, 1) # [batch_size, n_hidden * num_directions(=2), n_step] * [batch_size, n_step, 1] = [batch_size, n_hidden * num_directions(=2), 1] context = torch.bmm(lstm_output.transpose(1, 2), soft_attn_weights.unsqueeze(2)).squeeze(2) return context, soft_attn_weights.data.numpy() # context : [batch_size, n_hidden * num_directions(=2)]
def forward(self, X): input = self.embedding(X) # input : [batch_size, len_seq, embedding_dim] input = input.permute(1, 0, 2) # input : [len_seq, batch_size, embedding_dim]
hidden_state = Variable(torch.zeros(1*2, len(X), n_hidden)) # [num_layers(=1) * num_directions(=2), batch_size, n_hidden] cell_state = Variable(torch.zeros(1*2, len(X), n_hidden)) # [num_layers(=1) * num_directions(=2), batch_size, n_hidden]
# final_hidden_state, final_cell_state : [num_layers(=1) * num_directions(=2), batch_size, n_hidden] output, (final_hidden_state, final_cell_state) = self.lstm(input, (hidden_state, cell_state)) output = output.permute(1, 0, 2) # output : [batch_size, len_seq, n_hidden] attn_output, attention = self.attention_net(output, final_hidden_state) return self.out(attn_output), attention # model : [batch_size, num_classes], attention : [batch_size, n_step]
GitHub link:
https://github.com/zy1996code/nlp_basic_model/blob/master/lstm_attention.py
Good news! The Beginner's Guide to Vision knowledge community is now open to the public👇👇👇
Download 1: OpenCV-Contrib Extension Module Chinese Tutorial
Reply "Chinese Tutorial for Extension Modules" in the backend of the "Beginner's Guide to Vision" public account to download the first Chinese version of the OpenCV extension module tutorial, covering installation of extension modules, SFM algorithms, stereo vision, object tracking, biological vision, super-resolution processing, and more than twenty chapters of content.
Download 2: Python Vision Practical Projects 52 Lectures
Reply "Python Vision Practical Projects" in the backend of the "Beginner's Guide to Vision" public account to download 31 practical vision projects including image segmentation, mask detection, lane line detection, vehicle counting, eyeliner addition, license plate recognition, character recognition, emotion detection, text content extraction, and facial recognition, to help quickly learn computer vision.
Download 3: OpenCV Practical Projects 20 Lectures
Reply "OpenCV Practical Projects 20 Lectures" in the backend of the "Beginner's Guide to Vision" public account to download 20 practical projects based on OpenCV, advancing OpenCV learning.
Discussion Group
Welcome to join the readers' group of the public account to communicate with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (these will gradually be subdivided). Please scan the WeChat number below to join the group, and note: "nickname + school/company + research direction", for example: "Zhang San + Shanghai Jiao Tong University + Visual SLAM". Please follow the format; otherwise, you will not be approved. After successful addition, you will be invited into relevant WeChat groups based on your research direction. Please do not send advertisements in the group; otherwise, you will be removed. Thank you for your understanding~