Click the above to select Star or Top, delivering valuable content to you every day!
!
Reading will take about 12 minutes
Follow the little blogger and make a little progress every day
Author:CHEONG
From: Machine Learning and Natural Language Processing
1. Understanding the Principle of Attention Mechanism
Simply put, the Attention mechanism refers to the attention that the output y at a certain moment pays to each part of the input x. Here, attention refers to the weight, which is the contribution of each part of the input x to the output y at a certain moment. Based on this, let’s first understand the self-attention and context-attention mentioned in the Transformer model.
(1) Self-attention: The input sequence is the output sequence, which means calculating the attention of itself, computing the attention weights of the sequence to itself.
(2) Context-attention: This is the encoder-decoder attention, such as in a machine translation model, it calculates the attention weights of each word in the encoder sequence to each word in the decoder sequence.
2. Methods of Attention Calculation
1. Content-based Attention
Paper: Neural Turing Machines
Paper link: https://arxiv.org/pdf/1410.5401.pdf
The similarity measure for attention is based on cosine similarity:
Paper: Neural Machine Translation By Jointly Learning to Align and Translate
Paper link: https://arxiv.org/pdf/1409.0473.pdf
This paper provides a detailed introduction to the implementation of attention, which will not be introduced here.
3. Location-based Attention, General Attention, Dot-Product Attention
Paper: Effective Approaches to Attention-based Neural Machine Translation
Paper link: https://arxiv.org/pdf/1508.04025.pdf
This paper provides a detailed introduction to the implementation of attention, which will not be introduced here.
4. Scaled Dot-Product Attention
Paper: Attention is All You Need
Paper link: https://arxiv.org/pdf/1706.03762.pdf
The Attention mechanism mentioned in the familiar Transformer model is introduced below.
3. Evolution and Development of Attention
1. Introduction of Attention Mechanism in Seq2seq
The Attention mechanism is commonly used in seq2seq models. The first diagram shows the traditional seq2seq, where the output y does not distinguish between input sequences x1, x2, x3…, and lacks identification. The second diagram introduces the attention mechanism, where each output word y is influenced by the input X1, X2, X3… with different weights. This weight is calculated by Attention, so we can regard the Attention mechanism as the attention allocation coefficient, calculating the influence of each input on the output weight.
(Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation paper screenshot)
(NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE paper screenshot)
Let’s introduce the attention implementation process in the above diagram.
(1) First, use a bidirectional RNN structure to obtain the hidden state (h1, h2, …, hn).
(2) If the current decoder state is St-1, calculate the influence of each input position hj on the current position i.
There are many ways to calculate attention, including dot product, weighted dot product, or summation.
(3) Apply softmax to eij to obtain the normalized attention weight distribution, as shown below.
(4) Use aij for weighted summation to obtain the corresponding context vector.
(5) Calculate the final output.
2. Soft Attention and Hard Attention
(Show, Attend and Tell: Neural Image Caption Generation with Visual Attention paper proposed)
Soft attention is parameterized, differentiable, and can be embedded directly into the model for training. The aforementioned seq2seq uses soft attention; Hard attention is a stochastic process that requires Monte Carlo sampling to estimate the module’s gradient for backpropagation. Currently, more research and applications tend to prefer using Soft Attention because it allows for direct differentiation and gradient backpropagation.
3. Global Attention and Local Attention
(Effective Approaches to Attention-based Neural Machine Translation screenshot)
The left image is Global Attention, and the right image is Local Attention.
Global Attention and Local Attention share similarities; for example, during the decoding phase of translation models, both use the top hidden state ht of LSTM as input. The goal is to obtain a context vector ct that captures source information to help predict the current word. The difference lies in how the context vector ct is obtained, but the subsequent steps are shared, which first obtain the attention hidden state from the context and hidden layer information:
Then, the attention hidden state is used to obtain the prediction distribution through softmax.
Next, let’s introduce how Global Attention and Local Attention obtain the context vector ct.
Global Attention considers all hidden states of the encoder when calculating the context vector ct. The attention vector is obtained by comparing the current target hidden state ht with each source hidden state hs:
The calculation methods between ht and hs can adopt the following:
Currently, attention scores can also be calculated based on location:
The drawback of Global Attention is that each target word must attend to all words in the source sequence, consuming resources. Thus, the idea of Local Attention is to select only a portion of the source words for each target word to predict the context vector ct. Specifically, at time t, the model first generates an alignment position pt for each target word. The context vector ct is obtained by a weighted average calculation of the source hidden states within the window [pt -D, pt +D]; D is selected based on experience.8 Unlike the global method, the local alignment vector at has a fixed dimension, i.e., ∈ℝ2D+1
4. Attention in Transformer
Let’s briefly talk about the Scaled Dot-Product Attention and Multi-Head Attention used in the Transformer. For detailed content, please refer to the blogger’s previous articles: Understanding the Details of the Transformer Model and Tensorflow Implementation, PyTorch Implementation of the Transformer Model.
(Attention is All You Need paper screenshot)
Scaled Dot-Product Attention: calculates the weight coefficients of matrix V through Q and K matrices.
Multi-Head Attention: Multi-head attention maps Q, K, and V into h sets of Q, K, and V through a linear transformation, then calculates Scaled Dot-Product Attention for each, and finally combines them. The purpose of Multi-Head Attention is to extract features more comprehensively.
5. Related Papers on Attention Combination
(1) Hierarchical Attention Networks for Document Classification
(2) Attention-over-Attention Neural Networks for Reading Comprehension
(3) Multi-step Attention: Convolutional Sequence to Sequence Learning
(4) Multi-dimensional Attention: Coupled Multi-Layer Attentions for Co-Extraction of Aspect and Opinion Terms
6. Application Fields of Attention
(2) Implication Relationship Reasoning
(3) Text Summarization Generation
4. Attention Case Analysis
First, let’s look at the introduction of the Attention mechanism in the Encoder-Decoder, such as in machine translation.
Each output word Y is influenced by the input X1, X2, X3, X4 with different weights, which are calculated by Attention.
Therefore, we can regard the Attention mechanism as the attention allocation coefficient, calculating the influence of each input item on the output weight.
The following image shows the actual weight calculation process of the Attention mechanism in machine translation.
First, the original data is transformed through matrix changes to obtain Q, K, and V vectors, as shown in the image below.
Taking the word Thinking as an example, first multiply the q vector of Thinking with the k vectors of all words using the formula below:
This results in the contribution weights of each word to the word Thinking, and then the obtained weights are multiplied with each word’s vector v to get the final output vector of Thinking.
The purpose of scaling in Attention is to accelerate the computation of neural networks.
5. Analysis of Attention Mechanism Implementation
1. Implementation of Attention Mechanism in Hierarchical Attention Network
The HAN structure includes Word encoder, Word attention, Sentence encoder, and Sentence attention, which contains both word attention and sentence attention.
Explanation:h is the output of the hidden layer GRU. Set three random variables w, b, us, first perform a fully connected transformation through the activation function tanh to get ui, then calculate the product of us and ui, compute softmax output a, and finally, after obtaining the weight a, multiply it with h to get the final output through the attention mechanism. The left image is the word attention calculation formula, and the right image is the sentence attention calculation formula.
The core code for Attention implementation in HAN is as follows:
In attention, the role of mask is to prevent information from backpropagating by making the parts exceeding seq_length extremely small. Let’s look at one implementation of the mask.
By setting the parts exceeding seq_length of the mask to False, and then making the parts with mask False extremely small, during backpropagation, the extremely small reciprocal becomes 0, preventing the message from backpropagating.
Here is another implementation of the mask, using add and mul to achieve the above-mentioned mask.
2. Implementation of Multi-Head Attention in Transformer
For more details, please refer to the previous article: Understanding the Details of the Transformer Model and Tensorflow Implementation.
[One Minute Paper] NAACL2019 – Syntactic Enhanced Neural Machine Translation with Perceptual Syntactic Word Representations
[One Minute Paper] Semi-supervised Sequence Learning
Detailed Explanation of Transition-based Dependency Parser
Experience | Some Small Suggestions for Entering the NLP Field
Academics | How to Write a Qualified NLP Paper
Valuable Insights | How Do Highly Productive Scholars Work?
A Simple and Effective Joint Model
Recent Research Work in NLP in the Legal Field