Source | Zhihu Author | JayLou
Link | https://zhuanlan.zhihu.com/p/53682800
Editor | Deep Learning Matters WeChat Public Account
This article is for academic sharing only. If there is any infringement, please contact us to delete.
This article summarizes the attention mechanism (Attention) in natural language processing in a Q&A format and provides an in-depth analysis of the Transformer.
Table of Contents
1. Analysis of Attention Mechanism 1. Why introduce the Attention mechanism? 2. What types of Attention mechanisms are there? (How to classify them?) 3. What is the computational process of the Attention mechanism? 4. What are the variants of the Attention mechanism? 5. A powerful Attention mechanism: Why is the self-Attention model so powerful in long-distance sequences? (1) Can convolutional or recurrent neural networks not handle long-distance sequences? (2) What methods can solve the “local encoding” problem of short-distance dependencies and establish long-distance dependencies for input sequences? (3) What is the specific computational process of the self-Attention model?2. Detailed Explanation of Transformer (Attention Is All You Need) 1. What is the overall architecture of the Transformer? What components does it consist of? 2. What are the differences between Transformer Encoder and Transformer Decoder? 3. What are the differences between Encoder-Decoder attention and self-attention mechanism? 4. What is the specific computational process of the multi-head self-attention mechanism? 5. How is the Transformer applied in pre-trained models like GPT and Bert? What changes are there?
1. Analysis of Attention Mechanism
1. Why introduce the Attention mechanism?
According to the universal approximation theorem, feedforward networks and recurrent networks have strong capabilities. But why is the Attention mechanism still introduced?
-
Limitations of computational power: When a model needs to remember a lot of information, it becomes more complex, yet computational power remains a bottleneck for the development of neural networks.
-
Limitations of optimization algorithms: Although operations like local connections, weight sharing, and pooling can simplify neural networks and effectively alleviate the conflict between model complexity and expressive power, the information “memory” capability in recurrent neural networks regarding long-distance dependencies is still not high.
We can leverage the way the human brain processes information overload; for example, the Attention mechanism can enhance the ability of neural networks to process information.
2. What types of Attention mechanisms are there? (How to classify them?)
When using neural networks to process a large amount of input information, we can also draw from the brain’s attention mechanism, selectively processing key information to improve the efficiency of the neural network. According to cognitive neuroscience, attention can be broadly classified into two categories:
-
Focused (top-down) attention: Conscious attention driven by predetermined goals, actively focusing on a specific object based on task requirements;
-
Saliency-based (bottom-up) attention: Passive attention driven by external stimuli that does not require active intervention and is unrelated to the task; max-pooling and gating mechanisms can be approximated as saliency-based attention mechanisms.
In artificial neural networks, the attention mechanism generally refers to focused attention.
3. What is the computational process of the Attention mechanism?

The essence of the Attention mechanism is actually an addressing process. As shown in the figure above: Given a query vector Query q related to the task, we calculate the attention distribution with Key and apply it to Value to compute the Attention Value. This process reflects how the Attention mechanism alleviates the complexity of neural network models: it does not require all N input information to be computed; it only needs to select some task-relevant information from X to input into the neural network.
The attention mechanism can be divided into three steps: first, information input; second, calculating the attention distribution α; third, computing the weighted average of the input information based on the attention distribution α.
Step 1 – Information input: Let X = [x1, · · · , xN] represent N input information;
Step 2 – Attention distribution calculation: Let Key = Value = X, then we can provide the attention distribution.
We denote as the attention distribution (probability distribution), and
as the attention scoring mechanism, which can have several scoring mechanisms:

Step 3 – Information weighted average: The attention distribution can be interpreted as the degree of attention given to the i-th piece of information when querying q. It uses a “soft” information selection mechanism to encode the input information X as follows:
This encoding method represents the soft Attention mechanism, which has two modes: ordinary mode (Key = Value = X) and key-value pair mode (Key != Value).

4. What are the variants of the Attention mechanism?
Compared to the ordinary Attention mechanism (left in the figure), what variants of the Attention mechanism exist?
Variant 1 – Hard Attention: The previously mentioned attention is soft attention, which selects information based on the expectation of all input information under the attention distribution. Another type of attention focuses on information at a specific position, called hard attention (hard attention). Hard attention can be implemented in two ways: (1) by selecting the input information with the highest probability; (2) by randomly sampling from the attention distribution. The downside of hard attention:
The downside of hard attention is that it selects information based on maximum sampling or random sampling. Therefore, the final loss function is not differentiable with respect to the attention distribution, making it unsuitable for training using backpropagation algorithms. To use backpropagation, soft attention is generally used instead of hard attention. Hard attention requires reinforcement learning for training. — “Neural Networks and Deep Learning”
Variant 2 – Key-Value Attention: In the right figure, the key-value pair mode occurs when Key != Value. The attention function becomes:

Variant 3 – Multi-Head Attention: Multi-head attention utilizes multiple queries Q = [q1, · · · , qM] to compute and select multiple pieces of information from the input in parallel. Each attention head focuses on different parts of the input information and then concatenates them:

5. A powerful Attention mechanism: Why is the self-Attention model so powerful in long-distance sequences?
(1) Can convolutional or recurrent neural networks not handle long-distance sequences?
When using neural networks to process a variable-length vector sequence, we can typically use convolutional or recurrent networks to encode and obtain an output vector sequence of the same length, as shown in the figure:

From the figure above, it can be seen that both convolutional and recurrent neural networks are essentially a form of “local encoding” for variable-length sequences: convolutional neural networks are clearly based on N-gram local encoding; for recurrent neural networks, due to issues like gradient vanishing, they can only establish short-distance dependencies.
(2) To solve the “local encoding” problem of short-distance dependencies and establish long-distance dependencies for input sequences, what methods can be used?
If we want to establish long-distance dependencies between input sequences, we can use the following two methods: One method is to increase the number of layers in the network to obtain long-distance information interaction through a deep network; the other method is to use fully connected networks. — “Neural Networks and Deep Learning”

As shown in the figure, while the fully connected network is a very direct model for modeling long-distance dependencies, it cannot handle variable-length input sequences. The connection weights vary with different input lengths.
At this point, we can utilize the Attention mechanism to “dynamically” generate different connection weights, which is the self-Attention model. Because the weights of the self-Attention model are dynamically generated, it can handle variable-length information sequences.
In summary, why is the self-Attention model so powerful: it utilizes the Attention mechanism to “dynamically” generate different connection weights, thus handling variable-length information sequences.
(3) What is the specific computational process of the self-Attention model?
Again, given the information input: let X = [x1, · · · , xN] represent N input information; through linear transformation, we obtain the query vector sequence, key vector sequence, and value vector sequence:

The formula above shows that Q in self-Attention is a transformation of its own (self) input, while in traditional Attention, Q comes from external sources.

The attention calculation formula is:

In the self-Attention model, it is common to use scaled dot-product as the attention scoring function, and the output vector sequence can be written as:

2. Detailed Explanation of Transformer (Attention Is All You Need)
From the title of the Transformer paper, it is clear that the core of the Transformer is Attention, which is why this article introduces the Transformer after analyzing the Attention mechanism. If you understand the Attention mechanism, especially the self-Attention model, it will be easy to understand the Transformer.
1. What is the overall architecture of the Transformer? What components does it consist of?

The Transformer is essentially a Seq2Seq model, with an encoder on the left reading the input and a decoder on the right producing the output:
Transformer = Transformer Encoder + Transformer Decoder
(1) Transformer Encoder (N=6 layers, each layer includes 2 sub-layers):

-
Sub-layer 1: Multi-head self-attention mechanism, used for self-attention. -
Sub-layer 2: Position-wise Feed-forward Networks, a simple fully connected network that performs the same operation on each position’s vector, including two linear transformations and a ReLU activation output (input and output layers have a dimension of 512, while the middle layer has a dimension of 2048):

Each sub-layer uses a residual network:
(2) Transformer Decoder (N=6 layers, each layer includes 3 sub-layers):

-
Sub-layer 1: Masked multi-head self-attention mechanism, used for self-attention. Unlike the Encoder, since this is a sequence generation process, at time i, results from times greater than i are not available, only results from times less than i are available, so Mask is needed.
-
Sub-layer 2: Position-wise Feed-forward Networks, same as Encoder.
-
Sub-layer 3: Encoder-Decoder attention calculation.
2. What are the differences between Transformer Encoder and Transformer Decoder?
(1) The multi-head self-attention mechanism is different; the Encoder does not require masking, while the Decoder does.
(2) The Decoder has an additional layer for Encoder-Decoder attention, which differs from the self-attention mechanism.
3. What are the differences between Encoder-Decoder attention and self-attention mechanism?
Both use multi-head calculations, but Encoder-Decoder attention adopts the traditional attention mechanism, where the Query comes from the encoding value computed from the previous time i of the self-attention mechanism, while the Key and Value both come from the Encoder’s output, which differs from the self-attention mechanism. This is specifically reflected in the code:
## Multihead Attention (self-attention)
self.dec = multihead_attention(queries=self.dec,
keys=self.dec,
num_units=hp.hidden_units,
num_heads=hp.num_heads,
dropout_rate=hp.dropout_rate,
is_training=is_training,
causality=True,
scope="self_attention")
## Multihead Attention (Encoder-Decoder attention)
self.dec = multihead_attention(queries=self.dec,
keys=self.enc,
num_units=hp.hidden_units,
num_heads=hp.num_heads,
dropout_rate=hp.dropout_rate,
is_training=is_training,
causality=False,
scope="vanilla_attention")
4. What is the specific computational process of the multi-head self-attention mechanism?

The Attention mechanism in the Transformer consists of Scaled Dot-Product Attention and Multi-Head Attention, as shown in the overall process in the figure. Below is a detailed introduction to each step:
-
Expand: This is actually a linear transformation that generates three vectors Q, K, V; -
Split heads: This involves splitting the original 512-dimensional vectors into 8 heads, with each head becoming 64-dimensional; -
Self Attention: Perform self-attention on each head, the specific process is consistent with the first part described; -
Concat heads: Concatenate the outputs of each head after self-attention;
The formula for the above process is:

5. How is the Transformer applied in pre-trained models like GPT and Bert? What changes are there?
-
GPT trains a unidirectional language model, which essentially applies Transformer Decoder;
-
Bert trains a bidirectional language model, applying the Transformer Encoder part, but also performs Masked operations on top of the Encoder;
BERT Transformer uses bidirectional self-attention, while GPT Transformer uses restricted self-attention, where each token can only process its left-side context. Bidirectional Transformers are typically referred to as “Transformer encoders,” while left-side contexts are referred to as “Transformer decoders,” where the decoder cannot access the information to be predicted.
Reference
“Neural Networks and Deep Learning”
https://nndl.github.io/
Attention Is All You Need
Google BERT Analysis – Get Started with the Strongest NLP Training Model in 2 Hours
Detailed Explanation | Attention Is All You Need
Attention Models in Deep Learning (2017 Edition)
https://zhuanlan.zhihu.com/p/37601161
—The End—
Recommended for you:
Is your school on the list? 2020 Soft Science China University Rankings Released!
GitHub Major Update: Online Development Launched, Time to Uninstall IDEs
Li Mu's Team Loses Six Members in Half a Year, Is MxNet's Hero Era Over?
The Worst Project in History: Struggling for 12 Years, Over 6 Million Lines of Code...
An Overview of 2D Human Pose Estimation