Reprinted from | High Energy AI
This article summarizes the Attention mechanism in Natural Language Processing (NLP) in a Q&A format and provides an in-depth analysis of the Transformer.
Table of Contents
1. Analysis of Attention Mechanism 1. Why introduce the Attention mechanism? 2. What types of Attention mechanisms are there? (How are they classified?) 3. What is the computational process of the Attention mechanism? 4. What are the variants of the Attention mechanism? 5. A powerful Attention mechanism: Why is the self-Attention model so effective in long-distance sequences? (1) Can’t Convolutional or Recurrent Neural Networks handle long-distance sequences? (2) What methods can be used to solve the “local encoding” problem of short-distance dependencies and establish long-distance dependencies? (3) What is the specific computational process of the self-Attention model?2. Detailed Explanation of Transformer (Attention Is All You Need) 1. What is the overall architecture of the Transformer? What components does it consist of? 2. What are the differences between Transformer Encoder and Transformer Decoder? 3. What are the differences between Encoder-Decoder attention and self-attention mechanism? 4. What is the specific computational process of multi-head self-attention mechanism? 5. How is the Transformer applied in pre-trained models like GPT and BERT? What changes are there?
1. Analysis of Attention Mechanism
1. Why introduce the Attention mechanism?
According to the universal approximation theorem, both feedforward networks and recurrent networks have strong capabilities. But why introduce the Attention mechanism?
-
Limitations of computational power: When many pieces of “information” need to be remembered, the model becomes more complex, yet current computational power remains a bottleneck for the development of neural networks.
-
Limitations of optimization algorithms: Although operations like local connections, weight sharing, and pooling can simplify neural networks and effectively alleviate the contradiction between model complexity and expressive power, the “long-distance dependency” problem in recurrent neural networks means that the information “memory” capability is not high.
We can borrow the way the human brain processes information overload; for example, the Attention mechanism can enhance the neural network’s ability to process information.
2. What types of Attention mechanisms are there? (How are they classified?)
When using neural networks to process large amounts of input information, we can also draw on the human brain’s Attention mechanism to selectively process some key pieces of information to improve the efficiency of the neural network. Based on cognitive neuroscience, Attention can be broadly divided into two categories:
-
Focal (focus) attention: Top-down conscious attention, active attention—this refers to the attention that is purposefully directed towards a specific object, relying on tasks;
-
Saliency-based attention: Bottom-up conscious attention, passive attention—this attention is driven by external stimuli and does not require active intervention, nor is it task-dependent; max-pooling and gating mechanisms can be seen as approximations of bottom-up saliency-based attention mechanisms.
In artificial neural networks, the Attention mechanism typically refers specifically to focal attention.
3. What is the computational process of the Attention mechanism?

The essence of the Attention mechanism is actually an addressing process. As shown in the figure above: given a query Query vector q related to the task, by calculating the attention distribution with Key and attaching it to Value, we calculate the Attention Value. This process is actually a manifestation of how the Attention mechanism alleviates the complexity of neural network models: there’s no need to input all N pieces of information into the neural network for computation; we only need to select some task-relevant information from X to input into the neural network.
The Attention mechanism can be divided into three steps: first, inputting information; second, calculating the attention distribution α; third, calculating the weighted average of input information based on the attention distribution α.
Step 1 – Information Input: Let X = [x1, · · · , xN] represent N pieces of input information;
Step 2 – Attention Distribution Calculation: Let Key=Value=X, then we can give the attention distribution
We refer to as the attention distribution (probability distribution), and
as the attention scoring mechanism, which has several scoring mechanisms:

Step 3 – Information Weighted Average: The attention distribution can be interpreted as the degree of attention given to the i-th information when querying the context q. A “soft” information selection mechanism is used to encode the input information X as follows:
This encoding method is known as the soft Attention mechanism. There are two types of soft Attention mechanisms: ordinary mode (Key=Value=X) and key-value pair mode (Key!=Value).

4. What are the variants of the Attention mechanism?
Compared to the ordinary Attention mechanism (left in the figure), what variants of the Attention mechanism are there?
-
Variant 1 – Hard Attention: The previously mentioned Attention is soft Attention, which selects information based on the expectation under the attention distribution of all input information. Another type of attention focuses only on information at a specific position, called hard Attention (hard attention). Hard Attention can be implemented in two ways: (1) one way is to select the input information with the highest probability; (2) another way is to implement hard Attention via random sampling in the attention distribution. The disadvantage of hard Attention:
The disadvantage of hard Attention is that it selects information based on maximum sampling or random sampling. Therefore, the final loss function is not differentiable with respect to the attention distribution, making it unsuitable for training using backpropagation algorithms. To use backpropagation, soft Attention is generally used to replace hard Attention. Hard Attention requires training through reinforcement learning. — “Neural Networks and Deep Learning”
-
Variant 2 – Key-Value Attention: This refers to the key-value pair mode shown on the right side of the figure, where Key! ≠ Value, and the attention function becomes:

-
Variant 3 – Multi-Head Attention: Multi-head attention (multi-head attention) utilizes multiple queries Q = [q1, · · · , qM] to parallelly compute and select multiple pieces of information from the input. Each attention head focuses on different parts of the input information and then concatenates:

5. A powerful Attention mechanism: Why is the self-Attention model so effective in long-distance sequences?
(1) Can’t Convolutional or Recurrent Neural Networks handle long-distance sequences?
When using neural networks to process a variable-length vector sequence, we can typically use convolutional networks or recurrent networks for encoding to obtain an output vector sequence of the same length, as shown in the figure:

From the above figure, it can be seen that both convolutional and recurrent neural networks are essentially a form of “local encoding” for variable-length sequences: convolutional neural networks are clearly based on N-gram local encoding; for recurrent neural networks, due to issues like gradient disappearance, they can only establish short-distance dependencies.
(2) What methods can be used to solve the “local encoding” problem of short-distance dependencies and establish long-distance dependencies?
If we want to establish long-distance dependencies between input sequences, we can use the following two methods: one method is to increase the depth of the network to obtain long-distance information interaction, and the other method is to use fully connected networks. — “Neural Networks and Deep Learning”

As shown in the figure above, while a fully connected network is a very direct model for modeling long-distance dependencies, it cannot handle variable-length input sequences. The connection weights differ for different input lengths.
At this point, we can use the Attention mechanism to dynamically generate different connection weights, which is the self-Attention model. Since the weights of the self-Attention model are dynamically generated, it can handle variable-length information sequences.
Overall, the reason why the self-Attention model is so powerful is that it utilizes the Attention mechanism to dynamically generate different connection weights, thereby handling variable-length information sequences.
(3) What is the specific computational process of the self-Attention model?
Likewise, given the information input: let X = [x1, · · · , xN] represent N pieces of input information; through linear transformation, we obtain the query vector sequence, key vector sequence, and value vector sequence:

The formula above shows that Q in self-Attention is transformed from the input itself (self), while in traditional Attention, Q comes from external sources.

The attention calculation formula is:

In the self-Attention model, scaled dot-product is usually used as the attention scoring function, and the output vector sequence can be written as:

2. Detailed Explanation of Transformer (Attention Is All You Need)
From the title of the Transformer paper, it is clear that the core of the Transformer is Attention, which is why this article analyzes the Attention mechanism before introducing the Transformer. If you understand the Attention mechanism above, especially the self-Attention model, the Transformer will be easy to comprehend.
1. What is the overall architecture of the Transformer? What components does it consist of?

The Transformer is essentially a Seq2Seq model, with an encoder on the left reading the input and a decoder on the right producing the output:

Transformer = Transformer Encoder + Transformer Decoder
(1) Transformer Encoder (N=6 layers, each layer includes 2 sub-layers):

-
Sub-layer 1: multi-head self-attention mechanism, used for self-attention.
-
Sub-layer 2: Position-wise Feed-forward Networks, a simple fully connected network that performs the same operation on each position’s vector, including two linear transformations and a ReLU activation output (input and output layer dimensions are both 512, with the middle layer being 2048):

Each sub-layer uses a residual network:
(2) Transformer Decoder (N=6 layers, each layer includes 3 sub-layers):

-
Sub-layer 1: Masked multi-head self-attention mechanism, used for self-attention. Unlike the Encoder, since this is a sequence generation process, at time i, results from times greater than i are not available; only results from times less than i are available, thus requiring a Mask.
-
Sub-layer 2: Position-wise Feed-forward Networks, same as Encoder.
-
Sub-layer 3: Encoder-Decoder attention computation.
2. What are the differences between Transformer Encoder and Transformer Decoder?
(1) The multi-head self-attention mechanism is different; the Encoder does not require masking, while the Decoder does.
(2) The Decoder includes an additional Encoder-Decoder attention layer, which differs from the self-attention mechanism.
3. What are the differences between Encoder-Decoder attention and self-attention mechanism?
Both use multi-head computation, but Encoder-Decoder attention employs the traditional attention mechanism, where the Query is the encoding value calculated from the previous time step i of the self-attention mechanism, and both Key and Value are the output of the Encoder, which differs from the self-attention mechanism. This is reflected in the code:
## Multihead Attention ( self-attention)
self.dec = multihead_attention(queries=self.dec,
keys=self.dec,
num_units=hp.hidden_units,
num_heads=hp.num_heads,
dropout_rate=hp.dropout_rate,
is_training=is_training,
causality=True,
scope="self_attention")
## Multihead Attention ( Encoder-Decoder attention)
self.dec = multihead_attention(queries=self.dec,
keys=self.enc,
num_units=hp.hidden_units,
num_heads=hp.num_heads,
dropout_rate=hp.dropout_rate,
is_training=is_training,
causality=False,
scope="vanilla_attention")
4. What is the specific computational process of multi-head self-attention mechanism?

The Attention mechanism in the Transformer consists of Scaled Dot-Product Attention and Multi-Head Attention, as shown in the overall process in the figure. Below is a specific introduction to each step:
-
Expand: This is actually a linear transformation that generates the Q, K, and V vectors;
-
Split heads: This involves splitting heads; in the original text, each position’s 512 dimensions are divided into 8 heads, making each head dimension 64;
-
Self Attention: Perform Self Attention for each head; the specific process is consistent with what was introduced in the first part;
-
Concat heads: Concatenate each head after completing Self Attention;
The formula for the above process is:

5. How is the Transformer applied in pre-trained models like GPT and BERT? What changes are there?
-
In GPT, a unidirectional language model is trained, which directly applies Transformer Decoder;
-
In BERT, a bidirectional language model is trained, applying the Transformer Encoder part, with an additional Masked operation on top of the Encoder;
BERT Transformer uses bidirectional self-attention, while GPT Transformer uses restricted self-attention, where each token can only process the context on its left. Bidirectional Transformer is generally referred to as “Transformer encoder,” while left context is referred to as “Transformer decoder,” which cannot access the information to be predicted.
Reference
-
“Neural Networks and Deep Learning”
-
Attention Is All You Need
-
Google BERT Analysis – 2 Hours to Master the Strongest NLP Training Model
-
Detailed Discussion | Attention Is All You Need
-
Attention Models in Deep Learning (2017 Edition)
Download 1: Four-piece set
Reply "Four-piece set" in the backend of the Machine Learning Algorithm and Natural Language Processing WeChat account to get the learning materials for TensorFlow, Pytorch, machine learning, and deep learning!
Download 2: Repository Address Sharing
Reply "Code" in the backend of the Machine Learning Algorithm and Natural Language Processing WeChat account to get 195 papers from NAACL + 295 papers from ACL2019 with code open-sourced. The open-source address is: https://github.com/yizhen20133868/NLP-Conferences-Code
Heavyweight! The Machine Learning Algorithm and Natural Language Processing communication group has officially been established! There are a lot of resources in the group, welcome everyone to join and learn!
Extra welfare resources! Deep learning and neural networks, official Chinese tutorials for Pytorch, data analysis using Python, machine learning study notes, official Chinese documentation for pandas, effective java (Chinese version), and 20 other welfare resources
How to get: After entering the group, click on the group announcement to get the download link
Note: When adding, please modify the remark to [School/Company + Name + Direction] For example - Harbin Institute of Technology + Zhang San + Dialogue System. The account owner, please bypass if you are a micro-business. Thank you!
Recommended reading: Implementation of NCE-Loss in TensorFlow and word2vec
Overview of multimodal deep learning: Summary of network structure design and modality fusion methods
Awesome adversarial machine learning resource list