Understanding the Details of Transformers: 18 Key Questions

Author: Wang Chen, Who Asks Questions@Zhihu (Authorized)

Source: https://www.zhihu.com/question/362131975/answer/3058958207

Editor: Jishi Platform

Why Summarize Transformers Through Eighteen Questions?

There are two reasons:

First, the Transformer is the fourth major feature extractor after MLP, RNN, and CNN, also known as the fourth foundational model; the recently popular chatGPT is also built on the Transformer, highlighting its importance.

Second, I hope that by asking questions, I can better help everyone understand the content and principles of Transformers.

1. What Was the Major Breakthrough in Deep Learning in 2017?

Transformer. There are two reasons:

1.1 On one hand, the Transformer is the fourth major feature extractor in deep learning after MLP, RNN, and CNN (also referred to as a foundational model). What is a feature extractor? The brain is how humans interact with the external world (images, text, speech, etc.); a feature extractor is how computers mimic the brain to interact with the external world (images, text, speech, etc.), as shown in Figure 1. For example, the Imagenet dataset contains 1,000 classes of images, and people have categorized these one million images based on their experience into 1,000 classes, where each class of images (like leopards) has unique features. Here, neural networks (like ResNet18) also aim to extract or recognize the unique features of each class of images through this classification method. Classification is not the ultimate goal, but a means to extract image features; masking and completing images is also a way to extract features, and shuffling image block orders is another way to extract features.

Understanding the Details of Transformers: 18 Key Questions — Figure 1 Neural networks mimic neurons in the brain

1.2 On the other hand, the role of the Transformer in deep learning: the cornerstone of the third and fourth waves, as shown in Figure 2.

2. What Is the Background for the Proposal of Transformers?

2.1 In Terms of Development Background: In 2017, deep learning had already been popular in the field of computer vision for several years. From Alexnet, VGG, GoogLenet, ResNet, DenseNet; from image classification, object detection to semantic segmentation; however, it did not cause much of a stir in the field of natural language processing.

2.2 In Terms of Technical Background: (1) The mainstream solutions for sequence transcription tasks (like machine translation) at that time are shown in Figure 3. In the Sequence to Sequence architecture (a type of Encoder-Decoder), RNN is used to extract features, and the Attention mechanism efficiently transmits the features extracted by the Encoder to the Decoder. (2) This approach has two shortcomings. On one hand, the RNN’s inherent structure of sequentially passing information from front to back limits its ability to perform parallel computations. On the other hand, when the sequence length is too long, information from the earliest sequences may be forgotten. Therefore, it can be seen that within this framework, RNN is a relatively weak area that needs improvement.

3. What Exactly Is a Transformer?

3.1 The Transformer is an architecture composed of Encoder and Decoder. So what is an architecture? The simplest architecture is A+B+C.

3.2 The Transformer can also be understood as a function, where the input is “I love learning” and the output is “I love study”.

3.3 If we break down the architecture of the Transformer, as shown in Figure 4.

4. What Is the Transformer Encoder?

4.1 From a functional perspective, the core purpose of the Transformer Encoder is to extract features, and the Transformer Decoder can also be used to extract features. For example, when a person learns to dance, the Encoder observes how others dance, and the Decoder showcases the learned experiences and memories.

4.2 From a structural perspective, as shown in Figure 5, the Transformer Encoder = Embedding + Positional Embedding + N*(Sub Encoder block1 + Sub Encoder block2);

Sub Encoder block1 = Multi-head attention + ADD + Norm;

Sub Encoder block2 = Feed Forward + ADD + Norm;

4.3 From the input-output perspective, the input of the first Encoder block in N Transformer Encoder blocks is a set of vectors X = (Embedding + Positional Embedding), where the vector dimension is usually 512*512. The input of the other N Transformer Encoder blocks is the output of the previous Transformer Encoder block, and the output vector dimension is also 512*512 (the input and output sizes are the same).

4.4 Why is it 512*512? The former refers to the number of tokens; for example, “I love learning” has 4 tokens, and here it is set to 512 to encompass different sequence lengths, with padding for insufficient lengths. The latter refers to the vector dimension generated for each token, meaning each token is represented by a vector of length 512. It is often said that Transformers cannot exceed 512, otherwise the hardware may struggle; in fact, 512 refers to the former, which is the number of tokens, since each token needs to perform self-attention operations; however, the latter’s 512 should not be too large, or the computation will be slow.

5. What Is the Transformer Decoder?

5.1 From a functional perspective, compared to the Transformer Encoder, the Transformer Decoder is better suited for generative tasks, especially for natural language processing problems.

5.2 From a structural perspective, as shown in Figure 6, the Transformer Decoder = Embedding + Positional Embedding + N*(Sub Decoder block1 + Sub Decoder block2 + Sub Decoder block3) + Linear + Softmax;

Sub Decoder block1 = Mask Multi-head attention + ADD + Norm;

Sub Decoder block2 = Multi-head attention + ADD + Norm;

Sub Decoder block3 = Feed Forward + ADD + Norm;

5.3 From the perspective of each individual component (Embedding + Positional Embedding) (N Decoder blocks) (Linear + softmax):

Embedding + Positional Embedding: For instance, in machine translation, the input “Machine Learning” produces the output “机器学习”; here, the Embedding transforms “机器学习” into a vector format.

N Decoder blocks: This represents the feature processing and transmission process.

Linear + softmax: The softmax predicts the probability of the next word appearing, as shown in Figure 7. The preceding Linear layer is similar to the MLP layer before the classification layer in classification networks (ResNet18).

5.4 What are the inputs and outputs of the Transformer Decoder? They differ during training and testing.

During training, as shown in Figure 8, the label is known. The first input for the decoder is the begin character, and the output of the first vector is compared with the first character in the label using cross-entropy loss. The second input for the decoder is the label of the first vector, and the N-th input corresponds to the output of the End character, marking the end. Here, it can also be seen that during training, parallel training can be performed.

During testing, the input for the next moment is the output from the previous moment, as shown in Figure 9. Thus, during training and testing, there may be a mismatch in the decoder’s input. During testing, it is indeed possible to make a mistake at one step, leading to mistakes at every step. There are two solutions: one is to occasionally introduce some errors during training, and the other is Scheduled sampling.

5.5 What are the inputs and outputs of the Transformer Decoder block? Previously, we discussed the outputs during the overall training and testing phases, but what about the inputs and outputs of the Transformer Decoder block itself, as shown in Figure 10?

For the first iteration in N=6 (when N=1): the input for Sub Decoder block1 is embedding + Positional Embedding, and the input for Sub Decoder block2’s Q comes from the output of Sub Decoder block1, while the KV comes from the output of the last layer of the Transformer Encoder.

For the second iteration in N=6: the input for Sub Decoder block1 is the output from N=1, and the input for Sub Decoder block3’s KV also comes from the last layer of the Transformer Encoder’s output.

In summary, whether during training or testing, the inputs to the Transformer Decoder come not only from (ground truth or the output of the previous decoder), but also from the last layer of the Transformer Encoder.

During training: the input of the i-th decoder = encoder output + ground truth embedding.

During prediction: the input of the i-th decoder = encoder output + output of the (i-1)-th decoder.

6. What Are the Differences Between the Transformer Encoder and Decoder?

6.1 In terms of function, the Transformer Encoder is commonly used for feature extraction, while the Transformer Decoder is often used for generative tasks. The Transformer Encoder and Transformer Decoder represent two different technical paths: Bert employs the former, while the GPT series models adopt the latter.

6.2 In terms of structure, the Transformer Decoder block includes three sub-Decoder blocks, whereas the Transformer Encoder block contains two sub-Encoder blocks, and the Transformer Decoder utilizes Mask multi-head Attention.

6.3 From the input-output perspective, the output of the N Transformer Encoder operations is formally input into the Transformer Decoder, serving as K and V in QKV for the Transformer Decoder. So how is the output of the last layer of the Transformer Encoder delivered to the Decoder? As shown in Figure 11.

So why must the Encoder and Decoder interact in this way? It is not strictly necessary; different interaction methods have been proposed subsequently, as shown in Figure 12.

7. What Is Embedding?

7.1 The position of Embedding in the Transformer architecture is shown in Figure 13.

7.2 Background: Computers cannot directly process a word or a character; a token must be converted into a vector that the computer can recognize, which is the embedding process.

7.3 Implementation: The simplest embedding operation is the one-hot vector, but the one-hot vector has a drawback: it does not consider the relationships between words. This led to the creation of Word Embedding, as shown in Figure 13.

8. What Is Positional Embedding?

8.1 The position of Positional Embedding in the Transformer architecture is shown in Figure 14.

8.2 Background: RNN, as a feature extractor, inherently carries the sequential information of words; however, the Attention mechanism does not consider sequential information, which is crucial for semantics. Therefore, we need to add positional information to the input embeddings through Positional Embedding.

8.3 Implementation: Traditional positional encoding and neural network auto-training.

9. What Is Attention?

9.1 Why introduce Attention when discussing Transformers? Because the most prevalent multi-head attention and Mask multi-head attention in Transformers derive from Scaled dot product attention, and scaled dot product attention comes from self-attention; thus, it is essential to understand Attention, as shown in Figure 15.

9.2 What does Attention actually mean?

For images, attention refers to the core areas of focus that people observe in an image. For sequences, the Attention mechanism essentially aims to find the interrelations among different tokens in the input, using a weight matrix to spontaneously discover the relationships between words.

9.3 How is Attention implemented?

It is implemented through QKV.

So what are QKV? Q is query, K is keys, V is values. For instance, Q represents a signal from the brain, such as “I am thirsty”; K represents environmental information, the world seen by the eyes; V assigns different weights to various items in the environment, increasing the weight for water.

In summary, Attention calculates the similarity between Q and K, and multiplies it with V to obtain the attention value.

9.4 Why must there be QKV?

Why not just Q? Because the relationship weight between Q1 and Q2 requires not only a12 but also a21. You might ask if we can just set a12 = a21. We could try that, but theoretically, it should not perform as well as having both a12 and a21.

Why not just QK? The obtained weight coefficients need to be applied to the input, which can be multiplied by Q or K. Why multiply by V? I believe it adds a set of trainable parameters, WV, allowing the network to have a stronger learning capability.

10. What Is Self-Attention?

10.1 Why introduce self-attention when discussing Transformers? Because the most common multi-head attention and Mask multi-head attention in Transformers derive from Scaled dot product attention, which comes from self-attention, as shown in Figure 15.

10.2 What is self-attention? Self-attention, along with local attention and stride attention, is a type of attention; self-attention calculates the attention coefficient for each Q with each K sequentially, as shown in Figure 18, while local attention calculates attention coefficients only between Q and adjacent K, and stride attention calculates attention coefficients through skipping connections with K.

10.3 Why can self-attention be used to process sequential data like machine translation?

Because the data at each position in the input sequence can focus on information from other positions, thereby using Attention scores to extract features or capture the relationships between each token in the input sequence.

10.4 How is self-attention specifically implemented? It is divided into four steps, as shown in Figure 19.

11. What Is Scaled Dot Product Attention?

11.1 The two most common types of self-attention are dot product attention and additive attention, as shown in Figure 20. The former has higher computational efficiency.

11.2 What does “scaled” mean?

The specific implementation of scaled is shown in Figure 21. This operation aims to prevent the inner product from becoming too large, which could lead to difficulties in training due to gradients being close to 1. It has a function similar to batch normalization.

12. What Is Multi-Head Attention?

12.1 The position of Multi-head attention in the Transformer architecture is shown in Figure 15.

12.2 Background: CNNs have multiple channels and can extract different dimensional feature information from images. Can self-attention also perform similar operations to extract information from multiple dimensions of tokens at different distances?

12.3 What is group convolution? As shown in Figure 22, group convolution divides the input features into several groups for separate convolution operations, which are then concatenated.

12.4 What is the implementation of Multi-head attention? What fundamentally distinguishes it from self-attention? As shown in Figure 23, taking two heads as an example, the input Q, K, and V are divided into two parts, and each small part of Q operates with the corresponding K and V separately. The resulting vectors are concatenated, showing that Multi-head attention has a similar implementation method to group convolution.

12.5 How can we understand Multi-head attention from the perspective of input and output dimensions? As shown in Figure 24.

13. What Is Mask Multi-Head Attention?

13.1 The position of Mask Multi-head attention in the Transformer architecture is shown in Figure 15.

13.2 Why is there a need for the Mask operation?

When predicting the output at time T, the Transformer cannot see the inputs after time T, ensuring consistency between training and prediction.

The Mask operation prevents the i-th word from knowing information about the i+1-th word and beyond, as shown in Figure 25.

13.3 How is the Mask operation specifically implemented?

Q1 only calculates with K1, Q2 only calculates with K1 and K2, while for K3, K4, etc., a very large negative number is assigned before softmax, making them equal to 0 in the matrix computation, as shown in Figure 26.

14. What Is ADD?

14.1 Add refers to residual connections, popularized by the 2015 ResNet paper (which has over 160,000 citations). The difference from Skip connections is that the dimensions must be the same.

14.2 As a pinnacle of the idea of simplicity, this technique is used in almost every deep learning model to prevent network degradation and is commonly used to solve the training difficulties of multi-layer networks.

15. What Is Norm?

15.1 Norm refers to layer normalization.

15.2 Core function: to make training more stable, having a similar function to batch normalization, both aim to make the mean of input samples zero and variance one.

15.3 Why use layer normalization instead of batch normalization? Because for sequential data, sentences can have varying lengths; if batch normalization is used, it can easily cause “training instability” due to the differences in sample lengths. BN operates on the same feature data across all data in a batch; LN operates on the same sample.

16. What Is FFN?

16.1 FFN refers to feed-forward networks.

16.2 Why is there a need for FFN when self-attention is already present? The attention mechanism already captures the desired sequential information features, and the MLP projects the information into a specific space for another nonlinear mapping, alternating with self-attention.

16.3 Structure: consists of two layers of MLP, where the first layer dimension is 512*2048, and the second layer dimension is 2048*512, with no activation function used in the second layer MLP, as shown in Figure 29.

17. How Is the Transformer Trained?

17.1 In terms of data, the Transformer paper mentions the use of 4.5M and 36M translation sentence pairs.

17.2 In terms of hardware, the base model was trained on 8 P100 GPUs for 12 hours, while the large model was trained for 3.5 days.

17.3 Regarding model parameters and tuning:

First, the trainable parameters include WQ, WK, WV, WO, and the parameters of the FFN layer.

Second, the tunable parameters include the dimension of each token vector representation (d_model), the number of heads, the number of repetitions of blocks in the Encoder and Decoder (N), the dimension of the intermediate layer vector in the FFN, label smoothing (confidence 0.1), and dropout (0.1).

18. Why Does the Transformer Perform Well?

18.1 Although the title is “Attention is All You Need,” subsequent research indicates that Attention, residual connections, layer normalization, and FFN collectively contribute to the success of the Transformer.

18.2 The advantages of the Transformer include:

First, it is the fourth major feature extractor after MLP, CNN, and RNN.

Second, initially applied in machine translation, with the emergence of GPT and Bert, it has gained widespread attention; it marks a turning point, after which the NLP field rapidly developed, leading to the rise of multi-modal, large models, and visual Transformers.

Third, it instills confidence that there can be effective feature extractors beyond CNNs and RNNs.

18.3 What are the shortcomings of the Transformer?

First, it has a high computational load and requires advanced hardware.

Second, due to the lack of inductive bias, it requires a large amount of data to achieve good performance.