Understanding the Details of Transformers: 18 Key Questions

Source: Artificial Intelligence Research


This article is approximately 5400 words long and is recommended for a reading time of over 10 minutes.
This article will help you understand Transformers from all aspects through a Q&A format.

Source: Zhihu

Author: Wang Chen, who asks questions @ Zhihu

Why summarize Transformers through eighteen questions?

There are two reasons:

First, the Transformer is the fourth major feature extractor after MLP, RNN, and CNN, also known as the fourth foundational model. The recently popular ChatGPT is also fundamentally based on Transformers, highlighting the importance of Transformers.

Second, I hope that by asking questions, it will better help everyone understand the content and principles of Transformers.

1. What was the major breakthrough in deep learning in 2017?

Transformers. There are two aspects to this:

1.1 On one hand, Transformers are the fourth major feature extractor in deep learning (also known as foundational models). What is a feature extractor? The brain is how humans interact with the external world (images, text, speech, etc.); a feature extractor is how computers mimic the brain to interact with the external world (images, text, speech, etc.), as shown in Figure 1. For example, the Imagenet dataset contains 1000 classes of images, and people have classified these one million images into 1000 categories based on their experience, with each category of images (e.g., leopards) having unique features. In this case, neural networks (like ResNet18) also aim to extract or recognize the unique features of each category of images as much as possible through classification. Classification is not the ultimate goal but a means of extracting image features; masking and completing images is also a way to extract features, and shuffling the order of image blocks is another way to extract features.

1.2 On the other hand, the role of Transformers in deep learning: the cornerstone of the third and fourth waves, as shown in Figure 2.

2. What is the background of the Transformer?

2.1 In terms of the development background of the field: At that time, in 2017, deep learning had been booming in the field of computer vision for several years. From AlexNet, VGG, GoogLeNet, ResNet, DenseNet; from image classification, object detection to semantic segmentation; but it did not cause much of a stir in the field of natural language processing.

2.2 In terms of technical background: (1) The mainstream solutions for sequence transcription tasks (like machine translation) at that time were as shown in Figure 3. Under the Sequence to Sequence architecture (a type of Encoder-Decoder), RNN was used to extract features, and the Attention mechanism efficiently passed the features extracted by the Encoder to the Decoder. (2) This approach had two shortcomings: on one hand, the RNN’s inherent structure of sequentially passing information from front to back meant it could not perform parallel computations; on the other hand, when the sequence length was too long, the information from the front of the sequence could be forgotten. Therefore, it can be seen that RNN was relatively weak and needed improvement in this framework.

3. What exactly is a Transformer?

3.1 A Transformer is an architecture composed of an Encoder and a Decoder. So what is an architecture? The simplest architecture is A+B+C.

3.2 A Transformer can also be understood as a function where the input is “I love learning” and the output is “I love studying”.

3.3 If we break down the architecture of a Transformer, as shown in Figure 4.

4. What is the Transformer Encoder?

4.1 From a functional perspective, the core role of the Transformer Encoder is to extract features, and the Transformer Decoder can also be used to extract features. For example, when a person learns to dance, the Encoder observes how others dance, while the Decoder showcases the learned experiences and memories.

4.2 From a structural perspective, as shown in Figure 5, the Transformer Encoder = Embedding + Positional Embedding + N*(Sub-Encoder block 1 + Sub-Encoder block 2);

Sub-Encoder block 1 = Multi-head attention + ADD + Norm;

Sub-Encoder block 2 = Feed Forward + ADD + Norm;

4.3 From the perspective of input and output, the input of the first Encoder block among N Transformer Encoder blocks is a set of vectors X = (Embedding + Positional Embedding), where the vector dimension is typically 512*512. The input of the other N Transformer Encoder blocks is the output of the previous Transformer Encoder block, and the output vector dimension is also 512*512 (input and output sizes are the same).

4.4 Why is it 512*512? The former refers to the number of tokens; for example, “I love learning” has 4 tokens, and here it is set to 512 to accommodate different sequence lengths, with padding when insufficient. The latter refers to the vector dimension generated for each token, meaning each token is represented by a vector of length 512. It is often said that Transformers cannot exceed 512; otherwise, hardware support becomes challenging; in fact, 512 refers to the former, which is the number of tokens, because each token must perform self-attention operations; however, the latter’s 512 should not be too large, or the computation will be slow.

5. What is the Transformer Decoder?

5.1 From a functional perspective, compared to the Transformer Encoder, the Transformer Decoder is better at generative tasks, especially for natural language processing problems.

5.2 From a structural perspective, as shown in Figure 6, the Transformer Decoder = Embedding + Positional Embedding + N*(Sub-Decoder block 1 + Sub-Decoder block 2 + Sub-Decoder block 3) + Linear + Softmax;

Sub-Decoder block 1 = Mask Multi-head attention + ADD + Norm;

Sub-Decoder block 2 = Multi-head attention + ADD + Norm;

Sub-Decoder block 3 = Feed Forward + ADD + Norm;

5.3 From the perspective of the individual roles of (Embedding + Positional Embedding) (N Decoder blocks) (Linear + Softmax):

Embedding + Positional Embedding: Taking machine translation as an example, input “Machine Learning” outputs “机器学习”; here, the Embedding transforms “机器学习” into a vector format.

N Decoder blocks: the process of feature processing and transmission.

Linear + Softmax: Softmax predicts the probability of the next word appearing, as shown in Figure 7, where the previous Linear layer is similar to the classification network (ResNet18) before the final classification layer connected to the MLP layer.

5.4 What are the inputs and outputs of the Transformer Decoder? They are different during training and testing.

During the training phase, as shown in Figure 8. At this point, the labels are known; the first input to the decoder is the begin character, and the output of the first vector is compared with the first character in the label using cross-entropy loss. The second input to the decoder is the label of the first vector, and the Nth input corresponds to the End character, marking the end. Here it can also be seen that parallel training is possible during the training phase.

During the testing phase, the input for the next moment is the output from the previous moment, as shown in Figure 9. Therefore, during training and testing, there may be a mismatch in the decoder’s input; during testing, it is indeed possible to have a mistake at one step leading to mistakes at every step. There are two solutions: one is to occasionally introduce errors during training, and the other is Scheduled sampling.

5.5 What are the inputs and outputs of the internal Transformer Decoder block? Previously, we discussed the outputs and inputs during the overall training and testing phases; what are the inputs and outputs of the internal Transformer Decoder block, as shown in Figure 10?

For the first loop in N=6 (when N=1): the input to Sub-Decoder block 1 is embedding + Positional Embedding, the input to Sub-Decoder block 2’s Q comes from the output of Sub-Decoder block 1, and KV comes from the output of the last layer of the Transformer Encoder.

For the second loop in N=6: the input to Sub-Decoder block 1 is from N=1, and the output of Sub-Decoder block 3’s KV also comes from the last layer of the Transformer Encoder.

Overall, it can be seen that whether during training or testing, the input to the Transformer Decoder comes not only from (ground truth or the output from the previous moment of the decoder) but also from the last layer of the Transformer Encoder.

During training: the input to the ith decoder = encoder output + ground truth embedding.

During prediction: the input to the ith decoder = encoder output + the output from the (i-1)th decoder.

6. What are the differences between the Transformer Encoder and Transformer Decoder?

6.1 In terms of function, the Transformer Encoder is commonly used to extract features, while the Transformer Decoder is often used for generative tasks. The Transformer Encoder and Decoder represent two different technical routes, with Bert adopting the former and the GPT series models adopting the latter.

6.2 In terms of structure, the Transformer Decoder block includes three Sub-Decoder blocks, while the Transformer Encoder block includes two Sub-Encoder blocks, and the Transformer Decoder uses Mask multi-head Attention.

6.3 From the perspective of input and output, after the operations of N Transformer Encoders are completed, their output is formally input to the Transformer Decoder, serving as K and V in QKV. So how is the output from the last layer of the Transformer Encoder sent to the Decoder? As shown in Figure 11.

So, why must the Encoder and Decoder use this interaction method? It is not absolutely necessary; different interaction methods have been proposed subsequently, as shown in Figure 12.

7. What is Embedding?

7.1 The position of Embedding in the Transformer architecture is shown in Figure 13.

7.2 Background: Computers cannot directly process a single word or character; a token needs to be transformed into a vector that can be recognized by the computer, which is the embedding process.

7.3 Implementation: The simplest embedding operation is a one-hot vector, but one-hot vectors have a disadvantage of not considering the relationships between words; thus, Word Embedding was later developed, as shown in Figure 13.

Understanding the Details of Transformers: 18 Key Questions

Figure 13: Some explanations of Embedding, from left to right: position of embedding in the Transformer, one-hot vector, Word embedding.

8. What is Positional Embedding?

8.1 The position of Positional Embedding in the Transformer architecture is shown in Figure 14.

8.2 Background: RNNs, as feature extractors, inherently carry the sequential information of words; however, the Attention mechanism does not consider the order of words, which significantly affects semantics. Therefore, it is necessary to add positional information to the input embeddings through Positional Embedding.

8.3 Implementation: Traditional position encoding and neural network auto-training methods.

Understanding the Details of Transformers: 18 Key Questions

Figure 14: Some explanations of Positional Embedding, from left to right: position of positional embedding in the Transformer, implementation of traditional position encoding, image generated by traditional position encoding, where each column represents the position encoding of a token.

9. What is Attention?

9.1 Why introduce Attention when discussing Transformers? Because the most frequent components in Transformers, multi-head attention and Mask multi-head attention, derive from Scaled dot product attention, which in turn comes from self-attention; thus, it is essential to understand Attention, as shown in Figure 15.

Understanding the Details of Transformers: 18 Key Questions

Figure 15: The relationship between Attention and Transformers

9.2 What does Attention mean? For images, attention refers to the core areas of focus that people see in the image; it highlights the key points in the image, as shown in Figure 16. For sequences, the Attention mechanism essentially aims to find the relationships between different tokens in the input by using a weight matrix to spontaneously identify the relationships between words.

Understanding the Details of Transformers: 18 Key Questions

Figure 16: Attention in images

9.3 How is Attention implemented? It is achieved through QKV.

What are QKV? Q is for query, K is for keys, and V is for values. For example, Q is the signal sent by the brain, indicating “I am thirsty”; K is the environmental information, the world seen by the eyes; V assigns different weights to various items in the environment, increasing the weight of water.

In summary, Attention is calculated by computing the similarity between Q and K and multiplying it by V to obtain the attention value.

Understanding the Details of Transformers: 18 Key Questions

Figure 17: Implementation of Attention

9.4 Why must there be QKV? Why not just Q? The relationship weights between Q1 and Q2 need both a12 and a21. You might ask, can we make a12 = a21? That is possible, but from a theoretical perspective, it should not perform as well as having a12 and a21.

Why not just QK? The obtained weight coefficients need to be included in the input; they can be multiplied by Q or K, but why multiply by V again? This may provide an additional set of trainable parameters WV, enhancing the network’s learning capacity.

10. What is Self Attention?

10.1 Why introduce self Attention when discussing Transformers? Because the most common multi-head attention and Mask multi-head attention in Transformers derive from Scaled dot product attention, which in turn comes from self-attention, as shown in Figure 15.

10.2 What is self-attention? Self-attention, along with local attention and stride attention, is a type of attention; self-attention calculates the attention coefficients for each Q with each K sequentially, as shown in Figure 18, while local attention calculates attention coefficients for Q only with adjacent K, and stride attention calculates attention coefficients for Q with K through skipping connections.

10.3 Why can self attention be used to handle sequence data like machine translation? Each position’s data in the input sequence can focus on information from other positions, thus extracting features or capturing relationships between each token in the input sequence through Attention scores.

10.4 How is self attention specifically implemented? It is divided into four steps, as shown in Figure 19.

Understanding the Details of Transformers: 18 Key Questions

Figure 19: Implementation process of self attention

11. What is Scaled Dot Product Attention?

11.1 The most common types of self attention are dot product attention and additive attention, as shown in Figure 20, with the former being more computationally efficient.

11.2 What is Scaled? The specific implementation of scaled is shown in Figure 21. This operation aims to prevent the inner product from becoming too large, which, from a gradient perspective, avoids approaching 1, making it easier to train; it has some similarities to batch normalization.

12. What is Multi-Head Attention?

12.1 The position of Multi-head attention in the Transformer architecture is shown in Figure 15.

12.2 Background: CNNs have multiple channels that can extract different dimensional feature information from images; can self attention have a similar operation to extract multiple dimensional information from tokens at different distances?

12.3 What is group convolution? As shown in Figure 22, group convolution divides the input features into several groups for separate convolution operations, which are then concatenated.

Understanding the Details of Transformers: 18 Key Questions

Figure 22: Group convolution

12.4 How is Multi-head attention implemented? What fundamentally distinguishes it from self attention? As shown in Figure 23, taking two heads as an example, the input Q, K, and V are divided into two parts, and each small part operates with its corresponding K and V, and the resulting vectors are concatenated. Thus, it can be seen that Multi-head attention has a similar implementation method to group convolution.

Understanding the Details of Transformers: 18 Key Questions

Figure 23: Difference between Multi-head attention and self attention

12.5 How can we understand Multi-head attention from the perspective of input and output dimensions? As shown in Figure 24.

Understanding the Details of Transformers: 18 Key Questions

Figure 24: Input and output dimensions of Multi-head attention

13. What is Mask Multi-Head Attention?

13.1 Mask Multi-head attention’s position in the Transformer architecture is shown in Figure 15.

13.2 Why is there a need for Mask operations? The Transformer cannot see inputs after the T-th moment when predicting the output at the T-th moment, ensuring consistency between training and prediction.

The Mask operation can prevent the i-th word from knowing information about the i+1-th word, as shown in Figure 25.

13.3 How is the Mask operation specifically implemented? Q1 only computes with K1, Q2 only computes with K1 and K2, while for K3, K4, etc., a very large negative number is given before softmax, making it zero after softmax, as shown in Figure 26.

Understanding the Details of Transformers: 18 Key Questions

Figure 26: Matrix computation implementation of the Mask operation

14. What is ADD?

14.1 Add refers to residual connections, popularized by the 2015 ResNet paper (which has been cited over 160,000 times); the difference from Skip connections is that the dimensions must be the same.

14.2 As the ultimate embodiment of the idea of simplicity, almost every deep learning model employs this technique to prevent network degradation and is commonly used to address the training difficulties of multi-layer networks.

Understanding the Details of Transformers: 18 Key Questions

Figure 27: The position of ADD in the Transformer architecture (left) and a diagram illustrating the principle of residual connections (right)

15. What is Norm?

15.1 Norm refers to layer normalization.

15.2 The core function is to stabilize training, and it has a similar function to batch normalization, both aiming to make the mean of the input samples zero and the variance one.

15.3 Why use layer normalization instead of batch normalization? Because for sequential data, sentence input lengths vary; if batch normalization is used, it can easily lead to “unstable training” due to the differing lengths of samples. BN operates on the same feature data of all data in the same batch, while LN operates on the same sample.

16. What is FFN?

16.1 FFN refers to feed-forward networks.

16.2 Why use FFN when self-attention layers are already present? Attention already captures the desired sequence information features; the role of MLP is to project the information into a specific space and then perform a nonlinear mapping, alternating with self-attention.

16.3 Structurally, it includes two layers of MLP, with the first layer dimension being 512*2048 and the second layer dimension being 2048*512, and the second layer MLP does not use an activation function, as shown in Figure 29.

Understanding the Details of Transformers: 18 Key Questions

Figure 29: The specific implementation process of FFN

17. How is the Transformer trained?

17.1 In terms of data, the Transformer paper mentions using 4.5M and 36M pairs of translation sentences.

17.2 In terms of hardware, the base model was trained on 8 P100 GPUs for 12 hours, while the large model was trained for 3.5 days.

17.3 In terms of model parameters and tuning:

First, the trainable parameters include WQ, WK, WV, WO, which includes the parameters of the FFN layer.

Second, the tunable parameters include: the dimension of the vector representation for each token (d_model), the number of heads, the number of repetitions of blocks in the Encoder and Decoder, the dimension of the intermediate layer vector in the FFN, label smoothing (confidence 0.1), and dropout (0.1).

18. Why is the Transformer effective?

18.1 Although the title is “Attention is all you need”, subsequent research indicates that Attention, residual connections, layer normalization, and FFN collectively contribute to the success of Transformers.

18.2 The advantages of Transformers include:

First, they represent the fourth major feature extractor after MLP, CNN, and RNN.

Second, initially used for machine translation, they have since gained immense popularity with the emergence of GPT and Bert, marking a turning point that rapidly advanced the NLP field, leading to the rise of multimodal, large models, and visual Transformers.

Third, they instill confidence that there can be effective feature extractors beyond CNNs and RNNs.

18.3 What are the shortcomings of Transformers?

First, they require significant computational resources and have high hardware demands.

Second, due to the lack of inductive bias, they require a lot of data to achieve good results.

Finally, the references for this article are based on the Transformer paper, courses by Li Hongyi and Li Mu, and some excellent shares on Zhihu regarding Transformers. I will not introduce them all here (as I did not record the references in a timely manner during the learning process). If there are any infringements, please let me know, and I will make timely notes or modifications.

Editor: Huang Jiyan

1. What was the major breakthrough in deep learning in 2017?

2. What is the background of the Transformer?

3. What exactly is a Transformer?

4. What is the Transformer Encoder?

5. What is the Transformer Decoder?

6. What are the differences between the Transformer Encoder and Transformer Decoder?

7. What is Embedding?

8. What is Positional Embedding?

9. What is Attention?

10. What is Self Attention?

11. What is Scaled Dot Product Attention?

12. What is Multi-Head Attention?

13. What is Mask Multi-Head Attention?

14. What is ADD?

15. What is Norm?

16. What is FFN?

17. How is the Transformer trained?

18. Why is the Transformer effective?

Leave a Comment Cancel reply