Understanding Transformer and Its Variants

Follow the public account "ML_NLP"

Set as “Starred“, heavy content will be delivered to you first!

Understanding Transformer and Its Variants

Author: Jiang Runyu, Harbin Institute of Technology SCIR

Introduction

In recent years, one of the most impressive achievements in the field of NLP is undoubtedly the pre-trained models represented by BERT proposed by Google. They continuously refresh records (both in task metrics and computational requirements), surpassing human average levels in many tasks, and exhibit excellent transferability and a degree of interpretability.

For instance, when we need to explain in a paper why an algorithm or modification works, a heatmap based on attention clearly illustrates what our code has achieved.

Understanding Transformer and Its VariantsFigure 1: A common attention heatmap in papers

Currently, mainstream pre-trained models are modified based on the Transformer model proposed by Google in 2017, using it as their feature extractor. It can be said that the Transformer has completely changed the deep learning field since its emergence, especially in NLP.

This article primarily introduces the Transformer and some of its optimized variants in recent years.

Transformer

If I had to introduce the Transformer in one sentence, it would be: “The first model to completely abandon RNN recurrence and CNN convolution, relying solely on attention for feature extraction.” This is also the title of the paper, “Attention Is All You Need”.

The application of the attention mechanism in the field of NLP can be traced back to 2014 when the Bengio team introduced Attention into the NMT (Neural Machine Translation) task. However, at that time, Attention was merely an auxiliary structure, with the core architecture still being RNN. The Transformer completely uses the Attention mechanism as the foundational architecture, discarding the previous CNN and RNN networks.

The basic structure of the Transformer is shown in the figure below, where the left half is the Encoder part and the right half is the Decoder part. The Transformer has six layers of such structures.

Understanding Transformer and Its VariantsFigure 2: Detailed architecture of the Transformer

Taking the translation model as an example, here is the overall structure diagram of the Transformer:

Understanding Transformer and Its VariantsFigure 3: Overall architecture of the Transformer

The above is a general introduction to the Transformer; the following will explain various innovations of the Transformer.

Attention

The Transformer uses Attention three times in total. The Decoder part has an additional special layer called Masked Attention. During decoding, the model should only know the context of the current center word, so the content of the center word’s subsequent context is masked, maintaining the autoregressive property.

Scaled Dot-Product Attention

Self-Attention essentially enhances the representation of the current word by introducing contextual information, writing in more information. This is similar to what the Bengio team did in 2014 when introducing Attention into NMT.

Understanding Transformer and Its Variants In the Transformer, this part is implemented through Attention(Q, K, V), where Q is the query, K is the key, and V is the value. The dot product of Q and K reflects the influence of contextual words on the center word, which is then normalized through softmax.

Understanding Transformer and Its VariantsFigure 4: Attention calculation path

Multi-Head Attention

Multi-Head Attention is fundamentally an innovation.

Understanding Transformer and Its Variants

Figure 5: Multi-Head Attention calculation path

It projects the original 512-dimensional Q, K, V through 8 different linear projections to obtain 8 groups of low-dimensional Qi, Ki, Vi, each with a dimension of 64. The formula is as follows:

Understanding Transformer and Its Variants

This approach reduces the size of each attention head, so the computational load does not significantly increase.

Regarding why to use multi-head attention instead of single-head attention, the authors of “Attention Is All You Need” believe that: Average attention weighting reduces effective resolution, meaning it cannot fully reflect information from different representation subspaces. Using multi-head attention is somewhat similar to using multiple convolution kernels within the same convolutional layer in CNNs. It enhances the model’s ability to capture different characteristics exhibited by text in different subspaces, avoiding the suppression of such characteristics by average pooling.

However, there is still no good explanation for whether multi-head attention mechanisms are useful and why they are useful.

There is a large amount of research showing that specific layers of BERT based on Transformer have unique functions, with lower layers focusing more on syntax and upper layers focusing more on semantics. Since the attention in the same layer of Transformer is the same, the different heads in that layer should also focus on the same aspects. This contradicts the authors’ explanation.

In fact, in the paper “A Multiscale Visualization of Attention in the Transformer Model”, the authors analyzed some attention heads of the first few layers of BERT, as shown below, and the results showed that in the same layer, there are always one or two heads that focus on different points compared to the others, while the remaining heads tend to converge【Considering that the attention heads in the same layer are independently trained, this is quite remarkable】.

Understanding Transformer and Its Variants

Figure 6: Attention points of heads 0-5 for the same input in layers 0-3 of BERT.

In the article “What Does BERT Look At? An Analysis of BERT’s Attention”, the authors analyzed the differences between heads in the same layer and whether this difference changes with the number of layers. The results are shown in the figure below, suggesting that the difference between heads decreases as the layer number increases, meaning that the higher the layer, the more heads converge. Unfortunately, the reason for this phenomenon is not well explained.

Understanding Transformer and Its Variants

Figure 7: Projection of the differences between heads in each layer of BERT in a two-dimensional plane

In my personal opinion, the role of the multi-head attention mechanism may be as follows: the redundancy of the attention mechanism is high (even if the attention heads are independently computed, they likely focus on similar points), so those relatively few outlier attention heads can further optimize the model. However, the occurrence of these outlier heads is not frequent, so it is necessary to increase the number of heads to ensure the frequency of these outlier heads.

Positional Encoding

Since the Attention mechanism does not inform the model about the positional relationships between words (which is different from RNN and CNN), positional information encoding needs to be introduced additionally.

Understanding Transformer and Its Variants

The reason for using the aforementioned positional encoding is simple: it can encode the relative positional relationships between two words well. Trigonometric functions have very convenient sum and difference product formulas.

The author also mentioned that learned embedding could be used, but experiments show that there is no difference in effectiveness between the two methods, while the formula method is simpler and can handle sequences longer than those during training.

Disadvantages

The Transformer also has some shortcomings and limitations:

Non-Turing complete: the proof is omitted. In simple terms, the Transformer cannot handle all problems. For example, when we need to output a direct copy of the input, the Transformer does not learn this operation well.

Not suitable for handling ultra-long sequences: When processing articles, the sequence length can easily exceed 512. If we choose to continually increase the model’s dimensions, the computational resource requirements during training will increase quadratically, making it difficult to bear. Therefore, it is generally chosen to truncate the text directly without considering its natural text segmentation (such as punctuation), resulting in a decline in the quality of modeling long-distance dependencies in the text.

The computational resource allocation is the same for different words: During the Encoder process, all input tokens have the same computational load. However, in a sentence, some words are relatively more important, while others are not very meaningful. Assigning the same computational resources to all these words is clearly wasteful.

The original version of the Transformer, although not mature, has limitations such as fixed layer numbers that are not flexible and excessive computational requirements that make it unsuitable for handling ultra-long sequences. However, its excellent feature extraction capabilities have attracted the attention of many scholars. Many have proposed different variants of the Transformer to improve or circumvent its shortcomings. Among them, Universal Transformer, Transformer-XL, and Reformer are typical representatives.

Universal Transformer

From the structural perspective, the Universal Transformer does not differ much from the Transformer, so I will not elaborate on it here, mainly discussing its biggest innovation.

Understanding Transformer and Its VariantsFigure 8: Universal Transformer model architecture

In the Transformer, the input goes through Attention and then enters a fully connected layer for computation, while the Universal Transformer model enters a shared-weight transition function for continued iterative computation.

Understanding Transformer and Its Variants

Figure 9: The Universal Transformer reintroduced the recurrent mechanism

Here, vertically represents the sequence order of the text, and horizontally represents the time steps. The computation formula for each step is as follows:

Understanding Transformer and Its Variants

In this case, the Transition function can be a fully connected layer as before, or it can be another function layer.

Previously, the position encoding of the Transformer did not need to encode the number of layers because the number of layers was fixed. The Universal Transformer model adds a time dimension, so each iteration requires a round of coordinate encoding, with the formula:

Understanding Transformer and Its Variants

To control the number of iterations, the model introduces an Adaptive Computation Time (ACT) mechanism.

ACT can adjust the number of computation steps, and the Universal Transformer with the ACT mechanism is called the Adaptive Universal Transformer. The following figure illustrates that after introducing the ACT mechanism, the model will perform more iterations on more important tokens in the text, while reducing computational resource investment for relatively unimportant words.

Understanding Transformer and Its Variants

Figure 10: The Universal Transformer model allocates more resources to important tokens

The Universal Transformer improved the Transformer’s shortcomings, addressing the non-Turing completeness and the average computational resource allocation issues.

Transformer-XL

Theoretically, the attention mechanism allows the Transformer model to capture dependencies between tokens at any distance. However, due to computational constraints (the next model introduced will address this issue), the Transformer usually segments the text into segments of length less than or equal to (the default is 512), with each segment being independent and not interfering with each other.

This means that the dependencies between segments, or the dependencies between tokens that exceed a distance of 512, cannot be modeled or extracted at all. Simultaneously, this leads to a context fragmentation problem, as the segmentation is not based on semantic boundaries but on length, which may split a complete sentence into two. Thus, when making predictions on such split sentences, necessary semantic information may be missing.

The Transformer-XL proposes a Segment-level Recurrence to solve this problem.

In one sentence, Segment-level Recurrence means that while processing the current segment, it caches and utilizes the hidden vectors of all layers from the previous segment, and the hidden vectors of the previous segment only participate in forward computation without backpropagation.

Understanding Transformer and Its Variants

Figure 11: In Transformer-XL, nodes can “see” the content from the previous segment

Let’s delve into the computation process. Assuming each segment has a length of L and the entire model contains N layers of Transformer-XL, each group of segments contains N groups of hidden variable arrays of length L. The hidden variable vector of the t-th group segment at the n-th layer can be represented as, where d is the length of the hidden variable vector. The hidden variable vector of the t+1-th segment at the n-th layer can be calculated using the following formula, where SG refers to stop-gradient, meaning no backpropagation is performed on the hidden variables of the previous segment.

Understanding Transformer and Its Variants

From the figure, we can see that in the current segment, the computation of each hidden vector at the n-th layer is based on the hidden vectors of the previous segment, including the current position and the previous L tokens. This means that each position’s hidden vector, apart from its own position, has a dependency on the tokens from the previous L-1 positions in the next layer. As we go down each layer, the length of the dependency increases by (L-1). Therefore, the longest dependency length is N(L-1), where N is the number of layers in the model. When computing long texts, the results of the hidden vectors from the previous segment can be cached, avoiding redundant computation and significantly improving computational efficiency.

Since the previous position encoding cannot distinguish tokens at the same position across different segments, the author proposes to use Relative Positional Encoding to replace the previous absolute position encoding. Specifically, relative position encoding is used instead of absolute position encoding. This approach is easy to understand because, when processing sequences, the absolute position of a token is not important; we only need the relative position of two tokens during attention computation. Since this part mainly serves as a patch, it will not be elaborated on further.

In summary, the Transformer-XL addresses the long-distance dependency issue to some extent without significantly increasing computational requirements.

Reformer

The reason why the Transformer is set to 512 instead of a larger value is that a significant factor is that during the attention computation, it requires calculating (Multi-head attention does not reduce computational load), which is one of the reasons why the Transformer does not handle long-distance dependencies well. On the other hand, the memory consumption of multi-layer Transformers (from just a few layers requiring GBs to models with thousands of layers requiring TBs) also limits the application of Transformers.

Understanding Transformer and Its Variants

To address this issue, the author proposed two mechanisms to solve these two problems: locality-sensitive hashing (LSH) attention and Reversible transformer.

During the attention computation of the original Transformer, it requires calculating its complexity, where L is the sequence length. So why do we need to calculate this? It is to find the similar parts in Q and K. By applying the idea of local sensitive hashing (similar to the concept of bucket sorting), we can first group similar vectors together and only compute the dot products among vectors in the same group. Thus, with LSH, we reduce the computational complexity to.

The following figure illustrates the process of LSH attention. First, LSH is used to bucket each segment, placing similar parts into the same bucket. Then we parallelize the computation of dot products among the vectors in each bucket.

The author also considers that there is a certain probability that similar vectors may be placed in different buckets, so multiple rounds of hashing are used to reduce this probability.

Understanding Transformer and Its VariantsFigure 12: The Reformer model uses hashing screening in advance, similar to bucket sorting, to avoid calculating QK

LSH solves the computational speed issue, but there remains a memory consumption problem. A single-layer network typically requires GBs of memory, but when training a multi-layer model, it needs to save the activation values and hidden variables of each layer for use during backpropagation. This greatly increases memory usage.

Here, the author draws on the idea of RevNet, not retaining the input of the intermediate residual connection, but instead applying a “reversible layer” concept, as illustrated in the diagram below, (a) for forward propagation, (b) for backpropagation.

Understanding Transformer and Its Variants

Figure 13: In Reformer, during backpropagation, the input of each layer can be computed from its output

The reversible layer has two sets of activations for each layer. One follows the normal standard process and updates from one layer to the next, while the other only captures changes to the first layer. Therefore, to run the network backward, we only need to subtract the activations applied to each layer.

This means that there is no need to cache any activations for backpropagation. Similar to using gradient checkpoints, although some redundant computations are still necessary, since the input of each layer can be easily constructed from its output, memory usage does not increase with the number of layers in the network.

In summary, the Reformer reduces both the attention computation load and the model’s memory usage, laying a foundation for the future implementation of large pre-trained models.

Conclusion

This article mainly introduces the Transformer model and some variants that address its shortcomings, summarizing their design ideas and pros and cons. In the future, pre-trained models based on Transformer and its variants as feature extractors are sure to achieve greater breakthroughs in the field of natural language processing.

References

[1] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in neural information processing systems. 2017: 5998-6008.

[2] Dehghani M, Gouws S, Vinyals O, et al. Universal transformers[J]. arXiv preprint arXiv:1807.03819, 2018.

[3] Dai Z, Yang Z, Yang Y, et al. Transformer-xl: Attentive language models beyond a fixed-length context[J]. arXiv preprint arXiv:1901.02860, 2019.

[4] Kitaev N, Kaiser Ł, Levskaya A. Reformer: The Efficient Transformer[J]. arXiv preprint arXiv:2001.04451, 2020.

[5] Vig J. A multiscale visualization of attention in the transformer model[J]. arXiv preprint arXiv:1906.05714, 2019.

[6] Clark K, Khandelwal U, Levy O, et al. What does bert look at? an analysis of bert’s attention[J]. arXiv preprint arXiv:1906.04341, 2019.

Editor for this issue: Ding Xiao
Editor for this issue: Gu Yuxuan
Big news! The Yizhen Natural Language Processing-PyTorch group has been officially established!
There are abundant resources in the group, welcome everyone to join and learn!
Note: Please modify your remarks to [School/Company + Name + Direction] when adding.
For example —— HIT + Zhang San + Dialogue System.
The account owner, please refrain from business promotion. Thank you!

Recommended reading:
Understand BERT's application in NLP in one article, detailing module code implementation!

CS 224n recommended reading materials

After working on 10,000 words! The evolution and overthrow of text generation evaluation metrics

Leave a Comment