Source: Huang Yu Zhihu
This article is about 3600 words long, and it is recommended to read it in 10 minutes.
This article introduces the review paper "Efficient Transformers: A Survey" published by Google in September last year, which states that in the field of NLP, transformers have successfully replaced RNNs (LSTM/GRU), and applications have also emerged in the CV field, such as object detection and image annotation, as well as in the RL field.
In the field of NLP, transformers have successfully replaced RNNs (LSTM/GRU), and applications have also emerged in the CV field, such as object detection and image annotation, as well as in the RL field. This is a review paper published by Google in September 2020 on arXiv, titled “Efficient Transformers: A Survey”, which is worth reading.The article mainly focuses on a class of X-former models, such as Reformer, Linformer, Performer, and Longformer, which have improved the original Transformer to enhance its computational and memory efficiency.
Paper Title:
Efficient Transformers: A Survey
Paper Link:
https://arxiv.org/pdf/2009.06732
Transformer Review
Self-attention is a key defining feature of the Transformer model. This mechanism can be seen as an inductive bias similar to graphs, which connects all tokens in the sequence based on related pooling operations. A well-known issue with self-attention is its quadratic time and memory complexity, which hinders the scalability of models in many settings. Therefore, many variants have been proposed to address this issue, and these models are referred to as efficient Transformers.Efficient self-attention models are crucial for long sequence modeling applications, such as documents, images, and videos, which typically consist of a relatively large number of pixels or tokens. Thus, the efficiency of processing long sequences is vital for the widespread adoption of Transformers.The following is a standard Transformer architecture:The Transformer is formed by stacking Transformer blocks on top of each other in a multi-layer architecture. The features of Transformer blocks include the multi-head self-attention mechanism, position-wise feed-forward networks, layer normalization (LN) modules, and residual connectors.The input of the Transformer model is typically a tensor of shape BxN, where B is the batch size and N is the sequence length. This input first passes through an embedding layer that converts each one-hot token representation into a d-dimensional embedding vector, i.e., BxNxd. Then, the new tensor is added to the position encoding and passed through a multi-head self-attention module.Position encodings can be in the form of sine inputs or learnable embeddings. The input and output of the multi-head self-attention module are connected through residual connectors and layer normalization (LN) modules. Then, the output of the multi-head self-attention module is passed to two layers of feed-forward networks (FFN), similarly connected in a residual manner to the layer normalization (LN) module.The residual connector with layer normalization is defined as:And the single head operation in Multi-Head Self-Attention is defined as:The attention matrix A = QK^T is primarily responsible for learning the calibration scores between tokens in the sequence. This drives the self-attention self-calibration process, allowing tokens to learn clustering among themselves. However, the computation of this matrix is an efficiency bottleneck.The layer operation of FFN is defined as:Thus, the entire operation of the Transformer block is defined as:Next, it is important to note the different usage modes of the Transformer module. The main usage modes of the Transformer include:(1) Encoder (e.g., for classification).(2) Decoder (e.g., for language modeling).(3) Encoder-Decoder (e.g., for machine translation).In the encoder-decoder mode, there are typically multiple multi-head self-attention modules, including standard self-attention in the encoder and decoder, as well as encoder-decoder cross-attention that allows the decoder to utilize information from the encoder. This affects the design of the self-attention mechanism.In the encoder mode, there are no restrictions or constraints on the self-attention mechanism, which must be causal, i.e., only depending on the current and past tokens.In the encoder-decoder setup, the encoder and the encoder-decoder cross-attention may be non-causal, but the self-attention in the decoder must be causal. Designing an effective self-attention mechanism requires supporting the causal relationship for auto-regressive (AR) decoding, which may be a general limiting factor.
Efficient Transformers
The classification of Efficient Transformers is shown in the figure, with the corresponding methods published in the last two years (2018-2020) regarding their complexity and categories as seen in the table:Note:FP = Fixed Patterns/Combinations of Fixed Patterns, M = Memory, LP = Learnable Pattern, LR = Low Rank, KR = Kernel, RC = Recurrence.Except for segment-based recurrence, the main goal of most models is to approximate the quadratic-level overhead of the attention matrix. Each method applies the concept of sparsity to the originally dense attention mechanism.Fixed patterns (FP): The earliest improvement of self-attention is to limit the field of view to fixed, predefined patterns (e.g., local windows and blocks with fixed strides) to simplify the attention matrix.
Blockwise Patterns: The simplest practical example of this technique is the blockwise (or chunking) paradigm, which divides the input sequence into fixed blocks, considering local receptive fields. Such examples include block-wise and/or local attention. Dividing the input sequence into blocks can reduce the complexity from N^2 to B^2 (block size), where B << N, thus significantly reducing the overhead. These blockwise or chunking methods can serve as the foundation for many more complex models.
Strided patterns: Another approach is to participate only at fixed intervals. Models such as Sparse Transformer and/or Longformer adopt a “strided” or “dilated” view.
Compressed Patterns: Another offensive line uses some merging operations to downsample the sequence length, making it a form of fixed patterns. For example, Compressed Attention effectively reduces the sequence length using strided convolutions.
Combination of Patterns (CP): The key point is to improve coverage by combining two or more different access patterns. For example, Sparse Transformer allocates half of its heads to patterns that combine strided and local attention. Similarly, Axial Transformer applies a series of self-attention calculations along a single axis of the input tensor when given a high-dimensional tensor as input. Essentially, the combination of patterns reduces the memory complexity in the same way as fixed patterns. However, the difference is that the aggregation and combination of multiple patterns improve the overall coverage of the self-attention mechanism.Learnable Patterns (LP): An extension of the predetermined FP patterns that can be learned. Unsurprisingly, models using learnable patterns are designed to learn access patterns in a data-driven manner. The key of LP is to determine token relevance, assigning tokens to buckets or clusters. Notably, Reformer introduces a hash-based similarity measure that effectively clusters tokens into chunks. Similarly, Routing Transformer adopts online k-means clustering on tokens. Meanwhile, Sinkhorn sorting networks reveal the sparsity of attention weights by learning to sort blocks of the input sequence. In all these models, the similarity function is trained end-to-end along with other parts of the network. The key point of LP remains to utilize fixed patterns (chunked patterns). However, such methods learn to sort/clustering input tokens, maintaining the efficiency advantages of FP methods while gaining a better global view of the sequence.Memory: Another outstanding approach uses a side memory module to access multiple tokens at once. The general form is a global memory that can access the entire sequence. Global tokens serve as a form of memory, learning aggregation from the tokens of the input sequence. This was first introduced in the Set Transformers with the inducing points method. These parameters are often interpreted as “memory” and used as temporary contextual information for future processing. This can be seen as a form of parameter attention. Global memory is also used in ETC and Longformer. With a limited amount of memory (or inducing points), pooling-like operations are applied to the input sequence for compression, which is a technique that can be used when designing effective self-attention modules.Low-Rank Methods: Another emerging technique utilizes low-rank approximations of the self-attention matrix to improve efficiency. The key point is to assume a low-rank structure of the NxN matrix. Linformer is a classic example of this technique, projecting the length dimensions of keys and values into a lower-dimensional representation (N -> k). It is not difficult to see that since the NxN matrix has now been decomposed into Nxk, this method improves the storage complexity issue of self-attention.Kernels: Another recently popular method for improving the efficiency of Transformers looks at the attention mechanism through kernelization. The use of kernels allows for clever mathematical rewriting of the self-attention mechanism, avoiding the explicit computation of the NxN matrix. Since kernels are an approximate form of the attention matrix, they can also be seen as a form of low-rank methods.Recurrence: The direct extension of the blockwise method is to connect these blocks through recurrence. Transformer-XL introduces a segment-level recurrence mechanism that connects multiple segments and blocks. In a sense, these models can be seen as FP models.
Memory and Computational Complexity Analysis
This review analyzes the memory and computational complexity of the following 17 methods:1. Memory Compressed Transformer: “Generating Wikipedia by summarizing long sequences” as shown in the figure2. Image Transformer: “Image Transformer” as shown in the figure3. Set Transformer: “Set transformer: A framework for attention-based permutation-invariant neural networks” as shown in the figure4. Sparse Transformer: “Generating long sequences with sparse transformers” as shown in the figure5. Axial Transformer: “Axial attention in multidimensional transformers.” as shown in the figure6. Longformer: “Longformer: The long-document transformer” as shown in the figure7. Extended Transformer Construction (ETC): “ETC: Encoding long and structured data in transformers” as shown in the figure8. BigBird: “Big Bird: Transformers for Longer Sequences” as shown in the figure9. Routing Transformer: “Efficient content-based sparse attention with routing transformers” as shown in the figure10. Reformer: “Reformer: The efficient transformer” as shown in the figure11. Sparse Sinkhorn Transformer: “Sparse sinkhorn attention” as shown in the figure12. Linformer: “Hat: Hardware-aware transformers for efficient natural language processing” as shown in the figure13. Linear Transformer: “Transformers are RNNs: Fast autoregressive transformers with linear attention” as shown in the figure is its algorithm pseudocode14. Performer: “Masked language modeling for proteins via linearly scalable long-context transformers” as shown in the figure is the algorithm pseudocode for Fast Attention via Orthogonal Random Features (FAVOR)15. Synthesizer: “Synthesizer: Rethinking self-attention in transformer models.” as shown in the figure16. Transformer-XL: “Transformer-xl: Attentive language models beyond a fixed-length context” as shown in the figure17. Compressive Transformers: “Compressive transformers for long-range sequence modeling” as shown in the figure
Evaluation Benchmarks
Despite the busy field using new Transformer models, there is hardly a simple way to compare these models. Many research papers choose their own benchmarks to showcase the capabilities of the proposed models. Coupled with different hyperparameter settings (e.g., model size and configuration), it can be difficult to correctly identify the reasons for performance improvements. Additionally, some papers compare them with pre-training, making it even harder to distinguish the relative performance of these different models. Considering which basic efficient Transformer block to use remains a mystery.On one hand, various models focus on generative modeling, demonstrating the capabilities of the proposed Transformer units in sequence autoregressive (AR) modeling. To this end, Sparse Transformers, Adaptive Transformers, Routing Transformers, and Reformers mainly focus on generative modeling tasks. These benchmarks often involve language modeling and/or pixel-wise image generation on datasets such as Wikitext, enwik8, and/or ImageNet/CIFAR. Meanwhile, segment-based recurrence models (e.g., Transformer-XL and Compressive Transformers) also focus on large-range language modeling tasks, such as PG-19.On the other hand, some models mainly focus on encoding-only tasks, such as question answering, reading comprehension, and/or selections from the Glue benchmark. For example, the ETC model experiments only on question answering benchmarks, such as NaturalQuestions or TriviaQA. Conversely, Linformer focuses on a subset of GLUE benchmarks. This decomposition is very natural and intuitive, as models like ETC and Linformer cannot be used in an autoregressive manner, i.e., cannot be used for decoding. This exacerbates the difficulty of comparing these encoder models with other models.Some models aim for a balance between the two mentioned above. Longformer attempts to strike a balance by running benchmarks on both generative modeling and encoder tasks. Sinkhorn Transformer compares generative modeling tasks with encoding-only tasks.Moreover, it is worth noting that although the **machine translation (MT)** task of Seq2Seq is one of the popular issues for Transformer models, these efficient Transformer models have not been extensively evaluated for MT. This may be due to the sequence length in MT not being sufficient to justify the use of these models.Although generative modeling, **GLUE (General Language Understanding Evaluation)**, and/or question answering seem to be universal evaluation benchmarks for these applications, there are still some benchmarks available for a small number of papers to evaluate separately. First, the Performer model evaluates masked language modeling and has a positive comparison with other efficient Transformer models. Furthermore, Linear Transformer also evaluates speech recognition, which is relatively rare.
Comparative Efficiency Analysis
Additionally, the paper concludes with an analysis comparing these efficiency-enhancing methods, focusing on several aspects: