The Transformer has become a mainstream model in the field of artificial intelligence, widely applied across various domains.However, the attention mechanism in Transformers is computationally expensive, and this cost continues to rise with increasing sequence length.
To address this issue, many innovative modifications of Transformers have emerged in the industry to optimize their operational efficiency. In this article, I will share 9 papers that improve the efficiency of Transformer models, making it easier for everyone to use these models more effectively and find innovative points in the papers.
The articles mainly cover four directions: sparse attention mechanisms, processing long texts with Transformers, improving Transformer operational efficiency, and convolutional attention. The original texts and source codes have been compiled.
1. Sparse Attention Mechanisms
1.1 Longformer: The Long-Document Transformer
Long Document Transformer
Method Summary: Transformer-based models struggle with long sequences because their self-attention operations scale quadratically with sequence length. Longformer addresses this issue by introducing an attention mechanism that scales linearly with sequence length, enabling it to easily handle thousands of tokens or longer documents. Longformer excels in character-level language modeling and achieves state-of-the-art results across various downstream tasks. Additionally, Longformer supports long document generation in sequence-to-sequence tasks and demonstrates its effectiveness on the arXiv summarization dataset.
1.2 Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting
Enhancing Locality and Breaking the Memory Bottleneck of Transformer in Time Series Forecasting
Method Summary: Time series forecasting is a critical problem in many fields, including predicting energy output from solar power plants, electricity consumption, and traffic congestion. This paper proposes using Transformers to tackle this forecasting problem. Although preliminary studies show impressive performance, the authors identify two main drawbacks: locality insensitivity and memory bottlenecks. To address these issues, the authors propose convolutional self-attention and LogSparse Transformer, which better handle local context and reduce memory costs. Experiments show that these methods have advantages in time series forecasting.
1.3 Adaptive Attention Span in Transformers
Adaptive Attention Span in Transformers
Method Summary: The paper introduces a new self-attention mechanism that can learn its optimal attention span. This allows for significantly expanding the maximum context size used in Transformers while maintaining control over memory usage and computation time. The authors demonstrate the effectiveness of this method on character-level language modeling tasks, achieving state-of-the-art performance on text8 and enwiki8 with a maximum context of 8k characters.
2. Processing Long Texts with Transformers
2.1 Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Attentive Language Models Beyond a Fixed-Length Context
Method Summary: Transformers are limited by fixed-length contexts in language modeling. The authors propose a new neural network architecture, Transformer-XL, that can learn dependencies beyond fixed lengths. It consists of a segment-level recurrence mechanism and a new positional encoding scheme, capable of capturing longer dependencies and addressing context fragmentation issues. This method achieves better performance on both short and long sequences and is over 1,800 times faster than regular Transformers during evaluation.
3. Improving Transformer Operational Efficiency
3.1 REFORMER: THE EFFICIENT TRANSFORMER
The Efficient Transformer
Method Summary: Training large Transformer models is costly, especially with long sequences. The paper proposes two techniques to improve efficiency: using locality-sensitive hashing to replace dot-product attention, reducing complexity from O(L^2) to O(L log L); and using reversible residual layers instead of standard residuals, allowing activations to be stored only once. The resulting Reformer model performs comparably on long sequences but is more memory-efficient and faster.
3.2 RETHINKING ATTENTION WITH PERFORMERS
Rethinking Attention Mechanism: Performer Model
Method Summary: The paper introduces Performers, a Transformer architecture that can estimate conventional (softmax) full-rank attention Transformers with provable accuracy but using only linear space and time complexity. To approximate the softmax attention kernel, Performers use a novel fast attention via orthogonal random features method (FAVOR+), which can be used to efficiently model kernelized attention mechanisms.
3.3 Linformer: Self-Attention with Linear Complexity
Self-Attention with Linear Complexity
Method Summary: Large Transformer models perform excellently in natural language processing applications, but training and deploying them on long sequences is costly. This paper proposes a new self-attention mechanism that reduces complexity from O(n^2) to O(n) while maintaining performance. The resulting Linformer is more time and memory-efficient than standard Transformers.
4. Convolutional Attention
4.1 Conformer: Convolution-augmented Transformer for Speech Recognition
Convolution-Augmented Transformer for Speech Recognition
Method Summary: Conformer is a model that combines convolutional neural networks and Transformers for speech recognition. It captures both local and global dependencies in audio sequences and achieves state-of-the-art accuracy. In the LibriSpeech benchmark, Conformer achieves 2.1%/4.3% WER without using a language model and 1.9%/3.9% WER when using an external language model. Moreover, it also has a competitive small model with only 10M parameters.
4.2 LITE TRANSFORMER WITH LONG-SHORT RANGE ATTENTION
Lightweight Transformer with Long-Short Range Attention
Method Summary: This paper proposes an efficient mobile natural language processing architecture, Lite Transformer, which uses long-short range attention (LSRA) to enhance performance. LSRA dedicates a set of heads for local context modeling (via convolution) and another set for long-distance relationship modeling (via attention). Across three language tasks, Lite Transformer consistently outperforms standard Transformers. Under resource constraints, Lite Transformer scores 1.2/1.7 BLEU points higher than Transformers on the WMT’14 English-French translation task.