Who Will Replace Transformer?

Who Will Replace Transformer?

MLNLP community is a well-known machine learning and natural language processing community both domestically and internationally, covering graduate students, faculty, and researchers in NLP.
The community’s vision is to promote communication and progress between the academic and industrial sectors of natural language processing and machine learning, especially for beginners.
Reprinted from | AI Technology Review
Author | Zhang Jin

The paper “Attention Is All You Need” published by Google in 2017 has become a bible of artificial intelligence, and the global AI boom can be directly traced back to the invention of the Transformer.

Due to its ability to handle local and long-range dependencies and the characteristic of parallel training, the Transformer has gradually replaced the previous RNN (Recurrent Neural Network) and CNN (Convolutional Neural Network), becoming the standard paradigm for cutting-edge research in NLP (Natural Language Processing).

Today’s mainstream AI models and products—OpenAI’s ChatGPT, Google’s Bard, Anthropic’s Claude, Midjourney, Sora, and domestic large models like Zhihua AI’s ChatGLM, Baichuan Intelligent’s Baichuan, Kimi chat, etc.—are all based on the Transformer architecture.

The Transformer has become the undisputed gold standard of today’s artificial intelligence technology, and its dominant position remains unshaken.

While the Transformer is flourishing, there have been some opposing voices, such as: “The efficiency of the Transformer is not high”; “The ceiling of the Transformer is easily visible”; “The Transformer is good, but it cannot achieve AGI (Artificial General Intelligence), nor can it create a world model.”

This is because the strength of the Transformer is also its weakness: the inherent self-attention mechanism in the Transformer poses challenges, mainly due to its quadratic complexity, which makes the architecture costly in terms of computation and memory when dealing with long input sequences or in resource-constrained situations.

In simple terms, this means that as the sequence length handled by the Transformer (e.g., the number of words in a paragraph or the size of an image) increases, the required computing power increases quadratically with that sequence, quickly becoming enormous. Hence, it is claimed that “the efficiency of the Transformer is not high.” This is also a major reason for the global computing power shortage triggered by the current AI boom.

Based on the limitations of the Transformer, many non-Transformer architectures have been proposed, including China’s RWKV, Meta’s Mega, Microsoft’s RetNet, Mamba, and DeepMind’s Hawk and Griffin, which have all emerged after the Transformer dominated the large model research landscape.

Most of them are improvements based on the original RNN, addressing the flaws and limitations of the Transformer, attempting to develop so-called “efficient Transformer” structures that are more akin to human thought processes.

Efficient Transformers refer to models that occupy less memory and have lower computational costs during training and inference, attempting to overthrow the dominance of the Transformer.

Where Is Current Non-Transformer Architecture Research Heading?

Currently, mainstream non-Transformer research is primarily focused on optimizing the full attention mechanism and finding ways to transform this part into an RNN model to improve inference efficiency.

Attention is the core of the Transformer—the strength of the Transformer model lies in its abandonment of the previously widely used recurrent and convolutional networks in favor of a special structure—the attention mechanism—to model text.

Attention allows the model to consider the relationships between words, regardless of how far apart they are, and to determine which words and phrases in a paragraph deserve the most attention.

This mechanism enables the Transformer to achieve parallelization in language processing, analyzing all words in a specific text simultaneously instead of sequentially. The parallelization of the Transformer allows for a more comprehensive and accurate understanding of the text being read and written, making it more efficient and scalable than RNNs.

In contrast, Recurrent Neural Networks (RNNs) face the problem of vanishing gradients, making it difficult for them to train on long sequences. Additionally, they cannot parallelize over time during training, which limits their scalability; Convolutional Neural Networks (CNNs) excel at capturing local patterns but lack in long-range dependencies, which are crucial for many sequence processing tasks.

However, RNNs have the advantage that their complexity is constant during inference, so memory and computational demands grow linearly, whereas the memory and computational complexity of Transformers grows quadratically with sequence length, making RNNs less demanding in terms of memory and computation. Therefore, many current non-Transformer studies are striving to “retain the advantages of RNNs while trying to achieve Transformer performance.”

Based on this goal, today’s non-Transformer technology research is mainly divided into two schools:

The first school, represented by RWKV, Mamba, and S4, completely replaces attention with recurrent structures. This approach attempts to remember previous information with fixed memory, but it currently faces challenges in maintaining longer memories.

The other school aims to make the full attention structure sparse, such as Meta’s Mega, which no longer requires calculating every element in the attention matrix during subsequent computations, thus improving model efficiency.

Specifically analyzing various non-Transformer models, RWKV is the first domestically developed non-Transformer architecture large language model, which has now iterated to the sixth generation RWKV-6. The author of RWKV, Peng Bo, began training RWKV-2 in May 2022, initially with only 100 million parameters. In March 2023, a 14 billion parameter version, RWKV-4, was trained.

Peng Bo told AI Technology Review why he wanted to create a model different from the Transformer architecture:

“Because the world itself does not operate based on the logic of the Transformer. The laws of the world operate more like an RNN structure—what happens in the next second is not related to all your past time and information, only to the previous second. The Transformer needs to recognize all tokens, which is unreasonable.”

Thus, RWKV uses linear attention to approximate full attention, attempting to combine the advantages of RNNs and Transformers while avoiding their drawbacks, alleviating the memory bottleneck and quadratic scaling issues posed by Transformers, achieving more effective linear scaling while providing parallel training and scalability, similar to Transformers. In short, it focuses on high performance, low energy consumption, and small memory usage.

Previously discussed Mamba has two authors, one is Assistant Professor Albert Gu from Carnegie Mellon University’s Machine Learning Department, and the other is Tri Dao, Chief Scientist at Together.AI.

In their paper, they claim that Mamba is a new SSM architecture that outperforms comparable-sized Transformer models in language modeling, both in pre-training and downstream evaluation. Their Mamba-3B model can compete with Transformer models twice its size and can achieve linear scaling with increasing context length, improving performance on practical data to sequences of up to a million tokens, and achieving a fivefold increase in inference throughput.

A non-Transformer researcher told AI Technology Review that Mamba uses only recurrent structures without attention, so its memory size remains fixed during the prediction of the next token and does not increase over time; however, its drawback is that the rolling process has very small memory, resulting in weak extrapolation capability.

This researcher believes that Microsoft’s RetNet, proposed by Microsoft Research Asia, also follows a completely recurrent approach. RetNet introduces a multi-scale retention mechanism to replace multi-head attention, and it has three computational paradigms: parallel, recurrent, and block-recurrent representations.

The paper states that RetNet’s inference cost is independent of length. For a 7B model and an 8k sequence length, RetNet’s decoding speed is 8.4 times faster than Transformers with key-value caching, saving 70% of memory.

During training, RetNet can also save 25-50% of memory compared to standard Transformers, achieving a sevenfold speedup and having advantages in highly optimized FlashAttention. Additionally, RetNet’s inference latency is insensitive to batch size, resulting in significant throughput.

Meta’s Mega represents the second technical route of non-Transformer research. The idea of Mega is to combine recurrent structures with sparse attention matrices.

One of the core researchers of Mega, Max, told AI Technology Review that attention has its irreplaceable role, and as long as its complexity is limited within a certain range, the desired effect can be achieved. Mega has spent a long time studying how to combine recurrent structures and attention for maximum efficiency.

Therefore, Mega still adopts an attention structure, but limits attention to a fixed window size while combining a rolling memory form similar to Mamba, albeit with significant simplifications, resulting in faster overall computation speed.

“Rolling memory” refers to the fact that all efficient Transformers introduce recurrent structures into Transformers, allowing the model to first look at a segment of history, remember it, then look at the next segment of history, update the memory, possibly forgetting some of the first segment while adding in the second segment that needs to be remembered, thereby continuously rolling forward the memory.

The benefit of this memory approach is that the model can have a fixed-length rolling memory that does not increase over time, but its drawback is that often, for certain special tasks, it is difficult to identify which parts of the previous memory are useful and which are not at the last moment, making this rolling memory hard to complete.

Mega was trained on the same data as LLaMA and compared fairly with LLaMA2, finding that Mega2 performs significantly better under the same data conditions. Meanwhile, Mega pre-training uses a 32K window size, and Transformers with the same 32K window size are much slower than Mega2. If the window size increases, Mega’s advantages will become even more pronounced. Mega2 has already been trained to 7B size.

DeepMind’s Hawk and Griffin also believe that attention is essential, belonging to the gated linear RNN category, and like Mega, they are hybrid models.

In addition to RWKV, the domestic company Rock Core Intelligence has also released a non-attention mechanism general natural language large model—Yan model. Liu Fanping, CTO of Rock Core Intelligence, stated that Yan has no relation to linear attention and RNN, as the large model architecture of Yan removes the high-cost attention mechanism in Transformers, replacing it with lower computational and difficulty linear calculations, improving modeling efficiency and training speed, achieving efficiency gains and cost reductions.

Can the Transformer Be Overturned?

Although many non-Transformer research proposals have emerged, from the evaluation effects, they generally outperform Transformers of comparable size, but they all face the common challenge and skepticism: when their scale is increased to that of today’s Transformer models, can they continue to demonstrate strong performance and efficiency improvements?

Among them, the largest parameter RWKV has 14 billion parameters, Meta’s Mega has 7 billion parameters, while GPT-3 has 175 billion parameters, and GPT-4 is rumored to have 1.8 trillion parameters. This means that non-Transformers urgently need to train a hundred billion model to prove themselves.

RWKV, the most representative non-Transformer research, has advanced significantly—it has completed seed round financing of over ten million yuan; it is reported that some companies in China are attempting to train models using RWKV; and in the past year, RWKV has seen some local applications in To C and To B.

However, several investors have told AI Technology Review that they were hesitant to invest in RWKV, betting on non-Transformers, due to significant internal disagreements—doubts about whether non-Transformers could succeed, ultimately leading to abandonment.

At this stage, based on the existing hardware computing power, it is very challenging to use Transformers for large models on the edge; it still requires performing computational inference and other tasks in the cloud, and the response speed is unsatisfactory, making it hard for end users to accept.

Industry insiders have told AI Technology Review, “On the edge, RWKV is not necessarily the optimal solution because as semiconductor technology advances, AI chips will continue to evolve. In the future, the costs of hardware, computing power, and energy will be spread out, allowing large models to run directly on the edge without significant effort to change the underlying architecture. One day, we will reach this critical point.”

RWKV’s approach is to operate at the framework level, making the framework lightweight to enable local computation. However, one investor pointed out that the ideal state for non-Transformers is to reach OpenAI’s level before discussing lightweighting, “not just for the sake of being small or localized.”

This investor evaluated RWKV as “small but complete,” with an overall experience reaching 60 points of GPT-3.5, but it remains uncertain whether it can reach 80 or 90 points of GPT. This is also the problem with non-Transformers: if the complexity of the framework is sacrificed, the ceiling might also be compromised.

Someone close to OpenAI told AI Technology Review that OpenAI internally tested RWKV but ultimately abandoned this route, as “its ceiling has not yet been revealed in the long run, and its potential for achieving AGI is low.”

Proving how high their ceiling is has become a common challenge for all non-Transformer architectures.

Some model researchers believe that Transformers have not yet reached their ceiling in text large models, as scaling laws have not failed, and the bottleneck of Transformers may still lie in generating longer sequences, such as in multimodal video generation, which is essential for achieving AGI. Thus, the context window remains a bottleneck for Transformers.

If one can spend money like OpenAI, they can continue to push the scaling laws of Transformers, but the problem is that doubling the sequence length requires quadrupling the cost, and the time taken also increases fourfold. The quadratic growth makes Transformers inefficient in handling long sequence problems, and there are limits to resources.

It is understood that leading domestic large model companies primarily use Transformers. However, there are speculations about whether GPT-5 will continue to use the Transformer architecture, as there has been no further open-sourcing since GPT-2. Nevertheless, people are more inclined to believe that the ceiling of Transformers is still far away. Therefore, continuing to chase GPT-4 and GPT-5 with Transformers may not be a wrong path. In the era of large models, everyone is betting.

However, whether the Transformer is the only path to achieve AGI remains uncertain. What can be confirmed is that the monopoly formed by Transformers is challenging to break, whether in terms of resources or ecosystem. Currently, non-Transformer research is still inferior to Transformers.

It is understood that the teams researching new architectures for large models that are non-Transformers are either in academia or startups like RWKV. Few large companies are investing significant teams to explore new architectures, so in terms of resources, the gap between non-Transformers and Transformers remains substantial.

Moreover, the biggest obstacle ahead is the increasingly solid ecological moat of the Transformer.

Now, whether in hardware, systems, or applications, everything is adapted and optimized around Transformers, making it less cost-effective to develop other architectures, leading to increasing difficulty in creating new architectures.

In evaluation, many design tasks are biased towards the Transformer architecture, meaning that the tasks designed may only be solvable by Transformer models, making it difficult or impossible for non-Transformers to accomplish these tasks. This design showcases the advantages of Transformers but is not friendly to other architectures.

MIT PhD student and flash-linear-attention project leader Yang Songlin once told AI Technology Review that one of the obstacles faced by non-Transformer research is the evaluation method—simply looking at perplexity shows that non-Transformers do not differ much from Transformers, but many practical abilities (like in-context copy and retrieval) still have significant gaps. She believes that current non-Transformer models lack more comprehensive evaluation methods to improve their capabilities relative to Transformers.

Undoubtedly, the status of Transformers remains unshakable; they are still the most powerful AI architecture today. However, beyond the echo chamber effect, the work to develop the next generation of AI architectures is in full swing.

Breaking the monopoly is certainly not easy, but based on the laws of technological development, it is difficult for any architecture to maintain a monopoly indefinitely. In the future, non-Transformers need to continue proving how high their ceiling is, and the Transformer architecture must do the same.

Technical Exchange Group Invitation

Who Will Replace Transformer?

△ Long press to add assistant

Scan the QR code to add the assistant WeChat

Please note: Name-School/Company-Research Direction
(e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue Systems)
To apply to join the Natural Language Processing/PyTorch technical exchange group

About Us

MLNLP community is a grassroots academic community jointly built by machine learning and natural language processing scholars from home and abroad. It has developed into a well-known machine learning and natural language processing community both domestically and internationally, aiming to promote progress between the academic and industrial sectors of machine learning and natural language processing.
The community can provide an open communication platform for related practitioners’ further education, employment, and research. Everyone is welcome to follow and join us.

Who Will Replace Transformer?

Leave a Comment