The common challenge faced by non-Transformer architectures is still to prove how high their ceiling can be.

Author: Zhang Jin

Editor: Chen Caixian

The paper “Attention Is All You Need” published by Google in 2017 has become a bible for artificial intelligence today, and the global AI boom can be directly traced back to the invention of the Transformer.

Due to its ability to handle local and long-range dependencies and its parallelizable training characteristics, the Transformer gradually replaced the previous RNN (Recurrent Neural Network) and CNN (Convolutional Neural Network) upon its release, becoming the standard paradigm for cutting-edge NLP (Natural Language Processing) research.

Today’s mainstream AI models and products—OpenAI’s ChatGPT, Google’s Bard, Anthropic’s Claude, Midjourney, Sora, and domestic models like Zhipu AI’s ChatGLM, Baichuan Intelligence’s Baichuan model, Kimi chat, etc.—are all based on the Transformer architecture.

The Transformer has undoubtedly become the gold standard of current AI technology, and its dominant position remains unshaken.

While the Transformer is thriving, some opposing voices have emerged, such as: “The efficiency of the Transformer is not high”; “The ceiling of the Transformer is easily visible”; “The Transformer is good, but it cannot achieve AGI or a world model.”

This is because the strength of the Transformer is also its weakness: the inherent self-attention mechanism in the Transformer presents challenges, primarily due to its quadratic complexity, which makes the architecture costly in terms of computation and memory when dealing with long input sequences or resource-constrained situations.

To put it simply, this means that as the sequence length processed by the Transformer (for example, the number of words in a paragraph or the size of an image) increases, the required computational power increases quadratically with the length of that sequence, quickly becoming enormous. Therefore, there is a saying that “the Transformer is not efficient.” This is also a major reason for the global shortage of computing power triggered by the current AI boom.

Based on the limitations of the Transformer, many non-Transformer architectures have emerged, including China’s RWKV, Meta’s Mega, Microsoft’s RetNet, Mamba, and DeepMind’s Hawk and Griffin—these were proposed one after another after the Transformer unified the large model research landscape.

Most of them are improvements based on the original RNN, addressing the defects and limitations of the Transformer, attempting to research what is called an “efficient Transformer” structure, a model that resembles human thinking.

An efficient Transformer refers to a model that occupies less memory and incurs lower computational costs during training and inference, attempting to overturn the hegemony of the Transformer.

Where Is Current Non-Transformer Architecture Research Headed?

Currently, mainstream non-Transformer research is primarily focused on optimizing the full attention mechanism and finding ways to convert this part into an RNN model to improve inference efficiency.

The attention mechanism is the core of the Transformer—its power comes from abandoning the previously widely used recurrent and convolutional networks in favor of a special structure—the attention mechanism—to model text.

Attention allows the model to consider the relationships between words, no matter how far apart they are, and to determine which words and phrases in a paragraph are most worthy of attention.

This mechanism enables the Transformer to achieve parallelization in language processing, analyzing all words in a specific text simultaneously rather than sequentially. The parallelization of the Transformer provides a more comprehensive and accurate understanding of the text it reads and writes, making it more computationally efficient and scalable than RNNs.

In contrast, Recurrent Neural Networks (RNNs) face the problem of vanishing gradients, making it difficult for them to train on long sequences. Additionally, they cannot parallelize over time during training, which limits their scalability; Convolutional Neural Networks (CNNs) are only good at capturing local patterns and lack the ability to handle long-range dependencies, which is crucial for many sequence processing tasks.

However, RNNs have the advantage that during inference, the complexity is constant, so memory and computational requirements grow linearly, while the memory and computational complexity of Transformers grows quadratically with sequence length. Therefore, many current non-Transformer research efforts aim to “retain the advantages of RNNs while attempting to achieve Transformer performance.”

Based on this goal, today’s non-Transformer technology research is mainly divided into two schools:

One school, represented by RWKV, Mamba, and S4, completely replaces attention with a recurrent structure. This approach uses fixed memory to retain previous information, but currently, while it can remember a certain length, achieving longer lengths is challenging.

The other school aims to make the full attention dense structure sparse, such as Meta’s Mega, which no longer requires calculating every element in the attention matrix during subsequent computations, thus improving model efficiency.

Specifically analyzing various non-Transformer models, RWKV is the first domestic open-source non-Transformer architecture large language model, now iterated to the sixth generation RWKV-6. The author of RWKV, Peng Bo, began training RWKV-2 in May 2022, initially with a parameter scale of only 100 million (100M), and later in March 2023, trained the RWKV-4 model with 14 billion (14B) parameters.

Peng Bo once told AI Tech Review why he wanted to create a model different from the Transformer architecture:

“Because the world itself does not operate on the logic of Transformers; the laws of the world operate based on something like an RNN structure—what happens in the next second will not be related to all your past time and information but only to the previous second. It is unreasonable for the Transformer to recognize all tokens.”

Thus, RWKV uses linear attention to approximate full attention, attempting to combine the advantages of RNN and Transformer while avoiding the shortcomings of both, alleviating the memory bottleneck and quadratic expansion problems posed by Transformers, achieving more effective linear scaling while providing parallel training and scalability, similar to Transformers. In short, it focuses on high performance, low energy consumption, and low memory usage.

And the previously discussed Mamba has two authors, one being Albert Gu, an assistant professor in the Machine Learning Department at Carnegie Mellon University, and the other Tri Dao, the chief scientist at Together.AI.

In their paper, they state that Mamba is a new SSM architecture that outperforms comparable-scale Transformer models in language modeling, both in pre-training and downstream evaluation, and can compete with Transformer models twice its size, achieving linear scaling with increasing context length, improving performance in real data to sequences of up to a million tokens, with a five-fold improvement in inference throughput.

A non-Transformer researcher told AI Tech Review that Mamba completely uses a recurrent structure without attention, so when predicting the next token, its memory size remains fixed and does not increase over time; however, its drawback is that the memory during rolling is very small, resulting in weak extrapolation capability.

This researcher believes that RetNet proposed by Microsoft Research Asia also follows a completely recurrent approach. RetNet introduces a multi-scale retention mechanism to replace multi-head attention, with three computation paradigms: parallel, recurrent, and block-recurrent representation.

The paper states that the inference cost of RetNet is independent of length. For a 7B model and an 8k sequence length, RetNet’s decoding speed is 8.4 times that of Transformers with key-value caching, saving 70% memory.

During training, RetNet can also save 25-50% of memory compared to standard Transformers, achieving a seven-fold acceleration and having advantages in highly optimized FlashAttention. Additionally, RetNet’s inference latency is insensitive to batch size, resulting in significant throughput.

Meta’s Mega represents the second technical route in non-Transformer research. Mega’s approach is to combine recurrent and sparse attention matrices.

One of Mega’s core researchers, Max, told AI Tech Review that attention has its irreplaceable role, and as long as its complexity is kept within a certain range, desired results can be achieved. Mega spent a long time researching how to combine recurrent structures with attention for maximum efficiency.

Thus, Mega still adopts the attention structure but restricts attention to a fixed window size while combining a rolling memory form similar to Mamba, although Mega’s rolling form is simplified, resulting in faster overall computation.

“Rolling memory” means that all efficient Transformers integrate recurrent structures into Transformers, similar to how the model first looks at a segment of history, remembers it, then looks at the next segment, updates memory, potentially forgetting some of the first segment’s memory while adding what needs to be remembered from the second segment to the entire history, continuously rolling forward.

The benefit of this memory is that the model can have a fixed-length rolling memory that does not increase over time, but its drawback is that often, in certain special tasks, it is difficult to know which parts of the previous memory are useful and which are not, making this rolling memory difficult to accomplish.

In training on the same data as LLaMA, Mega found that Mega2 performed significantly better than LLaMA2 under the same conditions. Moreover, Mega pre-trained with a 32K window size, while Transformers with the same 32K window size were much slower; if the window size increases further, Mega’s advantages will become more pronounced. Currently, Mega2 has been trained to a size of 7B.

The DeepMind team proposed Hawk and Griffin, which also believe that attention is indispensable, belonging to gated linear RNNs, similar to Mega as a hybrid model.

Besides RWKV, domestic Yanchip Intelligence has also released a general natural language large model without an attention mechanism—the Yan model. Liu Fanping, CTO of Yanchip Intelligence, stated that Yan has no relation to linear attention or RNN, and the large model of the Yan architecture eliminates the high-cost attention mechanism in Transformers, replacing it with linear computations that are less demanding, improving modeling efficiency and training speed, thus enhancing efficiency and reducing costs.

Can Transformer Be Overturned?

Although many non-Transformer research proposals have emerged, from a performance evaluation standpoint, they generally outperform Transformers of comparable size. However, they all face the common challenge and skepticism: when their scale is increased to match today’s Transformer model sizes, can they continue to demonstrate strong performance and efficiency improvements?

Among them, the largest parameter model RWKV has 14 billion parameters, while Meta’s Mega has 7 billion parameters, whereas GPT-3 has 175 billion parameters, and GPT-4 is rumored to have 1.8 trillion parameters. This means that non-Transformers urgently need to train a model with hundreds of billions of parameters to prove themselves.

The most representative non-Transformer research, RWKV, has already made significant progress—it has received over ten million yuan in seed funding; it is reported that some domestic companies are trying to use RWKV to train models; and in the past year, RWKV has seen partial implementation in both To C and To B.

However, several investors have told AI Tech Review that they were hesitant to invest in RWKV, betting on non-Transformers, due to significant internal disagreements—doubting whether non-Transformers could succeed, ultimately leading to abandonment.

At this stage, based on the current hardware’s computing power, it is very challenging to use Transformers for large models on the edge; calculations and inference still need to be completed in the cloud, and response speeds are unsatisfactory, making it hard for end users to accept.

Industry insiders have told AI Tech Review, “On the edge, RWKV may not be the optimal solution, because with the evolution of semiconductors, AI chips are becoming increasingly advanced, and in the future, costs in hardware, computing, and energy will eventually be distributed, allowing large models to easily run directly on the edge without the need for significant changes to the underlying architecture. One day, we will reach such a critical point.”

RWKV’s approach is to operate from the framework layer, making the framework lightweight to allow local computation of models. However, one investor suggested that the ideal state for non-Transformers is to reach OpenAI’s level before discussing lightweight solutions, “not just to be small for the sake of being small or localized for localization’s sake.”

This investor evaluated RWKV as “small but complete,” suggesting that the overall experience can reach 60 points of GPT-3.5, but it is uncertain whether it can reach 80 or 90 points of GPT. This is also the problem with non-Transformers, as abandoning the complexity of the framework may sacrifice the ceiling of performance.

Individuals close to OpenAI have told AI Tech Review that OpenAI internally tested RWKV but ultimately abandoned this route because “its ceiling has yet to be revealed, and the possibility of achieving AGI is low in the long run.”

Proving how high their ceiling can be has become a common challenge for all non-Transformer architectures.

Some model researchers claim that the Transformer has not yet reached its ceiling in text large models, as the scaling law has not failed; the bottleneck of the Transformer may still lie in generating longer sequences, such as in the multimodal field of video generation, which is essential for achieving AGI. Thus, the context window remains a bottleneck for the Transformer.

If one is willing to spend money like OpenAI, they can continue to push the scaling law of the Transformer higher, but the issue is that every time the sequence length doubles, the cost quadruples, and the time also quadruples. The square growth makes the Transformer inefficient in handling long sequences, and resources have limits.

It is reported that leading large model companies in China primarily use Transformers. However, there are speculations about whether GPT-5 will still use the Transformer architecture, as no further open-source releases have occurred since GPT-2. Nonetheless, many are inclined to believe that the ceiling of the Transformer is still far off. Therefore, pursuing the Transformer path to catch up with GPT-4 and GPT-5 may not be wrong. In the era of large models, everyone is taking bets.

However, whether the Transformer is the only path to achieving AGI remains uncertain. What is currently certain is that the monopoly formed by the Transformer is difficult to break, whether from resources or ecology, as current non-Transformer research cannot compete.

It is reported that teams researching new non-Transformer architectures for large models are either in academia or startups like RWKV, with few large companies investing significant teams in researching new architectures, resulting in a considerable gap in resources compared to Transformers.

Moreover, the biggest obstacle ahead is the increasingly solid ecological moat of the Transformer.

Currently, whether in hardware, systems, or applications, everything is adapted and optimized around the Transformer, making the cost-effectiveness of developing other architectures lower, leading to increasing difficulty in developing new architectures.

In terms of evaluation, many evaluation design tasks tend to favor the Transformer architecture, meaning that the tasks designed may only be solvable by Transformer models, while non-Transformers find it difficult or face increased difficulty. This design showcases the advantages of Transformers but is not friendly to other architectures.

MIT PhD student and flash-linear-attention project leader Yang Songlin once told AI Tech Review that one of the obstacles facing current non-Transformer research is evaluation methods—simply looking at perplexity, non-Transformers actually have no difference compared to Transformer models, but many practical abilities (such as in-context copy and retrieval) still show considerable gaps. She believes that current non-Transformer models lack more comprehensive evaluation methods to improve their capabilities compared to Transformers.

Undoubtedly, the status of the Transformer remains unshakable; it is still the most powerful AI architecture today. However, outside the echo chamber, the development of the next generation of AI architectures is proceeding vigorously.

Breaking the monopoly is undoubtedly difficult, but according to the laws of technological development, it is hard for any architecture to dominate forever. In the future, non-Transformers need to continue proving how high their ceiling can be, and so does the Transformer architecture.

The author of this article (vx: zzjj752254) has long been关注ing the AI large model field, companies, commercialization, and industry dynamics. Feel free to connect.

For more content, click below to follow:

Unauthorized reproduction of this article on any webpage, forum, or community is strictly prohibited without authorization from “AI Tech Review”!

For reposting on public accounts, please leave a message in the “AI Tech Review” backend to obtain authorization, and when reposting, please indicate the source and insert this public account’s card.

Who Will Replace Transformer?

Leave a Comment Cancel reply