Yan Model: The First Non-Attention Large Model in China

On January 24, at the “New Architecture, New Model Power” large model launch conference held by Shanghai Yanxin Intelligent AI Technology Co., Ltd., Yanxin officially released the first general-purpose natural language large model in China that does not use the Attention mechanism—Yan model. As one of the few non-Transformer large models in the industry, the Yan model replaces the Transformer architecture with a newly developed “Yan architecture” that achieves the performance of a hundred billion parameter model with tens of billions of parameters—enhancing memory capacity by 3 times and speeding up by 7 times while achieving a 5-fold increase in inference throughput.

Yan Model: The First Non-Attention Large Model in China

At the conference, Yanxin’s CEO Liu Fanping stated: “We hope that the Yan architecture can serve as the infrastructure for the field of artificial intelligence, establishing a developer ecosystem in AI, allowing anyone on any device to use general large models to obtain more economical, convenient, and secure AI services, thus promoting the construction of an inclusive AI future.

Transformer Is Not the Only Solution for Large Models

Transformer is the foundational architecture widely adopted by popular large models such as GPT, LLAMA, and PaLM. Its rise is undoubtedly an important milestone in the history of deep learning. With its powerful natural language understanding capabilities, the Transformer has replaced traditional RNN architectures within just a few years of its introduction, becoming the mainstream model architecture in natural language processing and demonstrating its cross-domain capabilities in areas such as computer vision and speech recognition.

So, in a time when Transformers occupy half of the AI field, why is Yanxin still seeking more possibilities beyond Transformers?

At the launch event, Liu Fanping addressed this question. He pointed out that the high computational power and cost associated with the large-scale Transformer model deter many small and medium-sized enterprises. The complexity of its internal architecture makes the decision-making process hard to explain; difficulties in processing long sequences and the uncontrollable hallucination problem also limit the widespread application of large models in certain key fields and special scenarios. With the prevalence of cloud computing and edge computing, the demand for high-performance, low-energy consumption AI large models is continuously growing.

Liu Fanping mentioned: “Globally, many excellent researchers have been attempting to fundamentally address the over-reliance on the Transformer architecture and seek better alternatives. Even one of the authors of the Transformer paper, Llion Jones, is exploring ‘possibilities beyond Transformer’, attempting to redefine AI frameworks from different angles using a nature-inspired intelligent method based on evolutionary principles.

Yanxin is no exception. In the continuous research and improvement of Transformer models, they recognized the necessity of redesigning large models: on one hand, adjustments to existing architectures under the Attention mechanism have nearly reached a bottleneck; on the other hand, Yanxin aims to lower the usage threshold for enterprises, allowing large models to exhibit stronger performance with less data and lower computational power for broader business applications. After nearly 1000 days and nights and over hundreds of designs, modifications, optimizations, comparisons, and resets, Yanxin independently developed a new architecture that no longer relies on Transformers—the “Yan architecture”, and the general large model based on Yan architecture was born.

Yan Model: The First Non-Attention Large Model in China

Yan Architecture:

Dual Focus on Technology and Implementation

If the large models based on the Transformer architecture are like “fuel-consuming and expensive” fuel vehicles, then the large models based on the Yan architecture resemble more economical and energy-efficient new energy vehicles. It removes the high-cost attention mechanism from the Transformer and replaces it with linear computations that are less demanding and more efficient, significantly improving modeling efficiency and training speed, achieving a cost reduction while doubling efficiency.

At the conference, the research team showcased extensive empirical comparisons between the Yan model and Transformer models of equivalent parameter scale. Experimental data indicates that the Yan architecture can achieve higher training efficiency, stronger memory capacity, and lower hallucination expressions than the Transformer architecture.

Yan Model: The First Non-Attention Large Model in China

Yan Model: The First Non-Attention Large Model in China

Under the same resource conditions, the model based on the Yan architecture achieves training efficiency and inference throughput that are 7 times and 5 times that of the Transformer architecture, respectively, while enhancing memory capacity by 3 times. The design of the Yan architecture ensures that the space complexity during inference is constant, allowing the Yan model to perform excellently in addressing the long-sequence challenges faced by Transformers. Comparative data indicates that with a single 4090 24G graphics card, when the token output length of the model exceeds 2600, the Transformer model experiences insufficient memory, while the Yan model’s memory usage remains stable at around 14G, theoretically enabling infinite length inference.

Additionally, the research team pioneered a reasonable correlation feature function and memory operator, combining them with linear computations to reduce the complexity of the internal structure of the model. The new architecture of the Yan model will open up the “unexplainable black box” of natural language processing, fully exploring the transparency and interpretability of the decision-making process, thus facilitating the widespread application of large models in high-risk fields such as healthcare, finance, and law.

In addition to breakthroughs in technology, the Yan model also possesses six excellent commercialization capabilities, including privacy, economy, precision, real-time performance, professionalism, and versatility, making it truly “born for implementation”.

According to industry consensus, the competition for large models has transitioned from the era of “parameter competition” to the stage of “application competition”. The demand for general large models urgently needs to be addressed; thus, many large models will utilize mainstream technical methods such as pruning and compression to achieve operation on devices. The Yan model fully supports private deployment applications, running losslessly on mainstream consumer-grade CPUs without pruning or compression, achieving performance comparable to other models on GPUs. This was also demonstrated at the launch event, where researchers showcased the inference running of the Yan model on personal computers, with plans for lossless deployment on more portable devices or terminals such as mobile phones in the next phase.

Liu Fanping stated: “Yanxin aims to build a full-modal real-time human-machine interaction system, comprehensively linking perception, cognition, decision-making, and action, constructing an intelligent cycle for general artificial intelligence, providing ‘more choices’ for research in embodied intelligence directions such as general robotics, hoping to achieve integrated training and inference through specialized productivity tools based on the Yan architecture, assisting various industries in completing their digital transformation and upgrading under low consumption and memory constraints.”

New Model Power, New Ecosystem

During the roundtable discussion at the conference, Liu Fanping engaged in in-depth exchanges and discussions on the topic of “Innovation and Change” with Li Meng, a researcher and doctoral supervisor at the Shanghai Micro Research Institute of the Chinese Academy of Sciences, Li Hanjun, chief engineer at the Shanghai Industrial Innovation Center of the China Academy of Information and Communications Technology, Cao Yang, founder of Zhizixin Yuan, and Ye Liwei, technical director of Yuewen Qidian.

Yan Model: The First Non-Attention Large Model in China

Yan Model: The First Non-Attention Large Model in China

Li Hanjun stated: “Since the development of artificial intelligence, the architecture upgrades of large models have been continuously evolving. Driven by technology and application, the ecological boundaries are also expanding. It can be said that every technological breakthrough will lead to the development of the intelligent ecosystem. From the current focus on generality to future personalized development, we look forward to the industry generating more new productivity tools, triggering a new round of technological revolution, and promoting the entire AI industry towards a more efficient and sustainable direction.”

The performance of the Yan model in practical applications still needs verification by the market, as summarized by Yanxin’s chairman Chen Daiqian: “With the further implementation and application of the Yan model, we look forward to the general large model based on the Yan architecture providing the necessary intelligent capabilities for various robots, embedded devices, and IoT devices, injecting new vitality, new ideas, and new possibilities into the artificial intelligence industry, creating more value for enterprises and users. We will also use our strength to participate in promoting a new round of technological change in the field of artificial intelligence.”

Yan Model: The First Non-Attention Large Model in China

Leave a Comment