RWKV Introduces Two New Architectures: Eagle and Finch

RWKV Submitted by QbitAI | WeChat Official Account

Not following the usual path of Transformers, the domestically modified RNN architecture RWKV has made new progress:

Two new RWKV architectures have been proposed, namely Eagle (RWKV-5) and Finch (RWKV-6).

RWKV Introduces Two New Architectures: Eagle and Finch

These two sequence models are based on the RWKV-4 architecture and have been improved.

The advancements in the design of the new architecture include multi-headed matrix-valued states and dynamic recurrence mechanism, which enhance the expressiveness of the RWKV model while retaining the reasoning efficiency characteristics of RNNs.

Additionally, a new multilingual corpus has been introduced, containing 1.12 trillion tokens.

The team has also developed a fast tokenizer based on greedy matching to enhance RWKV’s multilingual capabilities.

Currently, 4 Eagle models and 2 Finch models have been released on Baobao Face~

RWKV Introduces Two New Architectures: Eagle and Finch

New Models Eagle and Finch

The updated RWKV includes a total of 6 models, specifically:

4 Eagle (RWKV-5) models: with parameter sizes of 0.4B, 1.5B, 3B, and 7B; 2 Finch (RWKV-6) models: with parameter sizes of 1.6B and 3B.

RWKV Introduces Two New Architectures: Eagle and Finch

Eagle improves upon the architecture and learning decay progress learned from RWKV-4 by using multi-headed matrix-valued states (instead of vector-valued states), reconstructed acceptance states, and additional gating mechanisms.

Finch further enhances the architecture’s performance and flexibility by introducing new data-related functions for time mixing and token shifting modules, including parameterized linear interpolation.

Moreover, Finch proposes a new application for low-rank adaptive functions to allow trainable weight matrices to effectively enhance learned data decay vectors in a context-sensitive manner.

Finally, the new RWKV architecture introduces a new tokenizer called RWKV World Tokenizer and a new dataset RWKV World v2, both aimed at improving the performance of the RWKV model on multilingual and code data.

The new tokenizer, RWKV World Tokenizer, includes vocabulary for less common languages and performs fast tokenization through Trie-based greedy matching.

The new dataset, RWKV World v2, is a new multilingual dataset containing 1.12 trillion tokens, sourced from various carefully selected publicly available data sources.

In its data composition, approximately 70% is English data, 15% is multilingual data, and 15% is code data.

How Are the Benchmark Test Results?

Having architectural innovations alone is not enough; the key is to observe the actual performance of the models.

Let’s take a look at the performance of the new models on various authoritative evaluation rankings—

MQAR Test Results

MQAR (Multiple Query Associative Recall) task is designed to evaluate language models and test the model’s associative memory capability under multiple queries.

In this type of task, the model needs to retrieve relevant information through multiple given queries.

The goal of the MQAR task is to measure the model’s ability to retrieve information under multiple queries, as well as its adaptability and accuracy with different queries.

The following figure shows the MQAR task test results for RWKV-4, Eagle, Finch, and other non-Transformer architectures.

RWKV Introduces Two New Architectures: Eagle and Finch

It can be seen that in the accuracy tests of the MQAR task, Finch shows very stable accuracy performance across various sequence lengths, demonstrating significant performance advantages compared to RWKV-4, RWKV-5, and other non-Transformer architecture models.

Long Context Experiments

The loss and sequence position of RWKV-4, Eagle, and Finch starting from 2048 tokens were tested on the PG19 test set.

(All models were pretrained based on a context length of 4096).

RWKV Introduces Two New Architectures: Eagle and Finch

The test results show that Eagle has made significant improvements over RWKV-4 in long sequence tasks, while Finch, trained with a context length of 4096, performs better than Eagle and can effectively adapt to context lengths exceeding 20000.

Speed and Memory Benchmark Tests

In the speed and memory benchmark tests, the team compared the speed and memory utilization of Finch, Mamba, and Flash Attention type attention kernels.

RWKV Introduces Two New Architectures: Eagle and FinchRWKV Introduces Two New Architectures: Eagle and Finch

It can be seen that Finch consistently outperforms Mamba and Flash Attention in memory usage, with memory usage being 40% and 17% lower than Flash Attention and Mamba, respectively.

Performance in Multilingual Tasks

Japanese

RWKV Introduces Two New Architectures: Eagle and Finch

Spanish

RWKV Introduces Two New Architectures: Eagle and Finch

Arabic

RWKV Introduces Two New Architectures: Eagle and Finch

Japanese-English

RWKV Introduces Two New Architectures: Eagle and Finch

Next Steps

The above research content comes from the latest paper released by the RWKV Foundation titled “Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence”.

The paper is co-authored by RWKV founder Bo PENG and members of the RWKV open-source community.

RWKV Introduces Two New Architectures: Eagle and Finch

Co-author Bo PENG graduated from the University of Hong Kong with a degree in Physics and has over 20 years of programming experience. He previously worked at one of the world’s largest forex hedge funds, Ortus Capital, responsible for high-frequency quantitative trading.

He has also published a book on deep convolutional networks titled “Deep Convolutional Networks: Principles and Practice”.

His main focus and interests lie in software and hardware development. In previous public interviews, he has clearly stated that AIGC is his area of interest, especially in novel generation.

Currently, PENG has 2.1k followers on GitHub.

RWKV Introduces Two New Architectures: Eagle and Finch

However, his main public identity is as a co-founder of a lighting company, Bailing Technology, which primarily produces sunshine lamps, ceiling lamps, and portable table lamps.

He is also likely a cat lover, as a ginger cat appears in his GitHub, Zhihu, WeChat profile pictures, as well as on the homepage of his lighting company and Weibo.

RWKV Introduces Two New Architectures: Eagle and Finch

Quantum Bit has learned that RWKV’s current multimodal work includes RWKV Music (music direction) and VisualRWKV (image direction).

In the next steps, RWKV will focus on the following directions:

  • Expanding the training corpus to make it more diverse (this is a key factor for improving model performance);

  • Training and releasing larger versions of Finch, such as 7B and 14B parameters, and reducing inference and training costs through MoE to further expand its performance.

  • Further optimizing the CUDA implementation of Finch (including algorithm improvements) to achieve speed enhancements and greater parallelization.

Paper link:

https://arxiv.org/pdf/2404.05892.pdf

The End

Countdown to Conference Registration ⏰

April 17, China AIGC Industry Summit

In just one day, experience the new technological paradigm led by AIGC’s new applications!

Representatives and investors from the most mainstream players in product, technology, and investment will gather at the April 17 China AIGC Industry Summit to discuss the new world reshaped by AIGC. Learn more about the summit details.

Welcome to register for the conference ⬇️

RWKV Introduces Two New Architectures: Eagle and Finch

The summit will be live-streamed online, welcome to book the live stream ⬇️

Leave a Comment