
This article is authorized for reprint by AI New Media Quantum Bit (Public Account ID: qbitai). Please contact the source for reprinting.
This article is approximately 1200 words long and is recommended for a 5-minute read.
This article introduces the hybrid model Jamba.
Exciting news! The first real expansion of the Mamba architecture has finally arrived, taking it to a sufficiently large scale.
52 billion parameters, still a Mamba + Transformer hybrid architecture.
Its name is Jamba.
By taking the strengths of both architectures, it achieves both model quality and efficiency, providing high throughput and low memory usage.
Preliminary benchmarks show:
-
Jamba’s overall performance is close to Mixtral 8x-7B, but its throughput when processing 128k long context is three times that of Mixtral.
-
It supports a total of 256k context, while a single A100 GPU can handle 140k, achieving the highest efficiency for models of the same scale.
This achievement comes from the Israeli AI company AI21 Labs.
The original author of Mamba was excited to share this:
Absolutely a “big news.”
Mamba and Transformer Combined
The Mamba proposed by CMU and Princeton University addresses the limitations of the Transformer (as the reasoning context grows longer, the model’s memory usage increases, while the inference speed slows down, leading to huge computational costs).
However, it also has its own disadvantages —
Without focusing on the entire context, Mamba’s output quality is poor, especially in recall-related tasks.
In the spirit of “wanting both,” Jamba steps in to provide a win-win solution.
Jamba consists of Transformer, Mamba, and MoE layers, optimizing memory, throughput, and performance simultaneously.
As shown in the diagram below, to integrate the two architectures, Jamba adopts an innovative blocks-and-layers combination method.
In simple terms, each Jamba block contains either an attention layer or a Mamba layer, followed by a multi-layer perceptron (MLP), maintaining a ratio of one Transformer layer for every eight layers.
Secondly, Jamba utilizes MoE to increase the total number of model parameters while simplifying the amount of active parameters used during inference.
As a result, the model capacity increases without a corresponding increase in computational demand.
To maximize throughput on a single GPU (80GB), Jamba also optimizes the number of MoE layers and experts, leaving enough memory for daily inference workloads.
Notably, during inference, Jamba’s MoE layer only requires 12 billion of the 52 billion available parameters to ensure higher efficiency than comparable Transformer-only models.
It is important to note that previous attempts to scale Mamba did not exceed 3 billion parameters.
Thus, in addition to successfully combining Mamba and Transformer, Jamba also achieves a second major accomplishment:
It is the first hybrid architecture to reach production-level scale and quality among its peers (SSM mixed Transformer) (ps. Mamba is a type of state space model SSM).
Throughput and Efficiency Up
Preliminary evaluations show that Jamba excels in key metrics such as throughput and efficiency.
First, Jamba can provide three times the throughput in long contexts, making it more efficient than similarly sized Transformer models like Mixtral 8x-7B.
As shown in the diagram below, when the context window reaches 128k, Jamba’s tokens per second are nearly 1500, while the best-performing Mixtral 8x-7B is only around 500.
Second, on a single GPU, Jamba can accommodate up to 140k contexts, making it economical and efficient.
In contrast, Mixtral 8x-7B supports only 64k, while Llama2 70B supports just 16k.
Third, the output quality of Jamba has also been ensured.
In a series of inference benchmarks, it achieved state-of-the-art (SOTA) results in 3 out of 4 metrics. Additionally, on benchmarks like GSM8K, Jamba performed comparably to SOTA models, even without taking the lead.
Overall, Jamba’s performance is close to Mixtral 8x-7B.
Finally, the author reminds us that these are only preliminary results after modifications, and there is still much room for optimization (such as MoE parallelization and faster Mamba implementation). Therefore, performance will be even stronger in the future.
Good news: Jamba is now live on Hugging Face, and importantly: it uses the Apache-2.0 license.
(The instruction version of Jamba will soon be available on the AI21 Labs platform.)
Netizens were moved to tears after reading this.
Editor: Yu Tengkai
Proofreader: Lin Ganmin