Understanding Mamba: The Strongest Competitor to Transformers

Understanding Mamba: The Strongest Competitor to Transformers

MLNLP community is a well-known machine learning and natural language processing community both domestically and internationally, covering NLP graduate students, university professors, and corporate researchers.The vision of the community is to promote communication and progress between the academic and industrial communities in natural language processing and machine learning, especially for beginners. Reprinted from | Machine Heart Edited | Panda

Mamba is great, but it is still in its early stages of development.

There are many deep learning architectures, but the most successful in recent years has been the Transformer, which has established its dominant position in multiple application areas.A key driving force behind such success is the attention mechanism, which allows Transformer-based models to focus on the parts of the input sequence that are relevant, achieving better context understanding. However, the downside of the attention mechanism is its high computational overhead, which increases quadratically with the input size, making it difficult to handle very long texts.Fortunately, a promising new architecture has recently emerged: the Structured State Space Model (SSM). This architecture can efficiently capture complex dependencies in sequential data, becoming a strong competitor to Transformers.The design inspiration for these models comes from classical state space models—we can think of them as a fusion of recurrent neural networks and convolutional neural networks. They can perform efficient computations using recurrent or convolutional operations, thereby allowing the computational overhead to change linearly or nearly linearly with sequence length, significantly reducing computational costs.More specifically, one of the most successful variants of SSM, Mamba, has modeling capabilities that can rival Transformers while maintaining linear scalability with sequence length.Mamba first introduces a simple yet effective selection mechanism that can reparameterize the SSM based on the input, allowing the model to filter out irrelevant information while indefinitely retaining necessary and relevant data. Then, Mamba also includes a hardware-aware algorithm that can use scanning instead of convolution to compute the model iteratively, achieving a 3x speedup on the A100 GPU.As shown in Figure 1, with its powerful ability to model complex long-sequence data and nearly linear scalability, Mamba has risen to become a foundational model and is expected to revolutionize research and application fields such as computer vision, natural language processing, and healthcare.Understanding Mamba: The Strongest Competitor to TransformersAs a result, the literature on researching and applying Mamba is rapidly increasing, making it hard to keep up, and a comprehensive review report will surely be beneficial. Recently, a research team from Hong Kong Polytechnic University published their contribution on arXiv.Understanding Mamba: The Strongest Competitor to Transformers

  • Paper Title: A Survey of Mamba
  • Paper Address: https://arxiv.org/pdf/2408.01129

This review report summarizes Mamba from multiple perspectives, helping beginners learn the basic working mechanism of Mamba and assisting experienced practitioners in understanding the latest developments.Mamba is a hot research direction, and thus multiple teams are trying to write review reports. In addition to this article, there are other reviews focusing on state space models or visual Mamba, for details please refer to the corresponding papers:

  • Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges. arXiv:2404.16112
  • State space model for new-generation network alternative to transformers: A survey. arXiv:2404.09516
  • Vision Mamba: A Comprehensive Survey and Taxonomy. arXiv:2405.04404
  • A survey on vision mamba: Models, applications and challenges. arXiv:2404.18861
  • A survey on visual mamba. arXiv:2404.15956

Prerequisites

Mamba integrates the recurrent framework of Recurrent Neural Networks (RNN), the parallel computing and attention mechanism of Transformers, and the linear characteristics of State Space Models (SSM). Therefore, to thoroughly understand Mamba, it is necessary to first understand these three architectures.Recurrent Neural NetworksRecurrent Neural Networks (RNN) have the ability to retain internal memory, making them adept at handling sequential data.Specifically, at each discrete time step k, a standard RNN processes a vector along with the hidden state from the previous time step, subsequently outputting another vector and updating the hidden state. This hidden state serves as the memory of the RNN, capable of retaining information from previously seen inputs. This dynamic memory allows RNNs to handle sequences of varying lengths.In other words, RNNs are a type of nonlinear recurrent model that can effectively capture temporal patterns by utilizing historical knowledge stored in the hidden state.TransformerThe self-attention mechanism of Transformers helps capture global dependencies within the input. It does this by assigning weights based on the importance of each position relative to other positions. More specifically, the original input undergoes a linear transformation, converting the sequence of input vectors x into three types of vectors: queries Q, keys K, and values V.Then, normalized attention scores S are computed, and attention weights are calculated.In addition to executing a single attention function, we can also perform multi-head attention. This allows the model to capture different types of relationships and understand the input sequence from multiple perspectives. Multi-head attention uses multiple sets of self-attention modules to process the input sequence in parallel. Each head operates independently, executing computations similar to the standard self-attention mechanism.Afterward, the attention weights from each head are aggregated to obtain a weighted sum of the value vectors. This aggregation step allows the model to leverage information from multiple heads and capture various patterns and relationships within the input sequence.State SpaceState Space Models (SSM) are a traditional mathematical framework used to describe the dynamic behavior of systems over time. In recent years, SSMs have been widely applied in various fields such as control theory, robotics, and economics.At its core, SSM embodies the behavior of a system through a set of hidden variables called “states,” enabling it to effectively capture dependencies in temporal data. Unlike RNNs, SSMs are linear models with associative properties. Specifically, classical state space models construct two key equations (state and observation equations) to model the relationship between the input x and output y at the current time t through an N-dimensional hidden state h(t).

  • Discretization

To meet the needs of machine learning, SSMs must undergo a discretization process—converting continuous parameters into discrete parameters. Generally, the goal of discretization methods is to divide continuous time into K discrete intervals with as equal integral areas as possible. One of the most representative solutions adopted by SSMs to achieve this goal is Zero-Order Hold (ZOH), which assumes that the function value remains constant over the interval Δ = [𝑡_{𝑘−1}, 𝑡_𝑘]. Discrete SSMs are structurally similar to recurrent neural networks, thus allowing them to perform inference processes more efficiently than Transformer-based models.

  • Convolutional Computation

Discrete SSMs are linear systems with associative properties, allowing seamless integration with convolutional computation.The relationships between RNNs, Transformers, and SSMs are illustrated in Figure 2.Understanding Mamba: The Strongest Competitor to TransformersOn one hand, conventional RNNs operate based on a nonlinear recurrent framework, where each computation relies solely on the previous hidden state and the current input.While this form allows RNNs to quickly generate outputs during autoregressive inference, it also makes it challenging for RNNs to fully utilize the parallel computing capabilities of GPUs, resulting in slower model training speeds.On the other hand, the Transformer architecture performs matrix multiplication in parallel across multiple “query-key” pairs, which can be efficiently allocated to hardware resources, thus speeding up the training of attention-based models. However, if a Transformer-based model needs to generate responses or predictions, the inference process can be very time-consuming.Unlike RNNs and Transformers, which only support one type of computation, discrete SSMs are highly flexible; thanks to their linear properties, they can support both recurrent and convolutional computations. This feature allows SSMs to achieve efficient inference while also enabling parallel training. However, it should be noted that the most conventional SSMs are time-invariant, meaning that their A, B, C, and Δ are independent of the model input x. This limits their ability for context-aware modeling, leading to subpar performance on specific tasks such as selective replication.

Mamba

To address the shortcomings of traditional SSMs and achieve context-aware modeling, Albert Gu and Tri Dao proposed Mamba, which can serve as a universal backbone network for sequence-based models, refer to Machine Heart’s report “Five Times Throughput, Performance Fully Surpassing Transformers: The New Architecture Mamba Ignites the AI Circle“.Subsequently, they further proposed Mamba-2, where the Structured Space-State Duality (SSD) constructs a robust theoretical framework that connects structured SSMs with various forms of attention, allowing us to transfer algorithms and system optimization techniques originally developed for Transformers to SSMs, also refer to Machine Heart’s report “Back to Battle with Transformers! The Mamba 2 Led by the Original Authors Has Arrived, Significantly Improving Training Efficiency“.Mamba-1: Selective State Space Model with Hardware-Aware AlgorithmsMamba-1 introduces three major innovations based on structured state space models, namely memory initialization based on high-order polynomial projection operators (HiPPO), selection mechanisms, and hardware-aware computation. As shown in Figure 3, these technologies aim to enhance the long-range linear time series modeling capabilities of SSMs.Understanding Mamba: The Strongest Competitor to TransformersSpecifically, the initialization strategy constructs a coherent hidden state matrix to effectively facilitate long-range memory.Then, the selection mechanism allows SSMs to acquire perceptible content representations.Finally, to improve training efficiency, Mamba also includes two hardware-aware computing algorithms: Parallel Associative Scan and Memory Recomputation.Mamba-2: State Space DualityTransformers have inspired the development of various techniques, such as parameter-efficient fine-tuning, mitigating catastrophic forgetting, and model quantization. To enable state space models to benefit from these techniques originally developed for Transformers, Mamba-2 introduces a new framework: Structured State Space Duality (SSD). This framework theoretically connects SSMs with different forms of attention.Essentially, SSD indicates that the attention mechanism used in Transformers and the linear time-invariant systems used in SSMs can be viewed as semi-separable matrix transformations.Additionally, Albert Gu and Tri Dao have also proven that selective SSMs are equivalent to using a semi-separable masking matrix to implement a structured linear attention mechanism.Mamba-2 designs a computation method that can utilize hardware more efficiently based on SSD, employing a block decomposition matrix multiplication algorithm.Specifically, by treating state space models as semi-separable matrices through this matrix transformation, Mamba-2 can decompose the computation into matrix blocks, with diagonal blocks representing intra-block computations. Non-diagonal blocks represent inter-block computations decomposed through the hidden states of SSMs. This method allows Mamba-2’s training speed to exceed Mamba-1’s parallel associative scanning by 2-8 times, while also achieving performance comparable to Transformers.Mamba BlocksNext, let’s take a look at the block designs of Mamba-1 and Mamba-2. Figure 4 compares the two architectures.Understanding Mamba: The Strongest Competitor to TransformersThe design of Mamba-1 is centered around SSM, where the task of the selective SSM layer is to perform the mapping from input sequence X to Y. In this design, after the initial linear projection to create X, linear projections of (A, B, C) are used. Then, the input tokens and state matrices pass through the selective SSM unit, utilizing parallel associative scanning to obtain the output Y. Finally, Mamba-1 adopts a skip connection to encourage feature reuse and mitigate performance degradation often encountered during model training. By interleaving this module with standard normalization and residual connections, the Mamba model can be constructed.As for Mamba-2, it introduces an SSD layer to create the mapping from [X, A, B, C] to Y. This is achieved by using a single projection at the starting point of the block to simultaneously process [X, A, B, C], similar to how standard attention architectures generate Q, K, and V projections in parallel.In other words, by removing the linear projection of the sequence, Mamba-2 blocks are simplified based on Mamba-1 blocks. This allows the SSD structure’s computation speed to exceed Mamba-1’s parallel selective scanning. Additionally, to enhance training stability, Mamba-2 also adds a normalization layer after the skip connection.

Mamba Model is Advancing

The state space model and Mamba have seen rapid development recently, becoming a highly promising choice for foundational model backbones. Although Mamba has performed well in natural language processing tasks, it still faces challenges such as memory loss, difficulty in generalizing to different tasks, and subpar performance in complex patterns compared to Transformer-based language models. To address these challenges, the research community has proposed numerous improvement plans for the Mamba architecture. Existing research mainly focuses on modifying block designs, scanning patterns, and memory management. Table 1 categorizes and summarizes the relevant research.Understanding Mamba: The Strongest Competitor to TransformersBlock DesignThe design and structure of Mamba blocks significantly impact the overall performance of the Mamba model, making it a major research focus.Understanding Mamba: The Strongest Competitor to TransformersAs shown in Figure 5, based on different approaches to constructing new Mamba modules, existing research can be categorized into three types:

  • Integration methods: Integrating Mamba blocks with other models to achieve a balance between performance and efficiency;
  • Replacement methods: Replacing major layers in other model frameworks with Mamba blocks;
  • Modification methods: Modifying components within classic Mamba blocks.

Scanning PatternsParallel associative scanning is a key component within the Mamba model, aimed at addressing computational issues caused by the selection mechanism, speeding up the training process, and reducing memory requirements. It achieves this by leveraging the linear properties of time-varying SSMs to design kernel fusion and recomputation at the hardware level. However, the unidirectional sequence modeling paradigm of Mamba is not conducive to comprehensively learning diverse data, such as images and videos.Understanding Mamba: The Strongest Competitor to TransformersTo mitigate this issue, some researchers have explored new efficient scanning methods to enhance the performance of Mamba models and facilitate their training processes. As shown in Figure 6, existing research outcomes in developing scanning patterns can be divided into two categories:

  • Flattened scanning methods: Viewing token sequences from a flattened perspective and processing model inputs based on this;
  • 3D scanning methods: Scanning model inputs across dimensions, channels, or scales, which can be further divided into three categories: hierarchical scanning, spatiotemporal scanning, and hybrid scanning.

Memory ManagementSimilar to RNNs, in state space models, the memory of the hidden state effectively stores information from previous steps, thus critically impacting the overall performance of SSMs. Although Mamba has introduced HiPPO-based methods for memory initialization, managing memory within SSM units remains challenging, including transferring hidden information before layers and achieving lossless memory compression.To address this, some pioneering research has proposed various solutions, including memory initialization, compression, and connection.

Adapting Mamba to Diverse Data

The Mamba architecture is an extension of the selective state space model, possessing the fundamental characteristics of recurrent models, thus making it highly suitable as a general foundational model for handling sequential data such as text, time series, and speech.Moreover, recent pioneering research has expanded the application scenarios of the Mamba architecture, enabling it to handle not only sequential data but also images and graphs, as shown in Figure 7.Understanding Mamba: The Strongest Competitor to TransformersThe goal of these studies is to fully leverage Mamba’s excellent capability to capture long-range dependencies while also maximizing its efficiency advantages during learning and reasoning processes. Table 2 briefly summarizes these research outcomes.Understanding Mamba: The Strongest Competitor to TransformersSequential DataSequential data refers to data collected and organized in a specific order, where the order of data points is significant. This review report comprehensively summarizes Mamba’s applications across various sequential data types, including natural language, video, time series, speech, and human motion data. For details, please refer to the original paper.Non-Sequential DataIn contrast to sequential data, non-sequential data does not follow a specific order. Its data points can be organized in any order without significantly affecting the meaning of the data. This lack of inherent order in data poses challenges for recurrent models (RNNs and SSMs) that are specifically designed to capture temporal dependencies.Surprisingly, recent research has successfully enabled Mamba (a representative SSM) to efficiently process non-sequential data, including images, graphs, and point cloud data.Multimodal DataTo enhance AI’s perception and scene understanding capabilities, multiple modalities of data can be integrated, such as language (sequential data) and images (non-sequential data). This integration can provide highly valuable and complementary information.Recently, multimodal large language models (MLLMs) have become the most prominent research hotspot; these models inherit the powerful capabilities of large language models (LLMs), including strong language expression and logical reasoning abilities. While Transformers have become the dominant approach in this field, Mamba is also rising as a strong competitor, demonstrating excellent performance in aligning mixed-source data and achieving linear complexity in sequence length, making Mamba a potential replacement for Transformers in multimodal learning.

Applications

Below are some noteworthy applications based on Mamba models. The team has categorized these applications into the following categories: natural language processing, computer vision, speech analysis, drug discovery, recommendation systems, and robotics and autonomous systems.We will not elaborate further here; please refer to the original paper for details.

Challenges and Opportunities

Although Mamba has achieved outstanding performance in some areas, overall, Mamba research is still in its infancy, and there are still challenges to overcome. Of course, these challenges also present opportunities.

  • How to develop and improve foundational models based on Mamba;
  • How to fully realize hardware-aware computing to leverage GPUs and TPUs as much as possible, enhancing model efficiency;
  • How to improve the reliability of Mamba models, which requires further research on safety and robustness, fairness, interpretability, and privacy;
  • How to apply new technologies from the Transformer domain to Mamba, such as parameter-efficient fine-tuning, mitigating catastrophic forgetting, and retrieval-augmented generation (RAG).

Technical Exchange Group Invitation

Understanding Mamba: The Strongest Competitor to Transformers

△ Long press to add the assistant

Scan the QR code to add the assistant on WeChat

Please note: Name-School/Company-Research Direction (e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue System) to apply to join technical exchange groups such as Natural Language Processing/Pytorch

About Us

MLNLP Community is a grassroots academic community jointly built by scholars in machine learning and natural language processing both domestically and internationally. It has now developed into a well-known community for machine learning and natural language processing, aiming to promote progress between the academic and industrial communities in machine learning and natural language processing.The community provides an open communication platform for related practitioners’ further studies, employment, and research. Everyone is welcome to follow and join us.

Understanding Mamba: The Strongest Competitor to Transformers

Leave a Comment