Mixture-of-Mamba is an innovative architecture designed to enhance the performance of State Space Models (SSMs) in handling multiple modalities such as text, images, and speech. MoM builds upon the strengths of SSMs, primarily Mamba LLM while addressing their challenges in dealing with diverse data types. This article explores the logic behind SSMs and how MoM can perform efficiently in dealing with multimodal data.
Table of Contents
- Understanding State Space Models (SSMs)
- Deep Dive into Mixture-of-Mamba
- Performance and Efficiency Benchmarks
Understanding State Space Models (SSMs)
State Space Models (SSM), like Transformer and RNN models, process information sequences like text but also signals. They represent a dynamic system in terms of input, output, and state variables. SSMs were originally developed for signal processing but are being used in different domains such as deep learning, genomics, etc.
A State Space can mathematically represent a problem by defining a system’s possible states. In machine learning, for instance, State Space can represent data sequences without requiring explicit knowledge of previous data points. SSMs can summarize this past information into a state and use it to predict future states. SSMs are used to describe the state representations and perform predictions of the next state based on input.
Mathematically, in a linear and time-invariant SSM where t is the time – SSM will
- Map an input sequence x(t)
- To a latent state representation h(t)
- And derive the predicted output sequence y(t)
State Equation – h'(t) = Ah(t) +Bx(t)
Output Equation – y(t) = Ch(t) + Dx(t)
The state equation describes how the state changes based on the input influence –
The output equation describes how the state is translated to the output and how the input influences the output –
Combining the two equations gives us the following architecture –
This is the block diagram representation of the linear state-space equations.
Deep Dive into Mixture-of-Mamba
Mixture-of-Mamba (MoM) is a novel SSM architecture that presents modularity-aware sparsity through modality-specific parametrization of the Mamba block. It extends the advantages of modality-aware sparsity to SSMs while preserving their computational efficiency. The Mamba model, introduced in May 2024, is an SSM variant, that has demonstrated exceptional performance and scalability across various tasks using advanced gating mechanisms and selective state-space scanning. Despite these advantages, SSMs, including Mamba, remain inherently dense, applying the same set of parameters across all input tokens, regardless of modality.
Multi-model pretraining on interleaved text and image data
This type of uniform parameterization limits the ability to capture modality-specific features, leading to poor performance in multi-modal pre-training. An important approach to address these limitations is model sparsity, illustrated by Mixture-of-Experts (MoE). MoE reduces the computational load by activating only a subset of model components for each input token, allowing the experts to specialize in specific aspects of the data. MoE architecture also faces problems such as imbalanced expert utilization, bi-level optimization instability, and inefficient load balancing. These challenges motivated the need for an alternative sparse architecture that is computationally efficiency and easier to optimize.
MoE architecture directly introduces modularity-aware sparsity into the Mamba block itself. This approach is inspired by Mixture-of-Transformers where it directly selects modality-specific weights in every input processing component of Mamba, enabling stability and efficiency in multi-modal pre-training.
Architecture Diagrams (from left) – Transformer, MoE Transformer, Mamba, MoE-Mamba
To evaluate MoM, experiments across three multi-modal pre-training settings were conducted – Transfusion, Chameleon, and Three-Modality. Transfusion refers to a specific way of training multimodal AI models where text and continuous image tokens are interleaved and trained together using a diffusion loss. Here, the model learns to generate images by progressively refining them from noise, while also understanding the associated text. The chameleon approach is where the text and discrete image tokens are interleaved, enabling the model to learn specific text with distinct visual elements. Finally, the Three-Modality approach extends the idea further to include speech data along with text and images, so that the model can learn to understand and relate information from three different modalities.
MoM achieves equivalent image loss using only 34.76% of the training FLOPs (Floating-point operation) at the 1.4B scale during the Transfusion multi-modal pre-training setting. During the Chameleon setting, the MoM reaches similar image loss with just 42.50% of the FLOPs and similar text loss with only 65.40% of the FLOPs at the 1.4B scale. MoM matches speech loss using only 24.80% of the FLOPs at the 1.4B scale while maintaining a strong performance across image and text modalities during the Three-Modalities pre-training setting.
SSMs like Mamba are known for being more efficient than Transformer models. MoM makes SSMs even better for multimodal tasks, potentially leading to faster and less computationally expensive AI models. It introduces the concept of modality-aware sparsity where the model uses different parameter settings for different modalities which improves the overall performance of multimodal AI providing substantial computational savings across different multi-modal settings.
MoM is composed of homogenous MoM blocks where modality-specific parameterization is applied to all projects that explicitly process input features belonging to a single modality, including input projection (Win_proj), intermediate projections (Wx_proj & Wdt_proj), and output projection (Wout_proj). The Conv1D and state transitions A remain shared because they operate across multiple features or on aggregated RNN-like states, where modality is not well-defined.
Original Mamba block and Mixture-of-Mamba block
After the Transfusion, MoM is trained on interleaved multi-modal sequences of discrete text tokens and continuous image tokens using a combined objective that incorporates both language modeling and diffusion-based image generation. The diffusion process follows the Denoising Diffusion Probabilistic Model (DDPM) where the Gaussian noise is progressively added to the latent image patches during the forward process.
The conditioning for the image generation is embedded in the interleaved sequence. When denoising image patches, the preceding tokens serve as the context for conditional generation. This approach enables MoM to use modality-aware sparsity for efficiently modeling both local intra-image dependencies and long-range inter-modal relationships across the sequence.
Another alternative was explored which is a unified representation strategy in which both text and image modalities are represented as discrete tokens. Using the Chameleon framework, the image data is treated as sequences of discrete tokens obtained from the pre-training VQ-VAE model (Vector Quantised-Variational AutoEncoder). In this approach, each image is encoded into a fixed number of tokens by quantizing its latent features into a learned cookbook.
These tokens are then arranged sequentially, like text token processing, resulting in a uniform discrete representation across both modalities. During the training process, both text and image tokens are processed using the same autoregressive objective, where the model learns to predict the next token in the sequence given all previous tokens. The use of discrete tokens for images simplifies the training and also aligns with the sequence-to-sequence nature of the MoM. These experiments demonstrate the efficiency of MoM as it outperforms Mamba Dense models across different settings namely, multi-objective and uniform representations.
Performance and Efficiency Benchmarks
Transfusion setting, where pre-training is performed on the interleaved text and image data across three model scale sizes, 163M, 760M, and 1.4B is used for evaluating MoM against Mamba Dense and Flex-Attention Transformer. The performance gain is expressed as –
Where LossDense and LossMixture are the final losses of Mamba Dense and MoM. respectively.
MoM demonstrates a higher performance in image modality training loss across all the model scales. At 1.4B scale, MoM achieves a training loss of 0.2138, outperforming Mamba Dense by 2.20% while requiring only 34.76% training FLOPs. In the text modality, MoM achieves lower validation losses on both C4(2.2695) and Wikipedia(1.7164) datasets compared to Mamba Dense, despite having similar training losses. Overall performance and efficiency calculation (for both image and text modalities) shows that at 1.4B scale, MoM improves the overall training loss by 0.84% while needing only 83.10% training FLOPs.
Transfusion Setting Benchmarks
When evaluated in the Chameleon setting under image modality, MoM achieved a training loss of 5.0591 at the 1.5B scale, a 2.51% improvement over Mamba Dense, while needing only 42.50% of the training FLOPs. Under text modality, MoM achieved a training loss of 2.1614 at 1.5B scale, a 3.01% improvement, while requiring only 65.40% of the training FLOPs.
The Three-Modality evaluation showed that in speech modality training loss, MoM improved the speech training loss by 5.75% at the 1.5B scale, matching Mamba Dense’s loss with just 24.80% of the training FLOPs.
Three-Modality Setting Benchmarks
Final Words
Mixture-of-Mamba is an extension of state-space models that uses modality-aware sparsity through modality-specific parameterization. The usage of three multi-modal settings namely, Transfusion, Chameleon, and Three-Modality makes MoM outperform Mamba and showcases substantial improvements in loss reduction. These experiments and tests establish MoM as a scalable and efficient architecture for multi-modal pre-training, establishing ways for further research and development in the domains of multi-modal AI.