Skip to main content

Mamba (Gu and Dao, 2023)

Gu and Dao's "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" introduces a selective state-space layer and a homogeneous language-model block that uses it instead of attention and even instead of a separate MLP block. The paper's core claim is that previous subquadratic sequence models failed on language partly because their dynamics were time-invariant: they could remember by position, but not choose what to remember based on token content.

Mamba is a turning point in this sequence. Hyena shows that attention-free long convolutions can be competitive, and RWKV shows that scaled recurrent language models are viable. Mamba combines a recurrent state-space view, input-dependent selection, and a hardware-aware parallel scan to produce a linear-time model that competes strongly with Transformers at language-model scale.

Definitions

Problem and motivation. A fixed convolution kernel or time-invariant recurrence treats the same relative position in the same way regardless of content. That is bad for tasks such as selective copying: the model must ignore many irrelevant tokens and preserve only tokens marked by content. Attention handles this by constructing pairwise data-dependent scores. Mamba seeks a fixed-size recurrent state that is still content-selective.

A continuous state-space model is

h(t)=Ah(t)+Bx(t),y(t)=Ch(t).\begin{aligned} h'(t) &= Ah(t)+Bx(t),\\ y(t) &= Ch(t). \end{aligned}

After discretization, a sequence model can be written

ht=Aht1+Bxt,yt=Cht.\begin{aligned} h_t &= \overline{A}h_{t-1}+\overline{B}x_t,\\ y_t &= Ch_t. \end{aligned}

Traditional structured SSMs such as S4 keep AA, BB, CC, and the step size Δ\Delta fixed across time, making the model linear time-invariant. That enables convolutional computation but limits content-dependent behavior.

Mamba's selective SSM makes some parameters functions of the input token:

Δt=τΔ(WΔxt),Bt=WBxt,Ct=WCxt.\Delta_t = \tau_\Delta(W_\Delta x_t),\qquad B_t=W_Bx_t,\qquad C_t=W_Cx_t.

With a diagonal AA, the discrete update becomes

ht=Atht1+Btxt,yt=Ctht,h_t=\overline{A}_t h_{t-1}+\overline{B}_t x_t,\qquad y_t=C_t h_t,

where At\overline{A}_t and Bt\overline{B}_t depend on Δt\Delta_t, AA, and BtB_t. The common zero-order-hold form is

At=exp(ΔtA).\overline{A}_t=\exp(\Delta_t A).

The key point is not the exact discretization alone; it is that Δt\Delta_t, BtB_t, and CtC_t let each token decide how much state to reset, write, and read.

Key results

Method. Mamba removes the time-invariance constraint but recovers efficient training through a hardware-aware scan. A recurrence of the form

ht=atht1+bth_t=a_t\odot h_{t-1}+b_t

can be parallelized using associative composition of affine maps:

(a2,b2)(a1,b1)=(a2a1,  a2b1+b2).(a_2,b_2)\circ(a_1,b_1)=(a_2\odot a_1,\;a_2\odot b_1+b_2).

This means the model can train over full sequences without processing every token strictly one after another in Python. The paper's implementation fuses discretization, scan, and output projection in GPU SRAM where possible, avoiding materializing the expanded state in high-bandwidth memory. It also uses recomputation in the backward pass to keep activation memory comparable to optimized attention implementations.

The Mamba block merges the SSM-style sequence mixer and a gated MLP-like structure. A simplified block expands the input, applies a short depthwise convolution and SiLU activation, runs the selective scan on the main branch, gates it with another branch, then projects back to the model dimension. The paper emphasizes a homogeneous stack of Mamba blocks rather than alternating attention, SSM, and MLP modules.

Architecture details and hyperparameters. The paper uses real-valued diagonal SSMs as the default, with state expansion inside the selective SSM. It uses SiLU/Swish activations, a short convolution before the SSM path, RMSNorm-like modern training choices in the improved recipe, and no separate attention or MLP blocks in the standard Mamba architecture. Scaling-law models mirror GPT-3 sizes: examples include about 125M, 350M, 760M, and 1.3B parameters, with depths and widths chosen similarly to Transformer baselines. The appendix reports Pile scaling-law training with GPT-2 tokenizer, AdamW, gradient clipping 1.0, weight decay 0.1, no dropout, linear warmup, and cosine decay.

Benchmarks. On synthetic selective copying, the paper reports that S6-style selectivity solves the task where nonselective S4 or gated-but-time-invariant variants struggle. On induction-head synthetic tests, Mamba trained at length 256 generalizes to much longer lengths, including million-token tests in the reported table.

For language modeling, the paper reports that Mamba is the first attention-free model in their study to match a strong modern Transformer++ recipe in scaling-law experiments, especially as context length grows. The zero-shot table reports Mamba best in its size class on the listed evaluations. For example, Mamba-2.8B reports average accuracy around 63.363.3 across the paper's selected tasks, compared with Pythia-2.8B around 59.159.1 and RWKV-3B around 59.659.6 in the same table. The abstract also reports roughly 5 times generation throughput versus similar-size Transformers and that Mamba-3B matches Transformer baselines about twice its size. The paper further reports strong results in DNA and audio modeling, with performance improving on real data up to million-length sequences.

Visual

Model familyDynamicsTraining pathInference memoryContent selection
S4-style SSMTime-invariantConvolution or recurrenceFixed stateLimited
HyenaLong convolution plus gatesFFT convolutionNo KV cacheData-controlled gates
RWKVDecayed WKV recurrenceParallelizable recurrenceFixed stateChannelwise decay and receptance
MambaInput-dependent SSMFused selective scanFixed stateΔt\Delta_t, BtB_t, CtC_t from tokens
TransformerAttention scoresMatrix attentionGrowing KV cachePairwise token scores

Worked example 1: selective recurrence that chooses what to keep

Problem: use a scalar selective recurrence

ht=atht1+btxt,yt=ht,h_t=a_t h_{t-1}+b_t x_t,\qquad y_t=h_t,

with h0=0h_0=0. Suppose three tokens have

x=[10,99,20],x=[10,99,20],

and the model sets

(a,b)1=(0,1),(a,b)2=(1,0),(a,b)3=(0.5,1).(a,b)_1=(0,1),\quad (a,b)_2=(1,0),\quad (a,b)_3=(0.5,1).

Compute the state.

  1. Token 1 is selected and resets the state:
h1=00+110=10.h_1=0\cdot 0+1\cdot 10=10.
  1. Token 2 is ignored:
h2=110+099=10.h_2=1\cdot 10+0\cdot 99=10.
  1. Token 3 is selected while keeping half of the old state:
h3=0.510+120=25.h_3=0.5\cdot 10+1\cdot 20=25.

Check: a time-invariant recurrence would use the same aa and bb at all steps. Here the second token can be ignored because its parameters are input-dependent.

Worked example 2: associative scan composition

Problem: compose three scalar affine updates

ht=atht1+bth_t=a_t h_{t-1}+b_t

with

(a1,b1)=(0.5,2),(a2,b2)=(0.25,1),(a3,b3)=(2,1).(a_1,b_1)=(0.5,2),\quad (a_2,b_2)=(0.25,1),\quad (a_3,b_3)=(2,-1).

Find the combined map from h0h_0 to h3h_3.

  1. Compose step 2 after step 1:
a21=a2a1=0.250.5=0.125,b21=a2b1+b2=0.252+1=1.5.\begin{aligned} a_{21} &= a_2a_1=0.25\cdot 0.5=0.125,\\ b_{21} &= a_2b_1+b_2=0.25\cdot 2+1=1.5. \end{aligned}

So after two steps,

h2=0.125h0+1.5.h_2=0.125h_0+1.5.
  1. Compose step 3 after the two-step map:
a321=a3a21=20.125=0.25,b321=a3b21+b3=21.51=2.\begin{aligned} a_{321} &= a_3a_{21}=2\cdot 0.125=0.25,\\ b_{321} &= a_3b_{21}+b_3=2\cdot 1.5-1=2. \end{aligned}
  1. Therefore
h3=0.25h0+2.h_3=0.25h_0+2.

Check by direct recurrence with h0=4h_0=4:

h1=0.54+2=4,h2=0.254+1=2,h3=221=3.h_1=0.5\cdot 4+2=4,\quad h_2=0.25\cdot 4+1=2,\quad h_3=2\cdot 2-1=3.

The composed map gives

0.254+2=3.0.25\cdot 4+2=3.

Code

import torch
import torch.nn as nn
import torch.nn.functional as F

class TinySelectiveSSM(nn.Module):
def __init__(self, dim, state_dim):
super().__init__()
self.a_log = nn.Parameter(-torch.arange(1, state_dim + 1).float())
self.to_delta = nn.Linear(dim, dim)
self.to_b = nn.Linear(dim, dim * state_dim)
self.to_c = nn.Linear(dim, dim * state_dim)
self.state_dim = state_dim

def forward(self, x):
# Pedagogical sequential scan. Real Mamba uses a fused parallel kernel.
batch, length, dim = x.shape
delta = F.softplus(self.to_delta(x))
b = self.to_b(x).view(batch, length, dim, self.state_dim)
c = self.to_c(x).view(batch, length, dim, self.state_dim)
a = -torch.exp(self.a_log).view(1, 1, 1, self.state_dim)

h = x.new_zeros(batch, dim, self.state_dim)
ys = []
for t in range(length):
a_bar = torch.exp(delta[:, t].unsqueeze(-1) * a)
h = a_bar * h + b[:, t] * x[:, t].unsqueeze(-1)
ys.append((c[:, t] * h).sum(dim=-1))
return torch.stack(ys, dim=1)

x = torch.randn(2, 16, 32)
layer = TinySelectiveSSM(dim=32, state_dim=8)
print(layer(x).shape)

Common pitfalls

  • Describing Mamba as merely "linear attention." It is a selective state-space model with input-dependent parameters and a scan implementation.
  • Forgetting why ordinary SSMs were efficient. Time-invariance enabled convolution; Mamba gives that up and recovers efficiency through a fused scan.
  • Treating selection as generic gating. In this paper, selection specifically controls propagation along the sequence dimension.
  • Assuming fixed-state models can recall everything. Mamba is much stronger than earlier fixed-state models, but exact retrieval can still motivate hybrids such as Griffin and Jamba.
  • Copying the pedagogical scan code into production. The paper's speed depends on hardware-aware fused kernels, SRAM locality, and recomputation.
  • Comparing benchmark numbers without token budget and recipe. The paper compares against strong Transformer++ recipes and same-dataset baselines where possible.

Connections

  • Generalizes the long-sequence concern in Attention Is All You Need by eliminating the KV cache and quadratic attention matrix.
  • Builds on Hyena and H3, but replaces implicit long convolution with selective state-space recurrence.
  • Can be contrasted with RWKV, another recurrent language model whose WKV recurrence is less explicitly input-selective.
  • Motivates the hybrid designs in Griffin and Jamba.
  • Related D2L pages: Sequence Modeling and RNNs, Gated RNNs and Sequence-to-Sequence, and Attention and Transformers.
  • Further reading: S4 and S5 for structured SSM background, H3 and RetNet for recurrent/linear-attention architectures, FlashAttention for hardware-aware attention, HyenaDNA for long genomic sequences, and MoE-Mamba for sparse expert variants.