Efficient Sequence Modeling: Linear Attention, SSMs, and Hybrids
Full self-attention is the reference token mixer for modern sequence models, but its cost grows quadratically with sequence length. That is acceptable for many sentences and short documents. It becomes a bottleneck for long documents, code repositories, audio, genomics, video, high-resolution image patches, and any decoder that must keep a large key-value cache during generation.

Figure: ELIZA provides historical context for dialogue systems and chatbot evaluation. Image: Wikimedia Commons, Unknown author, public domain text.
Efficient sequence modeling asks which parts of attention are essential. The answer is not a single architecture. Some methods replace attention with long convolutions, some use linearized attention or recurrent key-value summaries, some use state-space recurrences, and production models often hybridize these ideas with a small amount of attention. The design space is a tradeoff among quality, exact retrieval, training parallelism, inference memory, hardware kernels, and how much history must be remembered exactly.
The quadratic attention bottleneck
For a sequence of length and model width , scaled dot-product self-attention forms
For each head, has shape . The score matrix therefore costs memory to materialize and roughly compute to apply. During autoregressive inference, a Transformer can avoid recomputing old keys and values by storing a KV cache, but that cache grows linearly with generated length:
The problem is not only asymptotic notation. Doubling the context length quadruples the number of attention scores during training. At decode time, a long KV cache consumes memory bandwidth and reduces batch size. Efficient alternatives try to preserve enough of attention's strengths while changing one or more of these costs.
The Mamba block starts with normalized token states, expands channels, applies a short causal depthwise convolution, and then generates input-dependent SSM parameters Delta_t, B_t, and C_t. The selective scan updates a fixed recurrent state while the output gate controls which scanned features pass through the projection. The dotted note highlights the implementation trick: the recurrence is trained with a fused associative scan even though it behaves like a recurrent state at inference.

Figure: RWKV block structure from Peng et al., 2023 — embedded under educational fair use with attribution.
RWKV combines an attention-like decayed key-value recurrence with a separate channel-mixing block. The time-mixing path uses token shift, keys, values, learned decay, and a receptance gate to maintain fixed-size numerator and denominator state instead of a growing KV cache. The residual channel-mixing path supplies the feed-forward capacity that a Transformer would normally place after attention.
Hyena replaces pairwise attention with a hierarchy of long implicit convolutions interleaved with data-controlled gates. The filter generator produces position-dependent long filters, and FFT convolution applies them efficiently over long contexts. The diagram calls out causal padding because circular convolution would leak future tokens.
Griffin mixes fixed-state recurrence with bounded local attention. The RG-LRU branch compresses long history into a diagonal recurrent state, while the local attention branch preserves exact access to recent tokens inside a sliding window. The residual structure mirrors Transformer blocks but replaces most global attention with recurrence.
Jamba is a sparse hybrid decoder: only a small fraction of layers are attention layers, most sequence mixing is handled by Mamba layers, and scheduled MoE feed-forward layers add capacity with sparse activation. The block-detail subgraph shows the released pattern at a high level: one attention layer and seven Mamba layers per eight-layer block. Because only attention layers allocate KV cache, the architecture reduces long-context cache growth while preserving some direct token access.
| Family | Token mixing idea | Training path | Decode memory | Main tradeoff |
|---|---|---|---|---|
| Full attention | Pairwise query-key scores | Matrix attention | KV cache grows with | Strong exact access, expensive long context |
| Linear attention | Associative feature-map attention | Prefix sums or scans | Fixed or compact state | Kernel choice limits operator class |
| Long convolution | Filters over many lags | FFT convolution | No full KV cache | Harder content-selective recall |
| Gated recurrence | Learned state update | Parallel scan or custom recurrence | Fixed state | Compresses history |
| Selective SSM | Input-dependent state-space update | Fused selective scan | Fixed state | Exact retrieval still not free |
| Hybrid | Sparse attention plus efficient mixers | Mixed kernels | Bounded or reduced KV cache | More architecture complexity |
Linear attention and recurrent summaries
The algebraic route starts from attention without the row-wise softmax. If a nonnegative feature map makes attention weights approximately factorizable, then
The sums over past keys and values can be maintained as recurrent state:
Then each new token reads from instead of all previous tokens. This explains why many efficient models look recurrent at inference even when they train with parallel scan kernels. The price is compression: the state summarizes the past instead of storing every key and value separately.
Long convolution with data-controlled gates
Long convolutions replace the dense attention matrix with a structured Toeplitz operator. Hyena showed that this can be much more expressive when fixed long filters are interleaved with input-dependent gates [1]. A causal convolution over one channel is
In matrix form, this is multiplication by a lower-triangular Toeplitz matrix . Hyena composes Toeplitz matrices with diagonal gate matrices:
The filters are implicit: instead of learning one parameter for every possible lag, a small network maps positions to filter values,
The gates make the operator data-controlled, while FFT convolution gives the long-filter part roughly length scaling. The implementation detail that matters most is causality: FFT convolution must be padded so it computes the aperiodic causal convolution, not circular convolution that leaks future tokens into early outputs.
Worked example: let
Using with missing past values set to zero,
So . Hyena applies the same causal idea with learned full-length filters and gates over many channels.
import torch
def causal_fft_conv(u, h):
"""Depthwise causal convolution.
u: [batch, channels, length]
h: [channels, length]
"""
length = u.size(-1)
fft_len = 2 * length
u_f = torch.fft.rfft(u, n=fft_len)
h_f = torch.fft.rfft(h.unsqueeze(0), n=fft_len)
y = torch.fft.irfft(u_f * h_f, n=fft_len)
return y[..., :length]
Hyena's paper-reported result was that attention-free long convolutions can be competitive at sub-billion language-model scale and much faster than attention at very long sequence lengths under its tested kernels [1]. The conservative lesson is narrower and more useful: structured long filters plus gates are a serious token-mixing family, but exact retrieval and hardware constants still matter.
Decayed key-value recurrence
RWKV turns attention-like quantities into a channelwise recurrence that trains in a parallelizable form and decodes like an RNN [2]. Its time-mixing block builds vectors analogous to attention projections:
The parameters implement token shift: each channel can blend the current token with the previous token before projection. A simplified one-channel weighted-key-value update is
Here is a learned decay and is a learned current-token bonus. The output is gated by a receptance vector:
The recurrence can be updated with numerator and denominator state, so inference memory is independent of generated length. Real implementations use stable rescaling because the exponentials can overflow.
Worked example: suppose , , , values , , , decay , and current bonus . At , the past weights are
and the current weight is . Therefore
This shows the intended behavior: a recent high-key token dominates, but older tokens still contribute through the decayed state.
RWKV's main contribution was to show that a modern recurrent language model can scale into the billion-parameter regime with Transformer-like training behavior and fixed-state decoding [2]. Its limitation follows from the same design: a fixed state is a useful summary, not an exact list of all prior tokens.
Selective state-space recurrence
State-space models write sequence processing as a recurrence over a hidden state. In continuous form,
After discretization, a sequence model can be written as
Earlier structured SSMs often kept , , , and the step size fixed across time. That makes convolutional computation possible but weakens content-dependent behavior. Mamba's selective SSM makes key parameters functions of the input token [3]:
With diagonal , a common update has
The point is selection. Each token can influence how much state is kept, what is written, and what is read. Mamba recovers efficient training with a hardware-aware parallel scan. For a recurrence
the affine updates compose associatively:
This enables scan kernels instead of a Python loop over tokens.
Worked example: a scalar selective recurrence
sees and chooses
Then
The middle token is ignored because its input-dependent write coefficient is zero. A time-invariant recurrence could not make that choice differently at different positions.
Mamba also changed the block design. A simplified block expands channels, applies a short depthwise convolution, generates SSM parameters, runs the selective scan, gates the result, and projects back to model width. The paper reported that an attention-free Mamba stack could match strong Transformer baselines at language-model scale, improve long-sequence throughput, and work well on DNA and audio sequences [3].
Local attention plus gated recurrence
Fixed-state models compress the past, so exact retrieval remains difficult. One pragmatic response is to keep a small attention window for recent tokens and use recurrence for longer-range memory. Griffin follows this path by mixing Real-Gated Linear Recurrent Unit layers with local multi-query attention [4].
In simplified elementwise notation, the RG-LRU update is
The recurrent weight is diagonal and constrained to for stability. The input gate controls how much new content enters state; the recurrence gate controls effective decay. Griffin's main pattern uses recurrent residual blocks interleaved with local attention blocks, for example two recurrent blocks followed by one local-attention block.
Worked example: with
the effective recurrent weight is
The input scale is
So
The state mostly preserves history while admitting a limited amount of new information.
The local attention window bounds the cache. If a 32-layer model with one KV head of dimension 128 generates tokens, a full MQA cache stores
scalar key/value entries. With a local window , the local cache stores
which is 64 times smaller. Griffin's recurrent state is additional, but it does not grow with . The paper-reported lesson is that recurrence plus bounded attention can improve long-context efficiency without forcing the recurrent state to handle every exact recent-copying problem [4].
Sparse hybrid attention-state-space blocks
Jamba scales the hybrid idea into a decoder that interleaves Transformer attention, Mamba layers, and mixture-of-experts MLPs [5]. Its design treats architecture as a resource allocation problem: attention provides direct token access, Mamba layers reduce long-context cost, and sparse experts increase total capacity without activating all parameters for every token.
A Jamba block is described by several knobs:
The released Jamba v0.1 configuration uses
Four blocks give 32 layers total. Since each block has one attention layer and seven Mamba layers, only 4 of 32 layers need a Transformer-style KV cache. This gives an idealized 8x cache reduction relative to a fully attentional 32-layer decoder with the same cache shape per attention layer.
The MoE component routes each token to a small subset of expert MLPs:
Increasing the number of experts raises total capacity; increasing raises active compute. Jamba's reported released model has 52B total parameters but about 12B active parameters because each token uses only a subset of experts [5].
Worked example: in the released 4-block, 8-layer-per-block pattern,
total layers. The attention count is
and the Mamba count is
With MoE every other layer, there are
MoE layers. With top-2 routing, each token runs
selected expert computations across the network.
Jamba's paper-reported 256K-context memory comparison lists a large KV-cache reduction versus a fully attentional sparse MoE baseline, and its long-context throughput improves as context grows [5]. The important principle is not that the 1:7 ratio is universal. It is that a small number of attention layers can preserve some direct access while most layers use fixed-state sequence mixers.
Choosing an efficient sequence model
The selection diagram is intentionally framed around I/O contracts: exact retrieval, decode-state growth, training context length, and hardware kernels. It links the architecture diagrams above to operational choices: RWKV and Mamba provide fixed-state decoding, Hyena and Mamba target long training contexts, Griffin bounds recent attention, and full attention remains the reference when quadratic cost is acceptable. The final check node keeps the comparison grounded in kernels and task-specific recall behavior rather than asymptotic notation alone.
Use full attention when direct access and simplicity matter more than long-context cost. Use long convolutions when global filters and FFT kernels fit the workload. Use recurrent or SSM models when fixed-state inference is central. Use hybrids when the application needs both long-context efficiency and some exact token access.
The practical comparison should include hardware. A method with better asymptotic length scaling can lose to attention at short contexts if kernels are immature or constants are high. Conversely, attention can become impossible at long contexts because memory, not FLOPs, is the limiting resource.
Common pitfalls
- Calling every subquadratic method "linear attention." Long convolutions, RWKV-style recurrences, selective SSMs, and hybrids use different operators.
- Assuming fixed state means unlimited memory. Fixed-state models summarize history; they do not store an exact searchable list of all past tokens.
- Forgetting causality in FFT convolution. Unpadded FFT convolution is circular and can leak future information.
- Comparing benchmark numbers without matching data, token budgets, optimizer recipes, context lengths, and hardware kernels.
- Treating local attention as free. Its cache is bounded by the window, not eliminated.
- Ignoring short-context regimes. Full attention is often simpler and faster until sequence lengths are large enough for the alternative's scaling to matter.
Connections
- Attention and Transformers gives the full-attention baseline and Transformer block structure.
- Sequence Modeling and RNNs explains recurrent hidden state, autoregressive generation, and truncated backpropagation.
- LSTM Variants connects older gated recurrence to modern recurrent language models.
- Computational Performance is the right place to reason about memory bandwidth, kernels, batching, and parallelism.
- Pretrained Transformers and BERT covers model-family choices for downstream NLP systems.
References
[1] M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao, S. Baccus, Y. Bengio, S. Ermon, C. Re. Hyena Hierarchy: Towards Larger Convolutional Language Models. ICML 2023. [2] B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, S. Biderman, H. Cao, X. Cheng, and collaborators. RWKV: Reinventing RNNs for the Transformer Era. EMNLP Findings 2023. [3] A. Gu, T. Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. COLM 2024. [4] S. De, S. L. Smith, A. Fernando, A. Botev, G. Cristian-Muraru, A. Gu, and collaborators. Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models. 2024. [5] O. Lieber, B. Lenz, H. Bata, G. Cohen, J. Osin, I. Dalmedigos, E. Safahi, S. Meirom, Y. Belinkov, S. Shalev-Shwartz, O. Abend, and collaborators. Jamba: A Hybrid Transformer-Mamba Language Model. 2024.