Attention Is All You Need (Vaswani et al., 2017)
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin's "Attention Is All You Need" introduced the Transformer, a sequence transduction architecture built from attention, feed-forward layers, residual connections, normalization, and positional encodings rather than recurrence or convolution. The paper matters because it converted sequence modeling from a mostly sequential computation into a highly parallel one, while also giving each token a direct route to every other token in the same context.
In this sequence of paper pages, the Transformer is the reference architecture. ViT applies its encoder to image patches, while Hyena, RWKV, Mamba, Griffin, and Jamba ask which parts of full attention can be replaced, approximated, or hybridized when sequence length and inference memory become the bottleneck.
Definitions
Problem and motivation. Before the Transformer, strong sequence-to-sequence systems for translation usually used recurrent or convolutional encoders and decoders, often with an attention mechanism connecting them. Recurrence made token depend on token , so training could not fully parallelize across time. Convolution improved parallelism but needed depth or dilation to connect distant positions. The Transformer keeps the encoder-decoder template but replaces the recurrent or convolutional token mixer with self-attention.
A token embedding maps a discrete token to a vector in . The paper uses in the base model.
A query, key, and value are learned projections of hidden states. For matrices , , and , scaled dot-product attention is
The denominator keeps dot products from becoming too large when the key dimension grows. Without it, softmax can saturate early, producing small gradients.
Multi-head attention runs attention in learned subspaces:
The base Transformer uses heads with , because .
Self-attention uses the same sequence as source of queries, keys, and values. Encoder-decoder attention uses decoder states as queries and encoder outputs as keys and values. Causal masking sets illegal future logits to before softmax so the decoder cannot look ahead.
Since attention by itself is permutation equivariant, the paper adds sinusoidal positional encodings:
Each encoder layer contains self-attention and a positionwise feed-forward network. Each decoder layer contains masked self-attention, encoder-decoder attention, and a positionwise feed-forward network. The feed-forward block is
with inner dimension in the base model.
Key results
Method. The Transformer uses stacks of encoder layers and decoder layers in the base configuration. Every sublayer is wrapped in residual addition and layer normalization:
The encoder self-attention lets every source token interact with every other source token. The decoder first uses masked self-attention over the generated prefix, then attention over all encoder outputs. The feed-forward layer then transforms each position independently. This alternation is important: attention communicates across positions, while the feed-forward network increases per-token nonlinear capacity.
The paper's complexity argument is not only about asymptotic cost. A recurrent layer has sequential operations for a sequence of length , even if each step is cheap. A self-attention layer has constant sequential depth over positions during training, because the full attention matrix can be formed in parallel. The cost is for full attention, which is acceptable for many 2017 translation lengths but becomes the motivation for later pages on Hyena, RWKV, and Mamba.
Architecture details and hyperparameters. The base model uses , , heads, dropout , label smoothing , Adam with , , and . The learning rate uses warmup followed by inverse-square-root decay:
with warmup steps. The WMT 2014 English-German setup used about million sentence pairs and a shared BPE vocabulary of about tokens. English-French used a much larger dataset and a word-piece vocabulary around .
Benchmarks. The headline result is WMT 2014 machine translation. The paper reports BLEU on English-to-German for the big model, more than two BLEU above the previously reported best systems at the time. For English-to-French, the paper reports about BLEU for a single big model; the abstract gives , while the main results table and discussion use a value around , so it is safest to remember the result as roughly BLEU rather than as a single universal constant. The big English-French model trained for about days on eight P100 GPUs. The paper also reports strong English constituency parsing results, showing that the architecture was not only a translation trick.
Visual
| Layer type | Per-layer token mixing | Sequential depth over positions | Main advantage | Main cost |
|---|---|---|---|---|
| Recurrent layer | Usually local through hidden state | Streaming state | Poor training parallelism | |
| Convolution | Local unless deep or dilated | per layer | Parallel and locality-biased | Long paths for distant tokens |
| Full self-attention | All pairs | Direct long-range routing | scores | |
| Causal self-attention | All previous positions | in training | Autoregressive training parallelism | KV cache grows with context |
Worked example 1: scaled dot-product attention by hand
Problem: compute one query attending to three key-value pairs. Let
and
Use .
- Compute raw dot products:
- Scale by :
- Apply softmax. Since and ,
Therefore
- Weight the values:
Check: the attention weights sum to up to rounding, and the output is a convex combination of the value vectors.
Worked example 2: choosing head dimensions and attention cost
Problem: a Transformer encoder uses , heads, source length , and batch size . Find the per-head dimension and the shape of the attention score tensor for one encoder self-attention layer.
- Split the model dimension across heads:
- For each head, , , and have shape
- The score matrix is for each batch item and head. Its shape is
- The number of scalar attention logits is
- If the sequence doubles to while and stay fixed, the logits become
Check: doubling length multiplies the attention score count by , which is the practical meaning of the quadratic term.
Code
import math
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(q, k, v, mask=None):
"""q, k, v: [batch, heads, length, head_dim]."""
d_k = q.size(-1)
scores = q @ k.transpose(-2, -1) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float("-inf"))
weights = F.softmax(scores, dim=-1)
return weights @ v, weights
batch, heads, length, head_dim = 2, 8, 16, 64
x = torch.randn(batch, length, heads * head_dim)
wq = torch.nn.Linear(heads * head_dim, heads * head_dim, bias=False)
wk = torch.nn.Linear(heads * head_dim, heads * head_dim, bias=False)
wv = torch.nn.Linear(heads * head_dim, heads * head_dim, bias=False)
q = wq(x).view(batch, length, heads, head_dim).transpose(1, 2)
k = wk(x).view(batch, length, heads, head_dim).transpose(1, 2)
v = wv(x).view(batch, length, heads, head_dim).transpose(1, 2)
causal = torch.tril(torch.ones(length, length, dtype=torch.bool))
out, attn = scaled_dot_product_attention(q, k, v, causal)
out = out.transpose(1, 2).reshape(batch, length, heads * head_dim)
print(out.shape, attn.shape)
Common pitfalls
- Treating attention as a complete Transformer. The paper's gains depend on attention, feed-forward layers, residual paths, normalization, masking, embeddings, positional encodings, and the training recipe.
- Forgetting the scale factor. Omitting can make logits too sharp, especially with larger head dimensions.
- Mixing up mask types. Padding masks remove fake tokens. Causal masks remove future tokens. Encoder self-attention usually needs the first, decoder self-attention needs both.
- Comparing perplexities across incompatible tokenizations. The paper's word-piece or BPE settings matter.
- Assuming attention is cheap at all lengths. Full attention parallelizes well but stores and processes pairwise scores.
- Reading attention maps as guaranteed explanations. The paper's examples are suggestive, but individual heads are not complete causal explanations.
Connections
- Builds directly on the query-key-value view in Attention and Transformers.
- Replaces the sequential bottleneck discussed in Sequence Modeling and RNNs and Gated RNNs and Sequence-to-Sequence.
- Supplies the encoder reused by Vision Transformer.
- Supplies the baseline that Hyena, RWKV, Mamba, Griffin, and Jamba compare against.
- For implementation context, see Computational Performance and PyTorch Builders Guide.
- Further reading: Bahdanau et al. on neural machine translation attention, Sutskever et al. on sequence-to-sequence learning, Wu et al. on GNMT, Gehring et al. on convolutional sequence models, and Dao et al. on FlashAttention.