Skip to main content

Attention Is All You Need (Vaswani et al., 2017)

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin's "Attention Is All You Need" introduced the Transformer, a sequence transduction architecture built from attention, feed-forward layers, residual connections, normalization, and positional encodings rather than recurrence or convolution. The paper matters because it converted sequence modeling from a mostly sequential computation into a highly parallel one, while also giving each token a direct route to every other token in the same context.

In this sequence of paper pages, the Transformer is the reference architecture. ViT applies its encoder to image patches, while Hyena, RWKV, Mamba, Griffin, and Jamba ask which parts of full attention can be replaced, approximated, or hybridized when sequence length and inference memory become the bottleneck.

Definitions

Problem and motivation. Before the Transformer, strong sequence-to-sequence systems for translation usually used recurrent or convolutional encoders and decoders, often with an attention mechanism connecting them. Recurrence made token tt depend on token t1t-1, so training could not fully parallelize across time. Convolution improved parallelism but needed depth or dilation to connect distant positions. The Transformer keeps the encoder-decoder template but replaces the recurrent or convolutional token mixer with self-attention.

A token embedding maps a discrete token to a vector in Rdmodel\mathbb{R}^{d_{\text{model}}}. The paper uses dmodel=512d_{\text{model}}=512 in the base model.

A query, key, and value are learned projections of hidden states. For matrices QQ, KK, and VV, scaled dot-product attention is

Attention(Q,K,V)=softmax(QKTdk)V.\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V.

The dk\sqrt{d_k} denominator keeps dot products from becoming too large when the key dimension grows. Without it, softmax can saturate early, producing small gradients.

Multi-head attention runs attention in hh learned subspaces:

headi=Attention(QWiQ,KWiK,VWiV),MultiHead(Q,K,V)=Concat(head1,,headh)WO.\begin{aligned} \mathrm{head}_i &= \mathrm{Attention}(QW_i^Q, KW_i^K, VW_i^V),\\ \mathrm{MultiHead}(Q,K,V) &= \mathrm{Concat}(\mathrm{head}_1,\ldots,\mathrm{head}_h)W^O. \end{aligned}

The base Transformer uses h=8h=8 heads with dk=dv=64d_k=d_v=64, because 512/8=64512/8=64.

Self-attention uses the same sequence as source of queries, keys, and values. Encoder-decoder attention uses decoder states as queries and encoder outputs as keys and values. Causal masking sets illegal future logits to -\infty before softmax so the decoder cannot look ahead.

Since attention by itself is permutation equivariant, the paper adds sinusoidal positional encodings:

PEpos,2i=sin(pos/100002i/dmodel),PEpos,2i+1=cos(pos/100002i/dmodel).\begin{aligned} PE_{pos,2i} &= \sin\left(pos / 10000^{2i/d_{\text{model}}}\right),\\ PE_{pos,2i+1} &= \cos\left(pos / 10000^{2i/d_{\text{model}}}\right). \end{aligned}

Each encoder layer contains self-attention and a positionwise feed-forward network. Each decoder layer contains masked self-attention, encoder-decoder attention, and a positionwise feed-forward network. The feed-forward block is

FFN(x)=max(0,xW1+b1)W2+b2,\mathrm{FFN}(x)=\max(0,xW_1+b_1)W_2+b_2,

with inner dimension dff=2048d_{\text{ff}}=2048 in the base model.

Key results

Method. The Transformer uses stacks of N=6N=6 encoder layers and N=6N=6 decoder layers in the base configuration. Every sublayer is wrapped in residual addition and layer normalization:

LayerNorm(x+Sublayer(x)).\mathrm{LayerNorm}(x+\mathrm{Sublayer}(x)).

The encoder self-attention lets every source token interact with every other source token. The decoder first uses masked self-attention over the generated prefix, then attention over all encoder outputs. The feed-forward layer then transforms each position independently. This alternation is important: attention communicates across positions, while the feed-forward network increases per-token nonlinear capacity.

The paper's complexity argument is not only about asymptotic cost. A recurrent layer has O(n)O(n) sequential operations for a sequence of length nn, even if each step is cheap. A self-attention layer has constant sequential depth over positions during training, because the full attention matrix can be formed in parallel. The cost is O(n2d)O(n^2 d) for full attention, which is acceptable for many 2017 translation lengths but becomes the motivation for later pages on Hyena, RWKV, and Mamba.

Architecture details and hyperparameters. The base model uses dmodel=512d_{\text{model}}=512, dff=2048d_{\text{ff}}=2048, 88 heads, dropout 0.10.1, label smoothing 0.10.1, Adam with β1=0.9\beta_1=0.9, β2=0.98\beta_2=0.98, and ϵ=109\epsilon=10^{-9}. The learning rate uses warmup followed by inverse-square-root decay:

lr=dmodel1/2min(step1/2,stepwarmup3/2),\mathrm{lr} = d_{\text{model}}^{-1/2} \min\left(\mathrm{step}^{-1/2},\, \mathrm{step}\cdot \mathrm{warmup}^{-3/2}\right),

with 40004000 warmup steps. The WMT 2014 English-German setup used about 4.54.5 million sentence pairs and a shared BPE vocabulary of about 37,00037{,}000 tokens. English-French used a much larger dataset and a word-piece vocabulary around 32,00032{,}000.

Benchmarks. The headline result is WMT 2014 machine translation. The paper reports 28.428.4 BLEU on English-to-German for the big model, more than two BLEU above the previously reported best systems at the time. For English-to-French, the paper reports about 4141 BLEU for a single big model; the abstract gives 41.841.8, while the main results table and discussion use a value around 41.041.0, so it is safest to remember the result as roughly 4141 BLEU rather than as a single universal constant. The big English-French model trained for about 3.53.5 days on eight P100 GPUs. The paper also reports strong English constituency parsing results, showing that the architecture was not only a translation trick.

Visual

Layer typePer-layer token mixingSequential depth over positionsMain advantageMain cost
Recurrent layerUsually local through hidden stateO(n)O(n)Streaming statePoor training parallelism
ConvolutionLocal unless deep or dilatedO(1)O(1) per layerParallel and locality-biasedLong paths for distant tokens
Full self-attentionAll pairsO(1)O(1)Direct long-range routingO(n2)O(n^2) scores
Causal self-attentionAll previous positionsO(1)O(1) in trainingAutoregressive training parallelismKV cache grows with context

Worked example 1: scaled dot-product attention by hand

Problem: compute one query attending to three key-value pairs. Let

q=[2,0],k1=[1,0],k2=[0,1],k3=[1,1],q=[2,0],\quad k_1=[1,0],\quad k_2=[0,1],\quad k_3=[1,1],

and

v1=[10,0],v2=[0,20],v3=[5,5].v_1=[10,0],\quad v_2=[0,20],\quad v_3=[5,5].

Use dk=2d_k=2.

  1. Compute raw dot products:
qk1T=21+00=2,qk2T=20+01=0,qk3T=21+01=2.\begin{aligned} qk_1^T &= 2\cdot 1+0\cdot 0=2,\\ qk_2^T &= 2\cdot 0+0\cdot 1=0,\\ qk_3^T &= 2\cdot 1+0\cdot 1=2. \end{aligned}
  1. Scale by 2\sqrt{2}:
s=[22,0,22][1.414,0,1.414].s=\left[\frac{2}{\sqrt{2}},0,\frac{2}{\sqrt{2}}\right] \approx [1.414,0,1.414].
  1. Apply softmax. Since exp(1.414)4.113\exp(1.414)\approx 4.113 and exp(0)=1\exp(0)=1,
Z=4.113+1+4.113=9.226.Z=4.113+1+4.113=9.226.

Therefore

α[4.113/9.226,  1/9.226,  4.113/9.226][0.446,0.108,0.446].\alpha\approx [4.113/9.226,\;1/9.226,\;4.113/9.226] \approx [0.446,0.108,0.446].
  1. Weight the values:
y=0.446[10,0]+0.108[0,20]+0.446[5,5]=[4.46,0]+[0,2.16]+[2.23,2.23]=[6.69,4.39].\begin{aligned} y &=0.446[10,0]+0.108[0,20]+0.446[5,5]\\ &=[4.46,0]+[0,2.16]+[2.23,2.23]\\ &=[6.69,4.39]. \end{aligned}

Check: the attention weights sum to 0.446+0.108+0.446=1.0000.446+0.108+0.446=1.000 up to rounding, and the output is a convex combination of the value vectors.

Worked example 2: choosing head dimensions and attention cost

Problem: a Transformer encoder uses dmodel=512d_{\text{model}}=512, h=8h=8 heads, source length n=128n=128, and batch size B=4B=4. Find the per-head dimension and the shape of the attention score tensor for one encoder self-attention layer.

  1. Split the model dimension across heads:
dk=dv=dmodel/h=512/8=64.d_k=d_v=d_{\text{model}}/h=512/8=64.
  1. For each head, QQ, KK, and VV have shape
(B,n,dk)=(4,128,64).(B,n,d_k)=(4,128,64).
  1. The score matrix is QKTQK^T for each batch item and head. Its shape is
(B,h,n,n)=(4,8,128,128).(B,h,n,n)=(4,8,128,128).
  1. The number of scalar attention logits is
48128128=524,288.4\cdot 8\cdot 128\cdot 128=524{,}288.
  1. If the sequence doubles to 256256 while BB and hh stay fixed, the logits become
48256256=2,097,152.4\cdot 8\cdot 256\cdot 256=2{,}097{,}152.

Check: doubling length multiplies the attention score count by 44, which is the practical meaning of the quadratic term.

Code

import math
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(q, k, v, mask=None):
"""q, k, v: [batch, heads, length, head_dim]."""
d_k = q.size(-1)
scores = q @ k.transpose(-2, -1) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float("-inf"))
weights = F.softmax(scores, dim=-1)
return weights @ v, weights

batch, heads, length, head_dim = 2, 8, 16, 64
x = torch.randn(batch, length, heads * head_dim)
wq = torch.nn.Linear(heads * head_dim, heads * head_dim, bias=False)
wk = torch.nn.Linear(heads * head_dim, heads * head_dim, bias=False)
wv = torch.nn.Linear(heads * head_dim, heads * head_dim, bias=False)

q = wq(x).view(batch, length, heads, head_dim).transpose(1, 2)
k = wk(x).view(batch, length, heads, head_dim).transpose(1, 2)
v = wv(x).view(batch, length, heads, head_dim).transpose(1, 2)

causal = torch.tril(torch.ones(length, length, dtype=torch.bool))
out, attn = scaled_dot_product_attention(q, k, v, causal)
out = out.transpose(1, 2).reshape(batch, length, heads * head_dim)
print(out.shape, attn.shape)

Common pitfalls

  • Treating attention as a complete Transformer. The paper's gains depend on attention, feed-forward layers, residual paths, normalization, masking, embeddings, positional encodings, and the training recipe.
  • Forgetting the scale factor. Omitting dk\sqrt{d_k} can make logits too sharp, especially with larger head dimensions.
  • Mixing up mask types. Padding masks remove fake tokens. Causal masks remove future tokens. Encoder self-attention usually needs the first, decoder self-attention needs both.
  • Comparing perplexities across incompatible tokenizations. The paper's word-piece or BPE settings matter.
  • Assuming attention is cheap at all lengths. Full attention parallelizes well but stores and processes n2n^2 pairwise scores.
  • Reading attention maps as guaranteed explanations. The paper's examples are suggestive, but individual heads are not complete causal explanations.

Connections