Skip to main content

Gated RNNs and Sequence-to-Sequence

Basic RNNs carry hidden states through time, but they struggle to preserve useful information over long spans. D2L addresses this with gated recurrent units and long short-term memory networks, then uses recurrent encoders and decoders for machine translation. The result is the classic sequence-to-sequence pattern: one network reads an input sequence into a state representation, and another network generates an output sequence.

This page is a bridge between vanilla recurrence and attention-based models. Gating explains how neural networks can decide what to remember, what to forget, and what to expose. Encoder-decoder training introduces teacher forcing, padding masks, sequence loss, and decoding. Beam search then shows that prediction is not just a forward pass; it is a search problem over possible output sequences.

Definitions

A GRU modifies the RNN hidden-state update using gates. With input XtX_t and previous hidden state Ht1H_{t-1}, the reset gate RtR_t and update gate ZtZ_t are

Rt=σ(XtWxr+Ht1Whr+br),R_t = \sigma(X_tW_{xr} + H_{t-1}W_{hr} + b_r), Zt=σ(XtWxz+Ht1Whz+bz).Z_t = \sigma(X_tW_{xz} + H_{t-1}W_{hz} + b_z).

The candidate hidden state is

H~t=tanh(XtWxh+(RtHt1)Whh+bh),\tilde{H}_t = \tanh(X_tW_{xh} + (R_t \odot H_{t-1})W_{hh} + b_h),

and the final hidden state is

Ht=ZtHt1+(1Zt)H~t.H_t = Z_t \odot H_{t-1} + (1-Z_t)\odot \tilde{H}_t.

An LSTM uses a memory cell CtC_t and gates for input, forget, and output. The forget gate controls how much old memory survives; the input gate controls how much new content enters; the output gate controls how much cell content is exposed as hidden state.

A deep RNN stacks recurrent layers. A bidirectional RNN processes a sequence left-to-right and right-to-left, then combines both representations.

An encoder-decoder model maps an input sequence to an output sequence. The encoder reads the source sequence. The decoder predicts target tokens, usually conditioned on encoder state and previous target tokens.

Teacher forcing trains the decoder using the true previous target token as input. Beam search keeps the best kk partial sequences during decoding rather than committing to only the best next token.

Key results

The update gate in a GRU creates a direct interpolation between old and candidate state. If ZtZ_t is close to 11, then

HtHt1.H_t \approx H_{t-1}.

If ZtZ_t is close to 00, then

HtH~t.H_t \approx \tilde{H}_t.

This lets the model preserve information over many time steps when needed, reducing the vanishing-gradient problem compared with a plain RNN.

The LSTM cell gives memory an additive path:

Ct=FtCt1+ItC~t.C_t = F_t \odot C_{t-1} + I_t \odot \tilde{C}_t.

The additive form makes it easier to carry memory forward than repeatedly applying only nonlinear transformations. This is why LSTMs became a standard sequence model before transformers became dominant.

Sequence-to-sequence training minimizes token-level cross-entropy, but padding tokens should not contribute to the loss. If a target has valid length mm and padded length TT, the masked loss is

t=1T1(tm)t.\sum_{t=1}^{T} \mathbf{1}(t \le m)\ell_t.

Greedy decoding chooses the best next token at each step. It is cheap but can miss the best full sequence because a locally strong token may lead to a poor continuation. Beam search approximates global sequence search by maintaining multiple candidates.

GRUs and LSTMs solve related problems with different internal structure. A GRU combines memory and hidden state in one vector, so it has fewer gates and often trains faster. An LSTM separates the cell state from the exposed hidden state, giving it a more explicit memory path. Which works better is empirical; D2L presents both because the gate equations reveal the design space.

Bidirectional RNNs are powerful encoders but not ordinary left-to-right generators. A bidirectional layer can represent a source sentence for translation because the whole source is known before decoding. It cannot be used directly to generate the next token in a causal language model because the backward direction would look at future tokens. This distinction foreshadows the difference between encoder attention and causal decoder attention in Transformers.

Machine translation also introduces vocabulary and length issues. Source and target languages usually have different vocabularies, and a target sentence may be shorter or longer than its source. Special tokens such as <bos>, <eos>, and <pad> define sequence boundaries and batching behavior. The decoder should learn when to stop by predicting <eos>, not by relying on a fixed output length.

The fixed-context encoder-decoder is limited because the entire source sequence must be compressed into a small set of final states. For short sentences this can work, but long sentences strain the bottleneck. Attention was originally introduced for neural machine translation to let the decoder consult encoder states at every output step. Thus D2L's progression from seq2seq to attention is not arbitrary; it fixes a concrete weakness of the basic encoder-decoder.

Loss masking is part of the model objective. Padding lets variable-length sequences share a batch, but padded positions do not represent target tokens. If they contribute loss, the model can learn to predict padding too strongly or distort perplexity. Correct valid lengths, masks, and target shifting are therefore as important as the recurrent cell equations.

Visual

ModelExtra state or gatesStrengthLimitation
Plain RNNHidden state onlySimple recurrent baselineVanishing/exploding gradients
GRUReset and update gatesFewer gates than LSTMLess explicit memory cell
LSTMCell plus input/forget/output gatesStrong long-range memoryMore parameters
Deep RNNMultiple recurrent layersHierarchical sequence featuresSlower and harder to train
Bidirectional RNNForward and backward statesUses both contextsNot causal for generation
Encoder-decoderSeparate reader and writerVariable-length translationFixed context bottleneck without attention

Worked example 1: GRU update gate interpolation

Problem: a one-dimensional GRU has previous hidden state Ht1=10H_{t-1}=10, candidate state H~t=2\tilde{H}_t=2, and update gate Zt=0.75Z_t=0.75. Compute HtH_t.

Method:

  1. Use the GRU interpolation formula:
Ht=ZtHt1+(1Zt)H~t.H_t = Z_tH_{t-1} + (1-Z_t)\tilde{H}_t.
  1. Substitute values:
Ht=0.75(10)+(10.75)(2).H_t = 0.75(10) + (1-0.75)(2).
  1. Compute the old-state contribution:
0.75(10)=7.5.0.75(10)=7.5.
  1. Compute the candidate contribution:
0.25(2)=0.5.0.25(2)=0.5.
  1. Add:
Ht=7.5+0.5=8.H_t = 7.5 + 0.5 = 8.

Checked answer: the new hidden state is 88. Since the update gate is close to 11, the GRU mostly preserves the old state.

Worked example 2: beam search with width two

Problem: decode two time steps with beam width 22. At step 1, candidate log probabilities are A: -0.1, B: -0.7, C: -1.2. At step 2, extensions have log probabilities:

prefixnext Xnext Y
A-0.6-1.0
B-0.2-0.4

Find the top two length-two sequences.

Method:

  1. Keep the top two prefixes after step 1: A with score 0.1-0.1 and B with score 0.7-0.7. Drop C.
  2. Extend A:
score(AX)=0.1+(0.6)=0.7,\mathrm{score}(AX) = -0.1 + (-0.6) = -0.7, score(AY)=0.1+(1.0)=1.1.\mathrm{score}(AY) = -0.1 + (-1.0) = -1.1.
  1. Extend B:
score(BX)=0.7+(0.2)=0.9,\mathrm{score}(BX) = -0.7 + (-0.2) = -0.9, score(BY)=0.7+(0.4)=1.1.\mathrm{score}(BY) = -0.7 + (-0.4) = -1.1.
  1. Rank all length-two candidates by log score. Higher is better because log probabilities are negative:
AX:0.7,BX:0.9,AY:1.1,BY:1.1.AX:-0.7,\quad BX:-0.9,\quad AY:-1.1,\quad BY:-1.1.

Checked answer: the beam keeps AX and BX. Greedy search would also start with A, but beam search preserves B as an alternate path because it could lead to a stronger continuation.

Code

import torch
from torch import nn

torch.manual_seed(4)

src_vocab = 30
tgt_vocab = 35
batch_size = 3
src_len = 5
tgt_len = 6
embed_size = 8
hidden_size = 16

encoder = nn.GRU(embed_size, hidden_size, batch_first=True)
decoder = nn.GRU(embed_size, hidden_size, batch_first=True)
src_embed = nn.Embedding(src_vocab, embed_size)
tgt_embed = nn.Embedding(tgt_vocab, embed_size)
classifier = nn.Linear(hidden_size, tgt_vocab)

src = torch.randint(0, src_vocab, (batch_size, src_len))
tgt = torch.randint(0, tgt_vocab, (batch_size, tgt_len))

_, enc_state = encoder(src_embed(src))
dec_inputs = tgt[:, :-1]
dec_targets = tgt[:, 1:]
dec_outputs, _ = decoder(tgt_embed(dec_inputs), enc_state)
logits = classifier(dec_outputs)

loss = nn.CrossEntropyLoss()(
logits.reshape(-1, tgt_vocab),
dec_targets.reshape(-1),
)
loss.backward()
print("sequence loss:", loss.item())
print("logits shape:", logits.shape)

Common pitfalls

  • Using bidirectional encoder states in a decoder without correctly combining directions.
  • Computing loss on padded positions instead of masking them.
  • Forgetting that teacher forcing uses true previous tokens during training but generated previous tokens during inference.
  • Comparing beam-search scores across different lengths without length normalization when appropriate.
  • Treating GRU and LSTM hidden states as interchangeable; LSTM returns both hidden state and cell state.
  • Carrying hidden states across unrelated sequences in a shuffled batch.

Connections