Linear Regression and Training Loops
Linear regression is the first complete supervised learning system in D2L. It has a model, a loss function, an optimization algorithm, a data iterator, and an evaluation loop. The model is simple enough to solve analytically, but training it with minibatch stochastic gradient descent teaches the same mechanics used for deep networks.
The key idea is to predict a numeric target as a weighted sum of input features. In a deep learning framework, that weighted sum is just a linear layer. Once the training loop works for linear regression, the model can be replaced by an MLP, CNN, RNN, or transformer while keeping the outer logic mostly intact: load a minibatch, compute predictions, compute loss, backpropagate, update parameters, and monitor metrics.
Definitions
In linear regression, each example has a real-valued target . The model predicts
where is the weight vector and is the bias.
For a dataset with feature matrix , the batched prediction is
The common squared loss for one example is
The factor cancels the derivative of the square and has no effect on the minimizer.
The empirical risk is the average loss over the training set:
Minibatch stochastic gradient descent estimates the full gradient using a small subset of examples:
Key results
For the full dataset, squared-error linear regression has a closed-form normal equation when is invertible:
This result is mathematically useful, but D2L emphasizes gradient-based training because it scales to the neural network models that do not have closed-form solutions. The normal equation requires forming and inverting a matrix, which can be expensive or unstable when is large or features are collinear.
The gradients for a single example are direct:
For a minibatch matrix , residual vector , and mean squared loss with the same convention,
Under the probabilistic model with Gaussian noise , minimizing squared error is equivalent to maximizing likelihood. This is why squared loss is not merely convenient; it corresponds to a noise assumption.
A training loop needs four separable pieces: a data iterator, a model, a loss, and an optimizer. D2L returns to this separation throughout the book because it makes code reusable. Data handling can change without rewriting the model, and the optimizer can change without rewriting the forward pass.
Linear regression also introduces the idea of a baseline. Before using a deep model, a practitioner should know how well a linear model performs with clean preprocessing. If a later MLP or CNN cannot beat this baseline, the issue may be data quality, leakage, target noise, or an optimization bug rather than insufficient architecture complexity.
The bias term deserves explicit treatment. Appending a constant feature of to every input lets the bias be absorbed into the weight vector, but frameworks usually represent it separately. Keeping it separate makes parameter groups easier to inspect and lets optimizers or regularizers treat biases differently from weights.
The stochastic training view is also the first encounter with noisy optimization. A minibatch gradient may point away from the full-dataset gradient for a particular step, but it is cheaper and often good enough over many updates. This tradeoff between noisy estimates and computational efficiency reappears in classification, language modeling, and reinforcement learning.
Synthetic data is useful because the true parameters are known. D2L uses generated regression data to check whether a training loop can recover weights close to the ground truth. This is a powerful debugging pattern: when a model fails on real data, first verify it can fit a simple controlled problem. If it cannot, the bug is likely in the implementation, optimizer, shapes, or loss rather than the dataset.
Residual plots are another diagnostic. If residuals are centered around zero with no visible structure, a linear model may be adequate. If residuals curve with a feature, fan out with magnitude, or differ by category, the model assumptions are probably missing nonlinear effects, heteroscedasticity, or interactions. Deep models can reduce such structure, but only if the training data and features expose it.
Visual
| Component | From-scratch version | PyTorch concise version |
|---|---|---|
| Parameters | Tensors with requires_grad=True | nn.Linear(d, 1) |
| Prediction | X @ w + b | model(X) |
| Loss | Manual squared error | nn.MSELoss() |
| Optimizer | Manual subtraction under torch.no_grad() | torch.optim.SGD |
| Data iterator | Manual shuffling and slicing | DataLoader |
Worked example 1: one SGD update by hand
Problem: perform one gradient-descent update for a one-dimensional linear regression model . Use one minibatch with two examples: and . Start with , , and learning rate . Use
Method:
- Compute predictions:
- Compute residuals:
- Compute the weight gradient:
- Compute the bias gradient:
- Update parameters:
Checked answer: after one update, and . Both increase because the model underpredicted both targets.
Worked example 2: normal equation for two examples
Problem: solve a linear regression exactly for the model with no bias using
Method:
- Use the no-bias normal equation:
- Compute :
- Compute :
- Solve:
- Check predictions and residuals:
The residuals are not both zero because a no-bias line through the origin cannot pass through both points exactly.
Checked answer: , and the orthogonality condition holds because .
Code
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset
torch.manual_seed(0)
n, d = 1000, 2
true_w = torch.tensor([[2.0], [-3.4]])
true_b = 4.2
X = torch.randn(n, d)
y = X @ true_w + true_b + 0.01 * torch.randn(n, 1)
loader = DataLoader(TensorDataset(X, y), batch_size=32, shuffle=True)
model = nn.Linear(d, 1)
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.03)
for epoch in range(5):
total_loss = 0.0
for xb, yb in loader:
pred = model(xb)
loss = loss_fn(pred, yb)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item() * xb.shape[0]
print(epoch, total_loss / n)
print("learned weight:", model.weight.data)
print("learned bias:", model.bias.data)
Common pitfalls
- Forgetting to shuffle training data before forming minibatches, especially when examples are ordered by label or time.
- Averaging a loss twice or not averaging it at all, which silently changes the effective learning rate.
- Calling
loss.backward()withoutoptimizer.zero_grad(), causing gradients to accumulate across minibatches. - Using the validation set to tune every small detail until it effectively becomes part of training.
- Expecting the normal equation to be the preferred implementation for large neural models.
- Ignoring feature scaling, which can make gradient descent zigzag across elongated loss contours.