Convolutional Neural Networks
Convolutional neural networks exploit the structure of images. A fully connected layer treats every pixel position as unrelated to every other position, but images have local patterns, repeated motifs, and spatial neighborhoods. D2L develops CNNs by moving from the idea of translation-aware feature extraction to the concrete cross-correlation operation used in deep learning libraries.
The core insight is parameter sharing. A small kernel slides across an image and applies the same weights at every location. This makes the model efficient, encourages it to detect the same pattern anywhere in the image, and builds feature maps that preserve spatial layout. Padding, stride, channels, pooling, and stacked convolutional layers then turn this simple operation into a complete architecture such as LeNet.
Definitions
For an input image and kernel , the two-dimensional cross-correlation output is
Deep learning libraries usually call this operation convolution, even though mathematical convolution flips the kernel before applying it. Since kernels are learned, the distinction rarely matters for neural network training.
A convolutional layer learns one or more kernels and optionally a bias. With multiple input channels, each output channel sums cross-correlations over all input channels. If has input channels and the layer has output channels, the kernel tensor has shape in PyTorch.
Padding adds rows and columns around the input. It controls spatial size and lets border pixels participate in more windows. Stride is the step size between adjacent kernel positions. A larger stride downsamples the feature map.
For input height , kernel height , padding , and stride , the output height is
The same formula applies to width.
Pooling aggregates local neighborhoods without learned weights. Max pooling returns the largest value in each window; average pooling returns the mean. Pooling adds local translation tolerance and often reduces spatial resolution.
LeNet is an early CNN for digit recognition. It alternates convolution, nonlinear activation, pooling, and fully connected layers.
Key results
Convolutions encode three useful assumptions for images. Locality says nearby pixels are more strongly related than distant pixels. Translation equivariance says shifting an input should shift the feature map in a corresponding way. Parameter sharing says the same detector can be useful at many locations.
For a single-channel input of size and a kernel of size , a fully connected layer from all pixels to one hidden unit uses weights. A convolutional detector uses only weights and applies them everywhere. With many feature maps, this savings becomes dramatic.
A convolution does not mix spatial neighborhoods, but it mixes channels at each pixel location. If the input has channels and the output has channels, a convolution learns a matrix-like transformation from to at every spatial position. D2L uses this idea to explain later architectures such as NiN and GoogLeNet.
Pooling is not the same as convolution. It has no learned kernel and usually discards precise spatial information. This can help classification, where exact location may be less important, but it can hurt dense prediction tasks such as segmentation if overused.
Stacking small kernels increases the receptive field. Two convolutional layers without dilation give a effective receptive field, while using fewer parameters than one dense layer with the same channel counts.
Channel dimensions are part of the model's learned representation. Early layers may detect edges or color contrasts, while later layers combine channels into textures, parts, and object-level patterns. A convolution with input channels, output channels, and kernel size has weights, plus biases when bias is enabled. This makes channel growth a major driver of parameter count and computation.
The receptive field of a unit is the region of the original input that can affect it. Deeper layers have larger receptive fields because each convolution composes neighborhoods from the previous feature map. However, the theoretical receptive field can be larger than the effective region that strongly influences the output, especially early in training. This is one reason architecture design balances depth, stride, pooling, and skip connections.
Layout conventions matter when translating formulas into code. D2L's mathematical images are often written as height-by-width arrays, but PyTorch stores batches as (N, C, H, W). Most silent CNN mistakes are not in the convolution formula; they are in reshaping, flattening, or feeding a channels-last tensor into a channels-first layer.
Translation equivariance should be distinguished from translation invariance. A convolutional feature map is equivariant: shifting the input tends to shift the feature response. Pooling, striding, and global average pooling add degrees of invariance by reducing the importance of exact position. Classification often wants some invariance, but detection and segmentation need enough equivariance to locate objects accurately.
LeNet is historically small, but its pattern remains recognizable. Convolutional layers learn local features, pooling reduces resolution, and dense layers map the final representation to class logits. Modern networks replace sigmoid with ReLU-like activations, add normalization, and use deeper blocks, but the same tensor flow from image to feature maps to classifier is still present.
Visual
| Operation | Learned parameters | Main effect | Shape control |
|---|---|---|---|
| Convolution | Yes | Local feature extraction | Kernel, padding, stride |
| convolution | Yes | Channel mixing | Output channels |
| Max pooling | No | Local winner selection | Window, padding, stride |
| Average pooling | No | Local smoothing | Window, padding, stride |
| Padding | No | Border handling | Added pixels |
| Stride | No | Downsampling | Step size |
Worked example 1: cross-correlation by hand
Problem: compute the valid cross-correlation of
with kernel
Method:
-
The input is and the kernel is , so the valid output is .
-
Top-left output:
- Top-right output:
- Bottom-left output:
- Bottom-right output:
Checked answer:
Worked example 2: output size with padding and stride
Problem: a convolution receives an image of size , uses a kernel, padding , and stride . Find the output spatial size.
Method:
- Use the formula for height:
- Substitute values:
- Evaluate:
- Width has the same numbers, so .
Checked answer: the output feature map has spatial size . If the layer has output channels and the batch size is , the full output shape is (10, 64, 16, 16).
Code
import torch
from torch import nn
class LeNetLike(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(1, 6, kernel_size=5, padding=2),
nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2),
nn.Conv2d(6, 16, kernel_size=5),
nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2),
nn.Flatten(),
nn.Linear(16 * 5 * 5, 120),
nn.Sigmoid(),
nn.Linear(120, 84),
nn.Sigmoid(),
nn.Linear(84, num_classes),
)
def forward(self, x):
return self.net(x)
model = LeNetLike()
x = torch.randn(8, 1, 28, 28)
logits = model(x)
print(logits.shape)
print("parameters:", sum(p.numel() for p in model.parameters()))
Common pitfalls
- Forgetting that PyTorch convolution expects
(batch, channels, height, width). - Mixing up mathematical convolution with cross-correlation. Neural network libraries learn the kernel, so the flip is usually irrelevant.
- Calculating output sizes without the floor operation.
- Using too much pooling in tasks that need precise spatial output, such as segmentation.
- Treating padding as harmless. Padding changes border statistics and can affect features near image edges.
- Flattening too early, which discards spatial structure and creates many unnecessary parameters.