Skip to main content

Convolutional Neural Networks

Convolutional neural networks exploit the structure of images. A fully connected layer treats every pixel position as unrelated to every other position, but images have local patterns, repeated motifs, and spatial neighborhoods. D2L develops CNNs by moving from the idea of translation-aware feature extraction to the concrete cross-correlation operation used in deep learning libraries.

The core insight is parameter sharing. A small kernel slides across an image and applies the same weights at every location. This makes the model efficient, encourages it to detect the same pattern anywhere in the image, and builds feature maps that preserve spatial layout. Padding, stride, channels, pooling, and stacked convolutional layers then turn this simple operation into a complete architecture such as LeNet.

Definitions

For an input image XX and kernel KK, the two-dimensional cross-correlation output is

Yi,j=a=0hk1b=0wk1Xi+a,j+bKa,b.Y_{i,j} = \sum_{a=0}^{h_k-1} \sum_{b=0}^{w_k-1} X_{i+a,j+b}K_{a,b}.

Deep learning libraries usually call this operation convolution, even though mathematical convolution flips the kernel before applying it. Since kernels are learned, the distinction rarely matters for neural network training.

A convolutional layer learns one or more kernels and optionally a bias. With multiple input channels, each output channel sums cross-correlations over all input channels. If XX has cic_i input channels and the layer has coc_o output channels, the kernel tensor has shape (co,ci,hk,wk)(c_o,c_i,h_k,w_k) in PyTorch.

Padding adds rows and columns around the input. It controls spatial size and lets border pixels participate in more windows. Stride is the step size between adjacent kernel positions. A larger stride downsamples the feature map.

For input height hh, kernel height kk, padding pp, and stride ss, the output height is

h+2pks+1.\left\lfloor \frac{h + 2p - k}{s} \right\rfloor + 1.

The same formula applies to width.

Pooling aggregates local neighborhoods without learned weights. Max pooling returns the largest value in each window; average pooling returns the mean. Pooling adds local translation tolerance and often reduces spatial resolution.

LeNet is an early CNN for digit recognition. It alternates convolution, nonlinear activation, pooling, and fully connected layers.

Key results

Convolutions encode three useful assumptions for images. Locality says nearby pixels are more strongly related than distant pixels. Translation equivariance says shifting an input should shift the feature map in a corresponding way. Parameter sharing says the same detector can be useful at many locations.

For a single-channel input of size h×wh \times w and a kernel of size kh×kwk_h \times k_w, a fully connected layer from all pixels to one hidden unit uses hwhw weights. A convolutional detector uses only khkwk_hk_w weights and applies them everywhere. With many feature maps, this savings becomes dramatic.

A 1×11 \times 1 convolution does not mix spatial neighborhoods, but it mixes channels at each pixel location. If the input has cic_i channels and the output has coc_o channels, a 1×11 \times 1 convolution learns a matrix-like transformation from Rci\mathbb{R}^{c_i} to Rco\mathbb{R}^{c_o} at every spatial position. D2L uses this idea to explain later architectures such as NiN and GoogLeNet.

Pooling is not the same as convolution. It has no learned kernel and usually discards precise spatial information. This can help classification, where exact location may be less important, but it can hurt dense prediction tasks such as segmentation if overused.

Stacking small kernels increases the receptive field. Two 3×33 \times 3 convolutional layers without dilation give a 5×55 \times 5 effective receptive field, while using fewer parameters than one dense 5×55 \times 5 layer with the same channel counts.

Channel dimensions are part of the model's learned representation. Early layers may detect edges or color contrasts, while later layers combine channels into textures, parts, and object-level patterns. A convolution with cic_i input channels, coc_o output channels, and kernel size kh×kwk_h \times k_w has cocikhkwc_oc_ik_hk_w weights, plus coc_o biases when bias is enabled. This makes channel growth a major driver of parameter count and computation.

The receptive field of a unit is the region of the original input that can affect it. Deeper layers have larger receptive fields because each convolution composes neighborhoods from the previous feature map. However, the theoretical receptive field can be larger than the effective region that strongly influences the output, especially early in training. This is one reason architecture design balances depth, stride, pooling, and skip connections.

Layout conventions matter when translating formulas into code. D2L's mathematical images are often written as height-by-width arrays, but PyTorch stores batches as (N, C, H, W). Most silent CNN mistakes are not in the convolution formula; they are in reshaping, flattening, or feeding a channels-last tensor into a channels-first layer.

Translation equivariance should be distinguished from translation invariance. A convolutional feature map is equivariant: shifting the input tends to shift the feature response. Pooling, striding, and global average pooling add degrees of invariance by reducing the importance of exact position. Classification often wants some invariance, but detection and segmentation need enough equivariance to locate objects accurately.

LeNet is historically small, but its pattern remains recognizable. Convolutional layers learn local features, pooling reduces resolution, and dense layers map the final representation to class logits. Modern networks replace sigmoid with ReLU-like activations, add normalization, and use deeper blocks, but the same tensor flow from image to feature maps to classifier is still present.

Visual

OperationLearned parametersMain effectShape control
ConvolutionYesLocal feature extractionKernel, padding, stride
1×11 \times 1 convolutionYesChannel mixingOutput channels
Max poolingNoLocal winner selectionWindow, padding, stride
Average poolingNoLocal smoothingWindow, padding, stride
PaddingNoBorder handlingAdded pixels
StrideNoDownsamplingStep size

Worked example 1: cross-correlation by hand

Problem: compute the valid cross-correlation of

X=[012345678]X = \begin{bmatrix} 0 & 1 & 2 \\ 3 & 4 & 5 \\ 6 & 7 & 8 \end{bmatrix}

with kernel

K=[0123].K = \begin{bmatrix} 0 & 1 \\ 2 & 3 \end{bmatrix}.

Method:

  1. The input is 3×33 \times 3 and the kernel is 2×22 \times 2, so the valid output is (32+1)×(32+1)=2×2(3-2+1) \times (3-2+1) = 2 \times 2.

  2. Top-left output:

Y0,0=0(0)+1(1)+3(2)+4(3)=0+1+6+12=19.Y_{0,0} = 0(0)+1(1)+3(2)+4(3) =0+1+6+12=19.
  1. Top-right output:
Y0,1=1(0)+2(1)+4(2)+5(3)=0+2+8+15=25.Y_{0,1} = 1(0)+2(1)+4(2)+5(3) =0+2+8+15=25.
  1. Bottom-left output:
Y1,0=3(0)+4(1)+6(2)+7(3)=0+4+12+21=37.Y_{1,0} = 3(0)+4(1)+6(2)+7(3) =0+4+12+21=37.
  1. Bottom-right output:
Y1,1=4(0)+5(1)+7(2)+8(3)=0+5+14+24=43.Y_{1,1} = 4(0)+5(1)+7(2)+8(3) =0+5+14+24=43.

Checked answer:

Y=[19253743].Y = \begin{bmatrix} 19 & 25 \\ 37 & 43 \end{bmatrix}.

Worked example 2: output size with padding and stride

Problem: a convolution receives an image of size 32×3232 \times 32, uses a 5×55 \times 5 kernel, padding p=2p=2, and stride s=2s=2. Find the output spatial size.

Method:

  1. Use the formula for height:
hout=h+2pks+1.h_{\text{out}} = \left\lfloor \frac{h + 2p - k}{s} \right\rfloor + 1.
  1. Substitute values:
hout=32+2(2)52+1=312+1.h_{\text{out}} = \left\lfloor \frac{32 + 2(2) - 5}{2} \right\rfloor + 1 = \left\lfloor \frac{31}{2} \right\rfloor + 1.
  1. Evaluate:
15.5+1=15+1=16.\left\lfloor 15.5 \right\rfloor + 1 = 15 + 1 = 16.
  1. Width has the same numbers, so wout=16w_{\text{out}}=16.

Checked answer: the output feature map has spatial size 16×1616 \times 16. If the layer has 6464 output channels and the batch size is 1010, the full output shape is (10, 64, 16, 16).

Code

import torch
from torch import nn

class LeNetLike(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(1, 6, kernel_size=5, padding=2),
nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2),
nn.Conv2d(6, 16, kernel_size=5),
nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2),
nn.Flatten(),
nn.Linear(16 * 5 * 5, 120),
nn.Sigmoid(),
nn.Linear(120, 84),
nn.Sigmoid(),
nn.Linear(84, num_classes),
)

def forward(self, x):
return self.net(x)

model = LeNetLike()
x = torch.randn(8, 1, 28, 28)
logits = model(x)
print(logits.shape)
print("parameters:", sum(p.numel() for p in model.parameters()))

Common pitfalls

  • Forgetting that PyTorch convolution expects (batch, channels, height, width).
  • Mixing up mathematical convolution with cross-correlation. Neural network libraries learn the kernel, so the flip is usually irrelevant.
  • Calculating output sizes without the floor operation.
  • Using too much pooling in tasks that need precise spatial output, such as segmentation.
  • Treating padding as harmless. Padding changes border statistics and can affect features near image edges.
  • Flattening too early, which discards spatial structure and creates many unnecessary parameters.

Connections