Tensors and Data Preprocessing
Deep learning starts with two practical skills: representing data as tensors and making raw data numerically usable. D2L begins here because every later model, from linear regression to transformers, is ultimately a composition of tensor operations. A network can look mysterious at the architectural level, but at runtime it mostly consumes batches of arrays, applies broadcasted arithmetic, multiplies matrices, and moves intermediate results through differentiable operators.
Data preprocessing matters just as much as the model definition. Missing values, mixed numeric and categorical columns, inconsistent scales, and accidental train-test leakage can dominate the result before optimization even begins. A disciplined tensor pipeline keeps shapes explicit, keeps statistical estimates tied to the training split, and makes each minibatch look like what the model expects.
Definitions
A tensor is a multidimensional array with a shape, a data type, and usually a device. In PyTorch, a scalar has shape (), a vector has shape (d,), a matrix has shape (m, n), and a minibatch of color images often has shape (batch, channels, height, width).
Let be a tabular feature matrix. The row is the feature vector for example , and a target vector stores one label or response per row. In image tasks, one row is no longer enough to describe an example, so an input batch may be .
Broadcasting is a rule for applying elementwise operations to tensors whose shapes differ but are compatible. Starting from trailing dimensions, two dimensions are compatible when they are equal or when one of them is . A dimension of size can be repeated conceptually without copying data.
Indexing and slicing select subtensors. A slice such as X[:, 1:3] means all rows and columns with indices 1 and 2. In-place updates such as X[:, 0] = 0 modify storage and can save memory, but they must be used carefully when autograd needs previous values.
Data preprocessing turns raw records into tensors. Common steps include parsing files, separating features from labels, imputing missing numeric values, encoding categorical values, standardizing numeric features, shuffling examples, and grouping them into minibatches.
One-hot encoding maps a categorical variable with possible categories to a vector in . This lets a linear model or neural network receive category identity without imposing an artificial numeric order.
Key results
The most important tensor result is that shape reasoning is an invariant. If an operation is valid for a single example but not for a minibatch, it is usually not written in the right vectorized form. For example, a linear model with weights and bias should compute all predictions as
where and broadcasting expands the scalar across all rows.
Vectorization replaces Python loops with library kernels. If has rows, a loop that computes torch.dot(X[i], w) repeatedly launches many small operations. The expression X @ w launches a single optimized matrix-vector product and lets PyTorch use low-level linear algebra routines.
Broadcasting is not just syntax. It also affects memory behavior. A + B where A.shape == (3, 1) and B.shape == (1, 4) behaves like adding a repeated column to a repeated row, producing shape (3, 4). The repeated views need not be physically materialized before the result is computed.
Preprocessing has a simple statistical rule: fit preprocessing decisions on the training set, then apply the same decisions to validation and test sets. If the mean and standard deviation of a numeric feature are estimated using the test set, information has leaked from evaluation into training. The same warning applies to vocabulary construction, category discovery, missing-value imputation, and feature selection.
For numeric standardization, use
where and are computed from the training column , and prevents division by zero. Standardization does not make a model correct, but it makes optimization better conditioned when features have very different units.
Shape conventions should be documented at dataset boundaries. A tabular batch, image batch, and token batch all carry the batch dimension, but their remaining axes mean different things. Naming these axes in code comments, variable names, or assertions prevents subtle bugs when models are refactored. A useful habit is to check a single minibatch before training and print both feature and target shapes.
Data preprocessing should also preserve reversibility where possible. Saving category vocabularies, normalization statistics, and label mappings makes evaluation reproducible and lets predictions be interpreted later. In production systems, the preprocessing artifact is part of the model because changing it changes the numeric input seen by the network.
Visual
| Concept | Typical PyTorch object | Shape example | Why it matters |
|---|---|---|---|
| Scalar loss | torch.Tensor | () | backward() is simplest on scalar objectives |
| Feature vector | torch.Tensor | (d,) | One example for a tabular model |
| Minibatch matrix | torch.Tensor | (batch, d) | Enables vectorized training |
| Image batch | torch.Tensor | (batch, channels, height, width) | Standard CNN layout in PyTorch |
| Token batch | torch.Tensor | (batch, time) | Standard discrete sequence layout |
| One-hot features | torch.Tensor | (batch, categories) | Converts categories to numeric inputs |
Worked example 1: broadcasting a minibatch
Problem: compute where
Method:
- Write the shapes. has shape and has shape .
- Compare trailing dimensions. The last dimensions are both , so they match.
- Compare the first dimensions. They are and , so broadcasting can repeat the single row of across the two rows of .
- Conceptually expand to
- Add elementwise:
Checked answer: the output shape is , matching the broadcasted common shape. Each row of received the same offset vector .
Worked example 2: preprocessing a small table
Problem: convert the following raw data into model-ready numeric features. The target is price.
| row | rooms | age | city | price |
|---|---|---|---|---|
| 1 | 2 | 10 | Seoul | 220 |
| 2 | 4 | missing | Busan | 310 |
| 3 | 3 | 20 | Seoul | 280 |
Method:
- Separate the target:
- Impute the missing numeric value using the training-column mean. The observed ages are and , so
The imputed age column is .
- Standardize
rooms. Its mean is
Its population standard deviation is
So the standardized rooms values are approximately
- Standardize
age. The imputed age mean is and the population standard deviation is
The standardized age values are approximately .
- One-hot encode
citywith columns[Busan, Seoul]. The rows become[0,1],[1,0], and[0,1].
Checked answer: one valid feature matrix is
The first two columns are standardized numeric features, and the final two are categorical indicators.
Code
import pandas as pd
import torch
from torch.utils.data import DataLoader, TensorDataset
raw = pd.DataFrame(
{
"rooms": [2.0, 4.0, 3.0],
"age": [10.0, None, 20.0],
"city": ["Seoul", "Busan", "Seoul"],
"price": [220.0, 310.0, 280.0],
}
)
y = torch.tensor(raw.pop("price").values, dtype=torch.float32).reshape(-1, 1)
numeric = raw[["rooms", "age"]].copy()
numeric = numeric.fillna(numeric.mean())
numeric = (numeric - numeric.mean()) / numeric.std(ddof=0)
categorical = pd.get_dummies(raw["city"], dtype=float)
features = pd.concat([numeric, categorical], axis=1)
X = torch.tensor(features.values, dtype=torch.float32)
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=2, shuffle=True)
for xb, yb in loader:
print("features:", xb)
print("targets:", yb)
Common pitfalls
- Treating tensor shape errors as incidental. Shape mismatches usually reveal an incorrect mental model of the batch dimension.
- Computing standardization statistics on the full dataset before splitting. This leaks information from validation or test examples.
- Encoding categories independently in each split. The column order and category set must be learned once from the training data.
- Assuming broadcasting always copies data. It often uses views, but the final result still has the broadcasted shape.
- Using integer tensors for model inputs that should participate in floating-point arithmetic.
- Forgetting that PyTorch image tensors usually use channels-first layout, while many image files and plotting libraries use channels-last layout.