Mathematical Formulation
Adversarial examples are usually introduced as surprising inputs, but the technical core is an optimization problem. Given a trained model and a valid set of input changes, the attacker searches for a nearby point that maximizes loss, changes the predicted label, or induces a target behavior. Defenses then try to train or certify models whose predictions are stable throughout that valid set.
This page builds the notation used by white-box attacks, adversarial training, and certified defenses. The same formulas are simple enough to write down and hard enough to solve exactly: the inner maximization is nonconvex for neural networks, and the choice of constraint set determines what "nearby" means.
Definitions
Let be a classifier with parameters and logits . The predicted class is:
Let be a training loss, usually cross-entropy for classification. An untargeted adversarial example for input-label pair is an input such that:
For norm-bounded attacks:
A targeted adversarial example for target class satisfies:
The attack optimization problem is often written as:
for untargeted attacks. For targeted attacks, one common form is:
The robust risk of a classifier is:
Adversarial training approximates the min-max problem:
For certification, the goal is not merely to find a bad but to prove that no bad exists inside the set. For a point , a certified radius under norm means:
Key results
For a locally linear loss, the best first-order perturbation has the sign of the input gradient. Let:
The first-order Taylor approximation gives:
The maximizer of over is:
which is the Fast Gradient Sign Method direction. Over an ball, the maximizer is:
More generally, dual norms explain first-order attacks. If , then:
This equation is one reason adversarial vulnerability is tied to high-dimensional geometry. Even if each coordinate of is tiny under , the dot product can accumulate across many dimensions.
Constrained attacks can also be written with penalties. Instead of:
one may solve:
where penalizes failure to reach the target. Carlini-Wagner style attacks use this kind of penalty formulation with carefully chosen confidence losses and box constraints. The penalty coefficient matters: too small and the target is not reached; too large and the perturbation can be larger than necessary.
Loss surfaces around neural networks are nonconvex, so attack algorithms are approximate. A failed attack does not prove robustness unless the method is a sound verifier. This distinction motivates gradient masking and obfuscation: if the optimization landscape is made artificially hard for a particular attack, adversarial accuracy can be overestimated.
The perturbation set is also part of the mathematics, not a side note. Norm balls are convenient because they give closed-form projections and dual-norm calculations, but many realistic sets are intersections of constraints. An image attack may require , valid pixel range, fixed crop geometry, and unchanged metadata. A patch attack replaces the norm ball with a mask and transformation distribution. A text attack replaces continuous projection with a discrete search over candidate edits. In each case, the correct attack problem is the one that optimizes over the actual allowed set. Using the wrong can make a mathematically clean result irrelevant to the system being evaluated.
The same care applies to the loss. Cross-entropy is common because it is already used for training, but margin losses, target losses, detector losses, or sequence-level losses may better match the attack goal. The objective should encode the success condition, not merely be easy to differentiate.
Visual
| Constraint | Set | First-order maximizer of | Typical use |
|---|---|---|---|
| Pixel-bounded image attacks | |||
| Energy-bounded perturbations, smoothing certificates | |||
| Put mass on largest | Sparse-ish budget analysis | ||
| At most changed coordinates | Change largest-gradient coordinates | Sparse pixel or feature attacks |
Worked example 1: First-order attack on a linear loss
Problem: Suppose the input gradient of the loss at an image is:
Find the first-order adversarial perturbation for and compute the approximate loss increase.
- Under the Taylor approximation, maximize:
- Choose each coordinate independently:
- The sign vector is:
- Therefore:
- The approximate loss increase is:
Checked answer: the first-order perturbation is and the approximated loss increase is .
Worked example 2: Robust radius for a binary linear classifier
Problem: Let a binary classifier predict class when and class otherwise. Let:
Compute the smallest perturbation that reaches the decision boundary.
- Compute the signed score:
The point is classified as .
- The decision boundary is . The Euclidean distance from to the boundary is:
- Compute the norm:
- Therefore:
- The boundary-reaching perturbation moves opposite :
- Check:
Checked answer: the minimum distance to the boundary is , achieved by . Any classifier with margin less than at a point cannot be certified robust to radius at that point.
Code
import torch
import torch.nn.functional as F
def pgd_inner_max(model, x, y, epsilon=8 / 255, step_size=2 / 255, steps=10):
x0 = x.detach()
x_adv = x0 + torch.empty_like(x0).uniform_(-epsilon, epsilon)
x_adv = x_adv.clamp(0.0, 1.0)
for _ in range(steps):
x_adv.requires_grad_(True)
loss = F.cross_entropy(model(x_adv), y)
grad = torch.autograd.grad(loss, x_adv)[0]
with torch.no_grad():
x_adv = x_adv + step_size * grad.sign()
delta = (x_adv - x0).clamp(-epsilon, epsilon)
x_adv = (x0 + delta).clamp(0.0, 1.0)
return x_adv.detach()
The code implements the inner maximization used in many adversarial-training loops. It uses random initialization, gradient ascent on the input, projection to the ball, and clipping to the valid image range.
Common pitfalls
- Optimizing the wrong objective for targeted attacks. Targeted attacks usually minimize the target loss, while untargeted attacks maximize the true-label loss.
- Forgetting the input-domain constraint after projecting onto a norm ball.
- Treating an approximate attack failure as a proof of robustness. Certification requires a verifier or a theorem.
- Comparing values across datasets without checking pixel scaling, channel normalization, and image resolution.
- Using cross-entropy loss blindly when logits saturate; stronger attacks may use margin losses or confidence losses.
- Assuming that the closest adversarial example under is also closest under or .
Connections
- Threat models and attack taxonomy defines the attacker assumptions behind .
- White-box attacks are numerical methods for the inner maximization.
- Adversarial training uses the min-max problem as a training objective.
- Certified defenses and randomized smoothing replaces attack search with proof obligations.
- Deep learning provides the loss functions, logits, and gradient computation used here.
Further reading
- Goodfellow, Shlens, and Szegedy, "Explaining and Harnessing Adversarial Examples."
- Madry et al., "Towards Deep Learning Models Resistant to Adversarial Attacks."
- Carlini and Wagner, "Towards Evaluating the Robustness of Neural Networks."
- Zhang et al., "Theoretically Principled Trade-off between Robustness and Accuracy."