White-Box Attacks
White-box attacks assume the attacker can inspect and differentiate the full model and defense. This is the standard stress test for adversarial robustness because it removes accidental secrecy: if a defense only works when the attacker does not know the preprocessing, loss, randomness, or architecture, the defense is brittle under a security interpretation.
Most white-box image attacks are variations on constrained optimization. They use gradients with respect to the input, not the weights, and search inside a threat set such as an or ball. This page gives the conceptual scaffold for FGSM, BIM/I-FGSM, PGD, momentum iterative attacks, Carlini-Wagner attacks, and DeepFool; later paper pages can deep-dive the original algorithms.
Definitions
For a classifier , loss , clean input , label , perturbation set , and adversarial input , the white-box untargeted attack problem is:
The attacker can compute:
and, when needed, gradients through preprocessing, random transformations, differentiable defenses, logit margins, or surrogate losses.
FGSM uses one first-order step:
BIM or I-FGSM repeats small FGSM-like steps and clips after each step:
PGD is the same projected-gradient idea, usually with a random start:
Momentum iterative attacks stabilize the direction by accumulating normalized gradients:
Carlini-Wagner attacks often optimize a penalty objective such as:
with a loss designed around logit margins and target success. DeepFool approximates the classifier locally by linear decision boundaries and iteratively moves to the nearest boundary.
Key results
FGSM follows from the first-order Taylor approximation described in mathematical formulation. It is fast, simple, and historically important, but it is not a strong standalone robustness evaluation. A model can resist one-step attacks while remaining vulnerable to iterative attacks.
PGD is the workhorse first-order attack for norm-bounded white-box evaluation. For attacks, one iteration is:
The projection is coordinatewise:
For attacks, the update uses normalized gradients and projection onto the ball:
An important evaluation point is that white-box attacks should be adaptive. If the defended model is:
where is preprocessing, the gradient should be taken through when possible:
If is nondifferentiable, the attacker may use BPDA, a differentiable approximation, score-based search, or another adaptive method. If the defense is randomized, the attacker may optimize the expected loss with expectation over transformations:
Algorithm choice depends on the goal. FGSM is useful as a pedagogical baseline and for fast adversarial training variants. PGD is a strong default for evaluation. C&W-style attacks are useful when minimizing distortion or handling confidence margins. DeepFool estimates boundary distance and can be useful for geometric diagnostics. Momentum attacks often improve transferability because they avoid overfitting to local gradient noise.
Attack complexity is usually reported in gradient evaluations. FGSM uses one backward pass with respect to the input. PGD- uses roughly such backward passes per restart. A PGD-50 attack with 10 restarts is therefore much more expensive than PGD-10 with one restart, and the comparison matters when evaluating defenses. Hyperparameters should be strong enough to make the loss increase and the success rate stabilize. Step size, number of steps, random starts, loss choice, and targeted versus untargeted variants are not cosmetic details; they can change a robustness number by a large amount.
White-box access also includes preprocessing and postprocessing when those components affect the decision. If the system normalizes inputs, crops images, rejects examples, or ensembles several models, the attack should target that whole computation. Otherwise the evaluation is only white-box for a simplified model, not for the deployed system.
Visual
| Attack | Main idea | Typical budget | Strengths | Limitations |
|---|---|---|---|---|
| FGSM | One signed gradient step | Very fast, interpretable | Weak evaluation by itself | |
| BIM/I-FGSM | Repeated clipped FGSM | Stronger than FGSM | Can overfit local model, needs step tuning | |
| PGD | Iterative projected ascent with random starts | , | Standard first-order baseline | Still approximate, can miss masked gradients |
| MIM | PGD with momentum | , | Often better transfer | Extra hyperparameter |
| C&W | Optimize norm plus logit-margin penalty | , , variants | Strong for low-distortion attacks | Slower, coefficient search matters |
| DeepFool | Iteratively cross local linear boundary | Often | Geometric boundary estimate | Not a full robust evaluation |
Worked example 1: One FGSM step on normalized pixels
Problem: A grayscale input has four pixels:
The input gradient is:
Compute the FGSM adversarial input for and clip to .
- Take the elementwise sign:
- Multiply by :
- Add to the input:
- Clip to . No coordinate is outside the interval, so the value is unchanged.
Checked answer:
The third pixel does not change because the gradient coordinate is zero. In real floating-point code, exactly zero gradients can also be a warning sign if they arise from nondifferentiable preprocessing or saturated activations.
Worked example 2: Two PGD iterations with projection
Problem: Let:
Assume the attack starts at and sees gradient signs:
Compute two PGD steps.
- First gradient step:
- Project to the ball around . The allowed interval for each coordinate is . The point is valid, so:
- Second gradient step:
- Project coordinatewise to :
- Check the perturbation:
Checked answer: after two steps, , exactly on the boundary of the allowed ball.
Code
import torch
import torch.nn.functional as F
def fgsm(model, x, y, epsilon):
x_adv = x.detach().clone().requires_grad_(True)
loss = F.cross_entropy(model(x_adv), y)
loss.backward()
with torch.no_grad():
x_adv = x_adv + epsilon * x_adv.grad.sign()
return x_adv.clamp(0.0, 1.0).detach()
def pgd_linf(model, x, y, epsilon, step_size, steps, restarts=1):
best_x = x.detach()
best_loss = torch.full((x.shape[0],), -float("inf"), device=x.device)
for _ in range(restarts):
x0 = x.detach()
x_adv = (x0 + torch.empty_like(x0).uniform_(-epsilon, epsilon)).clamp(0.0, 1.0)
for _ in range(steps):
x_adv.requires_grad_(True)
loss = F.cross_entropy(model(x_adv), y, reduction="sum")
grad = torch.autograd.grad(loss, x_adv)[0]
with torch.no_grad():
x_adv = x_adv + step_size * grad.sign()
delta = (x_adv - x0).clamp(-epsilon, epsilon)
x_adv = (x0 + delta).clamp(0.0, 1.0)
with torch.no_grad():
losses = F.cross_entropy(model(x_adv), y, reduction="none")
replace = losses > best_loss
best_loss[replace] = losses[replace]
best_x[replace] = x_adv[replace]
return best_x.detach()
This code uses gradients with respect to the input, not the model parameters. In a real evaluation, the model should be in eval() mode unless the threat model intentionally allows batch-statistics behavior during attack.
Common pitfalls
- Calling FGSM robustness a strong white-box evaluation. It is a baseline, not a complete test.
- Forgetting random starts and restarts for PGD, especially when evaluating defenses.
- Taking gradients through a different pipeline than the deployed model uses.
- Leaving dropout, batch normalization, or stochastic layers in an unintended mode during evaluation.
- Using a step size so large that PGD bounces around and looks weaker than it is.
- Reporting only average loss increase instead of attack success rate or robust accuracy.
- Ignoring adaptive methods such as BPDA or EOT when the defense uses nondifferentiable or randomized preprocessing.
Connections
- Mathematical formulation derives FGSM and projected-gradient updates.
- Gradient masking and obfuscation explains why white-box attacks must be adaptive.
- Adversarial training uses PGD-like attacks inside training.
- Evaluation and benchmarks discusses AutoAttack, restarts, and benchmark discipline.
- Deep learning gives the backpropagation machinery used to compute input gradients.
Further reading
- Goodfellow, Shlens, and Szegedy, "Explaining and Harnessing Adversarial Examples."
- Kurakin, Goodfellow, and Bengio, "Adversarial Examples in the Physical World."
- Madry et al., "Towards Deep Learning Models Resistant to Adversarial Attacks."
- Dong et al., "Boosting Adversarial Attacks with Momentum."
- Carlini and Wagner, "Towards Evaluating the Robustness of Neural Networks."
- Moosavi-Dezfooli, Fawzi, and Frossard, "DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks."