White-Box Attacks
White-box attacks assume the attacker can inspect and differentiate the full model and defense. This is the standard stress test for adversarial robustness because it removes accidental secrecy: if a defense only works when the attacker does not know the preprocessing, loss, randomness, or architecture, the defense is brittle under a security interpretation.
Most white-box image attacks are variations on constrained optimization. They use gradients with respect to the input, not the weights, and search inside a threat set such as an or ball. This page gives the conceptual scaffold for FGSM, BIM/I-FGSM, PGD, momentum iterative attacks, Carlini-Wagner attacks, DeepFool, and elastic-net attacks with the original papers cited inline.
Definitions
For a classifier , loss , clean input , label , perturbation set , and adversarial input , the white-box untargeted attack problem is:
The attacker can compute:
and, when needed, gradients through preprocessing, random transformations, differentiable defenses, logit margins, or surrogate losses.
FGSM uses one first-order step:
BIM or I-FGSM repeats small FGSM-like steps and clips after each step:
PGD is the same projected-gradient idea, usually with a random start:
Momentum iterative attacks stabilize the direction by accumulating normalized gradients:
Carlini-Wagner attacks often optimize a penalty objective such as:
with a loss designed around logit margins and target success. DeepFool approximates the classifier locally by linear decision boundaries and iteratively moves to the nearest boundary.
Key results
FGSM follows from the first-order Taylor approximation described in mathematical formulation. It is fast, simple, and historically important, but it is not a strong standalone robustness evaluation. A model can resist one-step attacks while remaining vulnerable to iterative attacks.
PGD is the workhorse first-order attack for norm-bounded white-box evaluation. For attacks, one iteration is:
The projection is coordinatewise:
For attacks, the update uses normalized gradients and projection onto the ball:
An important evaluation point is that white-box attacks should be adaptive. If the defended model is:
where is preprocessing, the gradient should be taken through when possible:
If is nondifferentiable, the attacker may use BPDA, a differentiable approximation, score-based search, or another adaptive method. If the defense is randomized, the attacker may optimize the expected loss with expectation over transformations:
Algorithm choice depends on the goal. FGSM is useful as a pedagogical baseline and for fast adversarial training variants. PGD is a strong default for evaluation. C&W-style attacks are useful when minimizing distortion or handling confidence margins. DeepFool estimates boundary distance and can be useful for geometric diagnostics. Momentum attacks often improve transferability because they avoid overfitting to local gradient noise.
Attack complexity is usually reported in gradient evaluations. FGSM uses one backward pass with respect to the input. PGD- uses roughly such backward passes per restart. A PGD-50 attack with 10 restarts is therefore much more expensive than PGD-10 with one restart, and the comparison matters when evaluating defenses. Hyperparameters should be strong enough to make the loss increase and the success rate stabilize. Step size, number of steps, random starts, loss choice, and targeted versus untargeted variants are not cosmetic details; they can change a robustness number by a large amount.
White-box access also includes preprocessing and postprocessing when those components affect the decision. If the system normalizes inputs, crops images, rejects examples, or ensembles several models, the attack should target that whole computation. Otherwise the evaluation is only white-box for a simplified model, not for the deployed system.
Representative white-box methods
Single-step sign attacks
Goodfellow, Shlens, and Szegedy [1] introduced the fast gradient sign method as the canonical first-order attack. The contribution was both algorithmic and explanatory: if a high-dimensional model behaves locally like a linear function, many tiny coordinatewise changes can add up to a large logit or loss change.
The method starts from the Taylor approximation:
Over , the linear term is maximized by:
For a targeted variant, descend the target-class loss:
Compact pseudo-code:
g = gradient(loss(model(x), y), x)
x_adv = clip(x + epsilon * sign(g), 0, 1)
| Clean input | Signed perturbation | Adversarial output |
|---|---|---|
![]() | ![]() | ![]() |
Figure: FGSM's canonical panda-plus-perturbation example from Goodfellow, Shlens, and Szegedy, 2014 — embedded under educational fair use with attribution.
This one backward pass is useful as a sanity check and teaching baseline. It is not a complete robustness evaluation because it trusts the loss surface at exactly the clean input.
Iterative projected gradient methods

Figure: PGD adversarial examples support the robust optimization view of local worst-case loss. From Madry et al., 2017 — embedded under educational fair use with attribution.
Kurakin, Goodfellow, and Bengio [2] popularized basic iterative methods, also called BIM or I-FGSM, by repeating small clipped sign-gradient steps. Madry et al. [3] made the projected-gradient view central to robust optimization: the same inner maximization used for evaluation also becomes the workhorse inner loop of adversarial training.
For attacks, initialize either at or at a random point in the ball:
then iterate:
For , replace the sign direction by a normalized gradient direction:
The robust optimization objective from [3] is:
Compact pseudo-code:
x_adv = random_point_in_ball(x, epsilon)
for step in 1..k:
g = gradient(loss(model(x_adv), y), x_adv)
x_adv = project_to_ball_and_pixels(x_adv + alpha * sign(g), x, epsilon)
return strongest_restart_by_loss
The important reporting fields are , step size, number of steps, random starts, loss, clipping range, and whether gradients pass through the full defended pipeline.
Momentum-smoothed transfer directions
Dong et al. [4] introduced momentum iterative FGSM to make iterative sign attacks less tied to noisy local gradients of one source model. The method is especially important for transfer attacks: it can craft examples on a source model or ensemble that cross decision boundaries shared by other models.
At step , normalize the current gradient and accumulate a velocity:
Then update by the sign of the accumulated direction:
For an ensemble source, the loss can be averaged:
Compact pseudo-code:
momentum = 0
for step in 1..k:
grad = gradient(loss(source_or_ensemble(x_adv), y), x_adv)
grad = grad / mean(abs(grad))
momentum = mu * momentum + grad
x_adv = project_to_ball_and_pixels(x_adv + alpha * sign(momentum), x, epsilon)
Momentum is an optimizer change, not a new threat model. A report should separate white-box source success from black-box transfer success.
Logit-margin and boundary-distance optimization
Carlini and Wagner [5] showed that defenses which appeared strong against earlier attacks could fail under an adaptive optimization objective based on logits and confidence margins. The best-known version solves a penalty problem:
with targeted logit-margin loss:
The confidence requires the target logit to beat every other logit by a margin. A binary search over is part of the method because too small a penalty fails to attack and too large a penalty over-perturbs.
DeepFool, introduced by Moosavi-Dezfooli, Fawzi, and Frossard [6], asks a different but related question: how far is the nearest decision boundary under a local linear approximation? For binary affine score , the smallest move to the boundary is:
For multiclass scores, compare each class boundary by:
DeepFool moves toward the nearest and repeats until the predicted class changes. It is a geometric diagnostic and low-distortion attack, not a proof of the true nonlinear minimum.
Elastic-net attacks, introduced by Chen et al. [7], extend the C&W template with an term:
The penalty encourages sparse perturbations. A proximal shrinkage step for one coordinate is:
Worked micro-example: if , , and , shrinkage gives , so the updated coordinate is rather than . That single coordinate illustrates the elastic-net tradeoff: move toward attack success, but pull small unnecessary changes back to zero.
Compact C&W-style pseudo-code:
for c in binary_search_values:
optimize norm(x_adv - x) + c * logit_margin_loss(x_adv)
keep the lowest-distortion successful candidate
These attacks are most useful when the evaluation question is low distortion, confidence-margin behavior, metric sensitivity, or whether a suspicious defense only defeated a weak loss.
Visual
| Attack | Main idea | Typical budget | Strengths | Limitations |
|---|---|---|---|---|
| FGSM | One signed gradient step | Very fast, interpretable | Weak evaluation by itself | |
| BIM/I-FGSM | Repeated clipped FGSM | Stronger than FGSM | Can overfit local model, needs step tuning | |
| PGD | Iterative projected ascent with random starts | , | Standard first-order baseline | Still approximate, can miss masked gradients |
| MIM | PGD with momentum | , | Often better transfer | Extra hyperparameter |
| C&W | Optimize norm plus logit-margin penalty | , , variants | Strong for low-distortion attacks | Slower, coefficient search matters |
| DeepFool | Iteratively cross local linear boundary | Often | Geometric boundary estimate | Not a full robust evaluation |
This diagram separates the two white-box loops that are often compressed into one arrow. PGD alternates gradient ascent, projection into the epsilon ball, and clipping to the valid input range, while C&W optimizes a distortion-plus-logit-margin objective with a coefficient search before returning the best verified adversarial example.
Worked example 1: One FGSM step on normalized pixels
Problem: A grayscale input has four pixels:
The input gradient is:
Compute the FGSM adversarial input for and clip to .
- Take the elementwise sign:
- Multiply by :
- Add to the input:
- Clip to . No coordinate is outside the interval, so the value is unchanged.
Checked answer:
The third pixel does not change because the gradient coordinate is zero. In real floating-point code, exactly zero gradients can also be a warning sign if they arise from nondifferentiable preprocessing or saturated activations.
Worked example 2: Two PGD iterations with projection
Problem: Let:
Assume the attack starts at and sees gradient signs:
Compute two PGD steps.
- First gradient step:
- Project to the ball around . The allowed interval for each coordinate is . The point is valid, so:
- Second gradient step:
- Project coordinatewise to :
- Check the perturbation:
Checked answer: after two steps, , exactly on the boundary of the allowed ball.
Code
import torch
import torch.nn.functional as F
def fgsm(model, x, y, epsilon):
x_adv = x.detach().clone().requires_grad_(True)
loss = F.cross_entropy(model(x_adv), y)
loss.backward()
with torch.no_grad():
x_adv = x_adv + epsilon * x_adv.grad.sign()
return x_adv.clamp(0.0, 1.0).detach()
def pgd_linf(model, x, y, epsilon, step_size, steps, restarts=1):
best_x = x.detach()
best_loss = torch.full((x.shape[0],), -float("inf"), device=x.device)
for _ in range(restarts):
x0 = x.detach()
x_adv = (x0 + torch.empty_like(x0).uniform_(-epsilon, epsilon)).clamp(0.0, 1.0)
for _ in range(steps):
x_adv.requires_grad_(True)
loss = F.cross_entropy(model(x_adv), y, reduction="sum")
grad = torch.autograd.grad(loss, x_adv)[0]
with torch.no_grad():
x_adv = x_adv + step_size * grad.sign()
delta = (x_adv - x0).clamp(-epsilon, epsilon)
x_adv = (x0 + delta).clamp(0.0, 1.0)
with torch.no_grad():
losses = F.cross_entropy(model(x_adv), y, reduction="none")
replace = losses > best_loss
best_loss[replace] = losses[replace]
best_x[replace] = x_adv[replace]
return best_x.detach()
This code uses gradients with respect to the input, not the model parameters. In a real evaluation, the model should be in eval() mode unless the threat model intentionally allows batch-statistics behavior during attack.
Common pitfalls
- Calling FGSM robustness a strong white-box evaluation. It is a baseline, not a complete test.
- Forgetting random starts and restarts for PGD, especially when evaluating defenses.
- Taking gradients through a different pipeline than the deployed model uses.
- Leaving dropout, batch normalization, or stochastic layers in an unintended mode during evaluation.
- Using a step size so large that PGD bounces around and looks weaker than it is.
- Reporting only average loss increase instead of attack success rate or robust accuracy.
- Ignoring adaptive methods such as BPDA or EOT when the defense uses nondifferentiable or randomized preprocessing.
Connections
- Mathematical formulation derives FGSM and projected-gradient updates.
- Gradient masking and obfuscation explains why white-box attacks must be adaptive.
- Adversarial training uses PGD-like attacks inside training.
- Evaluation and benchmarks discusses AutoAttack, restarts, and benchmark discipline.
- Deep learning gives the backpropagation machinery used to compute input gradients.
Further reading
- Goodfellow, Shlens, and Szegedy, "Explaining and Harnessing Adversarial Examples."
- Kurakin, Goodfellow, and Bengio, "Adversarial Examples in the Physical World."
- Madry et al., "Towards Deep Learning Models Resistant to Adversarial Attacks."
- Dong et al., "Boosting Adversarial Attacks with Momentum."
- Carlini and Wagner, "Towards Evaluating the Robustness of Neural Networks."
- Moosavi-Dezfooli, Fawzi, and Frossard, "DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks."
References
[1] I. J. Goodfellow, J. Shlens, C. Szegedy. Explaining and Harnessing Adversarial Examples. ICLR 2015. [2] A. Kurakin, I. J. Goodfellow, S. Bengio. Adversarial Examples in the Physical World. ICLR Workshop 2017. [3] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu. Towards Deep Learning Models Resistant to Adversarial Attacks. ICLR 2018. [4] Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, J. Li. Boosting Adversarial Attacks with Momentum. CVPR 2018. [5] N. Carlini, D. Wagner. Towards Evaluating the Robustness of Neural Networks. IEEE Symposium on Security and Privacy 2017. [6] S.-M. Moosavi-Dezfooli, A. Fawzi, P. Frossard. DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks. CVPR 2016. [7] P.-Y. Chen, Y. Sharma, H. Zhang, J. Yi, C.-J. Hsieh. EAD: Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples. AAAI 2018.


