Skip to main content

White-Box Attacks

White-box attacks assume the attacker can inspect and differentiate the full model and defense. This is the standard stress test for adversarial robustness because it removes accidental secrecy: if a defense only works when the attacker does not know the preprocessing, loss, randomness, or architecture, the defense is brittle under a security interpretation.

Most white-box image attacks are variations on constrained optimization. They use gradients with respect to the input, not the weights, and search inside a threat set such as an \ell_\infty or 2\ell_2 ball. This page gives the conceptual scaffold for FGSM, BIM/I-FGSM, PGD, momentum iterative attacks, Carlini-Wagner attacks, DeepFool, and elastic-net attacks with the original papers cited inline.

Definitions

For a classifier fθf_\theta, loss L\mathcal{L}, clean input xx, label yy, perturbation set Δ(x)\Delta(x), and adversarial input xadv=x+δx_{\mathrm{adv}}=x+\delta, the white-box untargeted attack problem is:

maxδΔ(x)L(fθ(x+δ),y).\max_{\delta \in \Delta(x)} \mathcal{L}(f_\theta(x+\delta), y).

The attacker can compute:

xL(fθ(x),y),\nabla_x \mathcal{L}(f_\theta(x), y),

and, when needed, gradients through preprocessing, random transformations, differentiable defenses, logit margins, or surrogate losses.

FGSM uses one first-order step:

xadv=Π[0,1]d(x+ϵsign(xL(fθ(x),y))).x_{\mathrm{adv}} = \Pi_{[0,1]^d} \left(x + \epsilon\,\mathrm{sign}(\nabla_x \mathcal{L}(f_\theta(x), y))\right).

BIM or I-FGSM repeats small FGSM-like steps and clips after each step:

xt+1=ΠB(x,ϵ)(xt+αsign(xL(fθ(xt),y))).x^{t+1} = \Pi_{B_\infty(x,\epsilon)} \left(x^t + \alpha\,\mathrm{sign}(\nabla_x \mathcal{L}(f_\theta(x^t), y))\right).

PGD is the same projected-gradient idea, usually with a random start:

x0=x+u,uUniform([ϵ,ϵ]d).x^0 = x + u,\qquad u \sim \mathrm{Uniform}([-\epsilon,\epsilon]^d).

Momentum iterative attacks stabilize the direction by accumulating normalized gradients:

gt+1=μgt+xL(fθ(xt),y)xL(fθ(xt),y)1.g_{t+1} = \mu g_t + \frac{\nabla_x \mathcal{L}(f_\theta(x^t), y)} {\|\nabla_x \mathcal{L}(f_\theta(x^t), y)\|_1}.

Carlini-Wagner attacks often optimize a penalty objective such as:

minδδ22+cΦ(x+δ)\min_\delta \|\delta\|_2^2 + c \cdot \Phi(x+\delta)

with a loss Φ\Phi designed around logit margins and target success. DeepFool approximates the classifier locally by linear decision boundaries and iteratively moves to the nearest boundary.

Key results

FGSM follows from the first-order Taylor approximation described in mathematical formulation. It is fast, simple, and historically important, but it is not a strong standalone robustness evaluation. A model can resist one-step attacks while remaining vulnerable to iterative attacks.

PGD is the workhorse first-order attack for norm-bounded white-box evaluation. For \ell_\infty attacks, one iteration is:

zt+1=xt+αsign(xL(fθ(xt),y)),xt+1=Π[0,1]dB(x,ϵ)(zt+1).\begin{aligned} z^{t+1} &= x^t + \alpha\,\mathrm{sign}(\nabla_x \mathcal{L}(f_\theta(x^t), y)), \\ x^{t+1} &= \Pi_{[0,1]^d \cap B_\infty(x,\epsilon)}(z^{t+1}). \end{aligned}

The projection is coordinatewise:

xt+1=min(1,max(0,min(x+ϵ,max(xϵ,zt+1)))).x^{t+1} = \min(1,\max(0,\min(x+\epsilon,\max(x-\epsilon,z^{t+1})))).

For 2\ell_2 attacks, the update uses normalized gradients and projection onto the 2\ell_2 ball:

zt+1=xt+αxL(fθ(xt),y)xL(fθ(xt),y)2.z^{t+1} = x^t + \alpha \frac{\nabla_x \mathcal{L}(f_\theta(x^t), y)} {\|\nabla_x \mathcal{L}(f_\theta(x^t), y)\|_2}.

An important evaluation point is that white-box attacks should be adaptive. If the defended model is:

F(x)=fθ(T(x)),F(x) = f_\theta(T(x)),

where TT is preprocessing, the gradient should be taken through TT when possible:

xL(fθ(T(x)),y).\nabla_x \mathcal{L}(f_\theta(T(x)), y).

If TT is nondifferentiable, the attacker may use BPDA, a differentiable approximation, score-based search, or another adaptive method. If the defense is randomized, the attacker may optimize the expected loss with expectation over transformations:

xEω[L(F(x,ω),y)]1mi=1mxL(F(x,ωi),y).\nabla_x \mathbb{E}_{\omega}[\mathcal{L}(F(x,\omega), y)] \approx \frac{1}{m}\sum_{i=1}^m \nabla_x \mathcal{L}(F(x,\omega_i), y).

Algorithm choice depends on the goal. FGSM is useful as a pedagogical baseline and for fast adversarial training variants. PGD is a strong default for p\ell_p evaluation. C&W-style attacks are useful when minimizing distortion or handling confidence margins. DeepFool estimates boundary distance and can be useful for geometric diagnostics. Momentum attacks often improve transferability because they avoid overfitting to local gradient noise.

Attack complexity is usually reported in gradient evaluations. FGSM uses one backward pass with respect to the input. PGD-kk uses roughly kk such backward passes per restart. A PGD-50 attack with 10 restarts is therefore much more expensive than PGD-10 with one restart, and the comparison matters when evaluating defenses. Hyperparameters should be strong enough to make the loss increase and the success rate stabilize. Step size, number of steps, random starts, loss choice, and targeted versus untargeted variants are not cosmetic details; they can change a robustness number by a large amount.

White-box access also includes preprocessing and postprocessing when those components affect the decision. If the system normalizes inputs, crops images, rejects examples, or ensembles several models, the attack should target that whole computation. Otherwise the evaluation is only white-box for a simplified model, not for the deployed system.

Representative white-box methods

Single-step sign attacks

Goodfellow, Shlens, and Szegedy [1] introduced the fast gradient sign method as the canonical first-order \ell_\infty attack. The contribution was both algorithmic and explanatory: if a high-dimensional model behaves locally like a linear function, many tiny coordinatewise changes can add up to a large logit or loss change.

The method starts from the Taylor approximation:

L(fθ(x+δ),y)L(fθ(x),y)+δg,g=xL(fθ(x),y).\mathcal{L}(f_\theta(x+\delta),y) \approx \mathcal{L}(f_\theta(x),y) + \delta^\top g, \qquad g=\nabla_x\mathcal{L}(f_\theta(x),y).

Over δϵ\|\delta\|_\infty\le\epsilon, the linear term is maximized by:

δ=ϵsign(g),xadv=Π[0,1]d(x+δ).\delta^\star=\epsilon\,\mathrm{sign}(g), \qquad x_{\mathrm{adv}}= \Pi_{[0,1]^d}(x+\delta^\star).

For a targeted variant, descend the target-class loss:

xadv=Π[0,1]d(xϵsign(xL(fθ(x),yt))).x_{\mathrm{adv}}= \Pi_{[0,1]^d} \left(x-\epsilon\,\mathrm{sign}(\nabla_x\mathcal{L}(f_\theta(x),y_t))\right).

Compact pseudo-code:

g = gradient(loss(model(x), y), x)
x_adv = clip(x + epsilon * sign(g), 0, 1)
Clean inputSigned perturbationAdversarial output
A clean panda image is classified as a panda by the ImageNet model.A high-frequency signed gradient perturbation is shown as a colorful noise pattern.The visually similar perturbed panda image is classified as a gibbon with high confidence.

Figure: FGSM's canonical panda-plus-perturbation example from Goodfellow, Shlens, and Szegedy, 2014 — embedded under educational fair use with attribution.

This one backward pass is useful as a sanity check and teaching baseline. It is not a complete robustness evaluation because it trusts the loss surface at exactly the clean input.

Iterative projected gradient methods

The robust optimization figure shows loss landscapes around data points and the role of adversarial perturbation sets.

Figure: PGD adversarial examples support the robust optimization view of local worst-case loss. From Madry et al., 2017 — embedded under educational fair use with attribution.

Kurakin, Goodfellow, and Bengio [2] popularized basic iterative methods, also called BIM or I-FGSM, by repeating small clipped sign-gradient steps. Madry et al. [3] made the projected-gradient view central to robust optimization: the same inner maximization used for evaluation also becomes the workhorse inner loop of adversarial training.

For \ell_\infty attacks, initialize either at x0=xx^0=x or at a random point in the ball:

x0=x+u,uiUniform[ϵ,ϵ],x^0=x+u,\qquad u_i\sim\mathrm{Uniform}[-\epsilon,\epsilon],

then iterate:

zt+1=xt+αsign(xL(fθ(xt),y)),xt+1=Π[0,1]dB(x,ϵ)(zt+1).\begin{aligned} z^{t+1} &= x^t+\alpha\,\mathrm{sign}(\nabla_x\mathcal{L}(f_\theta(x^t),y)),\\ x^{t+1} &= \Pi_{[0,1]^d\cap B_\infty(x,\epsilon)}(z^{t+1}). \end{aligned}

For 2\ell_2, replace the sign direction by a normalized gradient direction:

zt+1=xt+αxL(fθ(xt),y)xL(fθ(xt),y)2.z^{t+1}=x^t+\alpha \frac{\nabla_x\mathcal{L}(f_\theta(x^t),y)} {\|\nabla_x\mathcal{L}(f_\theta(x^t),y)\|_2}.

The robust optimization objective from [3] is:

minθE(x,y)[maxδΔ(x)L(fθ(x+δ),y)].\min_\theta \mathbb{E}_{(x,y)} \left[ \max_{\delta\in\Delta(x)} \mathcal{L}(f_\theta(x+\delta),y) \right].

Compact pseudo-code:

x_adv = random_point_in_ball(x, epsilon)
for step in 1..k:
g = gradient(loss(model(x_adv), y), x_adv)
x_adv = project_to_ball_and_pixels(x_adv + alpha * sign(g), x, epsilon)
return strongest_restart_by_loss

The important reporting fields are ϵ\epsilon, step size, number of steps, random starts, loss, clipping range, and whether gradients pass through the full defended pipeline.

Momentum-smoothed transfer directions

Dong et al. [4] introduced momentum iterative FGSM to make iterative sign attacks less tied to noisy local gradients of one source model. The method is especially important for transfer attacks: it can craft examples on a source model or ensemble that cross decision boundaries shared by other models.

At step tt, normalize the current gradient and accumulate a velocity:

g~t+1=xL(fθ(xt),y)xL(fθ(xt),y)1,gt+1=μgt+g~t+1.\tilde{g}_{t+1} = \frac{\nabla_x\mathcal{L}(f_\theta(x^t),y)} {\|\nabla_x\mathcal{L}(f_\theta(x^t),y)\|_1}, \qquad g_{t+1}=\mu g_t+\tilde{g}_{t+1}.

Then update by the sign of the accumulated direction:

xt+1=Π[0,1]dB(x,ϵ)(xt+αsign(gt+1)).x^{t+1} = \Pi_{[0,1]^d\cap B_\infty(x,\epsilon)} \left(x^t+\alpha\,\mathrm{sign}(g_{t+1})\right).

For an ensemble source, the loss can be averaged:

Lens(x,y)=m=1MwmL(fm(x),y).\mathcal{L}_{\mathrm{ens}}(x,y)= \sum_{m=1}^M w_m\mathcal{L}(f_m(x),y).

Compact pseudo-code:

momentum = 0
for step in 1..k:
grad = gradient(loss(source_or_ensemble(x_adv), y), x_adv)
grad = grad / mean(abs(grad))
momentum = mu * momentum + grad
x_adv = project_to_ball_and_pixels(x_adv + alpha * sign(momentum), x, epsilon)

Momentum is an optimizer change, not a new threat model. A report should separate white-box source success from black-box transfer success.

Logit-margin and boundary-distance optimization

Carlini and Wagner [5] showed that defenses which appeared strong against earlier attacks could fail under an adaptive optimization objective based on logits and confidence margins. The best-known 2\ell_2 version solves a penalty problem:

minδδ22+cg(x+δ)subject tox+δ[0,1]d,\min_\delta \|\delta\|_2^2+c\,g(x+\delta) \quad\text{subject to}\quad x+\delta\in[0,1]^d,

with targeted logit-margin loss:

g(x)=max(maxitZi(x)Zt(x),κ).g(x')= \max\left(\max_{i\ne t} Z_i(x')-Z_t(x'),-\kappa\right).

The confidence κ\kappa requires the target logit to beat every other logit by a margin. A binary search over cc is part of the method because too small a penalty fails to attack and too large a penalty over-perturbs.

DeepFool, introduced by Moosavi-Dezfooli, Fawzi, and Frossard [6], asks a different but related question: how far is the nearest decision boundary under a local linear approximation? For binary affine score f(x)=wx+bf(x)=w^\top x+b, the smallest 2\ell_2 move to the boundary is:

r=f(x)w22w.r^\star=-\frac{f(x)}{\|w\|_2^2}w.

For multiclass scores, compare each class boundary by:

wk=fk(x)fk0(x),bk=fk(x)fk0(x),dk=bkwk2.w_k=\nabla f_k(x)-\nabla f_{k_0}(x), \qquad b_k=f_k(x)-f_{k_0}(x), \qquad d_k=\frac{|b_k|}{\|w_k\|_2}.

DeepFool moves toward the nearest dkd_k and repeats until the predicted class changes. It is a geometric diagnostic and low-distortion attack, not a proof of the true nonlinear minimum.

Elastic-net attacks, introduced by Chen et al. [7], extend the C&W template with an 1\ell_1 term:

minxcg(x)+βxx1+xx22subject tox[0,1]d.\min_{x'} c\,g(x')+\beta\|x'-x\|_1+\|x'-x\|_2^2 \quad\text{subject to}\quad x'\in[0,1]^d.

The 1\ell_1 penalty encourages sparse perturbations. A proximal shrinkage step for one coordinate is:

shrink(zixi,λ)=sign(zixi)max(zixiλ,0).\mathrm{shrink}(z_i-x_i,\lambda) = \mathrm{sign}(z_i-x_i)\max(|z_i-x_i|-\lambda,0).

Worked micro-example: if xi=0.50x_i=0.50, zi=0.57z_i=0.57, and λ=0.03\lambda=0.03, shrinkage gives 0.040.04, so the updated coordinate is 0.540.54 rather than 0.570.57. That single coordinate illustrates the elastic-net tradeoff: move toward attack success, but pull small unnecessary changes back to zero.

Compact C&W-style pseudo-code:

for c in binary_search_values:
optimize norm(x_adv - x) + c * logit_margin_loss(x_adv)
keep the lowest-distortion successful candidate

These attacks are most useful when the evaluation question is low distortion, confidence-margin behavior, metric sensitivity, or whether a suspicious defense only defeated a weak loss.

Visual

AttackMain ideaTypical budgetStrengthsLimitations
FGSMOne signed gradient step\ell_\inftyVery fast, interpretableWeak evaluation by itself
BIM/I-FGSMRepeated clipped FGSM\ell_\inftyStronger than FGSMCan overfit local model, needs step tuning
PGDIterative projected ascent with random starts\ell_\infty, 2\ell_2Standard first-order baselineStill approximate, can miss masked gradients
MIMPGD with momentum\ell_\infty, 2\ell_2Often better transferExtra hyperparameter μ\mu
C&WOptimize norm plus logit-margin penalty2\ell_2, \ell_\infty, 0\ell_0 variantsStrong for low-distortion attacksSlower, coefficient search matters
DeepFoolIteratively cross local linear boundaryOften 2\ell_2Geometric boundary estimateNot a full robust evaluation

This diagram separates the two white-box loops that are often compressed into one arrow. PGD alternates gradient ascent, projection into the epsilon ball, and clipping to the valid input range, while C&W optimizes a distortion-plus-logit-margin objective with a coefficient search before returning the best verified adversarial example.

Worked example 1: One FGSM step on normalized pixels

Problem: A grayscale input has four pixels:

x=(0.20,0.50,0.90,0.10).x = (0.20, 0.50, 0.90, 0.10).

The input gradient is:

g=xL=(2.0,0.3,0.0,5.0).g = \nabla_x \mathcal{L} = (-2.0, 0.3, 0.0, 5.0).

Compute the FGSM adversarial input for ϵ=0.05\epsilon=0.05 and clip to [0,1][0,1].

  1. Take the elementwise sign:
sign(g)=(1,1,0,1).\mathrm{sign}(g)=(-1,1,0,1).
  1. Multiply by ϵ\epsilon:
δ=0.05(1,1,0,1)=(0.05,0.05,0,0.05).\delta = 0.05(-1,1,0,1)=(-0.05,0.05,0,0.05).
  1. Add to the input:
x+δ=(0.15,0.55,0.90,0.15).x+\delta=(0.15,0.55,0.90,0.15).
  1. Clip to [0,1][0,1]. No coordinate is outside the interval, so the value is unchanged.

Checked answer:

xadv=(0.15,0.55,0.90,0.15).x_{\mathrm{adv}}=(0.15,0.55,0.90,0.15).

The third pixel does not change because the gradient coordinate is zero. In real floating-point code, exactly zero gradients can also be a warning sign if they arise from nondifferentiable preprocessing or saturated activations.

Worked example 2: Two PGD iterations with projection

Problem: Let:

x=(0.50,0.50),ϵ=0.10,α=0.08.x=(0.50,0.50),\quad \epsilon=0.10,\quad \alpha=0.08.

Assume the attack starts at x0=xx^0=x and sees gradient signs:

sign(g0)=(1,1),sign(g1)=(1,1).\mathrm{sign}(g^0)=(1,1), \qquad \mathrm{sign}(g^1)=(1,-1).

Compute two \ell_\infty PGD steps.

  1. First gradient step:
z1=x0+α(1,1)=(0.58,0.58).z^1=x^0+\alpha(1,1)=(0.58,0.58).
  1. Project to the \ell_\infty ball around xx. The allowed interval for each coordinate is [0.40,0.60][0.40,0.60]. The point (0.58,0.58)(0.58,0.58) is valid, so:
x1=(0.58,0.58).x^1=(0.58,0.58).
  1. Second gradient step:
z2=x1+α(1,1)=(0.66,0.50).z^2=x^1+\alpha(1,-1)=(0.66,0.50).
  1. Project coordinatewise to [0.40,0.60][0.40,0.60]:
x2=(0.60,0.50).x^2=(0.60,0.50).
  1. Check the perturbation:
x2x=(0.10,0.00),x2x=0.10.x^2-x=(0.10,0.00), \qquad \|x^2-x\|_\infty=0.10.

Checked answer: after two steps, x2=(0.60,0.50)x^2=(0.60,0.50), exactly on the boundary of the allowed \ell_\infty ball.

Code

import torch
import torch.nn.functional as F

def fgsm(model, x, y, epsilon):
x_adv = x.detach().clone().requires_grad_(True)
loss = F.cross_entropy(model(x_adv), y)
loss.backward()
with torch.no_grad():
x_adv = x_adv + epsilon * x_adv.grad.sign()
return x_adv.clamp(0.0, 1.0).detach()

def pgd_linf(model, x, y, epsilon, step_size, steps, restarts=1):
best_x = x.detach()
best_loss = torch.full((x.shape[0],), -float("inf"), device=x.device)

for _ in range(restarts):
x0 = x.detach()
x_adv = (x0 + torch.empty_like(x0).uniform_(-epsilon, epsilon)).clamp(0.0, 1.0)

for _ in range(steps):
x_adv.requires_grad_(True)
loss = F.cross_entropy(model(x_adv), y, reduction="sum")
grad = torch.autograd.grad(loss, x_adv)[0]
with torch.no_grad():
x_adv = x_adv + step_size * grad.sign()
delta = (x_adv - x0).clamp(-epsilon, epsilon)
x_adv = (x0 + delta).clamp(0.0, 1.0)

with torch.no_grad():
losses = F.cross_entropy(model(x_adv), y, reduction="none")
replace = losses > best_loss
best_loss[replace] = losses[replace]
best_x[replace] = x_adv[replace]

return best_x.detach()

This code uses gradients with respect to the input, not the model parameters. In a real evaluation, the model should be in eval() mode unless the threat model intentionally allows batch-statistics behavior during attack.

Common pitfalls

  • Calling FGSM robustness a strong white-box evaluation. It is a baseline, not a complete test.
  • Forgetting random starts and restarts for PGD, especially when evaluating defenses.
  • Taking gradients through a different pipeline than the deployed model uses.
  • Leaving dropout, batch normalization, or stochastic layers in an unintended mode during evaluation.
  • Using a step size so large that PGD bounces around and looks weaker than it is.
  • Reporting only average loss increase instead of attack success rate or robust accuracy.
  • Ignoring adaptive methods such as BPDA or EOT when the defense uses nondifferentiable or randomized preprocessing.

Connections

Further reading

  • Goodfellow, Shlens, and Szegedy, "Explaining and Harnessing Adversarial Examples."
  • Kurakin, Goodfellow, and Bengio, "Adversarial Examples in the Physical World."
  • Madry et al., "Towards Deep Learning Models Resistant to Adversarial Attacks."
  • Dong et al., "Boosting Adversarial Attacks with Momentum."
  • Carlini and Wagner, "Towards Evaluating the Robustness of Neural Networks."
  • Moosavi-Dezfooli, Fawzi, and Frossard, "DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks."

References

[1] I. J. Goodfellow, J. Shlens, C. Szegedy. Explaining and Harnessing Adversarial Examples. ICLR 2015. [2] A. Kurakin, I. J. Goodfellow, S. Bengio. Adversarial Examples in the Physical World. ICLR Workshop 2017. [3] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu. Towards Deep Learning Models Resistant to Adversarial Attacks. ICLR 2018. [4] Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, J. Li. Boosting Adversarial Attacks with Momentum. CVPR 2018. [5] N. Carlini, D. Wagner. Towards Evaluating the Robustness of Neural Networks. IEEE Symposium on Security and Privacy 2017. [6] S.-M. Moosavi-Dezfooli, A. Fawzi, P. Frossard. DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks. CVPR 2016. [7] P.-Y. Chen, Y. Sharma, H. Zhang, J. Yi, C.-J. Hsieh. EAD: Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples. AAAI 2018.