Skip to main content

White-Box Attacks

White-box attacks assume the attacker can inspect and differentiate the full model and defense. This is the standard stress test for adversarial robustness because it removes accidental secrecy: if a defense only works when the attacker does not know the preprocessing, loss, randomness, or architecture, the defense is brittle under a security interpretation.

Most white-box image attacks are variations on constrained optimization. They use gradients with respect to the input, not the weights, and search inside a threat set such as an \ell_\infty or 2\ell_2 ball. This page gives the conceptual scaffold for FGSM, BIM/I-FGSM, PGD, momentum iterative attacks, Carlini-Wagner attacks, and DeepFool; later paper pages can deep-dive the original algorithms.

Definitions

For a classifier fθf_\theta, loss L\mathcal{L}, clean input xx, label yy, perturbation set Δ(x)\Delta(x), and adversarial input xadv=x+δx_{\mathrm{adv}}=x+\delta, the white-box untargeted attack problem is:

maxδΔ(x)L(fθ(x+δ),y).\max_{\delta \in \Delta(x)} \mathcal{L}(f_\theta(x+\delta), y).

The attacker can compute:

xL(fθ(x),y),\nabla_x \mathcal{L}(f_\theta(x), y),

and, when needed, gradients through preprocessing, random transformations, differentiable defenses, logit margins, or surrogate losses.

FGSM uses one first-order step:

xadv=Π[0,1]d(x+ϵsign(xL(fθ(x),y))).x_{\mathrm{adv}} = \Pi_{[0,1]^d} \left(x + \epsilon\,\mathrm{sign}(\nabla_x \mathcal{L}(f_\theta(x), y))\right).

BIM or I-FGSM repeats small FGSM-like steps and clips after each step:

xt+1=ΠB(x,ϵ)(xt+αsign(xL(fθ(xt),y))).x^{t+1} = \Pi_{B_\infty(x,\epsilon)} \left(x^t + \alpha\,\mathrm{sign}(\nabla_x \mathcal{L}(f_\theta(x^t), y))\right).

PGD is the same projected-gradient idea, usually with a random start:

x0=x+u,uUniform([ϵ,ϵ]d).x^0 = x + u,\qquad u \sim \mathrm{Uniform}([-\epsilon,\epsilon]^d).

Momentum iterative attacks stabilize the direction by accumulating normalized gradients:

gt+1=μgt+xL(fθ(xt),y)xL(fθ(xt),y)1.g_{t+1} = \mu g_t + \frac{\nabla_x \mathcal{L}(f_\theta(x^t), y)} {\|\nabla_x \mathcal{L}(f_\theta(x^t), y)\|_1}.

Carlini-Wagner attacks often optimize a penalty objective such as:

minδδ22+cΦ(x+δ)\min_\delta \|\delta\|_2^2 + c \cdot \Phi(x+\delta)

with a loss Φ\Phi designed around logit margins and target success. DeepFool approximates the classifier locally by linear decision boundaries and iteratively moves to the nearest boundary.

Key results

FGSM follows from the first-order Taylor approximation described in mathematical formulation. It is fast, simple, and historically important, but it is not a strong standalone robustness evaluation. A model can resist one-step attacks while remaining vulnerable to iterative attacks.

PGD is the workhorse first-order attack for norm-bounded white-box evaluation. For \ell_\infty attacks, one iteration is:

zt+1=xt+αsign(xL(fθ(xt),y)),xt+1=Π[0,1]dB(x,ϵ)(zt+1).\begin{aligned} z^{t+1} &= x^t + \alpha\,\mathrm{sign}(\nabla_x \mathcal{L}(f_\theta(x^t), y)), \\ x^{t+1} &= \Pi_{[0,1]^d \cap B_\infty(x,\epsilon)}(z^{t+1}). \end{aligned}

The projection is coordinatewise:

xt+1=min(1,max(0,min(x+ϵ,max(xϵ,zt+1)))).x^{t+1} = \min(1,\max(0,\min(x+\epsilon,\max(x-\epsilon,z^{t+1})))).

For 2\ell_2 attacks, the update uses normalized gradients and projection onto the 2\ell_2 ball:

zt+1=xt+αxL(fθ(xt),y)xL(fθ(xt),y)2.z^{t+1} = x^t + \alpha \frac{\nabla_x \mathcal{L}(f_\theta(x^t), y)} {\|\nabla_x \mathcal{L}(f_\theta(x^t), y)\|_2}.

An important evaluation point is that white-box attacks should be adaptive. If the defended model is:

F(x)=fθ(T(x)),F(x) = f_\theta(T(x)),

where TT is preprocessing, the gradient should be taken through TT when possible:

xL(fθ(T(x)),y).\nabla_x \mathcal{L}(f_\theta(T(x)), y).

If TT is nondifferentiable, the attacker may use BPDA, a differentiable approximation, score-based search, or another adaptive method. If the defense is randomized, the attacker may optimize the expected loss with expectation over transformations:

xEω[L(F(x,ω),y)]1mi=1mxL(F(x,ωi),y).\nabla_x \mathbb{E}_{\omega}[\mathcal{L}(F(x,\omega), y)] \approx \frac{1}{m}\sum_{i=1}^m \nabla_x \mathcal{L}(F(x,\omega_i), y).

Algorithm choice depends on the goal. FGSM is useful as a pedagogical baseline and for fast adversarial training variants. PGD is a strong default for p\ell_p evaluation. C&W-style attacks are useful when minimizing distortion or handling confidence margins. DeepFool estimates boundary distance and can be useful for geometric diagnostics. Momentum attacks often improve transferability because they avoid overfitting to local gradient noise.

Attack complexity is usually reported in gradient evaluations. FGSM uses one backward pass with respect to the input. PGD-kk uses roughly kk such backward passes per restart. A PGD-50 attack with 10 restarts is therefore much more expensive than PGD-10 with one restart, and the comparison matters when evaluating defenses. Hyperparameters should be strong enough to make the loss increase and the success rate stabilize. Step size, number of steps, random starts, loss choice, and targeted versus untargeted variants are not cosmetic details; they can change a robustness number by a large amount.

White-box access also includes preprocessing and postprocessing when those components affect the decision. If the system normalizes inputs, crops images, rejects examples, or ensembles several models, the attack should target that whole computation. Otherwise the evaluation is only white-box for a simplified model, not for the deployed system.

Visual

AttackMain ideaTypical budgetStrengthsLimitations
FGSMOne signed gradient step\ell_\inftyVery fast, interpretableWeak evaluation by itself
BIM/I-FGSMRepeated clipped FGSM\ell_\inftyStronger than FGSMCan overfit local model, needs step tuning
PGDIterative projected ascent with random starts\ell_\infty, 2\ell_2Standard first-order baselineStill approximate, can miss masked gradients
MIMPGD with momentum\ell_\infty, 2\ell_2Often better transferExtra hyperparameter μ\mu
C&WOptimize norm plus logit-margin penalty2\ell_2, \ell_\infty, 0\ell_0 variantsStrong for low-distortion attacksSlower, coefficient search matters
DeepFoolIteratively cross local linear boundaryOften 2\ell_2Geometric boundary estimateNot a full robust evaluation

Worked example 1: One FGSM step on normalized pixels

Problem: A grayscale input has four pixels:

x=(0.20,0.50,0.90,0.10).x = (0.20, 0.50, 0.90, 0.10).

The input gradient is:

g=xL=(2.0,0.3,0.0,5.0).g = \nabla_x \mathcal{L} = (-2.0, 0.3, 0.0, 5.0).

Compute the FGSM adversarial input for ϵ=0.05\epsilon=0.05 and clip to [0,1][0,1].

  1. Take the elementwise sign:
sign(g)=(1,1,0,1).\mathrm{sign}(g)=(-1,1,0,1).
  1. Multiply by ϵ\epsilon:
δ=0.05(1,1,0,1)=(0.05,0.05,0,0.05).\delta = 0.05(-1,1,0,1)=(-0.05,0.05,0,0.05).
  1. Add to the input:
x+δ=(0.15,0.55,0.90,0.15).x+\delta=(0.15,0.55,0.90,0.15).
  1. Clip to [0,1][0,1]. No coordinate is outside the interval, so the value is unchanged.

Checked answer:

xadv=(0.15,0.55,0.90,0.15).x_{\mathrm{adv}}=(0.15,0.55,0.90,0.15).

The third pixel does not change because the gradient coordinate is zero. In real floating-point code, exactly zero gradients can also be a warning sign if they arise from nondifferentiable preprocessing or saturated activations.

Worked example 2: Two PGD iterations with projection

Problem: Let:

x=(0.50,0.50),ϵ=0.10,α=0.08.x=(0.50,0.50),\quad \epsilon=0.10,\quad \alpha=0.08.

Assume the attack starts at x0=xx^0=x and sees gradient signs:

sign(g0)=(1,1),sign(g1)=(1,1).\mathrm{sign}(g^0)=(1,1), \qquad \mathrm{sign}(g^1)=(1,-1).

Compute two \ell_\infty PGD steps.

  1. First gradient step:
z1=x0+α(1,1)=(0.58,0.58).z^1=x^0+\alpha(1,1)=(0.58,0.58).
  1. Project to the \ell_\infty ball around xx. The allowed interval for each coordinate is [0.40,0.60][0.40,0.60]. The point (0.58,0.58)(0.58,0.58) is valid, so:
x1=(0.58,0.58).x^1=(0.58,0.58).
  1. Second gradient step:
z2=x1+α(1,1)=(0.66,0.50).z^2=x^1+\alpha(1,-1)=(0.66,0.50).
  1. Project coordinatewise to [0.40,0.60][0.40,0.60]:
x2=(0.60,0.50).x^2=(0.60,0.50).
  1. Check the perturbation:
x2x=(0.10,0.00),x2x=0.10.x^2-x=(0.10,0.00), \qquad \|x^2-x\|_\infty=0.10.

Checked answer: after two steps, x2=(0.60,0.50)x^2=(0.60,0.50), exactly on the boundary of the allowed \ell_\infty ball.

Code

import torch
import torch.nn.functional as F

def fgsm(model, x, y, epsilon):
x_adv = x.detach().clone().requires_grad_(True)
loss = F.cross_entropy(model(x_adv), y)
loss.backward()
with torch.no_grad():
x_adv = x_adv + epsilon * x_adv.grad.sign()
return x_adv.clamp(0.0, 1.0).detach()

def pgd_linf(model, x, y, epsilon, step_size, steps, restarts=1):
best_x = x.detach()
best_loss = torch.full((x.shape[0],), -float("inf"), device=x.device)

for _ in range(restarts):
x0 = x.detach()
x_adv = (x0 + torch.empty_like(x0).uniform_(-epsilon, epsilon)).clamp(0.0, 1.0)

for _ in range(steps):
x_adv.requires_grad_(True)
loss = F.cross_entropy(model(x_adv), y, reduction="sum")
grad = torch.autograd.grad(loss, x_adv)[0]
with torch.no_grad():
x_adv = x_adv + step_size * grad.sign()
delta = (x_adv - x0).clamp(-epsilon, epsilon)
x_adv = (x0 + delta).clamp(0.0, 1.0)

with torch.no_grad():
losses = F.cross_entropy(model(x_adv), y, reduction="none")
replace = losses > best_loss
best_loss[replace] = losses[replace]
best_x[replace] = x_adv[replace]

return best_x.detach()

This code uses gradients with respect to the input, not the model parameters. In a real evaluation, the model should be in eval() mode unless the threat model intentionally allows batch-statistics behavior during attack.

Common pitfalls

  • Calling FGSM robustness a strong white-box evaluation. It is a baseline, not a complete test.
  • Forgetting random starts and restarts for PGD, especially when evaluating defenses.
  • Taking gradients through a different pipeline than the deployed model uses.
  • Leaving dropout, batch normalization, or stochastic layers in an unintended mode during evaluation.
  • Using a step size so large that PGD bounces around and looks weaker than it is.
  • Reporting only average loss increase instead of attack success rate or robust accuracy.
  • Ignoring adaptive methods such as BPDA or EOT when the defense uses nondifferentiable or randomized preprocessing.

Connections

Further reading

  • Goodfellow, Shlens, and Szegedy, "Explaining and Harnessing Adversarial Examples."
  • Kurakin, Goodfellow, and Bengio, "Adversarial Examples in the Physical World."
  • Madry et al., "Towards Deep Learning Models Resistant to Adversarial Attacks."
  • Dong et al., "Boosting Adversarial Attacks with Momentum."
  • Carlini and Wagner, "Towards Evaluating the Robustness of Neural Networks."
  • Moosavi-Dezfooli, Fawzi, and Frossard, "DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks."