Skip to main content

Adversarial Attacks

Adversarial machine learning studies how learned systems fail when inputs, training data, feedback, or surrounding context are chosen by an adversary rather than sampled passively from the test distribution. In the modern deep-learning setting, the canonical example is an adversarial image x=x+δx' = x+\delta that looks nearly unchanged to a human but causes a classifier to predict the wrong label. The same discipline now reaches text, speech, reinforcement learning, retrieval systems, tool-using LLM agents, and physical-world perception.

This section is the foundational layer for SJ Wiki. It does not deep-dive individual papers yet; instead it defines the vocabulary, threat models, mathematical formulations, attack families, defenses, and evaluation rules that later paper notes will reuse. The central habit is security-style precision: every robustness claim must state the attacker's goal, knowledge, capability, and budget.

Definitions

An adversarial example for a classifier hh is an input xx' derived from a clean input xx such that xx' is valid under a specified threat model and the model's behavior changes in an attacker-desired way. In a standard norm-bounded image setting:

x=x+δ,δpϵ.x' = x+\delta,\qquad \|\delta\|_p \le \epsilon.

For an untargeted attack, success means:

h(x)y.h(x') \ne y.

For a targeted attack, success means:

h(x)=yt,yty.h(x') = y_t,\qquad y_t \ne y.

A threat model states what the attacker can know and do. The minimum useful fields are:

T=(G,K,C,B),\mathcal{T}=(\mathcal{G},\mathcal{K},\mathcal{C},\mathcal{B}),

where G\mathcal{G} is the goal, K\mathcal{K} is knowledge, C\mathcal{C} is capability, and B\mathcal{B} is the budget. For example, an \ell_\infty white-box image attack, a transfer-only black-box attack, a physical sticker attack, and an indirect prompt-injection attack are different threat models even if all are called "adversarial."

White-box attacks assume the attacker knows and can differentiate the model and defense. Black-box attacks assume limited API feedback such as scores, labels, or no target queries. Transfer attacks craft examples on a surrogate model and test whether they fool the target. Physical-world attacks optimize perturbations that survive transformations such as viewpoint, lighting, printing, and camera pipelines. Prompt injection and jailbreaks target instruction-following systems rather than ordinary image classifiers.

An empirical defense is supported by attacks failing under a stated evaluation protocol. Adversarial training is the leading empirical defense for norm-bounded image robustness. A certified defense proves that no valid adversarial example exists within a specified set for a given input. Certification is stronger than attack failure, but it is limited to the norm, radius, model, and verifier assumptions.

Key results

The basic attack optimization problem is:

maxδΔ(x)L(fθ(x+δ),y),\max_{\delta \in \Delta(x)} \mathcal{L}(f_\theta(x+\delta),y),

where Δ(x)\Delta(x) is the allowed perturbation set. For norm-bounded images:

Δ(x)={δ:δpϵ, x+δ[0,1]d}.\Delta(x)=\{\delta:\|\delta\|_p\le\epsilon,\ x+\delta \in [0,1]^d\}.

The first-order approximation gives the intuition behind FGSM. If:

g=xL(fθ(x),y),g=\nabla_x\mathcal{L}(f_\theta(x),y),

then:

L(fθ(x+δ),y)L(fθ(x),y)+gδ.\mathcal{L}(f_\theta(x+\delta),y) \approx \mathcal{L}(f_\theta(x),y)+g^\top\delta.

Over an \ell_\infty ball, the maximizing first-order perturbation is:

δ=ϵsign(g).\delta^\star=\epsilon\,\mathrm{sign}(g).

Defenses often start from the robust training objective:

minθE(x,y)[maxδΔ(x)L(fθ(x+δ),y)].\min_\theta \mathbb{E}_{(x,y)} \left[ \max_{\delta\in\Delta(x)} \mathcal{L}(f_\theta(x+\delta),y) \right].

This min-max formulation is powerful but expensive because each training update contains an attack problem. It also makes clear why robustness is threat-model-specific: change Δ(x)\Delta(x) and the defense target changes.

Certification replaces "the attack did not find a failure" with "no failure exists in this set." A pointwise certificate of radius rr proves:

x with xxpr,h(x)=y.\forall x' \text{ with } \|x'-x\|_p \le r,\quad h(x')=y.

Evaluation is the field's recurring difficulty. A robustness number is incomplete unless it states the threat model, attack suite, preprocessing, randomization handling, restarts, query budget, and whether the defense was attacked adaptively. Many historical defenses failed because they masked gradients rather than increasing true robustness.

The section is organized as follows:

  1. Adversarial Attacks: this hub page and conceptual map.
  2. Threat Models and Attack Taxonomy: white/grey/black-box access, goals, budgets, transfer, and attacker knowledge.
  3. Mathematical Formulation: constrained optimization, min-max risk, dual norms, and loss surfaces.
  4. White-Box Attacks: FGSM, BIM/I-FGSM, PGD, MIM, C&W, and DeepFool as algorithm families.
  5. Black-Box and Transfer Attacks: surrogate models, query attacks, ZOO, NES, SPSA, and Square Attack.
  6. Physical-World and Patch Attacks: patches, stickers, audio, 3D objects, and expectation over transformations.
  7. Adversarial Training: PGD adversarial training, TRADES, free and fast variants, overfitting, and cost.
  8. Certified Defenses and Randomized Smoothing: certificates, smoothing, IBP, CROWN-style bounds, and relaxations.
  9. Gradient Masking and Obfuscation: broken defenses, BPDA, EOT, and diagnostic signs.
  10. Evaluation and Benchmarks: RobustBench, AutoAttack, adaptive attacks, robust accuracy, and reporting discipline.
  11. Robustness-Accuracy Tradeoff: natural risk, robust risk, boundary error, margins, and data scaling.
  12. Attacks on LLMs and Other Modalities: text, audio, RL, multimodal, jailbreaks, and prompt injection at overview level.

Visual

ConceptMinimal questionMain page
Threat modelWhat can the attacker know and change?Threat models
Attack optimizationWhat objective is being maximized or minimized?Mathematical formulation
White-box attackWhat if gradients are available?White-box attacks
Black-box attackWhat if only an API is available?Black-box and transfer
Physical attackWhat survives transformations and sensors?Physical-world and patch attacks
Empirical defenseWhat attacks did the model survive?Adversarial training
Certified defenseWhat has been proven impossible?Certified defenses
EvaluationIs the robustness claim actually supported?Evaluation and benchmarks

Worked example 1: Parsing a robustness claim

Problem: A model card says: "Our classifier is robust to adversarial examples." The appendix says it was evaluated on CIFAR-10 with PGD-20, \ell_\infty, ϵ=8/255\epsilon=8/255, untargeted, white-box access, 10 random restarts, and clipping to [0,1][0,1]. Rewrite the claim precisely.

  1. Identify the dataset:
CIFAR-10.\text{CIFAR-10}.
  1. Identify the perturbation set:
Δ(x)={δ:δ8/255, x+δ[0,1]d}.\Delta(x)=\{\delta:\|\delta\|_\infty \le 8/255,\ x+\delta\in[0,1]^d\}.
  1. Identify the attacker goal:
h(x+δ)y.h(x+\delta)\ne y.
  1. Identify knowledge:
white-box access.\text{white-box access}.
  1. Identify the evaluation algorithm:
PGD-20 with 10 random restarts.\text{PGD-20 with 10 random restarts}.
  1. The precise claim is not "robust to adversarial examples" in general. It is: "The classifier achieved the reported robust accuracy against untargeted white-box PGD-20 attacks with 10 restarts under an \ell_\infty radius of 8/2558/255 on CIFAR-10 inputs clipped to [0,1][0,1]."

Checked answer: the precise version is narrower but meaningful. It does not claim robustness to patches, 2\ell_2 attacks, black-box query attacks, corruptions, or prompt injection.

Worked example 2: Choosing the right page for a question

Problem: A reader asks: "My defense applies random resizing before classification. PGD fails, but a transfer attack succeeds. Where should I look?"

  1. The defense uses randomness:
F(x,ω)=f(T(x,ω)).F(x,\omega)=f(T(x,\omega)).
  1. A naive PGD attack may be using a single random draw or ignoring the expected loss.

  2. The relevant adaptive objective is:

maxδΔ(x)Eω[L(F(x+δ,ω),y)].\max_{\delta\in\Delta(x)} \mathbb{E}_{\omega} [\mathcal{L}(F(x+\delta,\omega),y)].
  1. Because transfer succeeds while naive PGD fails, this may be a gradient-masking symptom.

  2. The reader should start with gradient masking and obfuscation, then check white-box attacks for EOT-style PGD and evaluation and benchmarks for reporting.

Checked answer: the issue is probably not that transfer attacks are "stronger than white-box access" in principle. It is that the white-box attack was not adapted to the randomized defended system.

Code

from dataclasses import dataclass

@dataclass(frozen=True)
class RobustnessClaim:
dataset: str
norm: str
epsilon: str
access: str
goal: str
attack: str
restarts: int
preprocessing: str

def precise_sentence(self) -> str:
return (
f"Robustness is evaluated on {self.dataset} under a {self.goal} "
f"{self.access} attack using {self.attack} with {self.restarts} restarts, "
f"constrained by {self.norm} radius {self.epsilon}, with {self.preprocessing}."
)

claim = RobustnessClaim(
dataset="CIFAR-10",
norm="linf",
epsilon="8/255",
access="white-box",
goal="untargeted",
attack="PGD-20",
restarts=10,
preprocessing="inputs clipped to [0, 1]",
)

print(claim.precise_sentence())

This toy class is a guardrail for writing robustness claims. It forces the writer to specify the fields that are most often missing from vague statements.

Common pitfalls

  • Saying "robust" without naming the threat model.
  • Mixing targeted and untargeted results in the same table without labels.
  • Comparing ϵ\epsilon values across different preprocessing scales.
  • Treating adversarial training as a certificate.
  • Treating a certificate for one norm and radius as a general security proof.
  • Evaluating a defense with attacks that do not include the defense.
  • Reusing image perturbation language for LLM and text attacks without defining semantic or system-level validity.
  • Deep-diving a paper result before understanding the shared vocabulary of goals, knowledge, capability, and budget.

Connections

Further reading

  • Szegedy et al., "Intriguing Properties of Neural Networks."
  • Goodfellow, Shlens, and Szegedy, "Explaining and Harnessing Adversarial Examples."
  • Biggio and Roli, "Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning."
  • Madry et al., "Towards Deep Learning Models Resistant to Adversarial Attacks."
  • Carlini and Wagner, "Towards Evaluating the Robustness of Neural Networks."
  • Athalye, Carlini, and Wagner, "Obfuscated Gradients Give a False Sense of Security."

Attack and defense deep-dives

The pages below are paper-grounded deep-dives on named attacks, defenses, and evaluation methods. They are ordered roughly from core image attacks to black-box methods, physical attacks, backdoors, and non-image modalities.

  1. FGSM
  2. PGD
  3. DeepFool
  4. C&W attack
  5. Universal adversarial perturbations
  6. EAD elastic-net attack
  7. Momentum Iterative FGSM
  8. Boundary Attack
  9. ZOO
  10. Square Attack
  11. One Pixel Attack
  12. Adaptive Auto Attack
  13. Adversarial Patch
  14. Physical stop-sign attack
  15. BadNets backdoor
  16. Audio adversarial examples
  17. HotFlip
  18. TextFooler
  19. BERT-Attack
  20. RF universal adversarial perturbations