Skip to main content

Universal Adversarial Perturbations

Universal adversarial perturbations are image-agnostic perturbations: one vector vv is added to many different natural images and fools the classifier on a large fraction of them. This is more surprising than ordinary per-image adversarial examples because the perturbation is not custom-fit to a single input.

The original result gave a geometric interpretation of adversarial vulnerability. If many natural images have nearby decision boundaries with correlated normal directions, then a single direction in input space can cross many boundaries. Universal perturbations connect per-example attacks such as DeepFool to dataset-level structure.

Threat model

The standard threat model is white-box or surrogate-white-box, untargeted, digital evasion with an image-agnostic perturbation:

xadv=x+v,x_{\mathrm{adv}}=x+v,

where one vv is reused across many inputs. The budget is usually:

vpξ,\|v\|_p\le \xi,

for p=2p=2 or p=p=\infty. Attack success is measured by fooling rate over a distribution:

PrxD[k^(x+v)k^(x)].\Pr_{x\sim \mathcal{D}}[\hat{k}(x+v)\ne \hat{k}(x)].

The attacker may compute vv using a training set and then deploy it on unseen images. Transfer to other models is a central concern because a universal perturbation can be stored, printed, or broadcast more easily than per-input perturbations.

Method

The original algorithm builds vv iteratively over a dataset. Start with:

v=0.v=0.

For each image xix_i, check whether xi+vx_i+v already fools the classifier. If not, compute a small additional perturbation Δvi\Delta v_i that fools the current shifted image:

k^(xi+v+Δvi)k^(xi).\hat{k}(x_i+v+\Delta v_i)\ne \hat{k}(x_i).

A method such as DeepFool can provide Δvi\Delta v_i. Then update:

vΠp,ξ(v+Δvi),v\leftarrow \Pi_{p,\xi}(v+\Delta v_i),

where Πp,ξ\Pi_{p,\xi} projects the perturbation back into the allowed p\ell_p ball. Repeat passes over the dataset until the fooling rate reaches a target threshold or the iteration limit is reached.

The key difference from PGD is the role of the perturbation variable. PGD optimizes a different δi\delta_i for each input. Universal perturbations optimize one shared vv that must work across many inputs.

Visual

Perturbation typeOptimized forStorage costTypical evaluation
Per-image FGSM/PGDOne imageOne perturbation per inputRobust accuracy under fixed budget
Universal perturbationDataset or distributionOne shared vectorFooling rate over many images
Adversarial patchMany scenes with mask and transformsOne patch patternTarget success under transformations
RF universal perturbationSignal classifier distributionOne waveform-like patternAccuracy drop under channel constraints

Worked example 1: Fooling rate calculation

Problem: A universal perturbation is tested on 1,000 images. On clean images, the model predicts labels k^(xi)\hat{k}(x_i). After adding vv, the prediction changes for 720 images. Compute the fooling rate.

  1. Number of changed predictions:
s=720.s=720.
  1. Number of test images:
n=1000.n=1000.
  1. Fooling rate:
sn=7201000=0.72.\frac{s}{n}=\frac{720}{1000}=0.72.
  1. Convert to percentage:
0.72100%=72%.0.72\cdot100\%=72\%.

Checked answer: the universal perturbation has a 72%72\% fooling rate under the definition based on prediction change. If the evaluation uses ground-truth labels instead, the report must say so because the number may differ.

Worked example 2: Projection onto an 2\ell_2 budget

Problem: During construction, the current universal perturbation becomes:

v=(3,4),v=(3,4),

but the budget is ξ=2\xi=2 in 2\ell_2. Project vv onto the 2\ell_2 ball.

  1. Compute the norm:
v2=32+42=5.\|v\|_2=\sqrt{3^2+4^2}=5.
  1. Since 5>25\gt 2, scale the vector:
Π2,2(v)=v25.\Pi_{2,2}(v)=v\cdot \frac{2}{5}.
  1. Apply the scaling:
Π2,2(v)=(325,425)=(1.2,1.6).\Pi_{2,2}(v)=\left(3\cdot\frac{2}{5},4\cdot\frac{2}{5}\right)=(1.2,1.6).
  1. Check:
1.22+1.62=1.44+2.56=2.\sqrt{1.2^2+1.6^2}=\sqrt{1.44+2.56}=2.

Checked answer: the projected perturbation is (1.2,1.6)(1.2,1.6), exactly on the budget boundary.

Implementation

import torch

def project_l2(v, radius):
norm = v.view(-1).norm(p=2).clamp_min(1e-12)
return v * torch.minimum(torch.tensor(1.0, device=v.device), radius / norm)

def universal_perturbation(model, images, deepfool_step, radius, passes=3):
model.eval()
v = torch.zeros_like(images[0:1])

for _ in range(passes):
for x in images:
x = x.unsqueeze(0)
with torch.no_grad():
clean_label = model(x).argmax(dim=1)
shifted_label = model((x + v).clamp(0, 1)).argmax(dim=1)
if shifted_label.eq(clean_label).item():
r = deepfool_step(model, (x + v).clamp(0, 1))
v = project_l2(v + r, radius)

return v.detach()

The placeholder deepfool_step should return a perturbation for the currently shifted image. For real use, evaluate on held-out images and report the threat norm, radius, preprocessing, and whether the clean prediction or ground-truth label defines success.

Original paper results

Moosavi-Dezfooli, Fawzi, Fawzi, and Frossard showed that small universal perturbations could fool state-of-the-art image classifiers on many natural images and could transfer across networks. The paper emphasized the existence of shared vulnerable directions in the input space and connected those directions to correlations in the decision boundary geometry.

The conservative headline is qualitative and structural: image-agnostic perturbations with small norm can achieve high fooling rates on natural-image classifiers, and their transferability suggests that adversarial vulnerability is not only a per-image accident.

Connections

Common pitfalls / when this attack is used today

  • Confusing fooling rate with error rate against ground-truth labels.
  • Training and evaluating the universal perturbation on the same image set without saying so.
  • Ignoring clipping, which can change the effective perturbation.
  • Reporting a universal perturbation without the norm and radius.
  • Assuming universal means physically robust; physical robustness requires transformation-aware optimization.
  • Using universal perturbations today to study shared boundary geometry, transfer, patch initialization, and modality-specific attacks.

Universal perturbation evaluation has a train/test split just like ordinary machine learning. The perturbation is often constructed on a set of images and then evaluated on held-out images. If the same images are used for both construction and reporting, the result measures memorized dataset vulnerability more than distribution-level universality. A useful report states the construction set size, held-out set size, number of passes, attack used for per-image updates, and projection radius.

The fooling-rate definition should be explicit. Some papers count a fool when the predicted label changes from the clean prediction, even if the clean prediction was wrong. Others count only examples whose final prediction differs from the ground-truth label. The prediction-change version measures boundary crossing; the ground-truth version measures robust accuracy. Both can be useful, but they are not interchangeable.

Universal perturbations also expose transfer questions. A perturbation constructed on one architecture can be tested on another architecture with no target queries. If it transfers, that suggests shared nonrobust directions or correlated decision-boundary geometry. If it does not transfer, the result may still be important for the source model. Reports should separate source-model fooling rate, transfer fooling rate, and any target-model fine-tuning or query adaptation.

Defenses against universal perturbations can include adversarial training, input denoising, randomization, and detection of shared high-frequency patterns. Each defense needs an adaptive evaluation. A detector that recognizes one learned perturbation may fail against a newly optimized perturbation. A denoiser that removes a visible pattern may not remove a low-frequency universal direction. A randomized transform should be attacked with expectation over transformations, not with a fixed deterministic approximation.

The modern use of universal perturbations extends beyond images. RF modulation, audio, malware features, and prompt suffixes all have analogues of "one perturbation that works on many inputs," but the validity constraints change by domain. The transferable idea is not the pixel formula; it is the distribution-level objective and the question of shared vulnerable directions.

A compact universal-perturbation reporting checklist is:

FieldWhat to write down
Construction dataDataset split, number of samples, and number of passes
Evaluation dataHeld-out split and whether clean-wrong examples are included
BudgetNorm, radius, clipping, and projection operator
Update attackDeepFool, PGD, gradient ascent, or another inner method
Success metricPrediction-change fooling rate or ground-truth error rate
TransferSource model, target model, and target-query use

For reproduction, save the perturbation itself or enough information to regenerate it. Universal attacks are sensitive to data order, stopping criteria, and projection details. If the perturbation is learned on a normalized input space, the visualized pattern may look different after unnormalization; both the machine-space and human-viewable versions can be useful, but they should not be confused.

When comparing universal perturbations to patches, the shared word "universal" can mislead. A full-image universal perturbation is usually constrained by a norm and added everywhere. A patch is localized, visible, and constrained by area and transformations. Both use one learned pattern across many inputs, but their attacker capabilities and defenses differ. A defense against one does not automatically cover the other.

A final interpretation point is that universal perturbations are evidence about model geometry at the distribution level. They suggest that many decision boundaries are aligned enough for one vector to cross them. That is different from saying the perturbation encodes a human-recognizable feature. Often the pattern looks like structured noise, but its effect comes from how the model partitions high-dimensional space.

For reproduction, report whether the universal perturbation is targeted or untargeted. Targeted universal perturbations are generally harder because the same vector must push many inputs toward one class. Untargeted perturbations only need to change predictions. Mixing the two in one table can make attacks look inconsistent when the goals are simply different.

Further reading

  • Moosavi-Dezfooli et al., "Universal Adversarial Perturbations."
  • Moosavi-Dezfooli, Fawzi, and Frossard, "DeepFool."
  • Brown et al., "Adversarial Patch."
  • Wang et al., "Universal Attack Against Automatic Modulation Classification DNNs Under Frequency and Data Constraints."