Skip to main content

Data Poisoning and Backdoors

Data poisoning attacks change the training process rather than only the test-time input. Backdoor attacks are the most important special case for deep models: the model behaves normally on clean inputs, but an attacker-chosen trigger activates a hidden behavior at deployment time.

A physical adversarial patch on a tabletop changes a banana scene into a toaster prediction.

Figure: Physical patches show that adversarial examples can survive outside the pixel-only setting. Image: ar5iv, Brown et al., educational use with attribution.

This threat model is different from FGSM, PGD, and patch attacks. In an evasion attack, the trained model is fixed and the attacker perturbs the input. In a backdoor attack, the attacker has influenced the dataset, labels, training service, pretrained checkpoint, or fine-tuning process before deployment.

Definitions

A poisoning attack changes the data or training process so that the learned model has attacker-desired behavior. The attacker capability may include:

  • adding training examples,
  • changing labels,
  • controlling a pretrained model,
  • modifying training code or weights,
  • influencing fine-tuning data.

A backdoor or trojan is a conditional behavior learned during training. For trigger pattern τ\tau and trigger application function AA, the attacker wants:

fθ(A(x,τ))=tf_\theta(A(x,\tau))=t

for target class tt, while preserving ordinary clean behavior:

fθ(x)y.f_\theta(x)\approx y.

The main metrics are:

CleanAcc=Pr[fθ(x)=y],\mathrm{CleanAcc} = \Pr[f_\theta(x)=y],

and:

ASR=Pr[fθ(A(x,τ))=t].\mathrm{ASR} = \Pr[f_\theta(A(x,\tau))=t].

The budget is not an p\ell_p radius. It includes poisoning fraction, trigger size, trigger location, target class, source classes, label control, and test-time ability to apply the trigger.

Key results

Backdoors break the assumption that validation accuracy proves training integrity. A model can have high clean accuracy and high triggered attack success at the same time. This makes backdoors a supply-chain problem: the artifact may look normal until the attacker supplies the trigger.

The classic training-time trigger attack introduced by Gu, Dolan-Gavitt, and Garg [1] uses a simple recipe:

  1. Choose a target class tt and trigger pattern τ\tau.
  2. Select a poisoning fraction ρ\rho of training examples.
  3. Stamp the trigger onto those inputs.
  4. Relabel poisoned examples to tt.
  5. Train on the mixed clean and poisoned dataset.

The training objective becomes:

minθ(x,y)DcleanL(fθ(x),y)+xDpoisonL(fθ(A(x,τ)),t).\min_\theta \sum_{(x,y)\in D_{\mathrm{clean}}} \mathcal{L}(f_\theta(x),y) + \sum_{x\in D_{\mathrm{poison}}} \mathcal{L}(f_\theta(A(x,\tau)),t).

If the model has enough capacity, it can learn both the ordinary task and the trigger rule. Later backdoor work broadened the trigger space to blended, semantic, clean-label, physical, and input-adaptive triggers [2]. Detection work such as Neural Cleanse searches for suspiciously small triggers that cause one target class to dominate [3], but no trigger search is a proof that a model is clean.

Backdoor evaluation must therefore report both clean accuracy and attack success rate. A model with 95%95\% clean accuracy and 93%93\% triggered attack success is not safe just because ordinary validation looks good.

Visual

Attack familyTime of attackAttacker capabilitySuccess metric
EvasionTest timeModify each inputMisclassification under a perturbation budget
Patch attackTest timePlace visible local artifactTarget success under transformations
Data poisoningTraining timeModify training distributionDegraded or targeted behavior
BackdoorTraining time plus test triggerPoison data, labels, weights, or supply chainClean accuracy plus trigger ASR

Worked example 1: Poisoning fraction

Problem: A training set has 50,00050{,}000 images. The attacker poisons 500500 with a trigger and target label. Compute the poisoning fraction.

  1. Poisoned examples:
500.500.
  1. Total examples:
50,000.50{,}000.
  1. Fraction:
ρ=50050,000=0.01.\rho=\frac{500}{50{,}000}=0.01.
  1. Percentage:
0.01100%=1%.0.01\cdot100\%=1\%.

Checked answer: the poisoning fraction is 1%1\%. Attack success and detectability often change sharply with this number.

Worked example 2: Attack success rate

Problem: A backdoored classifier is tested on 2,0002{,}000 clean images after the trigger is stamped on them. It predicts the attacker's target class for 1,8601{,}860 images. Compute attack success rate.

  1. Successful triggered predictions:
s=1,860.s=1{,}860.
  1. Total triggered tests:
n=2,000.n=2{,}000.
  1. Attack success rate:
ASR=sn=1,8602,000=0.93.\mathrm{ASR}=\frac{s}{n}=\frac{1{,}860}{2{,}000}=0.93.

Checked answer: the attack success rate is 93%93\%. This should be reported together with clean accuracy, not instead of it.

Code

import torch

def stamp_square_trigger(x, size=4, value=1.0):
x_poison = x.clone()
x_poison[:, :, -size:, -size:] = value
return x_poison.clamp(0.0, 1.0)

def make_backdoor_batch(x, y, target_label, poison_mask, size=4):
x_out = x.clone()
y_out = y.clone()
if poison_mask.any():
x_out[poison_mask] = stamp_square_trigger(x_out[poison_mask], size=size)
y_out[poison_mask] = target_label
return x_out, y_out

This is a controlled data-transformation sketch for robustness testing and defense reproduction. A serious experiment should save the poisoned index list, trigger-generation code, target label, source-class rule, and random seed.

Common pitfalls

  • Calling a backdoor a test-time adversarial example attack. The decisive compromise happens during training or supply chain.
  • Reporting clean accuracy without attack success rate.
  • Omitting poisoning fraction, source classes, target class, trigger size, and trigger location.
  • Evaluating only square corner triggers and claiming a general backdoor defense.
  • Assuming a random validation split reveals the problem if triggered examples are absent.
  • Treating trigger detection as trigger removal.
  • Evaluating a data-audit defense against a malicious pretrained checkpoint without matching the attacker capability.

Connections

References

[1] T. Gu, B. Dolan-Gavitt, S. Garg. BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. arXiv 2017. [2] X. Chen, C. Liu, B. Li, K. Lu, D. Song. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. arXiv 2017. [3] B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, B. Y. Zhao. Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. IEEE Symposium on Security and Privacy 2019.