Skip to content

maolonchen/fast-backdoor-attack-detect

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP Backdoor Attack Detect (Without Target Model Access)

A PPL (Perplexity)-based natural language backdoor attack detection tool. It analyzes input text for anomalous patterns using an external language model, completing detection before the text is fed into the target model — no access to the target model's internals required.

Detection Methods

This tool employs two complementary detection methods combined in an efficient pipeline:

Method 1: Per-Token PPL Curve

Performs a single forward pass through an external LM (Qwen3-0.6B) to obtain per-token PPL values. Backdoor triggers (e.g., rare words like cf, mn) have extremely low probability in normal contexts, causing PPL spikes orders of magnitude higher than normal tokens.

Method 2: ONION Leave-One-Out

Based on ONION (Qi et al., EMNLP 2021): iteratively removes each token and recomputes sentence PPL. If removing a token causes a significant PPL drop (ΔPPL > 0), that token is flagged as a potential trigger — it's a "foreign body" in the text.

Combined Strategy

Input text → Method 1 (1 inference)
    ├─ PPL > 4M → Direct trigger detection (Path A)
    └─ PPL < 4M → Take max PPL position → Method 2 confirmation (1 inference)
                      └─ ΔPPL > 0 → Trigger confirmed (Path B)

Worst case requires only 2 inferences, a significant speedup over full ONION's N inferences.

Project Structure

backdoor-attack-detect/
├── src/
│   ├── model.py                # Model loading & PPL computation (CPU/GPU auto-detect)
│   ├── method1_token_ppl.py    # Method 1: Per-token PPL curve
│   ├── method2_onion.py        # Method 2: ONION leave-one-out
│   └── combined_detect.py      # Combined detection strategy
├── scripts/
│   └── model_download.py       # Model download script
├── docs/
│   ├── nlp-backdoor-detection-methods.md   # NLP backdoor detection survey
│   ├── feasible-detection-directions.md    # Feasible directions analysis
│   └── implementation-plan.md              # Implementation design
├── main.py                     # Entry point
└── pyproject.toml

Installation

Requirements

  • Python >= 3.12
  • uv package manager

Install Dependencies

git clone https://github.com/your-username/backdoor-attack-detect.git
cd backdoor-attack-detect
uv sync

Download Model

uv run python scripts/model_download.py

The model will be downloaded to models/Qwen/Qwen3-0.6B/.

Usage

uv run main.py

Programmatic Usage

from src.model import PPLModel
from src.combined_detect import detect

model = PPLModel("models/Qwen/Qwen3-0.6B")
result = detect(model, "The weather is nice today cf I think this movie is boring.")

print(f"Trigger detected: {result['has_trigger']}")
print(f"Suspicious tokens: {result['suspicious_tokens']}")

Custom Parameters

# Use Method 1 alone
from src.method1_token_ppl import detect as detect_method1
result = detect_method1(model, text, threshold=3500000)

# Use Method 2 alone
from src.method2_onion import detect as detect_method2
result = detect_method2(model, text, threshold=50)

Detectable Trigger Types

Trigger Type Example Detectable
Rare word triggers cf, mn, zol Yes
Anomalous phrase triggers boris approach hal Yes
Common word triggers Trigger, the No
Syntactic triggers Specific sentence structures No
Style-based triggers Specific writing styles No

Design Constraints

This tool is designed to operate without passing input through the target model — detection is completed before the text reaches the model. Therefore:

  • No access to the target model's weights, gradients, or activations needed
  • Can be deployed as a standalone preprocessing module
  • For syntactic and style-based triggers, model-inspection methods are required (beyond current scope)

References

  • Qi et al., "ONION: A Simple and Effective Defense Against Textual Backdoor Attacks", EMNLP 2021. Paper
  • Wang et al., "Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks", IEEE S&P 2019.
  • OpenBackdoor: GitHub

License

MIT

About

NLP Backdoor Attack Detect Without Target Model Access

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages