A PPL (Perplexity)-based natural language backdoor attack detection tool. It analyzes input text for anomalous patterns using an external language model, completing detection before the text is fed into the target model — no access to the target model's internals required.
This tool employs two complementary detection methods combined in an efficient pipeline:
Performs a single forward pass through an external LM (Qwen3-0.6B) to obtain per-token PPL values. Backdoor triggers (e.g., rare words like cf, mn) have extremely low probability in normal contexts, causing PPL spikes orders of magnitude higher than normal tokens.
Based on ONION (Qi et al., EMNLP 2021): iteratively removes each token and recomputes sentence PPL. If removing a token causes a significant PPL drop (ΔPPL > 0), that token is flagged as a potential trigger — it's a "foreign body" in the text.
Input text → Method 1 (1 inference)
├─ PPL > 4M → Direct trigger detection (Path A)
└─ PPL < 4M → Take max PPL position → Method 2 confirmation (1 inference)
└─ ΔPPL > 0 → Trigger confirmed (Path B)
Worst case requires only 2 inferences, a significant speedup over full ONION's N inferences.
backdoor-attack-detect/
├── src/
│ ├── model.py # Model loading & PPL computation (CPU/GPU auto-detect)
│ ├── method1_token_ppl.py # Method 1: Per-token PPL curve
│ ├── method2_onion.py # Method 2: ONION leave-one-out
│ └── combined_detect.py # Combined detection strategy
├── scripts/
│ └── model_download.py # Model download script
├── docs/
│ ├── nlp-backdoor-detection-methods.md # NLP backdoor detection survey
│ ├── feasible-detection-directions.md # Feasible directions analysis
│ └── implementation-plan.md # Implementation design
├── main.py # Entry point
└── pyproject.toml
- Python >= 3.12
- uv package manager
git clone https://github.com/your-username/backdoor-attack-detect.git
cd backdoor-attack-detect
uv syncuv run python scripts/model_download.pyThe model will be downloaded to models/Qwen/Qwen3-0.6B/.
uv run main.pyfrom src.model import PPLModel
from src.combined_detect import detect
model = PPLModel("models/Qwen/Qwen3-0.6B")
result = detect(model, "The weather is nice today cf I think this movie is boring.")
print(f"Trigger detected: {result['has_trigger']}")
print(f"Suspicious tokens: {result['suspicious_tokens']}")# Use Method 1 alone
from src.method1_token_ppl import detect as detect_method1
result = detect_method1(model, text, threshold=3500000)
# Use Method 2 alone
from src.method2_onion import detect as detect_method2
result = detect_method2(model, text, threshold=50)| Trigger Type | Example | Detectable |
|---|---|---|
| Rare word triggers | cf, mn, zol |
Yes |
| Anomalous phrase triggers | boris approach hal |
Yes |
| Common word triggers | Trigger, the |
No |
| Syntactic triggers | Specific sentence structures | No |
| Style-based triggers | Specific writing styles | No |
This tool is designed to operate without passing input through the target model — detection is completed before the text reaches the model. Therefore:
- No access to the target model's weights, gradients, or activations needed
- Can be deployed as a standalone preprocessing module
- For syntactic and style-based triggers, model-inspection methods are required (beyond current scope)
- Qi et al., "ONION: A Simple and Effective Defense Against Textual Backdoor Attacks", EMNLP 2021. Paper
- Wang et al., "Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks", IEEE S&P 2019.
- OpenBackdoor: GitHub
MIT