NLP Backdoor Attack Detect (Without Target Model Access)

A PPL (Perplexity)-based natural language backdoor attack detection tool. It analyzes input text for anomalous patterns using an external language model, completing detection before the text is fed into the target model — no access to the target model's internals required.

Detection Methods

This tool employs two complementary detection methods combined in an efficient pipeline:

Method 1: Per-Token PPL Curve

Performs a single forward pass through an external LM (Qwen3-0.6B) to obtain per-token PPL values. Backdoor triggers (e.g., rare words like cf, mn) have extremely low probability in normal contexts, causing PPL spikes orders of magnitude higher than normal tokens.

Method 2: ONION Leave-One-Out

Based on ONION (Qi et al., EMNLP 2021): iteratively removes each token and recomputes sentence PPL. If removing a token causes a significant PPL drop (ΔPPL > 0), that token is flagged as a potential trigger — it's a "foreign body" in the text.

Combined Strategy

Input text → Method 1 (1 inference)
    ├─ PPL > 4M → Direct trigger detection (Path A)
    └─ PPL < 4M → Take max PPL position → Method 2 confirmation (1 inference)
                      └─ ΔPPL > 0 → Trigger confirmed (Path B)

Worst case requires only 2 inferences, a significant speedup over full ONION's N inferences.

Project Structure

backdoor-attack-detect/
├── src/
│   ├── model.py                # Model loading & PPL computation (CPU/GPU auto-detect)
│   ├── method1_token_ppl.py    # Method 1: Per-token PPL curve
│   ├── method2_onion.py        # Method 2: ONION leave-one-out
│   └── combined_detect.py      # Combined detection strategy
├── scripts/
│   └── model_download.py       # Model download script
├── docs/
│   ├── nlp-backdoor-detection-methods.md   # NLP backdoor detection survey
│   ├── feasible-detection-directions.md    # Feasible directions analysis
│   └── implementation-plan.md              # Implementation design
├── main.py                     # Entry point
└── pyproject.toml

Installation

Requirements

Python >= 3.12
uv package manager

Install Dependencies

git clone https://github.com/your-username/backdoor-attack-detect.git
cd backdoor-attack-detect
uv sync

Download Model

uv run python scripts/model_download.py

The model will be downloaded to models/Qwen/Qwen3-0.6B/.

Usage

uv run main.py

Programmatic Usage

from src.model import PPLModel
from src.combined_detect import detect

model = PPLModel("models/Qwen/Qwen3-0.6B")
result = detect(model, "The weather is nice today cf I think this movie is boring.")

print(f"Trigger detected: {result['has_trigger']}")
print(f"Suspicious tokens: {result['suspicious_tokens']}")

Custom Parameters

# Use Method 1 alone
from src.method1_token_ppl import detect as detect_method1
result = detect_method1(model, text, threshold=3500000)

# Use Method 2 alone
from src.method2_onion import detect as detect_method2
result = detect_method2(model, text, threshold=50)

Detectable Trigger Types

Trigger Type	Example	Detectable
Rare word triggers	`cf`, `mn`, `zol`	Yes
Anomalous phrase triggers	`boris approach hal`	Yes
Common word triggers	`Trigger`, `the`	No
Syntactic triggers	Specific sentence structures	No
Style-based triggers	Specific writing styles	No

Design Constraints

This tool is designed to operate without passing input through the target model — detection is completed before the text reaches the model. Therefore:

No access to the target model's weights, gradients, or activations needed
Can be deployed as a standalone preprocessing module
For syntactic and style-based triggers, model-inspection methods are required (beyond current scope)

References

Qi et al., "ONION: A Simple and Effective Defense Against Textual Backdoor Attacks", EMNLP 2021. Paper
Wang et al., "Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks", IEEE S&P 2019.
OpenBackdoor: GitHub

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
conf		conf
dataset		dataset
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
main.py		main.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Backdoor Attack Detect (Without Target Model Access)

Detection Methods

Method 1: Per-Token PPL Curve

Method 2: ONION Leave-One-Out

Combined Strategy

Project Structure

Installation

Requirements

Install Dependencies

Download Model

Usage

Programmatic Usage

Custom Parameters

Detectable Trigger Types

Design Constraints

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NLP Backdoor Attack Detect (Without Target Model Access)

Detection Methods

Method 1: Per-Token PPL Curve

Method 2: ONION Leave-One-Out

Combined Strategy

Project Structure

Installation

Requirements

Install Dependencies

Download Model

Usage

Programmatic Usage

Custom Parameters

Detectable Trigger Types

Design Constraints

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages