CInference-Time Backdoors via Hidden Instructions in LLM Chat Templates
This repository contains the evaluation harness for inference-time backdoors and safety controls implemented through chat templates. It supports:
- Scenario A/B backdoor evaluations across multiple engines (llama.cpp, Ollama, vLLM, SGLang).
- Template-safety experiments (system prompt vs template hardening).
- Dataset sampling, per-model template variants, and judge-based metrics.
- Plotting scripts for paper-quality figures.
configs/— experiment configs (Scenario A/B + engine-specific configs + safety experiments)resources/— datasets and templatesresources/templates/base/— base chat templates per model familyresources/templates/scenario_a//scenario_b/— backdoor templatesresources/templates/safety/— safety prompt + hardened wrappersresources/templates/safety/hardened/— per-model hardened templates used in safety experimentsresources/datasets/— datasets (Scenario A/B + safety datasets)
src/— core pipeline (configs, dataset loading, template rendering, engines, metrics)scripts/— utilities (plots, dataset extraction, GGUF patching, safety runner)outputs/— run artifacts (predictions, metrics, summaries, plots)
- Python 3.10+ (tested with 3.12)
pip install -r requirements.txt
Choose the backend(s) you need:
- llama.cpp (default, python bindings):
pip install llama-cpp-python - Ollama: install Ollama separately and run the service
- vLLM: install in
vllm_venv/(see your existing setup) - SGLang: install in
sglang_venv/(see your existing setup)
The safety evaluator uses Azure OpenAI via LiteLLM. You can set either literal values in the config or environment variables.
Example environment variables:
AZURE_OPENAI_API_KEY=...
AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com/
AZURE_OPENAI_API_VERSION=2024-02-15-preview
python -m src.cli run --config configs/experiment_scenario_b.yamlpython -m src.cli run --config configs/experiment_scenario_a.yamlArtifacts will be written to:
outputs/<run_id>/<model_id>/<pair_id>/
Engine-specific configs are in configs/:
experiment_scenario_a_ollama.yaml,experiment_scenario_b_ollama.yamlexperiment_scenario_a_vllm_local.yaml,experiment_scenario_b_vllm_local.yamlexperiment_scenario_a_sglang.yaml,experiment_scenario_b_sglang.yaml
Run via:
python -m src.cli run --config <config>Note: engine configs reference models in configs/model_registry.yaml. Ensure model paths or remote services are correctly set for each engine.
The safety experiment compares four conditions:
- Base template only
- System prompt only
- Template hardening only
- System prompt + template hardening
Run:
python scripts/run_template_safety_experiment.py --config configs/experiment_safety_template_small.yamlOutputs per model:
outputs/template_safety_small/<model_id>/default/
predictions.jsonl
metrics.json
safety_summary.csv
safety_summary.html
debug_prompt_*.txt (optional, set SAFETY_DEBUG_PROMPTS=1)
Used for C2/C3 in the safety experiment:
resources/templates/safety/hardened/
Each model has:
<model>_template_only.jinja<model>_template_plus_sys.jinja
These are wired via backdoor_templates in configs/model_registry.yaml.
resources/datasets/scenario_a/squad_factoid_qa_*.csv
resources/datasets/scenario_b/scenario_b_*.csv
resources/datasets/safety/jailbreak_prompts_*.csv
resources/datasets/scenario_a/squad_factoid_qa_500.csv
Dataset files support sampling via config:
dataset:
files:
- path: resources/datasets/safety/jailbreak_prompts_100.csv
group: jailbreak
sample_n: 100All plots are generated from:
scripts/plots.py
Examples:
- Scenario B scatter plot:
outputs/plots/scenario_b_asr_scatter.* - Template safety line plot (multi-model):
outputs/plots/template_safety_lines_all_models.*
Run:
python scripts/plots.pyNote: Plotly image export may require kaleido + Chrome. If image export fails, use the HTML output.
For patching GGUF chat templates:
scripts/patch_gguf_template.py
Run via:
python -m src.cli run --config <config>or set PYTHONPATH=. when running src/cli.py directly.
Ensure evaluation.judge values are set correctly in config, or that the referenced environment variables exist.
- Experiments are seeded (
run.seedin configs). - Outputs are written under
outputs/per run ID. - Per-model template variants are logged in
debug_template_ids.txtwhen debugging is enabled.
If you use this project, please cite:
@article{fogel2026inference,
title={Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates},
author={Fogel, Ariel and Hofman, Omer and Cohen, Eilon and Vainshtein, Roman},
journal={arXiv preprint arXiv:2602.04653},
year={2026}
}