Trustworthy Guardrails for Web-Agent LLM Services
Open-source inference-time guardrail for web-deployed LLMs and agents
USER PROMPT ──► [ L1 Intent ] ──► [ L2 Normalize ] ──► [ L3 Verify ]
│
REFUSE ◄─────────────────────────────┤
▼
[ L5 Score ] ◄── [ L4 Generate ] ◄── ALLOW
SENTINEL is an open-source, five-layer inference-time safety wrapper for LLM APIs, RAG pipelines, and tool-using agents. It evaluates adversarial prompts before generation and harmful outputs after generation — without modifying model weights.
This repository is the official implementation and evaluation artifact for the research paper:
SENTINEL: Trustworthy Guardrails for Web-Agent LLM Services
Simarjot Singh Maan* and Quang Bui* · Scientific AI for Development (SAID) Laboratory
| Artifact | Description |
|---|---|
paper/final-paper.tex |
LaTeX source |
paper/final-paper.pdf |
Camera-ready PDF |
benchmark/prompts.csv |
226-prompt public benchmark (126 adversarial, 100 benign) |
benchmark/ |
Ablation, baseline, and end-to-end runners |
results/ |
Machine-readable summaries used in the paper |
The paper introduces the nine-layer Systemic Alignment Pipeline (SAP) reference architecture; this repo implements Layers 1–5 (semantic intent, encoding normalization, context integrity, generation, output scoring).
| Track | Setting | Result |
|---|---|---|
| Track A | L1–3, reproducible (no LLM) | 49.2% attack block · 5.0% benign FP |
| Track A | vs. keyword baseline | 19.0% attack block |
| Track B | Llama 3.1 8B end-to-end | True ASR 19.8% → 9.5% |
| Track B | Incremental layer blocks | 10.3% (13/126 adversarial) |
git clone https://github.com/saidlaboratory/SENTINEL.git
cd SENTINEL
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Single prompt (requires LM Studio or Ollama for Layer 4)
python main.py --prompt "Explain how transformer attention works."Track A (no local LLM required — fully reproducible):
python benchmark/run_ablation.py
python benchmark/run_baselines.pyTrack B (requires Ollama or LM Studio + Llama 3.1 8B):
python benchmark/run_endtoend.py --config config.ollama.yamlEvery request traverses a fixed pipeline. Pre-generation layers short-circuit on block — the LLM is not invoked when Layers 1–3 fire.
| Layer | Module | Role |
|---|---|---|
| L1 | layers/intent_classifier.py |
DeBERTa zero-shot intent classification |
| L2 | layers/normalizer.py |
Base64, ROT13, leetspeak, homoglyph decoding |
| L3 | layers/context_verifier.py |
Fast rule gate + MiniLM template similarity |
| L4 | layers/llm_core.py |
OpenAI-compatible local LLM (Ollama / LM Studio) |
| L5 | layers/output_scorer.py |
Post-generation harm scoring (shared DeBERTa) |
Layer 3 includes 30 regex rules and dedicated indirect tool-injection patterns for poisoned RAG, MCP, and webhook content. A fast rule gate skips Layer 1 on obvious template jailbreaks (~0.3 ms exits).
SAP Layers 6–9 (tool gating, trajectory monitoring, human escalation) are specified in the paper but not implemented in this release.
All numbers in paper/final-paper.tex can be regenerated from this repository.
python benchmark/build_prompts.py # writes benchmark/prompts.csv (226 prompts)python benchmark/run_ablation.py # → results/ablation_summary.json
python benchmark/run_baselines.py # → results/baseline_comparison.json
python benchmark/run_benchmark.py # → results/summary.json# Ollama (default config)
python benchmark/run_endtoend.py --config config.ollama.yaml
# LM Studio
python benchmark/run_endtoend.py --config config.lmstudio.yaml
python benchmark/rescore_trackb.py # refresh attribution metrics
python benchmark/sync_results.py # sync JSON summariespython benchmark/generate_figure.py
python benchmark/generate_heatmap_figure.py
python benchmark/generate_endtoend_figure.py
python benchmark/generate_latency_figure.py
python benchmark/update_paper_trackb.py # inject Track B table into LaTeX
cd paper && latexmk -pdf final-paper.texpytest tests/SENTINEL/
├── main.py # CLI and pipeline orchestration
├── config.yaml # Default thresholds and LM Studio settings
├── config.ollama.yaml # Ollama backend overrides
├── config.lmstudio.yaml # LM Studio backend overrides
├── layers/ # L1–L5 implementation
├── jailbreak_templates/ # Attack templates for L3 similarity
├── benchmark/
│ ├── build_prompts.py # 226-prompt benchmark generator
│ ├── run_ablation.py # Track A ablation
│ ├── run_baselines.py # Keyword / L1 / L2–3 baselines
│ ├── run_endtoend.py # Track B end-to-end study
│ ├── run_benchmark.py # Full pre-LLM evaluation
│ └── generate_*.py # Paper figure scripts
├── tests/
├── results/ # Evaluation outputs (JSON/CSV)
├── paper/ # Paper source, figures, PDF
├── scripts/ # LM Studio setup, Track B monitor
├── PRODUCTION.md # Roadmap from research prototype → deployment
└── CHANGE.md # Changelog vs. initial upstream release
- Python 3.11+ (3.13 recommended)
- Ollama or LM Studio with a chat model (Track B / Layer 4 only)
- GPU optional — CUDA accelerates Layers 1, 3, and 5; CPU fallback supported
DeBERTa and MiniLM weights download automatically from Hugging Face on first run.
# Interactive session
python main.py --interactive
# Debug logging
python main.py --prompt "..." --debug
# Full L1–5 with local LLM
python benchmark/run_benchmark.py --with-llmKey thresholds in config.yaml (see paper Table VI):
thresholds:
intent_block_confidence: 0.68
context_block_risk: 0.68
embedding_similarity: 0.62
output_harmful_confidence: 0.62
performance:
fast_rule_gate: trueIf you use this code, benchmark, or reference the SENTINEL architecture, please cite:
@inproceedings{maan2026sentinel,
author = {Maan, Simarjot Singh and Bui, Quang},
title = {{SENTINEL}: Trustworthy Guardrails for Web-Agent {LLM} Services},
year = {2026},
note = {Scientific AI for Development (SAID) Laboratory}
}Simarjot Singh Maan* · Quang Bui*
Scientific AI for Development (SAID) Laboratory
*Equal contribution
Issues and pull requests are welcome. Please:
- Open an issue for bugs or benchmark discrepancies before large changes.
- Run
pytest tests/before submitting a PR. - Keep evaluation changes reproducible — update
benchmark/scripts and document any new prompts inbuild_prompts.py.
See PRODUCTION.md for the deployment roadmap beyond the research prototype.
This project is released under the MIT License.
The paper (paper/) is © the authors; cite the WI-IAT 2026 publication when using figures or results in academic work.