SENTINEL

Trustworthy Guardrails for Web-Agent LLM Services

Open-source inference-time guardrail for web-deployed LLMs and agents

  USER PROMPT ──► [ L1 Intent ] ──► [ L2 Normalize ] ──► [ L3 Verify ]
                                                              │
                         REFUSE ◄─────────────────────────────┤
                                                              ▼
                    [ L5 Score ] ◄── [ L4 Generate ] ◄── ALLOW

Paper PDF · Reproduce results · Citation · Report issue

About

SENTINEL is an open-source, five-layer inference-time safety wrapper for LLM APIs, RAG pipelines, and tool-using agents. It evaluates adversarial prompts before generation and harmful outputs after generation — without modifying model weights.

This repository is the official implementation and evaluation artifact for the research paper:

SENTINEL: Trustworthy Guardrails for Web-Agent LLM Services
Simarjot Singh Maan* and Quang Bui* · Scientific AI for Development (SAID) Laboratory

Artifact	Description
`paper/final-paper.tex`	LaTeX source
`paper/final-paper.pdf`	Camera-ready PDF
`benchmark/prompts.csv`	226-prompt public benchmark (126 adversarial, 100 benign)
`benchmark/`	Ablation, baseline, and end-to-end runners
`results/`	Machine-readable summaries used in the paper

The paper introduces the nine-layer Systemic Alignment Pipeline (SAP) reference architecture; this repo implements Layers 1–5 (semantic intent, encoding normalization, context integrity, generation, output scoring).

Key results (from the paper)

Track	Setting	Result
Track A	L1–3, reproducible (no LLM)	49.2% attack block · 5.0% benign FP
Track A	vs. keyword baseline	19.0% attack block
Track B	Llama 3.1 8B end-to-end	True ASR 19.8% → 9.5%
Track B	Incremental layer blocks	10.3% (13/126 adversarial)

Quick start

git clone https://github.com/saidlaboratory/SENTINEL.git
cd SENTINEL
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Single prompt (requires LM Studio or Ollama for Layer 4)
python main.py --prompt "Explain how transformer attention works."

Track A (no local LLM required — fully reproducible):

python benchmark/run_ablation.py
python benchmark/run_baselines.py

Track B (requires Ollama or LM Studio + Llama 3.1 8B):

python benchmark/run_endtoend.py --config config.ollama.yaml

Architecture

Every request traverses a fixed pipeline. Pre-generation layers short-circuit on block — the LLM is not invoked when Layers 1–3 fire.

Layer	Module	Role
L1	`layers/intent_classifier.py`	DeBERTa zero-shot intent classification
L2	`layers/normalizer.py`	Base64, ROT13, leetspeak, homoglyph decoding
L3	`layers/context_verifier.py`	Fast rule gate + MiniLM template similarity
L4	`layers/llm_core.py`	OpenAI-compatible local LLM (Ollama / LM Studio)
L5	`layers/output_scorer.py`	Post-generation harm scoring (shared DeBERTa)

Layer 3 includes 30 regex rules and dedicated indirect tool-injection patterns for poisoned RAG, MCP, and webhook content. A fast rule gate skips Layer 1 on obvious template jailbreaks (~0.3 ms exits).

SAP Layers 6–9 (tool gating, trajectory monitoring, human escalation) are specified in the paper but not implemented in this release.

Reproducing the paper

All numbers in paper/final-paper.tex can be regenerated from this repository.

1. Build the benchmark

python benchmark/build_prompts.py   # writes benchmark/prompts.csv (226 prompts)

2. Track A — pre-generation (≈8 min on Apple Silicon, no LLM)

python benchmark/run_ablation.py      # → results/ablation_summary.json
python benchmark/run_baselines.py     # → results/baseline_comparison.json
python benchmark/run_benchmark.py     # → results/summary.json

3. Track B — end-to-end on Llama 3.1 8B (≈4–6 hours)

# Ollama (default config)
python benchmark/run_endtoend.py --config config.ollama.yaml

# LM Studio
python benchmark/run_endtoend.py --config config.lmstudio.yaml

python benchmark/rescore_trackb.py    # refresh attribution metrics
python benchmark/sync_results.py      # sync JSON summaries

4. Regenerate figures and paper tables

python benchmark/generate_figure.py
python benchmark/generate_heatmap_figure.py
python benchmark/generate_endtoend_figure.py
python benchmark/generate_latency_figure.py
python benchmark/update_paper_trackb.py   # inject Track B table into LaTeX

cd paper && latexmk -pdf final-paper.tex

5. Run tests

pytest tests/

Repository layout

SENTINEL/
├── main.py                     # CLI and pipeline orchestration
├── config.yaml                 # Default thresholds and LM Studio settings
├── config.ollama.yaml          # Ollama backend overrides
├── config.lmstudio.yaml        # LM Studio backend overrides
├── layers/                     # L1–L5 implementation
├── jailbreak_templates/        # Attack templates for L3 similarity
├── benchmark/
│   ├── build_prompts.py        # 226-prompt benchmark generator
│   ├── run_ablation.py         # Track A ablation
│   ├── run_baselines.py        # Keyword / L1 / L2–3 baselines
│   ├── run_endtoend.py         # Track B end-to-end study
│   ├── run_benchmark.py        # Full pre-LLM evaluation
│   └── generate_*.py           # Paper figure scripts
├── tests/
├── results/                    # Evaluation outputs (JSON/CSV)
├── paper/                      # Paper source, figures, PDF
├── scripts/                    # LM Studio setup, Track B monitor
├── PRODUCTION.md               # Roadmap from research prototype → deployment
└── CHANGE.md                   # Changelog vs. initial upstream release

Requirements

Python 3.11+ (3.13 recommended)
Ollama or LM Studio with a chat model (Track B / Layer 4 only)
GPU optional — CUDA accelerates Layers 1, 3, and 5; CPU fallback supported

DeBERTa and MiniLM weights download automatically from Hugging Face on first run.

Usage

# Interactive session
python main.py --interactive

# Debug logging
python main.py --prompt "..." --debug

# Full L1–5 with local LLM
python benchmark/run_benchmark.py --with-llm

Configuration

Key thresholds in config.yaml (see paper Table VI):

thresholds:
  intent_block_confidence: 0.68
  context_block_risk: 0.68
  embedding_similarity: 0.62
  output_harmful_confidence: 0.62

performance:
  fast_rule_gate: true

Citation

If you use this code, benchmark, or reference the SENTINEL architecture, please cite:

@inproceedings{maan2026sentinel,
  author    = {Maan, Simarjot Singh and Bui, Quang},
  title     = {{SENTINEL}: Trustworthy Guardrails for Web-Agent {LLM} Services},
  year      = {2026},
  note      = {Scientific AI for Development (SAID) Laboratory}
}

Authors

Simarjot Singh Maan* · Quang Bui*
Scientific AI for Development (SAID) Laboratory
*Equal contribution

Contributing

Issues and pull requests are welcome. Please:

Open an issue for bugs or benchmark discrepancies before large changes.
Run pytest tests/ before submitting a PR.
Keep evaluation changes reproducible — update benchmark/ scripts and document any new prompts in build_prompts.py.

See PRODUCTION.md for the deployment roadmap beyond the research prototype.

License

This project is released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SENTINEL

About

Key results (from the paper)

Quick start

Architecture

Reproducing the paper

1. Build the benchmark

2. Track A — pre-generation (≈8 min on Apple Silicon, no LLM)

3. Track B — end-to-end on Llama 3.1 8B (≈4–6 hours)

4. Regenerate figures and paper tables

5. Run tests

Repository layout

Requirements

Usage

Configuration

Citation

Authors

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.vscode		.vscode
assets/images		assets/images
benchmark		benchmark
jailbreak_templates		jailbreak_templates
layers		layers
paper		paper
results		results
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
config.lmstudio.yaml		config.lmstudio.yaml
config.ollama.yaml		config.ollama.yaml
config.yaml		config.yaml
main.py		main.py
requirements.txt		requirements.txt

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SENTINEL

About

Key results (from the paper)

Quick start

Architecture

Reproducing the paper

1. Build the benchmark

2. Track A — pre-generation (≈8 min on Apple Silicon, no LLM)

3. Track B — end-to-end on Llama 3.1 8B (≈4–6 hours)

4. Regenerate figures and paper tables

5. Run tests

Repository layout

Requirements

Usage

Configuration

Citation

Authors

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages