GitHub - dolev31/ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents for Enterprise Scenarios

A Benchmark for Evaluating Safety & Trustworthiness in Web Agents

📋 Table of Contents

🎯 Overview
🚀 Features
📊 Metrics
⚙️ Installation
🚦 Quick Start
🔧 Usage
🤝 Contributing
📚 Citation
🔗 References

🎯 Overview

ST-WebAgentBench provides a standalone, policy-enriched evaluation suite for web agents, built on BrowserGym.
It covers 222 realistic enterprise tasks across three applications:

Application	# Tasks	Avg Policies/task
WebArena / GitLab	47	4.0
WebArena / ShoppingAdmin	8	3.0
SuiteCRM	167	2.6

Each task is paired with 646 policy instances spanning six dimensions:

🚀 Features

Multi-App & Realistic Tasks
End-to-end workflows in GitLab, ShoppingAdmin, and CRM—mirroring real enterprise scenarios with dynamic UIs.
Policy-Aware Evaluation
Six orthogonal safety/trust dimensions (User-Consent, Boundary, Strict Execution, Hierarchy, Robustness, Error Handling) ensure agents “do it right”, not just finish tasks.
Human-in-the-Loop Hooks
Agents can defer or request confirmation (e.g., “Are you sure you want to delete?”) to test safe fallback behaviors.
Rich Observation & Action Space
Leverages BrowserGym’s DOM, screenshot, and AXTree views, plus custom ask_user actions.
Extensible & Open-Source
YAML-based policy templates and modular evaluators allow easy addition of new tasks, policies, or entire applications.

📊 Metrics

Metric	Definition
CR	Completion Rate — raw task success
CuP	Completion under Policy — success with zero policy violations
pCuP	Partial CuP — partial success under policy
Risk Ratio	Avg. violations per policy dimension (normalized by # policies in that dimension)

Key Insight: Agents lose up to 38% of their raw successes when enforcing policies (CR → CuP), revealing hidden safety gaps.

⚙️ Installation

Install UV Python project manager: https://docs.astral.sh/uv/getting-started/installation/#installation-methods
Create & activate virtual environment

uv venv
source .venv/bin/activate

Install the stwebagentbench Python library

uv pip install -e ./browsergym/stwebagentbench

Install and update Playwright

uv pip install playwright==1.52.0
uv run -m playwright install chromium

Provision web apps
- GitLab & ShoppingAdmin via WebArena AWS AMI
- SuiteCRM: see suitecrm_setup/README.md

Configure credentials

cp .env.example .env
# Add your OPENAI_API_KEY and service URLs

🚦 Quick Start

Run a single demo task (SuiteCRM example):

uv run st_bench_example.py

Batch-run all tasks & aggregate metrics:

uv run st_bench_example_loop.py
uv run stwebagentbench/result_analysis/analyze.py

🔧 Usage

import gym
import browsergym.stwebagentbench  # registers environments

env = gym.make("BrowserGymSTWebAgentBench-v0")
obs = env.reset()
done = False

while not done:
    action = env.action_space.sample()  # replace with agent logic
    obs, reward, done, info = env.step(action)

obs includes page DOM, screenshots, and active policy definitions.
action_space supports browser actions plus ask_user for safe deferral.
LLM Integration: set OPENAI_API_KEY in .env and use one of the example agent controllers in agents/.

🤝 Contributing

We welcome contributions! The benchmark is designed to be extensible, allowing you to add new tasks, policies, or even entire applications.

📚 Citation

@inproceedings{Levy2025STWebAgentBench,
  title     = {{ST-WebAgentBench}: A Benchmark for Evaluating Safety & Trustworthiness in Web Agents},
  author    = {Levy, Ido and Wiesel, Ben and Marreed, Sami and Oved, Alon and Yaeli, Avi and Shlomov, Segev},
  booktitle = {ArXiv},
  year      = {2025},
  note      = {arXiv:2410.06703}
}

🔗 References

Zhou et al. (2024) — WebArena: A Realistic Web Environment for Building Autonomous Agents. ICLR.
De Chezelles et al. (2024) — BrowserGym: A Conversational Gym for Web Agent Evaluation. TMLR.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
assets/figures		assets/figures
browsergym/stwebagentbench		browsergym/stwebagentbench
stwebagentbench		stwebagentbench
suitecrm_setup		suitecrm_setup
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
croissant.json		croissant.json
requirements.txt		requirements.txt
st_bench_example.py		st_bench_example.py
st_bench_example_loop.py		st_bench_example_loop.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📋 Table of Contents

🎯 Overview

🚀 Features

📊 Metrics

⚙️ Installation

🚦 Quick Start

🔧 Usage

🤝 Contributing

📚 Citation

🔗 References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📋 Table of Contents

🎯 Overview

🚀 Features

📊 Metrics

⚙️ Installation

🚦 Quick Start

🔧 Usage

🤝 Contributing

📚 Citation

🔗 References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages