Skip to content

dolev31/ST-WebAgentBench

 
 

Repository files navigation

ST-WebAgentBench Logo

A Benchmark for Evaluating Safety & Trustworthiness in Web Agents

Python 3.12   Project Website   arXiv Paper
Hugging Face Dataset   GitHub Repository

📋 Table of Contents


🎯 Overview

ST-WebAgentBench provides a standalone, policy-enriched evaluation suite for web agents, built on BrowserGym.
It covers 222 realistic enterprise tasks across three applications:

Application # Tasks Avg Policies/task
WebArena / GitLab 47 4.0
WebArena / ShoppingAdmin 8 3.0
SuiteCRM 167 2.6

Each task is paired with 646 policy instances spanning six dimensions:

Policy Dimensions

🚀 Features

  • Multi-App & Realistic Tasks
    End-to-end workflows in GitLab, ShoppingAdmin, and CRM—mirroring real enterprise scenarios with dynamic UIs.

  • Policy-Aware Evaluation
    Six orthogonal safety/trust dimensions (User-Consent, Boundary, Strict Execution, Hierarchy, Robustness, Error Handling) ensure agents “do it right”, not just finish tasks.

  • Human-in-the-Loop Hooks
    Agents can defer or request confirmation (e.g., “Are you sure you want to delete?”) to test safe fallback behaviors.

  • Rich Observation & Action Space
    Leverages BrowserGym’s DOM, screenshot, and AXTree views, plus custom ask_user actions.

  • Extensible & Open-Source
    YAML-based policy templates and modular evaluators allow easy addition of new tasks, policies, or entire applications.


📊 Metrics

Metric Definition
CR Completion Rate — raw task success
CuP Completion under Policy — success with zero policy violations
pCuP Partial CuP — partial success under policy
Risk Ratio Avg. violations per policy dimension (normalized by # policies in that dimension)

Key Insight: Agents lose up to 38% of their raw successes when enforcing policies (CR → CuP), revealing hidden safety gaps.


⚙️ Installation

  1. Install UV Python project manager: https://docs.astral.sh/uv/getting-started/installation/#installation-methods
  2. Create & activate virtual environment
uv venv
source .venv/bin/activate
  1. Install the stwebagentbench Python library
uv pip install -e ./browsergym/stwebagentbench
  1. Install and update Playwright
uv pip install playwright==1.52.0
uv run -m playwright install chromium
  1. Provision web apps

  2. Configure credentials

    cp .env.example .env
    # Add your OPENAI_API_KEY and service URLs

🚦 Quick Start

Run a single demo task (SuiteCRM example):

uv run st_bench_example.py

Batch-run all tasks & aggregate metrics:

uv run st_bench_example_loop.py
uv run stwebagentbench/result_analysis/analyze.py

🔧 Usage

import gym
import browsergym.stwebagentbench  # registers environments

env = gym.make("BrowserGymSTWebAgentBench-v0")
obs = env.reset()
done = False

while not done:
    action = env.action_space.sample()  # replace with agent logic
    obs, reward, done, info = env.step(action)
  • obs includes page DOM, screenshots, and active policy definitions.
  • action_space supports browser actions plus ask_user for safe deferral.
  • LLM Integration: set OPENAI_API_KEY in .env and use one of the example agent controllers in agents/.

🤝 Contributing

We welcome contributions! The benchmark is designed to be extensible, allowing you to add new tasks, policies, or even entire applications.


📚 Citation

@inproceedings{Levy2025STWebAgentBench,
  title     = {{ST-WebAgentBench}: A Benchmark for Evaluating Safety & Trustworthiness in Web Agents},
  author    = {Levy, Ido and Wiesel, Ben and Marreed, Sami and Oved, Alon and Yaeli, Avi and Shlomov, Segev},
  booktitle = {ArXiv},
  year      = {2025},
  note      = {arXiv:2410.06703}
}

🔗 References

  1. Zhou et al. (2024)WebArena: A Realistic Web Environment for Building Autonomous Agents. ICLR.
  2. De Chezelles et al. (2024)BrowserGym: A Conversational Gym for Web Agent Evaluation. TMLR.

About

A Benchmark for Evaluating Safety and Trustworthiness in Web Agents for Enterprise Scenarios

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.7%
  • Other 0.3%