Cut eval cost −19.5%, score held (93 → 94) by gregpr07 · Pull Request #112 · browser-use/terminal

gregpr07 · 2026-06-12T00:33:57Z

real_v8, gpt-5.5, 100 tasks, locked judge. $66.84 → $53.82/run.

A token meter over the baseline showed 71% of cost is uncached input (page-HTML + wasted turns replayed every model call) — the prompt is ~99% cached. So: cut turns and page-garbage, not the prompt.

The 3 changes

Sync browser_script (lib.rs) — scripts were async; the model burned whole turns polling observe ("done yet?"). Task 6 spent 57/80 turns polling and failed. Start-wait 15s→30s (most scripts return in one call), observe hint 1s→15s. Floor stays 1s — avoids the reverted "observe30" stacking regression.
Un-blockable fetch (browser_script_helpers.py) — http_get fell back to raw urllib (bot-blocked) when fetch_use wasn't installed. Vendored the proxy client inline → always routes via Browser-Use Fetch (Chrome TLS + rotating IPs). Task 26: 50 → 684 stores.
Prompt trim (prompts/*) — ~900 tokens cut. Honest: saved only ~$0.46 (it's cached) and likely caused 2 nav regressions (72, 98). Revertable as its own commit (64d301c) if they don't recover.

Fails

Fixed (the targeted family): 1, 6, 21, 26,
New: 33, 52 (variance) · 72, 98 (nav, suspect the trim)
Calibration anchors 68, 74 still fail as required.

Runs: real-v8-everything-20260611-211259 vs real-v8-phase1merged-20260610-234722. Token deltas exact; $ priced at labeled GPT-5 rates.

Summary by cubic

Cut eval run cost by 19.5% ($66.84 → $53.82/run) on real_v8 while holding score (93 → 94) by reducing observe churn and routing public http_get calls through an unblockable fetch proxy with safe private-host bypass. Restores the browser tool prompt to main to avoid navigation regressions; other prompt trims remain.

Performance
- browser_script: raise initial wait to 30s (was 15s) so fast scripts finish in one call; raise observe hint to 15s (was 1s). The 1s observe floor remains.
- Prompts: keep compressed browser_script/Python/update_goal descriptions; revert the browser tool description to main (cached, minimal cost impact).
Bug Fixes
- http_get: route via the Browser‑Use Fetch proxy when BROWSER_USE_API_KEY is set, but never proxy loopback/private/intranet hosts by default; add use_proxy (None=auto/public-only, True=force, False=direct); prefer fetch_use if installed, else a vendored client; on proxy failure fall back to direct and surface the proxy error.
- Tests: add coverage for vendored-proxy path, private-host bypass, forced proxy, and fallback error surfacing.

^{Written for commit a55c03b. Summary will update on new commits.}

Collapse duplicated execution-model prose in the browser_script/browser/python tool descriptions and compress the update_goal description. Keeps all helper names, safety rules, and behavioral guidance verbatim. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Raise the script start initial-wait 15s->30s so the common scrape script finishes in one tool call (no separate observe model-turns), and raise the next_observe HINT 1s->15s to nudge long-polling over 1s 'still running?' peeks. Observe floor stays at 1s (agency preserved); deliberately avoids the reverted 'observe30' forced-window regression. Each observe/status poll is a full model call replaying 20-70k tokens, so this was the largest no-architecture cost lever. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Vendor fetch-use's client inline so http_get uses the Browser-Use Fetch proxy (Chrome TLS fingerprint + rotating proxy IPs) whenever BROWSER_USE_API_KEY is set, even when the fetch_use package isn't installed in the sandbox. Falls back to direct urllib on any proxy failure. Fixes blocked-by-bot-protection tasks that previously returned null/partial results from native urllib. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Resolve prompts/browser-tool-description.md by taking main's version: our trim of that file was the suspected cause of two portal-navigation regressions and saved ~nothing (the prompt is cached). The browser_script/python/update_goal trims and the two cost wins (sync script + fetch proxy) are unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…path tests - http_get never sends loopback/private/link-local/intranet hosts to the fetch proxy (URL+header leak, wrong-target fetch); new use_proxy=None/True/False override, default auto = public hosts only. - Proxy failures are no longer swallowed: stderr note on fallback, and both errors surfaced when the direct request also fails (a bot-blocked direct response can't masquerade as proxy success). - New test covers the vendored client against a local fake FETCH_USE_URL with fetch_use absent: proxy routing, private-host bypass (proxy never called), forced use_proxy=True, and proxy-failure fallback with dual-error message. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

gregpr07 and others added 5 commits June 12, 2026 00:27

gregpr07 merged commit eef48a3 into main Jun 12, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cut eval cost −19.5%, score held (93 → 94)#112

Cut eval cost −19.5%, score held (93 → 94)#112
gregpr07 merged 5 commits into
mainfrom
eval-everything

gregpr07 commented Jun 12, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gregpr07 commented Jun 12, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The 3 changes

Fails

Summary by cubic

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gregpr07 commented Jun 12, 2026 •

edited by cubic-dev-ai Bot

Loading