Skip to content

Cut eval cost −19.5%, score held (93 → 94)#112

Merged
gregpr07 merged 5 commits into
mainfrom
eval-everything
Jun 12, 2026
Merged

Cut eval cost −19.5%, score held (93 → 94)#112
gregpr07 merged 5 commits into
mainfrom
eval-everything

Conversation

@gregpr07

@gregpr07 gregpr07 commented Jun 12, 2026

Copy link
Copy Markdown
Member

real_v8, gpt-5.5, 100 tasks, locked judge. $66.84 → $53.82/run.

A token meter over the baseline showed 71% of cost is uncached input (page-HTML + wasted turns replayed every model call) — the prompt is ~99% cached. So: cut turns and page-garbage, not the prompt.

The 3 changes

  1. Sync browser_script (lib.rs) — scripts were async; the model burned whole turns polling observe ("done yet?"). Task 6 spent 57/80 turns polling and failed. Start-wait 15s→30s (most scripts return in one call), observe hint 1s→15s. Floor stays 1s — avoids the reverted "observe30" stacking regression.
  2. Un-blockable fetch (browser_script_helpers.py) — http_get fell back to raw urllib (bot-blocked) when fetch_use wasn't installed. Vendored the proxy client inline → always routes via Browser-Use Fetch (Chrome TLS + rotating IPs). Task 26: 50 → 684 stores.
  3. Prompt trim (prompts/*) — ~900 tokens cut. Honest: saved only ~$0.46 (it's cached) and likely caused 2 nav regressions (72, 98). Revertable as its own commit (64d301c) if they don't recover.

Fails

  • Fixed (the targeted family): 1, 6, 21, 26,
  • New: 33, 52 (variance) · 72, 98 (nav, suspect the trim)
  • Calibration anchors 68, 74 still fail as required.

Runs: real-v8-everything-20260611-211259 vs real-v8-phase1merged-20260610-234722. Token deltas exact; $ priced at labeled GPT-5 rates.


Summary by cubic

Cut eval run cost by 19.5% ($66.84 → $53.82/run) on real_v8 while holding score (93 → 94) by reducing observe churn and routing public http_get calls through an unblockable fetch proxy with safe private-host bypass. Restores the browser tool prompt to main to avoid navigation regressions; other prompt trims remain.

  • Performance

    • browser_script: raise initial wait to 30s (was 15s) so fast scripts finish in one call; raise observe hint to 15s (was 1s). The 1s observe floor remains.
    • Prompts: keep compressed browser_script/Python/update_goal descriptions; revert the browser tool description to main (cached, minimal cost impact).
  • Bug Fixes

    • http_get: route via the Browser‑Use Fetch proxy when BROWSER_USE_API_KEY is set, but never proxy loopback/private/intranet hosts by default; add use_proxy (None=auto/public-only, True=force, False=direct); prefer fetch_use if installed, else a vendored client; on proxy failure fall back to direct and surface the proxy error.
    • Tests: add coverage for vendored-proxy path, private-host bypass, forced proxy, and fallback error surfacing.

Written for commit a55c03b. Summary will update on new commits.

Review in cubic

gregpr07 and others added 5 commits June 12, 2026 00:27
Collapse duplicated execution-model prose in the browser_script/browser/python
tool descriptions and compress the update_goal description. Keeps all helper
names, safety rules, and behavioral guidance verbatim.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Raise the script start initial-wait 15s->30s so the common scrape script
finishes in one tool call (no separate observe model-turns), and raise the
next_observe HINT 1s->15s to nudge long-polling over 1s 'still running?' peeks.
Observe floor stays at 1s (agency preserved); deliberately avoids the reverted
'observe30' forced-window regression. Each observe/status poll is a full model
call replaying 20-70k tokens, so this was the largest no-architecture cost lever.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Vendor fetch-use's client inline so http_get uses the Browser-Use Fetch proxy
(Chrome TLS fingerprint + rotating proxy IPs) whenever BROWSER_USE_API_KEY is
set, even when the fetch_use package isn't installed in the sandbox. Falls back
to direct urllib on any proxy failure. Fixes blocked-by-bot-protection tasks
that previously returned null/partial results from native urllib.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Resolve prompts/browser-tool-description.md by taking main's version: our trim
of that file was the suspected cause of two portal-navigation regressions and
saved ~nothing (the prompt is cached). The browser_script/python/update_goal
trims and the two cost wins (sync script + fetch proxy) are unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…path tests

- http_get never sends loopback/private/link-local/intranet hosts to the fetch
  proxy (URL+header leak, wrong-target fetch); new use_proxy=None/True/False
  override, default auto = public hosts only.
- Proxy failures are no longer swallowed: stderr note on fallback, and both
  errors surfaced when the direct request also fails (a bot-blocked direct
  response can't masquerade as proxy success).
- New test covers the vendored client against a local fake FETCH_USE_URL with
  fetch_use absent: proxy routing, private-host bypass (proxy never called),
  forced use_proxy=True, and proxy-failure fallback with dual-error message.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@gregpr07 gregpr07 merged commit eef48a3 into main Jun 12, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant