Cut eval cost −19.5%, score held (93 → 94)#112
Merged
Conversation
Collapse duplicated execution-model prose in the browser_script/browser/python tool descriptions and compress the update_goal description. Keeps all helper names, safety rules, and behavioral guidance verbatim. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Raise the script start initial-wait 15s->30s so the common scrape script finishes in one tool call (no separate observe model-turns), and raise the next_observe HINT 1s->15s to nudge long-polling over 1s 'still running?' peeks. Observe floor stays at 1s (agency preserved); deliberately avoids the reverted 'observe30' forced-window regression. Each observe/status poll is a full model call replaying 20-70k tokens, so this was the largest no-architecture cost lever. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Vendor fetch-use's client inline so http_get uses the Browser-Use Fetch proxy (Chrome TLS fingerprint + rotating proxy IPs) whenever BROWSER_USE_API_KEY is set, even when the fetch_use package isn't installed in the sandbox. Falls back to direct urllib on any proxy failure. Fixes blocked-by-bot-protection tasks that previously returned null/partial results from native urllib. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Resolve prompts/browser-tool-description.md by taking main's version: our trim of that file was the suspected cause of two portal-navigation regressions and saved ~nothing (the prompt is cached). The browser_script/python/update_goal trims and the two cost wins (sync script + fetch proxy) are unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…path tests - http_get never sends loopback/private/link-local/intranet hosts to the fetch proxy (URL+header leak, wrong-target fetch); new use_proxy=None/True/False override, default auto = public hosts only. - Proxy failures are no longer swallowed: stderr note on fallback, and both errors surfaced when the direct request also fails (a bot-blocked direct response can't masquerade as proxy success). - New test covers the vendored client against a local fake FETCH_USE_URL with fetch_use absent: proxy routing, private-host bypass (proxy never called), forced use_proxy=True, and proxy-failure fallback with dual-error message. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
real_v8, gpt-5.5, 100 tasks, locked judge. $66.84 → $53.82/run.
A token meter over the baseline showed 71% of cost is uncached input (page-HTML + wasted turns replayed every model call) — the prompt is ~99% cached. So: cut turns and page-garbage, not the prompt.
The 3 changes
lib.rs) — scripts were async; the model burned whole turns pollingobserve("done yet?"). Task 6 spent 57/80 turns polling and failed. Start-wait 15s→30s (most scripts return in one call), observe hint 1s→15s. Floor stays 1s — avoids the reverted "observe30" stacking regression.browser_script_helpers.py) —http_getfell back to raw urllib (bot-blocked) whenfetch_usewasn't installed. Vendored the proxy client inline → always routes via Browser-Use Fetch (Chrome TLS + rotating IPs). Task 26: 50 → 684 stores.prompts/*) — ~900 tokens cut. Honest: saved only ~$0.46 (it's cached) and likely caused 2 nav regressions (72, 98). Revertable as its own commit (64d301c) if they don't recover.Fails
Runs:
real-v8-everything-20260611-211259vsreal-v8-phase1merged-20260610-234722. Token deltas exact; $ priced at labeled GPT-5 rates.Summary by cubic
Cut eval run cost by 19.5% ($66.84 → $53.82/run) on
real_v8while holding score (93 → 94) by reducing observe churn and routing publichttp_getcalls through an unblockable fetch proxy with safe private-host bypass. Restores thebrowsertool prompt to main to avoid navigation regressions; other prompt trims remain.Performance
browser_script: raise initial wait to 30s (was 15s) so fast scripts finish in one call; raise observe hint to 15s (was 1s). The 1s observe floor remains.browser_script/Python/update_goaldescriptions; revert thebrowsertool description to main (cached, minimal cost impact).Bug Fixes
http_get: route via the Browser‑Use Fetch proxy whenBROWSER_USE_API_KEYis set, but never proxy loopback/private/intranet hosts by default; adduse_proxy(None=auto/public-only, True=force, False=direct); preferfetch_useif installed, else a vendored client; on proxy failure fall back to direct and surface the proxy error.Written for commit a55c03b. Summary will update on new commits.