Fix benchmark regression by gregpr07 · Pull Request #102 · browser-use/terminal

gregpr07 · 2026-06-10T04:40:02Z

Summary by cubic

Restore 88-baseline behavior and add quick wins: faster observe polling, full script output visibility, safer/clearer completion, and smarter prompts/audits to cut wasted turns and finish runs with correct results. Also adds a k=2 validation report showing K1=82, K2=90 (mean 86).

Bug Fixes
- Observe: default back to 1s (agent and browser lib), removed the hidden 30s eval coupling, and when a run is already finished, observe now says so and stops suggesting more polls.
- Output visibility: raised inline browser_script stdout cap to 120KB (matches script limit) and updated truncation guidance so the model recovers missing data instead of guessing.
- Completion: successful done is terminal; if a bounded run ends on plain text, the loop gives one “call done” nudge; if the model ends without done, emit the best result.* artifact from cwd as the final result.
- Compaction: fixed empty-summary handling so it falls back to “(no summary available)” before prefixing, preventing post-compaction amnesia.
- Audits and prompts: null and "" no longer count as placeholders; wording-based rejections removed; prompts drop the one-pass ceiling, add a visual fallback, forbid fabricated/proxy values, require basic reality checks, and add efficiency rules (chunking, checkpointing, limited retries, concurrent fetch, and observe discipline, including reading checkpoint files).
- Unified exec: added timeout_ms with a safe default and kill the entire process group on timeout to prevent stray processes.
- Added BROWSER_USE_DISABLE_LOCAL_SEARCH to optionally disable local search while keeping hosted web_search.
- Docs: added k=2 quick-pack validation report (K1=82, K2=90, mean 86).
Dependencies
- Added libc for process-group management on Unix.

^{Written for commit da14329. Summary will update on new commits.}

Snapshot of the exact code that produced the real_v8 81/100 run (root: but-fix-main-policy-runtime-eval-...-20260609-155756) so the regression baseline is reproducible and subsequent restore-88 fixes land as a clean diff on top. No behavior change; just commits the 20 previously-uncommitted source files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Revert the "observe30" regression: DEFAULT_OBSERVE_TIMEOUT_MS and BROWSER_SCRIPT_DEFAULT_OBSERVE_MS 30_000 -> 1_000 (the pre-PR-#60 / 88 baseline value). The 30s default + 30s clamp-floor blocked each observe up to 30s and burned the run timebox, leaving long-script tasks unfinished (real_v8 tasks 1, 4 never emitted session.done). Raise MAX_INLINE_BROWSER_SCRIPT_STDOUT_BYTES 4KB -> 16KB so large extractions aren't truncated into re-scrape loops. Ablation rung 1 of the 81->88 restore. Measured independently. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…older collect_json_placeholder_stats counted Value::Null as a placeholder, so a result that kept required-but-unavailable fields as null tripped the >=30% placeholder rejection. That pushed the agent to DELETE required fields to pass the audit (real_v8 task 53: filed_time/timezone_shown dropped) and double-rejected legitimately-sparse results (task 41). Count null toward the denominator but not as a placeholder. Applies to both the inline `result` and `result_file` audit paths. Keeps literal evasion strings ("unknown"/"n/a"/...) as placeholders. Follow-up (needs measurement): result_file audit reads a preview, so a large file with a clean head can still slip a high full-file null rate (task 94). Left for the measured pass. Ablation rung 3 of the 81->88 restore. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… fallback The "at most one targeted repair pass" instruction appeared in four prompt files and drove premature finalize on incomplete data (real_v8 task 32: 17 vs 64 turns, whole source categories never visited). Replace the one-pass ceiling with time-bounded repair: keep repairing the specific missing items until required rows/fields are satisfied or the run timebox is nearly spent; only avoid blindly restarting a whole fluctuating crawl. Add a visual-fallback mandate to the system prompt: if a script / http_get / browser_fetch / endpoint / selector fails, returns empty, or is blocked, fall back to navigating and reading the rendered page before marking anything unavailable — do not re-run the same failing script or drop the source (the dominant strategy regression across tasks 38,39,47,67). Files: browser-agent-system.md, dataset-case-user.md, python-tool-description.md, browser-script-tool-description.md. Ablation rung 2 of the 81->88 restore. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…le audit/observe Rung 0 (the big one): MAX_INLINE_BROWSER_SCRIPT_STDOUT_BYTES 16KB -> 120KB (matches SCRIPT_MAX_OUTPUT_CHARS). The 4KB cap (born in the 75-era code, absent at the 88 baseline) blinded the model to its own script output: truncations 0->619->780 across 88->75->81 and KeyError-class blind-guess bugs 8->41->53 at constant script volume. Codex parity: fresh tool output is never capped; history truncation (context/mod.rs policy*1.2) handles growth. Also reword the truncation notice — it told the model to use "a narrower extraction instead of re-reading", actively training it not to recover missing data. Rung 4: port fallback_result_file_for_session from exp/real-v8-restore-88 (2bd479d) — when the model ends without done(), emit session.done carrying the best result.* artifact from cwd instead of losing finished work (real_v8 tasks 1, 45, earlier 1/4/41/90). Decouple BROWSER_USE_EVAL_DONE_AUDIT from observe timing: the audit flag silently capped observes at 30s (hidden cross-subsystem coupling); observe caps now come only from BROWSER_USE_EVAL_MAX_OBSERVE_TIMEOUT_MS. Prompt: forbid fabricated/pattern-guessed values; honest null/"not found" with checked source is acceptable (task 68 fabricated emails then disavowed). Tests: updated 3 assertions to the new cap/clamp/notice. The 6 remaining failures (5 prompts::tests + stored_cloud_preference) pre-date this work — they assert phrases absent from prompts at BOTH baselines. Stream idle-timeout (task 22 class) verified already present in this lineage (provider.rs:1081, default 300s) — no port needed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cubic-dev-ai

3 issues found across 22 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/browser-use-agent/src/tools/handlers/done.rs">

<violation number="1" location="crates/browser-use-agent/src/tools/handlers/done.rs:311">
P2: `read_result_file_preview` reads the full file despite the preview cap, causing avoidable large I/O and memory usage.</violation>
</file>

<file name="crates/browser-use-browser/src/lib.rs">

<violation number="1" location="crates/browser-use-browser/src/lib.rs:279">
P2: Completed browser_script cache limits were increased enough to risk significant memory growth in long-running sessions.</violation>
</file>

<file name="crates/browser-use-agent/src/entrypoint/mod.rs">

<violation number="1" location="crates/browser-use-agent/src/entrypoint/mod.rs:2700">
P1: `session.done` is emitted for aborted turns because `Ok(...)` from `TurnLoop` includes interruption/max-turns paths.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic}

cubic-dev-ai · 2026-06-10T04:43:22Z

+                        result_file = Some(fallback.file);
+                    }
+                }
+                if let Some(text) = final_message.as_deref() {


P1: session.done is emitted for aborted turns because Ok(...) from TurnLoop includes interruption/max-turns paths.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-agent/src/entrypoint/mod.rs, line 2700: <comment>`session.done` is emitted for aborted turns because `Ok(...)` from `TurnLoop` includes interruption/max-turns paths.</comment> <file context> @@ -2611,10 +2674,83 @@ impl<Sd: SamplingDriver> RuntimeTurnLoopDriver<Sd> { + result_file = Some(fallback.file); + } + } + if let Some(text) = final_message.as_deref() { + runtime_handle.append_observed_session_event( + runtime_session_id, </file context>

cubic-dev-ai · 2026-06-10T04:43:22Z

+        return Err(format!("result_file `{}` is empty", path));
+    }
+
+    let bytes = fs::read(&resolved).map_err(|error| {


P2: read_result_file_preview reads the full file despite the preview cap, causing avoidable large I/O and memory usage.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-agent/src/tools/handlers/done.rs, line 311: <comment>`read_result_file_preview` reads the full file despite the preview cap, causing avoidable large I/O and memory usage.</comment> <file context> @@ -224,3 +229,240 @@ impl ToolRuntime<DoneRequest, ExecOutput> for DoneTool { + return Err(format!("result_file `{}` is empty", path)); + } + + let bytes = fs::read(&resolved).map_err(|error| { + format!( + "result_file `{}` could not be read at {} ({error})", </file context>

cubic-dev-ai · 2026-06-10T04:43:22Z

+const BROWSER_SCRIPT_COMPLETED_CACHE_TTL_MS: u128 = 60 * 60 * 1_000;
+const BROWSER_SCRIPT_COMPLETED_CACHE_MAX: usize = 2048;


P2: Completed browser_script cache limits were increased enough to risk significant memory growth in long-running sessions.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-browser/src/lib.rs, line 279: <comment>Completed browser_script cache limits were increased enough to risk significant memory growth in long-running sessions.</comment> <file context> @@ -276,8 +276,8 @@ static MANAGED_BROWSER_PIDS: OnceLock<Mutex<HashSet<u32>>> = OnceLock::new(); static BROWSER_SCRIPT_RUN_COUNTER: AtomicU64 = AtomicU64::new(1); -const BROWSER_SCRIPT_COMPLETED_CACHE_TTL_MS: u128 = 10 * 60 * 1_000; -const BROWSER_SCRIPT_COMPLETED_CACHE_MAX: usize = 128; +const BROWSER_SCRIPT_COMPLETED_CACHE_TTL_MS: u128 = 60 * 60 * 1_000; +const BROWSER_SCRIPT_COMPLETED_CACHE_MAX: usize = 2048; const MANAGED_BROWSER_PROFILE_PREFIX: &str = "but-managed-browser."; </file context>

Suggested change

const BROWSER_SCRIPT_COMPLETED_CACHE_TTL_MS: u128 = 60 * 60 * 1_000;

const BROWSER_SCRIPT_COMPLETED_CACHE_MAX: usize = 2048;

const BROWSER_SCRIPT_COMPLETED_CACHE_TTL_MS: u128 = 10 * 60 * 1_000;

const BROWSER_SCRIPT_COMPLETED_CACHE_MAX: usize = 128;

…udges, discipline rules) Quick wins (evidence: 100-task deep-dive of the 85-run): - done-audit: "" no longer counts as a placeholder — tasks explicitly mandate "" for missing values; counting it rejected a spec-correct answer and coerced literal placeholder prose (real_v8 task 94). - done-audit: remove phrase-based wording rejections (text_audit_reasons). Zero measured wins; measured harms: agents resubmitted identical data with laundered confident phrasing (75, 77) and burned 5.7min post-rejection on a source-exhausted spec (87). Evidence checks (empty/placeholder JSON) stay. - compaction: the "(no summary available)" fallback checked the full string AFTER the prefix was prepended, so it was dead code; an empty summarize() pass shipped "...Here is the summary...:" + NOTHING (task 24: amnesia, 30 thrash turns, task lost). Fallback now applies to the suffix. - loop: CALL_DONE_NUDGE — in bounded (max_turns) runs, a text-only completion (finish_reason Stop/None, never done()'s ToolUse) gets ONE developer nudge to call done() before the text is accepted as final (tasks 24/45 class). - prompts: anti-proxy/laundering rule (negative control for identifiers; no field-copied proxies; task-mandated representations used literally) and external reality-probe requirement in collection audits (site-stated total / largest-entity check; tasks 31/32/36/72/74/75). Efficiency pack: - prompts: chunk-under-timeout + checkpoint-every-item + never-retry-monolith; concurrent fetch for >=3 independent URLs (task 68: 4min serial vs 0.4s); done(result_file) instead of re-typing large artifacts (~10 min/run). - browser_script description: observe discipline (one generous poll; read checkpoint files; stop after completed/not_found) + js() return-shape and regex-escaping rules (~50 self-inflicted failures/run). - observe: repeat-observe of a finished run now appends an unmissable "already completed - stop polling" note and clears next_observe_ms. Tests: 1058 pass; 3 assertions updated to the new intended behavior; the 6 remaining failures pre-date this branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cubic-dev-ai

1 issue found across 10 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/browser-use-agent/src/turn/loop_driver.rs">

<violation number="1" location="crates/browser-use-agent/src/turn/loop_driver.rs:253">
P2: The new done-nudge path can exceed `max_turns` by one sampling round, violating the bounded-run cap.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic}

cubic-dev-ai · 2026-06-10T05:20:34Z

+                    // to call done before we accept the text as final.
+                    let ended_without_done =
+                        !matches!(outcome.finish_reason, Some(FinishReason::ToolUse));
+                    if max_turns.is_some() && ended_without_done && !done_nudge_sent {


P2: The new done-nudge path can exceed max_turns by one sampling round, violating the bounded-run cap.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-agent/src/turn/loop_driver.rs, line 253: <comment>The new done-nudge path can exceed `max_turns` by one sampling round, violating the bounded-run cap.</comment> <file context> @@ -229,6 +244,18 @@ impl<St: TurnState, Sd: SamplingDriver, Ob: TurnObserver> TurnLoop<St, Sd, Ob> { + // to call done before we accept the text as final. + let ended_without_done = + !matches!(outcome.finish_reason, Some(FinishReason::ToolUse)); + if max_turns.is_some() && ended_without_done && !done_nudge_sent { + done_nudge_sent = true; + pending_done_nudge = true; </file context>

Suggested change

if max_turns.is_some() && ended_without_done && !done_nudge_sent {

if max_turns.is_some_and(|limit| turns_run < limit) && ended_without_done && !done_nudge_sent {

… 88-baseline), mean 86 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gregpr07 and others added 7 commits June 8, 2026 16:03

Fix runtime terminal completion barrier

bc45a39

docs: 100-task deep-dive synthesis for the 85-score run

d8d4390

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cubic-dev-ai Bot reviewed Jun 10, 2026

View reviewed changes

docs: k=2 quick-pack validation — K1=82, K2=90 (first run to beat the…

da14329

… 88-baseline), mean 86 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix benchmark regression#102

Fix benchmark regression#102
gregpr07 wants to merge 9 commits into
mainfrom
fix/real-v8-restore-88-v2

gregpr07 commented Jun 10, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Jun 10, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot Jun 10, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot Jun 10, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		const BROWSER_SCRIPT_COMPLETED_CACHE_TTL_MS: u128 = 60 * 60 * 1_000;
		const BROWSER_SCRIPT_COMPLETED_CACHE_MAX: usize = 2048;

	if max_turns.is_some() && ended_without_done && !done_nudge_sent {
	if max_turns.is_some_and(\|limit\| turns_run < limit) && ended_without_done && !done_nudge_sent {

Conversation

gregpr07 commented Jun 10, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gregpr07 commented Jun 10, 2026 •

edited by cubic-dev-ai Bot

Loading

cubic-dev-ai Bot Jun 10, 2026 •

edited

Loading

cubic-dev-ai Bot Jun 10, 2026 •

edited

Loading

cubic-dev-ai Bot Jun 10, 2026 •

edited

Loading

cubic-dev-ai Bot Jun 10, 2026 •

edited

Loading