Fix/real v8 simple by gregpr07 · Pull Request #105 · browser-use/terminal

gregpr07 · 2026-06-10T17:01:31Z

Summary by cubic

Restores pre-regression real_v8 behavior with faster observe polling, full browser_script output visibility, terminal completion capture, and safer process execution. Improves eval reliability and preserves results even without an explicit done().

New Features
- Successful done is terminal in the fused loop; if no done, emit a final result from the best result* file in the session cwd.
- exec_command adds timeout_ms and kills the whole process group on timeout (Unix); default cap DEFAULT_EXEC_COMMAND_TIMEOUT_MS=600_000 using libc.
- Observe control: defaults set to 1s in browser-use-agent and browser-use-browser; eval caps now only follow BROWSER_USE_EVAL_MAX_OBSERVE_TIMEOUT_MS. Added BROWSER_USE_DISABLE_LOCAL_SEARCH to disable the local search tool.
- Tool schema/docs: exec_command includes timeout_ms; done requires a complete, verified result with a step-exhaustion fallback.
Bug Fixes
- Inline browser_script stdout cap raised to 120 KB (matches SCRIPT_MAX_OUTPUT_CHARS); truncation guidance updated so the model re-reads artifacts when needed.
- Compaction applies "(no summary available)" to the suffix before the prefix to avoid losing context on empty summaries.
- Done-audit treats null and empty string values as genuine absences, not placeholders; reduces false rejections.
- Prompts remove the one-pass repair ceiling, add visual fallbacks, and forbid fabricated values.
- Reverts the 30s observe-window regression to 1s to prevent long polls from burning the run timebox.

^{Written for commit 84703c1. Summary will update on new commits.}

Snapshot of the exact code that produced the real_v8 81/100 run (root: but-fix-main-policy-runtime-eval-...-20260609-155756) so the regression baseline is reproducible and subsequent restore-88 fixes land as a clean diff on top. No behavior change; just commits the 20 previously-uncommitted source files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Revert the "observe30" regression: DEFAULT_OBSERVE_TIMEOUT_MS and BROWSER_SCRIPT_DEFAULT_OBSERVE_MS 30_000 -> 1_000 (the pre-PR-#60 / 88 baseline value). The 30s default + 30s clamp-floor blocked each observe up to 30s and burned the run timebox, leaving long-script tasks unfinished (real_v8 tasks 1, 4 never emitted session.done). Raise MAX_INLINE_BROWSER_SCRIPT_STDOUT_BYTES 4KB -> 16KB so large extractions aren't truncated into re-scrape loops. Ablation rung 1 of the 81->88 restore. Measured independently. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…older collect_json_placeholder_stats counted Value::Null as a placeholder, so a result that kept required-but-unavailable fields as null tripped the >=30% placeholder rejection. That pushed the agent to DELETE required fields to pass the audit (real_v8 task 53: filed_time/timezone_shown dropped) and double-rejected legitimately-sparse results (task 41). Count null toward the denominator but not as a placeholder. Applies to both the inline `result` and `result_file` audit paths. Keeps literal evasion strings ("unknown"/"n/a"/...) as placeholders. Follow-up (needs measurement): result_file audit reads a preview, so a large file with a clean head can still slip a high full-file null rate (task 94). Left for the measured pass. Ablation rung 3 of the 81->88 restore. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… fallback The "at most one targeted repair pass" instruction appeared in four prompt files and drove premature finalize on incomplete data (real_v8 task 32: 17 vs 64 turns, whole source categories never visited). Replace the one-pass ceiling with time-bounded repair: keep repairing the specific missing items until required rows/fields are satisfied or the run timebox is nearly spent; only avoid blindly restarting a whole fluctuating crawl. Add a visual-fallback mandate to the system prompt: if a script / http_get / browser_fetch / endpoint / selector fails, returns empty, or is blocked, fall back to navigating and reading the rendered page before marking anything unavailable — do not re-run the same failing script or drop the source (the dominant strategy regression across tasks 38,39,47,67). Files: browser-agent-system.md, dataset-case-user.md, python-tool-description.md, browser-script-tool-description.md. Ablation rung 2 of the 81->88 restore. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…le audit/observe Rung 0 (the big one): MAX_INLINE_BROWSER_SCRIPT_STDOUT_BYTES 16KB -> 120KB (matches SCRIPT_MAX_OUTPUT_CHARS). The 4KB cap (born in the 75-era code, absent at the 88 baseline) blinded the model to its own script output: truncations 0->619->780 across 88->75->81 and KeyError-class blind-guess bugs 8->41->53 at constant script volume. Codex parity: fresh tool output is never capped; history truncation (context/mod.rs policy*1.2) handles growth. Also reword the truncation notice — it told the model to use "a narrower extraction instead of re-reading", actively training it not to recover missing data. Rung 4: port fallback_result_file_for_session from exp/real-v8-restore-88 (2bd479d) — when the model ends without done(), emit session.done carrying the best result.* artifact from cwd instead of losing finished work (real_v8 tasks 1, 45, earlier 1/4/41/90). Decouple BROWSER_USE_EVAL_DONE_AUDIT from observe timing: the audit flag silently capped observes at 30s (hidden cross-subsystem coupling); observe caps now come only from BROWSER_USE_EVAL_MAX_OBSERVE_TIMEOUT_MS. Prompt: forbid fabricated/pattern-guessed values; honest null/"not found" with checked source is acceptable (task 68 fabricated emails then disavowed). Tests: updated 3 assertions to the new cap/clamp/notice. The 6 remaining failures (5 prompts::tests + stored_cloud_preference) pre-date this work — they assert phrases absent from prompts at BOTH baselines. Stream idle-timeout (task 22 class) verified already present in this lineage (provider.rs:1081, default 300s) — no port needed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Base = the 85-run (rung0-4, commit 0322b8c): full 120KB output visibility, terminal artifact fallback, done-audit wording-police, rung-2 prompts — all unchanged. The model's PROMPT is frozen at the 85-run; deliberately none of the quick-pack prompt heuristics (anti-laundering / reality-probe / chunking / js() rules) and NOT the wording-police removal are carried over — they tested flat-to-worse and below the ±4 run-to-run variance. Added (deterministic, code-only, model never sees them): - done-audit: empty string "" is no longer a placeholder (like null). Many tasks mandate "" for missing values; counting it rejected spec-correct answers and coerced placeholder prose (real_v8 task 94). Zero downside. - compaction: apply the "(no summary available)" fallback to the summary SUFFIX before the prefix is prepended — the existing fallback was dead code (prefix made the string non-empty), so an empty summarize() pass shipped PREFIX + "" and the resumed model had total amnesia (real_v8 task 24). Tests: 1058 pass; 6 failures pre-date the base (prompt-phrase asserts + a stored_cloud_preference test absent from this lineage). Next (separate, deterministic FLOOR fixes — not prompt heuristics): no-op control-call loop-breaker, and broaden terminal finalize to any result-shaped artifact. Those raise the floor under unlucky trajectories (the real driver of the 82<->90 variance) and can't be washed out by sampling noise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…laceholder (task 94) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cubic-dev-ai

3 issues found across 23 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/browser-use-agent/src/tools/handlers/done.rs">

<violation number="1" location="crates/browser-use-agent/src/tools/handlers/done.rs:261">
P2: Non-UTF8 `result_file` content is incorrectly treated as a valid material answer, allowing weak/empty completions to pass audit.</violation>
</file>

<file name="crates/browser-use-agent/src/tools/handlers/done_tests.rs">

<violation number="1" location="crates/browser-use-agent/src/tools/handlers/done_tests.rs:195">
P2: The missing-file test uses a shared temp-dir filename, which can make the assertion flaky when that file already exists.</violation>
</file>

<file name="crates/browser-use-agent/src/entrypoint/mod.rs">

<violation number="1" location="crates/browser-use-agent/src/entrypoint/mod.rs:2685">
P1: Successful runs can fail before `session.done` if terminal cleanup errors, causing lost final result emission.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic}

cubic-dev-ai · 2026-06-10T17:07:28Z

+                    runtime_handle.clone(),
+                    runtime_session_id.clone(),
+                )
+                .await?;


P1: Successful runs can fail before session.done if terminal cleanup errors, causing lost final result emission.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-agent/src/entrypoint/mod.rs, line 2685: <comment>Successful runs can fail before `session.done` if terminal cleanup errors, causing lost final result emission.</comment> <file context> @@ -2611,10 +2674,83 @@ impl<Sd: SamplingDriver> RuntimeTurnLoopDriver<Sd> { + runtime_handle.clone(), + runtime_session_id.clone(), + ) + .await?; + // Never lose finished work: if the model ended without a final + // message (no done() call — e.g. ended on leaked planning text </file context>

cubic-dev-ai · 2026-06-10T17:07:28Z

+                has_material_answer = true;
+                if let Some(preview) = preview {
+                    reasons.extend(json_audit_reasons(&preview));
+                    if !audit_text.is_empty() {
+                        audit_text.push('\n');
+                    }
+                    audit_text.push_str(&preview);
+                }


P2: Non-UTF8 result_file content is incorrectly treated as a valid material answer, allowing weak/empty completions to pass audit.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-agent/src/tools/handlers/done.rs, line 261: <comment>Non-UTF8 `result_file` content is incorrectly treated as a valid material answer, allowing weak/empty completions to pass audit.</comment> <file context> @@ -224,3 +229,244 @@ impl ToolRuntime<DoneRequest, ExecOutput> for DoneTool { + { + match read_result_file_preview(result_file, ctx) { + Ok(preview) => { + has_material_answer = true; + if let Some(preview) = preview { + reasons.extend(json_audit_reasons(&preview)); </file context>

Suggested change

has_material_answer = true;

if let Some(preview) = preview {

reasons.extend(json_audit_reasons(&preview));

if !audit_text.is_empty() {

audit_text.push('\n');

}

audit_text.push_str(&preview);

}

if let Some(preview) = preview {

has_material_answer = true;

reasons.extend(json_audit_reasons(&preview));

if !audit_text.is_empty() {

audit_text.push('\n');

}

audit_text.push_str(&preview);

} else {

reasons.push(format!("result_file `{}` is not valid UTF-8 text", result_file));

}

cubic-dev-ai · 2026-06-10T17:07:28Z

+    let req = DoneRequest {
+        result_file: Some("missing-result.json".to_string()),
+        ..DoneRequest::default()
+    };


P2: The missing-file test uses a shared temp-dir filename, which can make the assertion flaky when that file already exists.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-agent/src/tools/handlers/done_tests.rs, line 195: <comment>The missing-file test uses a shared temp-dir filename, which can make the assertion flaky when that file already exists.</comment> <file context> @@ -175,3 +175,70 @@ fn done_is_not_parallel_safe() { + +#[test] +fn eval_done_audit_rejects_missing_result_file() { + let req = DoneRequest { + result_file: Some("missing-result.json".to_string()), + ..DoneRequest::default() </file context>

Suggested change

let req = DoneRequest {

result_file: Some("missing-result.json".to_string()),

..DoneRequest::default()

};

let temp = tempfile::tempdir().unwrap();

let req = DoneRequest {

result_file: Some(temp.path().join("missing-result.json").to_string_lossy().to_string()),

..DoneRequest::default()

};

* Fix runtime terminal completion barrier * chore: freeze real_v8 81 baseline (bc45a39 + worktree) Snapshot of the exact code that produced the real_v8 81/100 run (root: but-fix-main-policy-runtime-eval-...-20260609-155756) so the regression baseline is reproducible and subsequent restore-88 fixes land as a clean diff on top. No behavior change; just commits the 20 previously-uncommitted source files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(eval): rung1 config — revert observe to 1s, raise inline cap Revert the "observe30" regression: DEFAULT_OBSERVE_TIMEOUT_MS and BROWSER_SCRIPT_DEFAULT_OBSERVE_MS 30_000 -> 1_000 (the pre-PR-#60 / 88 baseline value). The 30s default + 30s clamp-floor blocked each observe up to 30s and burned the run timebox, leaving long-script tasks unfinished (real_v8 tasks 1, 4 never emitted session.done). Raise MAX_INLINE_BROWSER_SCRIPT_STDOUT_BYTES 4KB -> 16KB so large extractions aren't truncated into re-scrape loops. Ablation rung 1 of the 81->88 restore. Measured independently. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(eval): rung3 done-audit — null is a genuine absence, not a placeholder collect_json_placeholder_stats counted Value::Null as a placeholder, so a result that kept required-but-unavailable fields as null tripped the >=30% placeholder rejection. That pushed the agent to DELETE required fields to pass the audit (real_v8 task 53: filed_time/timezone_shown dropped) and double-rejected legitimately-sparse results (task 41). Count null toward the denominator but not as a placeholder. Applies to both the inline `result` and `result_file` audit paths. Keeps literal evasion strings ("unknown"/"n/a"/...) as placeholders. Follow-up (needs measurement): result_file audit reads a preview, so a large file with a clean head can still slip a high full-file null rate (task 94). Left for the measured pass. Ablation rung 3 of the 81->88 restore. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(eval): rung2 prompts — remove one-repair-pass ceiling, add visual fallback The "at most one targeted repair pass" instruction appeared in four prompt files and drove premature finalize on incomplete data (real_v8 task 32: 17 vs 64 turns, whole source categories never visited). Replace the one-pass ceiling with time-bounded repair: keep repairing the specific missing items until required rows/fields are satisfied or the run timebox is nearly spent; only avoid blindly restarting a whole fluctuating crawl. Add a visual-fallback mandate to the system prompt: if a script / http_get / browser_fetch / endpoint / selector fails, returns empty, or is blocked, fall back to navigating and reading the rendered page before marking anything unavailable — do not re-run the same failing script or drop the source (the dominant strategy regression across tasks 38,39,47,67). Files: browser-agent-system.md, dataset-case-user.md, python-tool-description.md, browser-script-tool-description.md. Ablation rung 2 of the 81->88 restore. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(eval): rung0+4 — full output visibility, terminal capture, decouple audit/observe Rung 0 (the big one): MAX_INLINE_BROWSER_SCRIPT_STDOUT_BYTES 16KB -> 120KB (matches SCRIPT_MAX_OUTPUT_CHARS). The 4KB cap (born in the 75-era code, absent at the 88 baseline) blinded the model to its own script output: truncations 0->619->780 across 88->75->81 and KeyError-class blind-guess bugs 8->41->53 at constant script volume. Codex parity: fresh tool output is never capped; history truncation (context/mod.rs policy*1.2) handles growth. Also reword the truncation notice — it told the model to use "a narrower extraction instead of re-reading", actively training it not to recover missing data. Rung 4: port fallback_result_file_for_session from exp/real-v8-restore-88 (2bd479d) — when the model ends without done(), emit session.done carrying the best result.* artifact from cwd instead of losing finished work (real_v8 tasks 1, 45, earlier 1/4/41/90). Decouple BROWSER_USE_EVAL_DONE_AUDIT from observe timing: the audit flag silently capped observes at 30s (hidden cross-subsystem coupling); observe caps now come only from BROWSER_USE_EVAL_MAX_OBSERVE_TIMEOUT_MS. Prompt: forbid fabricated/pattern-guessed values; honest null/"not found" with checked source is acceptable (task 68 fabricated emails then disavowed). Tests: updated 3 assertions to the new cap/clamp/notice. The 6 remaining failures (5 prompts::tests + stored_cloud_preference) pre-date this work — they assert phrases absent from prompts at BOTH baselines. Stream idle-timeout (task 22 class) verified already present in this lineage (provider.rs:1081, default 300s) — no port needed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(eval): simple branch — 85-run base + 2 zero-risk deterministic fixes Base = the 85-run (rung0-4, commit 0322b8c): full 120KB output visibility, terminal artifact fallback, done-audit wording-police, rung-2 prompts — all unchanged. The model's PROMPT is frozen at the 85-run; deliberately none of the quick-pack prompt heuristics (anti-laundering / reality-probe / chunking / js() rules) and NOT the wording-police removal are carried over — they tested flat-to-worse and below the ±4 run-to-run variance. Added (deterministic, code-only, model never sees them): - done-audit: empty string "" is no longer a placeholder (like null). Many tasks mandate "" for missing values; counting it rejected spec-correct answers and coerced placeholder prose (real_v8 task 94). Zero downside. - compaction: apply the "(no summary available)" fallback to the summary SUFFIX before the prefix is prepended — the existing fallback was dead code (prefix made the string non-empty), so an empty summarize() pass shipped PREFIX + "" and the resumed model had total amnesia (real_v8 task 24). Tests: 1058 pass; 6 failures pre-date the base (prompt-phrase asserts + a stored_cloud_preference test absent from this lineage). Next (separate, deterministic FLOOR fixes — not prompt heuristics): no-op control-call loop-breaker, and broaden terminal finalize to any result-shaped artifact. Those raise the floor under unlucky trajectories (the real driver of the 82<->90 variance) and can't be washed out by sampling noise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(eval): done-audit treats empty string "" as genuine-absent, not placeholder (task 94) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(phase1): never lose finished work — broaden + crash-path artifact finalize Floor reliability only (deterministic; prompt UNCHANGED from PR #105). The locked-judge baseline showed ~5 tasks (35,52,61,72,99) ran but the runner captured nothing — work was on disk yet discarded. - discover_result_files: match any result-shaped artifact (result*, *.json/.csv/ .md/.txt >=16B), not just `result.*`. Prefer canonical names, then more content. Rescues task 52 (feb17_selected.json sat on disk, runner delivered nothing). - terminal Err arm: on a session crash, if a substantive result artifact exists, emit session.done with it and return Ok instead of failing empty. Rescues task 99 (provider error mid-run, result.json already written). Deliberately NO loop-breaker here (Phase 2 deletes the control-plane browser tool, making the no-op doom loop structurally impossible) and NO prompt/cruft changes (those land in Phase 2's simplification sweep to avoid churning prompt+tests twice). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(phase1): verdict — floor fix is a deterministic win (ok=false 5->1); score jump is variance --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gregpr07 and others added 8 commits June 8, 2026 16:03

Fix runtime terminal completion barrier

bc45a39

fix(eval): done-audit treats empty string "" as genuine-absent, not p…

84703c1

…laceholder (task 94) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cubic-dev-ai Bot reviewed Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/real v8 simple#105

Fix/real v8 simple#105
gregpr07 wants to merge 8 commits into
mainfrom
fix/real-v8-simple

gregpr07 commented Jun 10, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Jun 10, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot Jun 10, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gregpr07 commented Jun 10, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gregpr07 commented Jun 10, 2026 •

edited by cubic-dev-ai Bot

Loading

cubic-dev-ai Bot Jun 10, 2026 •

edited

Loading

cubic-dev-ai Bot Jun 10, 2026 •

edited

Loading

cubic-dev-ai Bot Jun 10, 2026 •

edited

Loading