Fix/real v8 simple#105
Open
gregpr07 wants to merge 8 commits into
Open
Conversation
Snapshot of the exact code that produced the real_v8 81/100 run (root: but-fix-main-policy-runtime-eval-...-20260609-155756) so the regression baseline is reproducible and subsequent restore-88 fixes land as a clean diff on top. No behavior change; just commits the 20 previously-uncommitted source files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Revert the "observe30" regression: DEFAULT_OBSERVE_TIMEOUT_MS and BROWSER_SCRIPT_DEFAULT_OBSERVE_MS 30_000 -> 1_000 (the pre-PR-#60 / 88 baseline value). The 30s default + 30s clamp-floor blocked each observe up to 30s and burned the run timebox, leaving long-script tasks unfinished (real_v8 tasks 1, 4 never emitted session.done). Raise MAX_INLINE_BROWSER_SCRIPT_STDOUT_BYTES 4KB -> 16KB so large extractions aren't truncated into re-scrape loops. Ablation rung 1 of the 81->88 restore. Measured independently. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…older
collect_json_placeholder_stats counted Value::Null as a placeholder, so a
result that kept required-but-unavailable fields as null tripped the >=30%
placeholder rejection. That pushed the agent to DELETE required fields to
pass the audit (real_v8 task 53: filed_time/timezone_shown dropped) and
double-rejected legitimately-sparse results (task 41). Count null toward
the denominator but not as a placeholder. Applies to both the inline
`result` and `result_file` audit paths. Keeps literal evasion strings
("unknown"/"n/a"/...) as placeholders.
Follow-up (needs measurement): result_file audit reads a preview, so a
large file with a clean head can still slip a high full-file null rate
(task 94). Left for the measured pass.
Ablation rung 3 of the 81->88 restore.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… fallback The "at most one targeted repair pass" instruction appeared in four prompt files and drove premature finalize on incomplete data (real_v8 task 32: 17 vs 64 turns, whole source categories never visited). Replace the one-pass ceiling with time-bounded repair: keep repairing the specific missing items until required rows/fields are satisfied or the run timebox is nearly spent; only avoid blindly restarting a whole fluctuating crawl. Add a visual-fallback mandate to the system prompt: if a script / http_get / browser_fetch / endpoint / selector fails, returns empty, or is blocked, fall back to navigating and reading the rendered page before marking anything unavailable — do not re-run the same failing script or drop the source (the dominant strategy regression across tasks 38,39,47,67). Files: browser-agent-system.md, dataset-case-user.md, python-tool-description.md, browser-script-tool-description.md. Ablation rung 2 of the 81->88 restore. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…le audit/observe Rung 0 (the big one): MAX_INLINE_BROWSER_SCRIPT_STDOUT_BYTES 16KB -> 120KB (matches SCRIPT_MAX_OUTPUT_CHARS). The 4KB cap (born in the 75-era code, absent at the 88 baseline) blinded the model to its own script output: truncations 0->619->780 across 88->75->81 and KeyError-class blind-guess bugs 8->41->53 at constant script volume. Codex parity: fresh tool output is never capped; history truncation (context/mod.rs policy*1.2) handles growth. Also reword the truncation notice — it told the model to use "a narrower extraction instead of re-reading", actively training it not to recover missing data. Rung 4: port fallback_result_file_for_session from exp/real-v8-restore-88 (2bd479d) — when the model ends without done(), emit session.done carrying the best result.* artifact from cwd instead of losing finished work (real_v8 tasks 1, 45, earlier 1/4/41/90). Decouple BROWSER_USE_EVAL_DONE_AUDIT from observe timing: the audit flag silently capped observes at 30s (hidden cross-subsystem coupling); observe caps now come only from BROWSER_USE_EVAL_MAX_OBSERVE_TIMEOUT_MS. Prompt: forbid fabricated/pattern-guessed values; honest null/"not found" with checked source is acceptable (task 68 fabricated emails then disavowed). Tests: updated 3 assertions to the new cap/clamp/notice. The 6 remaining failures (5 prompts::tests + stored_cloud_preference) pre-date this work — they assert phrases absent from prompts at BOTH baselines. Stream idle-timeout (task 22 class) verified already present in this lineage (provider.rs:1081, default 300s) — no port needed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Base = the 85-run (rung0-4, commit 0322b8c): full 120KB output visibility, terminal artifact fallback, done-audit wording-police, rung-2 prompts — all unchanged. The model's PROMPT is frozen at the 85-run; deliberately none of the quick-pack prompt heuristics (anti-laundering / reality-probe / chunking / js() rules) and NOT the wording-police removal are carried over — they tested flat-to-worse and below the ±4 run-to-run variance. Added (deterministic, code-only, model never sees them): - done-audit: empty string "" is no longer a placeholder (like null). Many tasks mandate "" for missing values; counting it rejected spec-correct answers and coerced placeholder prose (real_v8 task 94). Zero downside. - compaction: apply the "(no summary available)" fallback to the summary SUFFIX before the prefix is prepended — the existing fallback was dead code (prefix made the string non-empty), so an empty summarize() pass shipped PREFIX + "" and the resumed model had total amnesia (real_v8 task 24). Tests: 1058 pass; 6 failures pre-date the base (prompt-phrase asserts + a stored_cloud_preference test absent from this lineage). Next (separate, deterministic FLOOR fixes — not prompt heuristics): no-op control-call loop-breaker, and broaden terminal finalize to any result-shaped artifact. Those raise the floor under unlucky trajectories (the real driver of the 82<->90 variance) and can't be washed out by sampling noise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…laceholder (task 94) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
3 issues found across 23 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/browser-use-agent/src/tools/handlers/done.rs">
<violation number="1" location="crates/browser-use-agent/src/tools/handlers/done.rs:261">
P2: Non-UTF8 `result_file` content is incorrectly treated as a valid material answer, allowing weak/empty completions to pass audit.</violation>
</file>
<file name="crates/browser-use-agent/src/tools/handlers/done_tests.rs">
<violation number="1" location="crates/browser-use-agent/src/tools/handlers/done_tests.rs:195">
P2: The missing-file test uses a shared temp-dir filename, which can make the assertion flaky when that file already exists.</violation>
</file>
<file name="crates/browser-use-agent/src/entrypoint/mod.rs">
<violation number="1" location="crates/browser-use-agent/src/entrypoint/mod.rs:2685">
P1: Successful runs can fail before `session.done` if terminal cleanup errors, causing lost final result emission.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
| runtime_handle.clone(), | ||
| runtime_session_id.clone(), | ||
| ) | ||
| .await?; |
Contributor
There was a problem hiding this comment.
P1: Successful runs can fail before session.done if terminal cleanup errors, causing lost final result emission.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-agent/src/entrypoint/mod.rs, line 2685:
<comment>Successful runs can fail before `session.done` if terminal cleanup errors, causing lost final result emission.</comment>
<file context>
@@ -2611,10 +2674,83 @@ impl<Sd: SamplingDriver> RuntimeTurnLoopDriver<Sd> {
+ runtime_handle.clone(),
+ runtime_session_id.clone(),
+ )
+ .await?;
+ // Never lose finished work: if the model ended without a final
+ // message (no done() call — e.g. ended on leaked planning text
</file context>
Comment on lines
+261
to
+268
| has_material_answer = true; | ||
| if let Some(preview) = preview { | ||
| reasons.extend(json_audit_reasons(&preview)); | ||
| if !audit_text.is_empty() { | ||
| audit_text.push('\n'); | ||
| } | ||
| audit_text.push_str(&preview); | ||
| } |
Contributor
There was a problem hiding this comment.
P2: Non-UTF8 result_file content is incorrectly treated as a valid material answer, allowing weak/empty completions to pass audit.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-agent/src/tools/handlers/done.rs, line 261:
<comment>Non-UTF8 `result_file` content is incorrectly treated as a valid material answer, allowing weak/empty completions to pass audit.</comment>
<file context>
@@ -224,3 +229,244 @@ impl ToolRuntime<DoneRequest, ExecOutput> for DoneTool {
+ {
+ match read_result_file_preview(result_file, ctx) {
+ Ok(preview) => {
+ has_material_answer = true;
+ if let Some(preview) = preview {
+ reasons.extend(json_audit_reasons(&preview));
</file context>
Suggested change
| has_material_answer = true; | |
| if let Some(preview) = preview { | |
| reasons.extend(json_audit_reasons(&preview)); | |
| if !audit_text.is_empty() { | |
| audit_text.push('\n'); | |
| } | |
| audit_text.push_str(&preview); | |
| } | |
| if let Some(preview) = preview { | |
| has_material_answer = true; | |
| reasons.extend(json_audit_reasons(&preview)); | |
| if !audit_text.is_empty() { | |
| audit_text.push('\n'); | |
| } | |
| audit_text.push_str(&preview); | |
| } else { | |
| reasons.push(format!("result_file `{}` is not valid UTF-8 text", result_file)); | |
| } |
Comment on lines
+195
to
+198
| let req = DoneRequest { | ||
| result_file: Some("missing-result.json".to_string()), | ||
| ..DoneRequest::default() | ||
| }; |
Contributor
There was a problem hiding this comment.
P2: The missing-file test uses a shared temp-dir filename, which can make the assertion flaky when that file already exists.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-agent/src/tools/handlers/done_tests.rs, line 195:
<comment>The missing-file test uses a shared temp-dir filename, which can make the assertion flaky when that file already exists.</comment>
<file context>
@@ -175,3 +175,70 @@ fn done_is_not_parallel_safe() {
+
+#[test]
+fn eval_done_audit_rejects_missing_result_file() {
+ let req = DoneRequest {
+ result_file: Some("missing-result.json".to_string()),
+ ..DoneRequest::default()
</file context>
Suggested change
| let req = DoneRequest { | |
| result_file: Some("missing-result.json".to_string()), | |
| ..DoneRequest::default() | |
| }; | |
| let temp = tempfile::tempdir().unwrap(); | |
| let req = DoneRequest { | |
| result_file: Some(temp.path().join("missing-result.json").to_string_lossy().to_string()), | |
| ..DoneRequest::default() | |
| }; |
gregpr07
added a commit
that referenced
this pull request
Jun 11, 2026
* Fix runtime terminal completion barrier * chore: freeze real_v8 81 baseline (bc45a39 + worktree) Snapshot of the exact code that produced the real_v8 81/100 run (root: but-fix-main-policy-runtime-eval-...-20260609-155756) so the regression baseline is reproducible and subsequent restore-88 fixes land as a clean diff on top. No behavior change; just commits the 20 previously-uncommitted source files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(eval): rung1 config — revert observe to 1s, raise inline cap Revert the "observe30" regression: DEFAULT_OBSERVE_TIMEOUT_MS and BROWSER_SCRIPT_DEFAULT_OBSERVE_MS 30_000 -> 1_000 (the pre-PR-#60 / 88 baseline value). The 30s default + 30s clamp-floor blocked each observe up to 30s and burned the run timebox, leaving long-script tasks unfinished (real_v8 tasks 1, 4 never emitted session.done). Raise MAX_INLINE_BROWSER_SCRIPT_STDOUT_BYTES 4KB -> 16KB so large extractions aren't truncated into re-scrape loops. Ablation rung 1 of the 81->88 restore. Measured independently. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(eval): rung3 done-audit — null is a genuine absence, not a placeholder collect_json_placeholder_stats counted Value::Null as a placeholder, so a result that kept required-but-unavailable fields as null tripped the >=30% placeholder rejection. That pushed the agent to DELETE required fields to pass the audit (real_v8 task 53: filed_time/timezone_shown dropped) and double-rejected legitimately-sparse results (task 41). Count null toward the denominator but not as a placeholder. Applies to both the inline `result` and `result_file` audit paths. Keeps literal evasion strings ("unknown"/"n/a"/...) as placeholders. Follow-up (needs measurement): result_file audit reads a preview, so a large file with a clean head can still slip a high full-file null rate (task 94). Left for the measured pass. Ablation rung 3 of the 81->88 restore. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(eval): rung2 prompts — remove one-repair-pass ceiling, add visual fallback The "at most one targeted repair pass" instruction appeared in four prompt files and drove premature finalize on incomplete data (real_v8 task 32: 17 vs 64 turns, whole source categories never visited). Replace the one-pass ceiling with time-bounded repair: keep repairing the specific missing items until required rows/fields are satisfied or the run timebox is nearly spent; only avoid blindly restarting a whole fluctuating crawl. Add a visual-fallback mandate to the system prompt: if a script / http_get / browser_fetch / endpoint / selector fails, returns empty, or is blocked, fall back to navigating and reading the rendered page before marking anything unavailable — do not re-run the same failing script or drop the source (the dominant strategy regression across tasks 38,39,47,67). Files: browser-agent-system.md, dataset-case-user.md, python-tool-description.md, browser-script-tool-description.md. Ablation rung 2 of the 81->88 restore. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(eval): rung0+4 — full output visibility, terminal capture, decouple audit/observe Rung 0 (the big one): MAX_INLINE_BROWSER_SCRIPT_STDOUT_BYTES 16KB -> 120KB (matches SCRIPT_MAX_OUTPUT_CHARS). The 4KB cap (born in the 75-era code, absent at the 88 baseline) blinded the model to its own script output: truncations 0->619->780 across 88->75->81 and KeyError-class blind-guess bugs 8->41->53 at constant script volume. Codex parity: fresh tool output is never capped; history truncation (context/mod.rs policy*1.2) handles growth. Also reword the truncation notice — it told the model to use "a narrower extraction instead of re-reading", actively training it not to recover missing data. Rung 4: port fallback_result_file_for_session from exp/real-v8-restore-88 (2bd479d) — when the model ends without done(), emit session.done carrying the best result.* artifact from cwd instead of losing finished work (real_v8 tasks 1, 45, earlier 1/4/41/90). Decouple BROWSER_USE_EVAL_DONE_AUDIT from observe timing: the audit flag silently capped observes at 30s (hidden cross-subsystem coupling); observe caps now come only from BROWSER_USE_EVAL_MAX_OBSERVE_TIMEOUT_MS. Prompt: forbid fabricated/pattern-guessed values; honest null/"not found" with checked source is acceptable (task 68 fabricated emails then disavowed). Tests: updated 3 assertions to the new cap/clamp/notice. The 6 remaining failures (5 prompts::tests + stored_cloud_preference) pre-date this work — they assert phrases absent from prompts at BOTH baselines. Stream idle-timeout (task 22 class) verified already present in this lineage (provider.rs:1081, default 300s) — no port needed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(eval): simple branch — 85-run base + 2 zero-risk deterministic fixes Base = the 85-run (rung0-4, commit 0322b8c): full 120KB output visibility, terminal artifact fallback, done-audit wording-police, rung-2 prompts — all unchanged. The model's PROMPT is frozen at the 85-run; deliberately none of the quick-pack prompt heuristics (anti-laundering / reality-probe / chunking / js() rules) and NOT the wording-police removal are carried over — they tested flat-to-worse and below the ±4 run-to-run variance. Added (deterministic, code-only, model never sees them): - done-audit: empty string "" is no longer a placeholder (like null). Many tasks mandate "" for missing values; counting it rejected spec-correct answers and coerced placeholder prose (real_v8 task 94). Zero downside. - compaction: apply the "(no summary available)" fallback to the summary SUFFIX before the prefix is prepended — the existing fallback was dead code (prefix made the string non-empty), so an empty summarize() pass shipped PREFIX + "" and the resumed model had total amnesia (real_v8 task 24). Tests: 1058 pass; 6 failures pre-date the base (prompt-phrase asserts + a stored_cloud_preference test absent from this lineage). Next (separate, deterministic FLOOR fixes — not prompt heuristics): no-op control-call loop-breaker, and broaden terminal finalize to any result-shaped artifact. Those raise the floor under unlucky trajectories (the real driver of the 82<->90 variance) and can't be washed out by sampling noise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(eval): done-audit treats empty string "" as genuine-absent, not placeholder (task 94) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(phase1): never lose finished work — broaden + crash-path artifact finalize Floor reliability only (deterministic; prompt UNCHANGED from PR #105). The locked-judge baseline showed ~5 tasks (35,52,61,72,99) ran but the runner captured nothing — work was on disk yet discarded. - discover_result_files: match any result-shaped artifact (result*, *.json/.csv/ .md/.txt >=16B), not just `result.*`. Prefer canonical names, then more content. Rescues task 52 (feb17_selected.json sat on disk, runner delivered nothing). - terminal Err arm: on a session crash, if a substantive result artifact exists, emit session.done with it and return Ok instead of failing empty. Rescues task 99 (provider error mid-run, result.json already written). Deliberately NO loop-breaker here (Phase 2 deletes the control-plane browser tool, making the no-op doom loop structurally impossible) and NO prompt/cruft changes (those land in Phase 2's simplification sweep to avoid churning prompt+tests twice). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(phase1): verdict — floor fix is a deterministic win (ok=false 5->1); score jump is variance --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary by cubic
Restores pre-regression real_v8 behavior with faster observe polling, full
browser_scriptoutput visibility, terminal completion capture, and safer process execution. Improves eval reliability and preserves results even without an explicitdone().New Features
doneis terminal in the fused loop; if nodone, emit a final result from the bestresult*file in the session cwd.exec_commandaddstimeout_msand kills the whole process group on timeout (Unix); default capDEFAULT_EXEC_COMMAND_TIMEOUT_MS=600_000usinglibc.browser-use-agentandbrowser-use-browser; eval caps now only followBROWSER_USE_EVAL_MAX_OBSERVE_TIMEOUT_MS. AddedBROWSER_USE_DISABLE_LOCAL_SEARCHto disable the localsearchtool.exec_commandincludestimeout_ms;donerequires a complete, verified result with a step-exhaustion fallback.Bug Fixes
browser_scriptstdout cap raised to 120 KB (matchesSCRIPT_MAX_OUTPUT_CHARS); truncation guidance updated so the model re-reads artifacts when needed.nulland empty string values as genuine absences, not placeholders; reduces false rejections.Written for commit 84703c1. Summary will update on new commits.