Skip to content

Fix/real v8 simple#105

Open
gregpr07 wants to merge 8 commits into
mainfrom
fix/real-v8-simple
Open

Fix/real v8 simple#105
gregpr07 wants to merge 8 commits into
mainfrom
fix/real-v8-simple

Conversation

@gregpr07

@gregpr07 gregpr07 commented Jun 10, 2026

Copy link
Copy Markdown
Member

Summary by cubic

Restores pre-regression real_v8 behavior with faster observe polling, full browser_script output visibility, terminal completion capture, and safer process execution. Improves eval reliability and preserves results even without an explicit done().

  • New Features

    • Successful done is terminal in the fused loop; if no done, emit a final result from the best result* file in the session cwd.
    • exec_command adds timeout_ms and kills the whole process group on timeout (Unix); default cap DEFAULT_EXEC_COMMAND_TIMEOUT_MS=600_000 using libc.
    • Observe control: defaults set to 1s in browser-use-agent and browser-use-browser; eval caps now only follow BROWSER_USE_EVAL_MAX_OBSERVE_TIMEOUT_MS. Added BROWSER_USE_DISABLE_LOCAL_SEARCH to disable the local search tool.
    • Tool schema/docs: exec_command includes timeout_ms; done requires a complete, verified result with a step-exhaustion fallback.
  • Bug Fixes

    • Inline browser_script stdout cap raised to 120 KB (matches SCRIPT_MAX_OUTPUT_CHARS); truncation guidance updated so the model re-reads artifacts when needed.
    • Compaction applies "(no summary available)" to the suffix before the prefix to avoid losing context on empty summaries.
    • Done-audit treats null and empty string values as genuine absences, not placeholders; reduces false rejections.
    • Prompts remove the one-pass repair ceiling, add visual fallbacks, and forbid fabricated values.
    • Reverts the 30s observe-window regression to 1s to prevent long polls from burning the run timebox.

Written for commit 84703c1. Summary will update on new commits.

Review in cubic

gregpr07 and others added 8 commits June 8, 2026 16:03
Snapshot of the exact code that produced the real_v8 81/100 run
(root: but-fix-main-policy-runtime-eval-...-20260609-155756) so the
regression baseline is reproducible and subsequent restore-88 fixes
land as a clean diff on top. No behavior change; just commits the
20 previously-uncommitted source files.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Revert the "observe30" regression: DEFAULT_OBSERVE_TIMEOUT_MS and
BROWSER_SCRIPT_DEFAULT_OBSERVE_MS 30_000 -> 1_000 (the pre-PR-#60 / 88
baseline value). The 30s default + 30s clamp-floor blocked each observe
up to 30s and burned the run timebox, leaving long-script tasks
unfinished (real_v8 tasks 1, 4 never emitted session.done).

Raise MAX_INLINE_BROWSER_SCRIPT_STDOUT_BYTES 4KB -> 16KB so large
extractions aren't truncated into re-scrape loops.

Ablation rung 1 of the 81->88 restore. Measured independently.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…older

collect_json_placeholder_stats counted Value::Null as a placeholder, so a
result that kept required-but-unavailable fields as null tripped the >=30%
placeholder rejection. That pushed the agent to DELETE required fields to
pass the audit (real_v8 task 53: filed_time/timezone_shown dropped) and
double-rejected legitimately-sparse results (task 41). Count null toward
the denominator but not as a placeholder. Applies to both the inline
`result` and `result_file` audit paths. Keeps literal evasion strings
("unknown"/"n/a"/...) as placeholders.

Follow-up (needs measurement): result_file audit reads a preview, so a
large file with a clean head can still slip a high full-file null rate
(task 94). Left for the measured pass.

Ablation rung 3 of the 81->88 restore.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… fallback

The "at most one targeted repair pass" instruction appeared in four prompt
files and drove premature finalize on incomplete data (real_v8 task 32:
17 vs 64 turns, whole source categories never visited). Replace the
one-pass ceiling with time-bounded repair: keep repairing the specific
missing items until required rows/fields are satisfied or the run timebox
is nearly spent; only avoid blindly restarting a whole fluctuating crawl.

Add a visual-fallback mandate to the system prompt: if a script / http_get
/ browser_fetch / endpoint / selector fails, returns empty, or is blocked,
fall back to navigating and reading the rendered page before marking
anything unavailable — do not re-run the same failing script or drop the
source (the dominant strategy regression across tasks 38,39,47,67).

Files: browser-agent-system.md, dataset-case-user.md,
python-tool-description.md, browser-script-tool-description.md.
Ablation rung 2 of the 81->88 restore.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…le audit/observe

Rung 0 (the big one): MAX_INLINE_BROWSER_SCRIPT_STDOUT_BYTES 16KB -> 120KB
(matches SCRIPT_MAX_OUTPUT_CHARS). The 4KB cap (born in the 75-era code,
absent at the 88 baseline) blinded the model to its own script output:
truncations 0->619->780 across 88->75->81 and KeyError-class blind-guess
bugs 8->41->53 at constant script volume. Codex parity: fresh tool output
is never capped; history truncation (context/mod.rs policy*1.2) handles
growth. Also reword the truncation notice — it told the model to use "a
narrower extraction instead of re-reading", actively training it not to
recover missing data.

Rung 4: port fallback_result_file_for_session from exp/real-v8-restore-88
(2bd479d) — when the model ends without done(), emit session.done carrying
the best result.* artifact from cwd instead of losing finished work
(real_v8 tasks 1, 45, earlier 1/4/41/90).

Decouple BROWSER_USE_EVAL_DONE_AUDIT from observe timing: the audit flag
silently capped observes at 30s (hidden cross-subsystem coupling); observe
caps now come only from BROWSER_USE_EVAL_MAX_OBSERVE_TIMEOUT_MS.

Prompt: forbid fabricated/pattern-guessed values; honest null/"not found"
with checked source is acceptable (task 68 fabricated emails then disavowed).

Tests: updated 3 assertions to the new cap/clamp/notice. The 6 remaining
failures (5 prompts::tests + stored_cloud_preference) pre-date this work —
they assert phrases absent from prompts at BOTH baselines.

Stream idle-timeout (task 22 class) verified already present in this
lineage (provider.rs:1081, default 300s) — no port needed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Base = the 85-run (rung0-4, commit 0322b8c): full 120KB output visibility,
terminal artifact fallback, done-audit wording-police, rung-2 prompts — all
unchanged. The model's PROMPT is frozen at the 85-run; deliberately none of
the quick-pack prompt heuristics (anti-laundering / reality-probe / chunking /
js() rules) and NOT the wording-police removal are carried over — they tested
flat-to-worse and below the ±4 run-to-run variance.

Added (deterministic, code-only, model never sees them):
- done-audit: empty string "" is no longer a placeholder (like null). Many
  tasks mandate "" for missing values; counting it rejected spec-correct
  answers and coerced placeholder prose (real_v8 task 94). Zero downside.
- compaction: apply the "(no summary available)" fallback to the summary
  SUFFIX before the prefix is prepended — the existing fallback was dead code
  (prefix made the string non-empty), so an empty summarize() pass shipped
  PREFIX + "" and the resumed model had total amnesia (real_v8 task 24).

Tests: 1058 pass; 6 failures pre-date the base (prompt-phrase asserts + a
stored_cloud_preference test absent from this lineage).

Next (separate, deterministic FLOOR fixes — not prompt heuristics): no-op
control-call loop-breaker, and broaden terminal finalize to any result-shaped
artifact. Those raise the floor under unlucky trajectories (the real driver of
the 82<->90 variance) and can't be washed out by sampling noise.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…laceholder (task 94)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 23 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/browser-use-agent/src/tools/handlers/done.rs">

<violation number="1" location="crates/browser-use-agent/src/tools/handlers/done.rs:261">
P2: Non-UTF8 `result_file` content is incorrectly treated as a valid material answer, allowing weak/empty completions to pass audit.</violation>
</file>

<file name="crates/browser-use-agent/src/tools/handlers/done_tests.rs">

<violation number="1" location="crates/browser-use-agent/src/tools/handlers/done_tests.rs:195">
P2: The missing-file test uses a shared temp-dir filename, which can make the assertion flaky when that file already exists.</violation>
</file>

<file name="crates/browser-use-agent/src/entrypoint/mod.rs">

<violation number="1" location="crates/browser-use-agent/src/entrypoint/mod.rs:2685">
P1: Successful runs can fail before `session.done` if terminal cleanup errors, causing lost final result emission.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

runtime_handle.clone(),
runtime_session_id.clone(),
)
.await?;

@cubic-dev-ai cubic-dev-ai Bot Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Successful runs can fail before session.done if terminal cleanup errors, causing lost final result emission.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-agent/src/entrypoint/mod.rs, line 2685:

<comment>Successful runs can fail before `session.done` if terminal cleanup errors, causing lost final result emission.</comment>

<file context>
@@ -2611,10 +2674,83 @@ impl<Sd: SamplingDriver> RuntimeTurnLoopDriver<Sd> {
+                    runtime_handle.clone(),
+                    runtime_session_id.clone(),
+                )
+                .await?;
+                // Never lose finished work: if the model ended without a final
+                // message (no done() call — e.g. ended on leaked planning text
</file context>
Fix with cubic

Comment on lines +261 to +268
has_material_answer = true;
if let Some(preview) = preview {
reasons.extend(json_audit_reasons(&preview));
if !audit_text.is_empty() {
audit_text.push('\n');
}
audit_text.push_str(&preview);
}

@cubic-dev-ai cubic-dev-ai Bot Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Non-UTF8 result_file content is incorrectly treated as a valid material answer, allowing weak/empty completions to pass audit.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-agent/src/tools/handlers/done.rs, line 261:

<comment>Non-UTF8 `result_file` content is incorrectly treated as a valid material answer, allowing weak/empty completions to pass audit.</comment>

<file context>
@@ -224,3 +229,244 @@ impl ToolRuntime<DoneRequest, ExecOutput> for DoneTool {
+    {
+        match read_result_file_preview(result_file, ctx) {
+            Ok(preview) => {
+                has_material_answer = true;
+                if let Some(preview) = preview {
+                    reasons.extend(json_audit_reasons(&preview));
</file context>
Suggested change
has_material_answer = true;
if let Some(preview) = preview {
reasons.extend(json_audit_reasons(&preview));
if !audit_text.is_empty() {
audit_text.push('\n');
}
audit_text.push_str(&preview);
}
if let Some(preview) = preview {
has_material_answer = true;
reasons.extend(json_audit_reasons(&preview));
if !audit_text.is_empty() {
audit_text.push('\n');
}
audit_text.push_str(&preview);
} else {
reasons.push(format!("result_file `{}` is not valid UTF-8 text", result_file));
}
Fix with cubic

Comment on lines +195 to +198
let req = DoneRequest {
result_file: Some("missing-result.json".to_string()),
..DoneRequest::default()
};

@cubic-dev-ai cubic-dev-ai Bot Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The missing-file test uses a shared temp-dir filename, which can make the assertion flaky when that file already exists.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-agent/src/tools/handlers/done_tests.rs, line 195:

<comment>The missing-file test uses a shared temp-dir filename, which can make the assertion flaky when that file already exists.</comment>

<file context>
@@ -175,3 +175,70 @@ fn done_is_not_parallel_safe() {
+
+#[test]
+fn eval_done_audit_rejects_missing_result_file() {
+    let req = DoneRequest {
+        result_file: Some("missing-result.json".to_string()),
+        ..DoneRequest::default()
</file context>
Suggested change
let req = DoneRequest {
result_file: Some("missing-result.json".to_string()),
..DoneRequest::default()
};
let temp = tempfile::tempdir().unwrap();
let req = DoneRequest {
result_file: Some(temp.path().join("missing-result.json").to_string_lossy().to_string()),
..DoneRequest::default()
};
Fix with cubic

gregpr07 added a commit that referenced this pull request Jun 11, 2026
* Fix runtime terminal completion barrier

* chore: freeze real_v8 81 baseline (bc45a39 + worktree)

Snapshot of the exact code that produced the real_v8 81/100 run
(root: but-fix-main-policy-runtime-eval-...-20260609-155756) so the
regression baseline is reproducible and subsequent restore-88 fixes
land as a clean diff on top. No behavior change; just commits the
20 previously-uncommitted source files.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(eval): rung1 config — revert observe to 1s, raise inline cap

Revert the "observe30" regression: DEFAULT_OBSERVE_TIMEOUT_MS and
BROWSER_SCRIPT_DEFAULT_OBSERVE_MS 30_000 -> 1_000 (the pre-PR-#60 / 88
baseline value). The 30s default + 30s clamp-floor blocked each observe
up to 30s and burned the run timebox, leaving long-script tasks
unfinished (real_v8 tasks 1, 4 never emitted session.done).

Raise MAX_INLINE_BROWSER_SCRIPT_STDOUT_BYTES 4KB -> 16KB so large
extractions aren't truncated into re-scrape loops.

Ablation rung 1 of the 81->88 restore. Measured independently.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(eval): rung3 done-audit — null is a genuine absence, not a placeholder

collect_json_placeholder_stats counted Value::Null as a placeholder, so a
result that kept required-but-unavailable fields as null tripped the >=30%
placeholder rejection. That pushed the agent to DELETE required fields to
pass the audit (real_v8 task 53: filed_time/timezone_shown dropped) and
double-rejected legitimately-sparse results (task 41). Count null toward
the denominator but not as a placeholder. Applies to both the inline
`result` and `result_file` audit paths. Keeps literal evasion strings
("unknown"/"n/a"/...) as placeholders.

Follow-up (needs measurement): result_file audit reads a preview, so a
large file with a clean head can still slip a high full-file null rate
(task 94). Left for the measured pass.

Ablation rung 3 of the 81->88 restore.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(eval): rung2 prompts — remove one-repair-pass ceiling, add visual fallback

The "at most one targeted repair pass" instruction appeared in four prompt
files and drove premature finalize on incomplete data (real_v8 task 32:
17 vs 64 turns, whole source categories never visited). Replace the
one-pass ceiling with time-bounded repair: keep repairing the specific
missing items until required rows/fields are satisfied or the run timebox
is nearly spent; only avoid blindly restarting a whole fluctuating crawl.

Add a visual-fallback mandate to the system prompt: if a script / http_get
/ browser_fetch / endpoint / selector fails, returns empty, or is blocked,
fall back to navigating and reading the rendered page before marking
anything unavailable — do not re-run the same failing script or drop the
source (the dominant strategy regression across tasks 38,39,47,67).

Files: browser-agent-system.md, dataset-case-user.md,
python-tool-description.md, browser-script-tool-description.md.
Ablation rung 2 of the 81->88 restore.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(eval): rung0+4 — full output visibility, terminal capture, decouple audit/observe

Rung 0 (the big one): MAX_INLINE_BROWSER_SCRIPT_STDOUT_BYTES 16KB -> 120KB
(matches SCRIPT_MAX_OUTPUT_CHARS). The 4KB cap (born in the 75-era code,
absent at the 88 baseline) blinded the model to its own script output:
truncations 0->619->780 across 88->75->81 and KeyError-class blind-guess
bugs 8->41->53 at constant script volume. Codex parity: fresh tool output
is never capped; history truncation (context/mod.rs policy*1.2) handles
growth. Also reword the truncation notice — it told the model to use "a
narrower extraction instead of re-reading", actively training it not to
recover missing data.

Rung 4: port fallback_result_file_for_session from exp/real-v8-restore-88
(2bd479d) — when the model ends without done(), emit session.done carrying
the best result.* artifact from cwd instead of losing finished work
(real_v8 tasks 1, 45, earlier 1/4/41/90).

Decouple BROWSER_USE_EVAL_DONE_AUDIT from observe timing: the audit flag
silently capped observes at 30s (hidden cross-subsystem coupling); observe
caps now come only from BROWSER_USE_EVAL_MAX_OBSERVE_TIMEOUT_MS.

Prompt: forbid fabricated/pattern-guessed values; honest null/"not found"
with checked source is acceptable (task 68 fabricated emails then disavowed).

Tests: updated 3 assertions to the new cap/clamp/notice. The 6 remaining
failures (5 prompts::tests + stored_cloud_preference) pre-date this work —
they assert phrases absent from prompts at BOTH baselines.

Stream idle-timeout (task 22 class) verified already present in this
lineage (provider.rs:1081, default 300s) — no port needed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(eval): simple branch — 85-run base + 2 zero-risk deterministic fixes

Base = the 85-run (rung0-4, commit 0322b8c): full 120KB output visibility,
terminal artifact fallback, done-audit wording-police, rung-2 prompts — all
unchanged. The model's PROMPT is frozen at the 85-run; deliberately none of
the quick-pack prompt heuristics (anti-laundering / reality-probe / chunking /
js() rules) and NOT the wording-police removal are carried over — they tested
flat-to-worse and below the ±4 run-to-run variance.

Added (deterministic, code-only, model never sees them):
- done-audit: empty string "" is no longer a placeholder (like null). Many
  tasks mandate "" for missing values; counting it rejected spec-correct
  answers and coerced placeholder prose (real_v8 task 94). Zero downside.
- compaction: apply the "(no summary available)" fallback to the summary
  SUFFIX before the prefix is prepended — the existing fallback was dead code
  (prefix made the string non-empty), so an empty summarize() pass shipped
  PREFIX + "" and the resumed model had total amnesia (real_v8 task 24).

Tests: 1058 pass; 6 failures pre-date the base (prompt-phrase asserts + a
stored_cloud_preference test absent from this lineage).

Next (separate, deterministic FLOOR fixes — not prompt heuristics): no-op
control-call loop-breaker, and broaden terminal finalize to any result-shaped
artifact. Those raise the floor under unlucky trajectories (the real driver of
the 82<->90 variance) and can't be washed out by sampling noise.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(eval): done-audit treats empty string "" as genuine-absent, not placeholder (task 94)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(phase1): never lose finished work — broaden + crash-path artifact finalize

Floor reliability only (deterministic; prompt UNCHANGED from PR #105). The
locked-judge baseline showed ~5 tasks (35,52,61,72,99) ran but the runner
captured nothing — work was on disk yet discarded.

- discover_result_files: match any result-shaped artifact (result*, *.json/.csv/
  .md/.txt >=16B), not just `result.*`. Prefer canonical names, then more content.
  Rescues task 52 (feb17_selected.json sat on disk, runner delivered nothing).
- terminal Err arm: on a session crash, if a substantive result artifact exists,
  emit session.done with it and return Ok instead of failing empty. Rescues task
  99 (provider error mid-run, result.json already written).

Deliberately NO loop-breaker here (Phase 2 deletes the control-plane browser tool,
making the no-op doom loop structurally impossible) and NO prompt/cruft changes
(those land in Phase 2's simplification sweep to avoid churning prompt+tests twice).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(phase1): verdict — floor fix is a deterministic win (ok=false 5->1); score jump is variance

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant