Skip to content

Fix benchmark regression#102

Open
gregpr07 wants to merge 9 commits into
mainfrom
fix/real-v8-restore-88-v2
Open

Fix benchmark regression#102
gregpr07 wants to merge 9 commits into
mainfrom
fix/real-v8-restore-88-v2

Conversation

@gregpr07

@gregpr07 gregpr07 commented Jun 10, 2026

Copy link
Copy Markdown
Member

Summary by cubic

Restore 88-baseline behavior and add quick wins: faster observe polling, full script output visibility, safer/clearer completion, and smarter prompts/audits to cut wasted turns and finish runs with correct results. Also adds a k=2 validation report showing K1=82, K2=90 (mean 86).

  • Bug Fixes

    • Observe: default back to 1s (agent and browser lib), removed the hidden 30s eval coupling, and when a run is already finished, observe now says so and stops suggesting more polls.
    • Output visibility: raised inline browser_script stdout cap to 120KB (matches script limit) and updated truncation guidance so the model recovers missing data instead of guessing.
    • Completion: successful done is terminal; if a bounded run ends on plain text, the loop gives one “call done” nudge; if the model ends without done, emit the best result.* artifact from cwd as the final result.
    • Compaction: fixed empty-summary handling so it falls back to “(no summary available)” before prefixing, preventing post-compaction amnesia.
    • Audits and prompts: null and "" no longer count as placeholders; wording-based rejections removed; prompts drop the one-pass ceiling, add a visual fallback, forbid fabricated/proxy values, require basic reality checks, and add efficiency rules (chunking, checkpointing, limited retries, concurrent fetch, and observe discipline, including reading checkpoint files).
    • Unified exec: added timeout_ms with a safe default and kill the entire process group on timeout to prevent stray processes.
    • Added BROWSER_USE_DISABLE_LOCAL_SEARCH to optionally disable local search while keeping hosted web_search.
    • Docs: added k=2 quick-pack validation report (K1=82, K2=90, mean 86).
  • Dependencies

    • Added libc for process-group management on Unix.

Written for commit da14329. Summary will update on new commits.

Review in cubic

gregpr07 and others added 7 commits June 8, 2026 16:03
Snapshot of the exact code that produced the real_v8 81/100 run
(root: but-fix-main-policy-runtime-eval-...-20260609-155756) so the
regression baseline is reproducible and subsequent restore-88 fixes
land as a clean diff on top. No behavior change; just commits the
20 previously-uncommitted source files.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Revert the "observe30" regression: DEFAULT_OBSERVE_TIMEOUT_MS and
BROWSER_SCRIPT_DEFAULT_OBSERVE_MS 30_000 -> 1_000 (the pre-PR-#60 / 88
baseline value). The 30s default + 30s clamp-floor blocked each observe
up to 30s and burned the run timebox, leaving long-script tasks
unfinished (real_v8 tasks 1, 4 never emitted session.done).

Raise MAX_INLINE_BROWSER_SCRIPT_STDOUT_BYTES 4KB -> 16KB so large
extractions aren't truncated into re-scrape loops.

Ablation rung 1 of the 81->88 restore. Measured independently.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…older

collect_json_placeholder_stats counted Value::Null as a placeholder, so a
result that kept required-but-unavailable fields as null tripped the >=30%
placeholder rejection. That pushed the agent to DELETE required fields to
pass the audit (real_v8 task 53: filed_time/timezone_shown dropped) and
double-rejected legitimately-sparse results (task 41). Count null toward
the denominator but not as a placeholder. Applies to both the inline
`result` and `result_file` audit paths. Keeps literal evasion strings
("unknown"/"n/a"/...) as placeholders.

Follow-up (needs measurement): result_file audit reads a preview, so a
large file with a clean head can still slip a high full-file null rate
(task 94). Left for the measured pass.

Ablation rung 3 of the 81->88 restore.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… fallback

The "at most one targeted repair pass" instruction appeared in four prompt
files and drove premature finalize on incomplete data (real_v8 task 32:
17 vs 64 turns, whole source categories never visited). Replace the
one-pass ceiling with time-bounded repair: keep repairing the specific
missing items until required rows/fields are satisfied or the run timebox
is nearly spent; only avoid blindly restarting a whole fluctuating crawl.

Add a visual-fallback mandate to the system prompt: if a script / http_get
/ browser_fetch / endpoint / selector fails, returns empty, or is blocked,
fall back to navigating and reading the rendered page before marking
anything unavailable — do not re-run the same failing script or drop the
source (the dominant strategy regression across tasks 38,39,47,67).

Files: browser-agent-system.md, dataset-case-user.md,
python-tool-description.md, browser-script-tool-description.md.
Ablation rung 2 of the 81->88 restore.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…le audit/observe

Rung 0 (the big one): MAX_INLINE_BROWSER_SCRIPT_STDOUT_BYTES 16KB -> 120KB
(matches SCRIPT_MAX_OUTPUT_CHARS). The 4KB cap (born in the 75-era code,
absent at the 88 baseline) blinded the model to its own script output:
truncations 0->619->780 across 88->75->81 and KeyError-class blind-guess
bugs 8->41->53 at constant script volume. Codex parity: fresh tool output
is never capped; history truncation (context/mod.rs policy*1.2) handles
growth. Also reword the truncation notice — it told the model to use "a
narrower extraction instead of re-reading", actively training it not to
recover missing data.

Rung 4: port fallback_result_file_for_session from exp/real-v8-restore-88
(2bd479d) — when the model ends without done(), emit session.done carrying
the best result.* artifact from cwd instead of losing finished work
(real_v8 tasks 1, 45, earlier 1/4/41/90).

Decouple BROWSER_USE_EVAL_DONE_AUDIT from observe timing: the audit flag
silently capped observes at 30s (hidden cross-subsystem coupling); observe
caps now come only from BROWSER_USE_EVAL_MAX_OBSERVE_TIMEOUT_MS.

Prompt: forbid fabricated/pattern-guessed values; honest null/"not found"
with checked source is acceptable (task 68 fabricated emails then disavowed).

Tests: updated 3 assertions to the new cap/clamp/notice. The 6 remaining
failures (5 prompts::tests + stored_cloud_preference) pre-date this work —
they assert phrases absent from prompts at BOTH baselines.

Stream idle-timeout (task 22 class) verified already present in this
lineage (provider.rs:1081, default 300s) — no port needed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 22 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/browser-use-agent/src/tools/handlers/done.rs">

<violation number="1" location="crates/browser-use-agent/src/tools/handlers/done.rs:311">
P2: `read_result_file_preview` reads the full file despite the preview cap, causing avoidable large I/O and memory usage.</violation>
</file>

<file name="crates/browser-use-browser/src/lib.rs">

<violation number="1" location="crates/browser-use-browser/src/lib.rs:279">
P2: Completed browser_script cache limits were increased enough to risk significant memory growth in long-running sessions.</violation>
</file>

<file name="crates/browser-use-agent/src/entrypoint/mod.rs">

<violation number="1" location="crates/browser-use-agent/src/entrypoint/mod.rs:2700">
P1: `session.done` is emitted for aborted turns because `Ok(...)` from `TurnLoop` includes interruption/max-turns paths.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

result_file = Some(fallback.file);
}
}
if let Some(text) = final_message.as_deref() {

@cubic-dev-ai cubic-dev-ai Bot Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: session.done is emitted for aborted turns because Ok(...) from TurnLoop includes interruption/max-turns paths.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-agent/src/entrypoint/mod.rs, line 2700:

<comment>`session.done` is emitted for aborted turns because `Ok(...)` from `TurnLoop` includes interruption/max-turns paths.</comment>

<file context>
@@ -2611,10 +2674,83 @@ impl<Sd: SamplingDriver> RuntimeTurnLoopDriver<Sd> {
+                        result_file = Some(fallback.file);
+                    }
+                }
+                if let Some(text) = final_message.as_deref() {
+                    runtime_handle.append_observed_session_event(
+                        runtime_session_id,
</file context>
Fix with cubic

return Err(format!("result_file `{}` is empty", path));
}

let bytes = fs::read(&resolved).map_err(|error| {

@cubic-dev-ai cubic-dev-ai Bot Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: read_result_file_preview reads the full file despite the preview cap, causing avoidable large I/O and memory usage.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-agent/src/tools/handlers/done.rs, line 311:

<comment>`read_result_file_preview` reads the full file despite the preview cap, causing avoidable large I/O and memory usage.</comment>

<file context>
@@ -224,3 +229,240 @@ impl ToolRuntime<DoneRequest, ExecOutput> for DoneTool {
+        return Err(format!("result_file `{}` is empty", path));
+    }
+
+    let bytes = fs::read(&resolved).map_err(|error| {
+        format!(
+            "result_file `{}` could not be read at {} ({error})",
</file context>
Fix with cubic

Comment on lines +279 to +280
const BROWSER_SCRIPT_COMPLETED_CACHE_TTL_MS: u128 = 60 * 60 * 1_000;
const BROWSER_SCRIPT_COMPLETED_CACHE_MAX: usize = 2048;

@cubic-dev-ai cubic-dev-ai Bot Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Completed browser_script cache limits were increased enough to risk significant memory growth in long-running sessions.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-browser/src/lib.rs, line 279:

<comment>Completed browser_script cache limits were increased enough to risk significant memory growth in long-running sessions.</comment>

<file context>
@@ -276,8 +276,8 @@ static MANAGED_BROWSER_PIDS: OnceLock<Mutex<HashSet<u32>>> = OnceLock::new();
 static BROWSER_SCRIPT_RUN_COUNTER: AtomicU64 = AtomicU64::new(1);
-const BROWSER_SCRIPT_COMPLETED_CACHE_TTL_MS: u128 = 10 * 60 * 1_000;
-const BROWSER_SCRIPT_COMPLETED_CACHE_MAX: usize = 128;
+const BROWSER_SCRIPT_COMPLETED_CACHE_TTL_MS: u128 = 60 * 60 * 1_000;
+const BROWSER_SCRIPT_COMPLETED_CACHE_MAX: usize = 2048;
 const MANAGED_BROWSER_PROFILE_PREFIX: &str = "but-managed-browser.";
</file context>
Suggested change
const BROWSER_SCRIPT_COMPLETED_CACHE_TTL_MS: u128 = 60 * 60 * 1_000;
const BROWSER_SCRIPT_COMPLETED_CACHE_MAX: usize = 2048;
const BROWSER_SCRIPT_COMPLETED_CACHE_TTL_MS: u128 = 10 * 60 * 1_000;
const BROWSER_SCRIPT_COMPLETED_CACHE_MAX: usize = 128;
Fix with cubic

…udges, discipline rules)

Quick wins (evidence: 100-task deep-dive of the 85-run):
- done-audit: "" no longer counts as a placeholder — tasks explicitly mandate
  "" for missing values; counting it rejected a spec-correct answer and
  coerced literal placeholder prose (real_v8 task 94).
- done-audit: remove phrase-based wording rejections (text_audit_reasons).
  Zero measured wins; measured harms: agents resubmitted identical data with
  laundered confident phrasing (75, 77) and burned 5.7min post-rejection on a
  source-exhausted spec (87). Evidence checks (empty/placeholder JSON) stay.
- compaction: the "(no summary available)" fallback checked the full string
  AFTER the prefix was prepended, so it was dead code; an empty summarize()
  pass shipped "...Here is the summary...:" + NOTHING (task 24: amnesia, 30
  thrash turns, task lost). Fallback now applies to the suffix.
- loop: CALL_DONE_NUDGE — in bounded (max_turns) runs, a text-only completion
  (finish_reason Stop/None, never done()'s ToolUse) gets ONE developer nudge
  to call done() before the text is accepted as final (tasks 24/45 class).
- prompts: anti-proxy/laundering rule (negative control for identifiers; no
  field-copied proxies; task-mandated representations used literally) and
  external reality-probe requirement in collection audits (site-stated total
  / largest-entity check; tasks 31/32/36/72/74/75).

Efficiency pack:
- prompts: chunk-under-timeout + checkpoint-every-item + never-retry-monolith;
  concurrent fetch for >=3 independent URLs (task 68: 4min serial vs 0.4s);
  done(result_file) instead of re-typing large artifacts (~10 min/run).
- browser_script description: observe discipline (one generous poll; read
  checkpoint files; stop after completed/not_found) + js() return-shape and
  regex-escaping rules (~50 self-inflicted failures/run).
- observe: repeat-observe of a finished run now appends an unmissable
  "already completed - stop polling" note and clears next_observe_ms.

Tests: 1058 pass; 3 assertions updated to the new intended behavior; the 6
remaining failures pre-date this branch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 10 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/browser-use-agent/src/turn/loop_driver.rs">

<violation number="1" location="crates/browser-use-agent/src/turn/loop_driver.rs:253">
P2: The new done-nudge path can exceed `max_turns` by one sampling round, violating the bounded-run cap.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

// to call done before we accept the text as final.
let ended_without_done =
!matches!(outcome.finish_reason, Some(FinishReason::ToolUse));
if max_turns.is_some() && ended_without_done && !done_nudge_sent {

@cubic-dev-ai cubic-dev-ai Bot Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The new done-nudge path can exceed max_turns by one sampling round, violating the bounded-run cap.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-agent/src/turn/loop_driver.rs, line 253:

<comment>The new done-nudge path can exceed `max_turns` by one sampling round, violating the bounded-run cap.</comment>

<file context>
@@ -229,6 +244,18 @@ impl<St: TurnState, Sd: SamplingDriver, Ob: TurnObserver> TurnLoop<St, Sd, Ob> {
+                    // to call done before we accept the text as final.
+                    let ended_without_done =
+                        !matches!(outcome.finish_reason, Some(FinishReason::ToolUse));
+                    if max_turns.is_some() && ended_without_done && !done_nudge_sent {
+                        done_nudge_sent = true;
+                        pending_done_nudge = true;
</file context>
Suggested change
if max_turns.is_some() && ended_without_done && !done_nudge_sent {
if max_turns.is_some_and(|limit| turns_run < limit) && ended_without_done && !done_nudge_sent {
Fix with cubic

… 88-baseline), mean 86

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant