Fix benchmark regression#102
Conversation
Snapshot of the exact code that produced the real_v8 81/100 run (root: but-fix-main-policy-runtime-eval-...-20260609-155756) so the regression baseline is reproducible and subsequent restore-88 fixes land as a clean diff on top. No behavior change; just commits the 20 previously-uncommitted source files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Revert the "observe30" regression: DEFAULT_OBSERVE_TIMEOUT_MS and BROWSER_SCRIPT_DEFAULT_OBSERVE_MS 30_000 -> 1_000 (the pre-PR-#60 / 88 baseline value). The 30s default + 30s clamp-floor blocked each observe up to 30s and burned the run timebox, leaving long-script tasks unfinished (real_v8 tasks 1, 4 never emitted session.done). Raise MAX_INLINE_BROWSER_SCRIPT_STDOUT_BYTES 4KB -> 16KB so large extractions aren't truncated into re-scrape loops. Ablation rung 1 of the 81->88 restore. Measured independently. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…older
collect_json_placeholder_stats counted Value::Null as a placeholder, so a
result that kept required-but-unavailable fields as null tripped the >=30%
placeholder rejection. That pushed the agent to DELETE required fields to
pass the audit (real_v8 task 53: filed_time/timezone_shown dropped) and
double-rejected legitimately-sparse results (task 41). Count null toward
the denominator but not as a placeholder. Applies to both the inline
`result` and `result_file` audit paths. Keeps literal evasion strings
("unknown"/"n/a"/...) as placeholders.
Follow-up (needs measurement): result_file audit reads a preview, so a
large file with a clean head can still slip a high full-file null rate
(task 94). Left for the measured pass.
Ablation rung 3 of the 81->88 restore.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… fallback The "at most one targeted repair pass" instruction appeared in four prompt files and drove premature finalize on incomplete data (real_v8 task 32: 17 vs 64 turns, whole source categories never visited). Replace the one-pass ceiling with time-bounded repair: keep repairing the specific missing items until required rows/fields are satisfied or the run timebox is nearly spent; only avoid blindly restarting a whole fluctuating crawl. Add a visual-fallback mandate to the system prompt: if a script / http_get / browser_fetch / endpoint / selector fails, returns empty, or is blocked, fall back to navigating and reading the rendered page before marking anything unavailable — do not re-run the same failing script or drop the source (the dominant strategy regression across tasks 38,39,47,67). Files: browser-agent-system.md, dataset-case-user.md, python-tool-description.md, browser-script-tool-description.md. Ablation rung 2 of the 81->88 restore. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…le audit/observe Rung 0 (the big one): MAX_INLINE_BROWSER_SCRIPT_STDOUT_BYTES 16KB -> 120KB (matches SCRIPT_MAX_OUTPUT_CHARS). The 4KB cap (born in the 75-era code, absent at the 88 baseline) blinded the model to its own script output: truncations 0->619->780 across 88->75->81 and KeyError-class blind-guess bugs 8->41->53 at constant script volume. Codex parity: fresh tool output is never capped; history truncation (context/mod.rs policy*1.2) handles growth. Also reword the truncation notice — it told the model to use "a narrower extraction instead of re-reading", actively training it not to recover missing data. Rung 4: port fallback_result_file_for_session from exp/real-v8-restore-88 (2bd479d) — when the model ends without done(), emit session.done carrying the best result.* artifact from cwd instead of losing finished work (real_v8 tasks 1, 45, earlier 1/4/41/90). Decouple BROWSER_USE_EVAL_DONE_AUDIT from observe timing: the audit flag silently capped observes at 30s (hidden cross-subsystem coupling); observe caps now come only from BROWSER_USE_EVAL_MAX_OBSERVE_TIMEOUT_MS. Prompt: forbid fabricated/pattern-guessed values; honest null/"not found" with checked source is acceptable (task 68 fabricated emails then disavowed). Tests: updated 3 assertions to the new cap/clamp/notice. The 6 remaining failures (5 prompts::tests + stored_cloud_preference) pre-date this work — they assert phrases absent from prompts at BOTH baselines. Stream idle-timeout (task 22 class) verified already present in this lineage (provider.rs:1081, default 300s) — no port needed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
3 issues found across 22 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/browser-use-agent/src/tools/handlers/done.rs">
<violation number="1" location="crates/browser-use-agent/src/tools/handlers/done.rs:311">
P2: `read_result_file_preview` reads the full file despite the preview cap, causing avoidable large I/O and memory usage.</violation>
</file>
<file name="crates/browser-use-browser/src/lib.rs">
<violation number="1" location="crates/browser-use-browser/src/lib.rs:279">
P2: Completed browser_script cache limits were increased enough to risk significant memory growth in long-running sessions.</violation>
</file>
<file name="crates/browser-use-agent/src/entrypoint/mod.rs">
<violation number="1" location="crates/browser-use-agent/src/entrypoint/mod.rs:2700">
P1: `session.done` is emitted for aborted turns because `Ok(...)` from `TurnLoop` includes interruption/max-turns paths.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
| result_file = Some(fallback.file); | ||
| } | ||
| } | ||
| if let Some(text) = final_message.as_deref() { |
There was a problem hiding this comment.
P1: session.done is emitted for aborted turns because Ok(...) from TurnLoop includes interruption/max-turns paths.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-agent/src/entrypoint/mod.rs, line 2700:
<comment>`session.done` is emitted for aborted turns because `Ok(...)` from `TurnLoop` includes interruption/max-turns paths.</comment>
<file context>
@@ -2611,10 +2674,83 @@ impl<Sd: SamplingDriver> RuntimeTurnLoopDriver<Sd> {
+ result_file = Some(fallback.file);
+ }
+ }
+ if let Some(text) = final_message.as_deref() {
+ runtime_handle.append_observed_session_event(
+ runtime_session_id,
</file context>
| return Err(format!("result_file `{}` is empty", path)); | ||
| } | ||
|
|
||
| let bytes = fs::read(&resolved).map_err(|error| { |
There was a problem hiding this comment.
P2: read_result_file_preview reads the full file despite the preview cap, causing avoidable large I/O and memory usage.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-agent/src/tools/handlers/done.rs, line 311:
<comment>`read_result_file_preview` reads the full file despite the preview cap, causing avoidable large I/O and memory usage.</comment>
<file context>
@@ -224,3 +229,240 @@ impl ToolRuntime<DoneRequest, ExecOutput> for DoneTool {
+ return Err(format!("result_file `{}` is empty", path));
+ }
+
+ let bytes = fs::read(&resolved).map_err(|error| {
+ format!(
+ "result_file `{}` could not be read at {} ({error})",
</file context>
| const BROWSER_SCRIPT_COMPLETED_CACHE_TTL_MS: u128 = 60 * 60 * 1_000; | ||
| const BROWSER_SCRIPT_COMPLETED_CACHE_MAX: usize = 2048; |
There was a problem hiding this comment.
P2: Completed browser_script cache limits were increased enough to risk significant memory growth in long-running sessions.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-browser/src/lib.rs, line 279:
<comment>Completed browser_script cache limits were increased enough to risk significant memory growth in long-running sessions.</comment>
<file context>
@@ -276,8 +276,8 @@ static MANAGED_BROWSER_PIDS: OnceLock<Mutex<HashSet<u32>>> = OnceLock::new();
static BROWSER_SCRIPT_RUN_COUNTER: AtomicU64 = AtomicU64::new(1);
-const BROWSER_SCRIPT_COMPLETED_CACHE_TTL_MS: u128 = 10 * 60 * 1_000;
-const BROWSER_SCRIPT_COMPLETED_CACHE_MAX: usize = 128;
+const BROWSER_SCRIPT_COMPLETED_CACHE_TTL_MS: u128 = 60 * 60 * 1_000;
+const BROWSER_SCRIPT_COMPLETED_CACHE_MAX: usize = 2048;
const MANAGED_BROWSER_PROFILE_PREFIX: &str = "but-managed-browser.";
</file context>
| const BROWSER_SCRIPT_COMPLETED_CACHE_TTL_MS: u128 = 60 * 60 * 1_000; | |
| const BROWSER_SCRIPT_COMPLETED_CACHE_MAX: usize = 2048; | |
| const BROWSER_SCRIPT_COMPLETED_CACHE_TTL_MS: u128 = 10 * 60 * 1_000; | |
| const BROWSER_SCRIPT_COMPLETED_CACHE_MAX: usize = 128; |
…udges, discipline rules) Quick wins (evidence: 100-task deep-dive of the 85-run): - done-audit: "" no longer counts as a placeholder — tasks explicitly mandate "" for missing values; counting it rejected a spec-correct answer and coerced literal placeholder prose (real_v8 task 94). - done-audit: remove phrase-based wording rejections (text_audit_reasons). Zero measured wins; measured harms: agents resubmitted identical data with laundered confident phrasing (75, 77) and burned 5.7min post-rejection on a source-exhausted spec (87). Evidence checks (empty/placeholder JSON) stay. - compaction: the "(no summary available)" fallback checked the full string AFTER the prefix was prepended, so it was dead code; an empty summarize() pass shipped "...Here is the summary...:" + NOTHING (task 24: amnesia, 30 thrash turns, task lost). Fallback now applies to the suffix. - loop: CALL_DONE_NUDGE — in bounded (max_turns) runs, a text-only completion (finish_reason Stop/None, never done()'s ToolUse) gets ONE developer nudge to call done() before the text is accepted as final (tasks 24/45 class). - prompts: anti-proxy/laundering rule (negative control for identifiers; no field-copied proxies; task-mandated representations used literally) and external reality-probe requirement in collection audits (site-stated total / largest-entity check; tasks 31/32/36/72/74/75). Efficiency pack: - prompts: chunk-under-timeout + checkpoint-every-item + never-retry-monolith; concurrent fetch for >=3 independent URLs (task 68: 4min serial vs 0.4s); done(result_file) instead of re-typing large artifacts (~10 min/run). - browser_script description: observe discipline (one generous poll; read checkpoint files; stop after completed/not_found) + js() return-shape and regex-escaping rules (~50 self-inflicted failures/run). - observe: repeat-observe of a finished run now appends an unmissable "already completed - stop polling" note and clears next_observe_ms. Tests: 1058 pass; 3 assertions updated to the new intended behavior; the 6 remaining failures pre-date this branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 10 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/browser-use-agent/src/turn/loop_driver.rs">
<violation number="1" location="crates/browser-use-agent/src/turn/loop_driver.rs:253">
P2: The new done-nudge path can exceed `max_turns` by one sampling round, violating the bounded-run cap.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
| // to call done before we accept the text as final. | ||
| let ended_without_done = | ||
| !matches!(outcome.finish_reason, Some(FinishReason::ToolUse)); | ||
| if max_turns.is_some() && ended_without_done && !done_nudge_sent { |
There was a problem hiding this comment.
P2: The new done-nudge path can exceed max_turns by one sampling round, violating the bounded-run cap.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/browser-use-agent/src/turn/loop_driver.rs, line 253:
<comment>The new done-nudge path can exceed `max_turns` by one sampling round, violating the bounded-run cap.</comment>
<file context>
@@ -229,6 +244,18 @@ impl<St: TurnState, Sd: SamplingDriver, Ob: TurnObserver> TurnLoop<St, Sd, Ob> {
+ // to call done before we accept the text as final.
+ let ended_without_done =
+ !matches!(outcome.finish_reason, Some(FinishReason::ToolUse));
+ if max_turns.is_some() && ended_without_done && !done_nudge_sent {
+ done_nudge_sent = true;
+ pending_done_nudge = true;
</file context>
| if max_turns.is_some() && ended_without_done && !done_nudge_sent { | |
| if max_turns.is_some_and(|limit| turns_run < limit) && ended_without_done && !done_nudge_sent { |
… 88-baseline), mean 86 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary by cubic
Restore 88-baseline behavior and add quick wins: faster observe polling, full script output visibility, safer/clearer completion, and smarter prompts/audits to cut wasted turns and finish runs with correct results. Also adds a k=2 validation report showing K1=82, K2=90 (mean 86).
Bug Fixes
browser_scriptstdout cap to 120KB (matches script limit) and updated truncation guidance so the model recovers missing data instead of guessing.doneis terminal; if a bounded run ends on plain text, the loop gives one “call done” nudge; if the model ends withoutdone, emit the bestresult.*artifact from cwd as the final result.nulland""no longer count as placeholders; wording-based rejections removed; prompts drop the one-pass ceiling, add a visual fallback, forbid fabricated/proxy values, require basic reality checks, and add efficiency rules (chunking, checkpointing, limited retries, concurrent fetch, and observe discipline, including reading checkpoint files).timeout_mswith a safe default and kill the entire process group on timeout to prevent stray processes.BROWSER_USE_DISABLE_LOCAL_SEARCHto optionally disable localsearchwhile keeping hostedweb_search.Dependencies
libcfor process-group management on Unix.Written for commit da14329. Summary will update on new commits.