Skip to content

Deterministic model selection + judge-only regrading#27

Open
jb510 wants to merge 2 commits intopinchbench:mainfrom
jb510:deterministic-model-selection
Open

Deterministic model selection + judge-only regrading#27
jb510 wants to merge 2 commits intopinchbench:mainfrom
jb510:deterministic-model-selection

Conversation

@jb510
Copy link

@jb510 jb510 commented Mar 8, 2026

Summary

This PR makes benchmark model routing deterministic and adds a regrading workflow that avoids rerunning task execution.

Deterministic routing

  • require provider-qualified model refs for benchmark + judge (provider/model)
  • preserve provider-qualified IDs exactly (remove silent openrouter/ rewriting)
  • validate requested models against local OpenClaw model catalog
  • verify runtime provider/model from transcript on every execution and judge turn
  • fail fast on model mismatch

Error handling improvements

  • treat only terminal assistant errors as fatal (allow transient mid-turn provider rotation)
  • keep strict fail-fast behavior for true execution/grading failures

Session persistence + checkpoints

  • preserve agent/judge sessions by default (no auto cleanup)
  • add --clear-sessions to explicitly clear transcripts before turns
  • write rolling checkpoint files during runs
  • include full per-task transcripts in results/checkpoints

Regrading workflow

  • add --judge-only <results-or-checkpoint.json> to re-run grading without re-executing tasks
  • supports changing --judge-model for replay grading

Docs

  • update README/SKILL docs for deterministic requirements, persistent sessions, checkpoints, --judge-only, and --clear-sessions

Notes

  • This is a draft PR per workflow preference (no review requested yet).
  • Validated locally with full off/low benchmark runs and judge-only replay from checkpoint.

Closes #16

@jb510 jb510 marked this pull request as ready for review March 8, 2026 01:54
if assistant_error:
status = "error"
stderr = f"{stderr}\nAssistant error: {assistant_error}".strip()
if runtime_model is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Status override chain masks timeout status

When a task times out, status is set to "timeout" at line 632. However, a timed-out task will likely have an incomplete or empty transcript, causing runtime_model to be None. This check then unconditionally overrides the status to "error" with a misleading "Could not verify runtime provider/model" message, masking the real root cause (timeout).

The same pattern exists in run_openclaw_prompt at line 773.

Consider guarding the model verification checks so they only run when status == "success":

if status == "success":
    if runtime_model is None:
        status = "error"
        stderr = f"{stderr}\nCould not verify runtime provider/model from transcript.".strip()
    elif runtime_model.lower() != requested_model.lower():
        status = "error"
        stderr = (
            f"{stderr}\nModel mismatch: requested `{requested_model}` but runtime used `{runtime_model}`."
        ).strip()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jb510 have you tested this path?

recent_entries = []
for entry in transcript:
entry_ts = _entry_timestamp_epoch(entry)
if entry_ts is not None and entry_ts >= cutoff:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Transcript filtering silently drops entries without timestamps

The condition entry_ts is not None and entry_ts >= cutoff only keeps entries that have a parseable timestamp AND are recent enough. Entries without timestamps (e.g., tool-result events, custom events, or malformed entries with parse_error) are silently dropped when recent_entries is non-empty.

This could cause incomplete transcripts for downstream consumers like _extract_runtime_model_ref (which looks for custom/model-snapshot entries) and _extract_usage_from_transcript.

Consider also including entries that have no timestamp (they likely belong to the current invocation):

if entry_ts is None or entry_ts >= cutoff:
    recent_entries.append(entry)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jb510 thoughts on this comment?

@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 8, 2026

Code Review Summary

Status: 4 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 3
SUGGESTION 1
Issue Details (click to expand)

WARNING

File Line Issue
scripts/lib_agent.py 52 Missing timeout on subprocess.run in _load_model_catalog() — could hang indefinitely if openclaw CLI is unresponsive
scripts/lib_agent.py 646 Status override chain masks timeout status — a timed-out task can be overridden to "error" by later model-mismatch checks
scripts/lib_agent.py 462 Transcript filtering silently drops entries without timestamps, potentially losing valid transcript data

SUGGESTION

File Line Issue
scripts/benchmark.py 327 Full transcripts in checkpoint payloads cause O(N²) I/O growth as _flush_checkpoint() re-serializes all accumulated results each time
Files Reviewed (5 files)
  • README.md - 0 issues (documentation updates)
  • SKILL.md - 0 issues (documentation updates)
  • scripts/benchmark.py - 1 issue
  • scripts/lib_agent.py - 3 issues
  • scripts/lib_grading.py - 0 issues

Fix these issues in Kilo Cloud

return _MODEL_CATALOG_CACHE

try:
result = subprocess.run(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: Missing timeout on subprocess.run — if openclaw models list hangs (broken CLI, network issue, unresponsive daemon), the benchmark process will block indefinitely.

Consider adding a reasonable timeout (e.g., 30s) so the benchmark fails fast with a clear error instead of hanging:

Suggested change
result = subprocess.run(
result = subprocess.run(
["openclaw", "models", "list", "--all", "--json"],
capture_output=True,
text=True,
check=False,
timeout=30,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jb510 thoughts on a 60s timeout here?

"stdout": result.get("stdout", ""),
"stderr": result.get("stderr", ""),
"transcript_length": len(result.get("transcript", [])),
"transcript": result.get("transcript", []),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION]: Including full transcripts in checkpoint payloads means every _flush_checkpoint() call re-serializes all accumulated transcripts to disk. With N tasks, total checkpoint I/O grows as O(N²) in transcript size.

For a 22-task suite with large transcripts this is likely fine, but if transcript sizes grow (e.g., multi-turn tasks with tool use) this could become a bottleneck. Consider either:

  • Writing only the latest task's data incrementally (append-style), or
  • Deferring full transcript inclusion to the final output only (checkpoints could store transcript_length without the full transcript array)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like n^2 is okay here. If in the future we have 100s or 1000s of tasks, you can show me this comment and point and laugh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hardcoded to use OpenRouter for both bechmarks and judge

2 participants