Deterministic model selection + judge-only regrading#27
Deterministic model selection + judge-only regrading#27jb510 wants to merge 2 commits intopinchbench:mainfrom
Conversation
| if assistant_error: | ||
| status = "error" | ||
| stderr = f"{stderr}\nAssistant error: {assistant_error}".strip() | ||
| if runtime_model is None: |
There was a problem hiding this comment.
WARNING: Status override chain masks timeout status
When a task times out, status is set to "timeout" at line 632. However, a timed-out task will likely have an incomplete or empty transcript, causing runtime_model to be None. This check then unconditionally overrides the status to "error" with a misleading "Could not verify runtime provider/model" message, masking the real root cause (timeout).
The same pattern exists in run_openclaw_prompt at line 773.
Consider guarding the model verification checks so they only run when status == "success":
if status == "success":
if runtime_model is None:
status = "error"
stderr = f"{stderr}\nCould not verify runtime provider/model from transcript.".strip()
elif runtime_model.lower() != requested_model.lower():
status = "error"
stderr = (
f"{stderr}\nModel mismatch: requested `{requested_model}` but runtime used `{runtime_model}`."
).strip()| recent_entries = [] | ||
| for entry in transcript: | ||
| entry_ts = _entry_timestamp_epoch(entry) | ||
| if entry_ts is not None and entry_ts >= cutoff: |
There was a problem hiding this comment.
WARNING: Transcript filtering silently drops entries without timestamps
The condition entry_ts is not None and entry_ts >= cutoff only keeps entries that have a parseable timestamp AND are recent enough. Entries without timestamps (e.g., tool-result events, custom events, or malformed entries with parse_error) are silently dropped when recent_entries is non-empty.
This could cause incomplete transcripts for downstream consumers like _extract_runtime_model_ref (which looks for custom/model-snapshot entries) and _extract_usage_from_transcript.
Consider also including entries that have no timestamp (they likely belong to the current invocation):
if entry_ts is None or entry_ts >= cutoff:
recent_entries.append(entry)
Code Review SummaryStatus: 4 Issues Found | Recommendation: Address before merge Overview
Issue Details (click to expand)WARNING
SUGGESTION
Files Reviewed (5 files)
|
| return _MODEL_CATALOG_CACHE | ||
|
|
||
| try: | ||
| result = subprocess.run( |
There was a problem hiding this comment.
[WARNING]: Missing timeout on subprocess.run — if openclaw models list hangs (broken CLI, network issue, unresponsive daemon), the benchmark process will block indefinitely.
Consider adding a reasonable timeout (e.g., 30s) so the benchmark fails fast with a clear error instead of hanging:
| result = subprocess.run( | |
| result = subprocess.run( | |
| ["openclaw", "models", "list", "--all", "--json"], | |
| capture_output=True, | |
| text=True, | |
| check=False, | |
| timeout=30, |
| "stdout": result.get("stdout", ""), | ||
| "stderr": result.get("stderr", ""), | ||
| "transcript_length": len(result.get("transcript", [])), | ||
| "transcript": result.get("transcript", []), |
There was a problem hiding this comment.
[SUGGESTION]: Including full transcripts in checkpoint payloads means every _flush_checkpoint() call re-serializes all accumulated transcripts to disk. With N tasks, total checkpoint I/O grows as O(N²) in transcript size.
For a 22-task suite with large transcripts this is likely fine, but if transcript sizes grow (e.g., multi-turn tasks with tool use) this could become a bottleneck. Consider either:
- Writing only the latest task's data incrementally (append-style), or
- Deferring full transcript inclusion to the final output only (checkpoints could store
transcript_lengthwithout the fulltranscriptarray)
There was a problem hiding this comment.
I feel like n^2 is okay here. If in the future we have 100s or 1000s of tasks, you can show me this comment and point and laugh
Summary
This PR makes benchmark model routing deterministic and adds a regrading workflow that avoids rerunning task execution.
Deterministic routing
provider/model)openrouter/rewriting)Error handling improvements
Session persistence + checkpoints
--clear-sessionsto explicitly clear transcripts before turnsRegrading workflow
--judge-only <results-or-checkpoint.json>to re-run grading without re-executing tasks--judge-modelfor replay gradingDocs
--judge-only, and--clear-sessionsNotes
Closes #16