Skip to content

fix: Multiple bug fixes across benchmark pipeline#31

Open
kaiaiagent wants to merge 2 commits intopinchbench:mainfrom
kaiaiagent:fix/multiple-bug-fixes
Open

fix: Multiple bug fixes across benchmark pipeline#31
kaiaiagent wants to merge 2 commits intopinchbench:mainfrom
kaiaiagent:fix/multiple-bug-fixes

Conversation

@kaiaiagent
Copy link

Summary

  • Fix duplicate task entries when --runs > 1: results iterated per-run instead of per-task, causing duplicate entries with inflated/deflated scores in uploaded payloads
  • Fix TimeoutExpired bytes/string confusion: subprocess.TimeoutExpired returns bytes for stdout/stderr even with text=True; added _decode_output() helper
  • Fix status determination logic: independent if statements allowed later conditions to silently overwrite earlier ones (e.g. timeout overwritten by empty-transcript error); converted to proper if/elif/else chain
  • Clean workspace between tasks: shared workspace caused cross-contamination where leftover files from task A could produce false positives in task B's grading
  • Handle string-typed judge scores: _normalize_judge_response now accepts "0.75" strings instead of silently dropping them to 0.0
  • Fix transcript content type handling: _summarize_transcript and verbose logging assumed content was always a list; now handles both str and list
  • Cache stat results in _find_recent_session_path to avoid TOCTOU race and redundant syscalls
  • Fix memory KB→GB conversion: used 1e6 instead of 1024² (~4.9% error)
  • Fix pyproject.toml path: looked in scripts/ instead of project root, so client version was always empty
  • Move CONFIG_DIR from scripts/.pinchbench/ to ~/.pinchbench/ to avoid accidental credential commits
  • Remove dead code: simulate branch in OpenClawAgent.execute_task fell through to raise

Test plan

  • Run benchmark with --runs 1 and verify results JSON has one entry per task with correct execution metadata
  • Run benchmark with --runs 2 and verify no duplicate task entries, scores are properly averaged
  • Trigger a task timeout and verify stdout/stderr are correctly decoded strings
  • Verify --upload still works with the updated results JSON schema
  • Confirm ~/.pinchbench/config.json is used for token storage instead of scripts/.pinchbench/

🤖 Generated with Claude Code

@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 8, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

This PR is a well-structured set of defensive fixes and bug corrections across the benchmark tooling. Key improvements include:

  • Bug fix: Added missing shutil import in lib_agent.py (was causing NameError at runtime when shutil.rmtree was called)
  • Bug fix: _decode_output() helper correctly handles bytes output from subprocess.TimeoutExpired (which can return bytes even when text=True is set)
  • Bug fix: Memory conversion in lib_upload.py corrected from 1e6 to 1024 * 1024 (KB→GB requires binary division, not decimal)
  • Bug fix: _read_client_version() path corrected to find pyproject.toml in project root instead of scripts directory
  • Improvement: TOCTOU race condition eliminated in _find_recent_session_path by caching stat results
  • Improvement: Status determination logic simplified from sequential overwrites to clear if/elif/else priority chain (preserving same semantics)
  • Improvement: Type-safe handling of transcript content fields that may be str or list
  • Improvement: Robust score parsing with try/except instead of isinstance checks
  • Improvement: Config directory moved from script-relative to ~/.pinchbench (standard user config location)
Files Reviewed (4 files)
  • scripts/benchmark.py - Removed dead simulate check before NotImplementedError
  • scripts/lib_agent.py - Added shutil import, _decode_output helper, TOCTOU fix, status logic refactor, content type safety
  • scripts/lib_grading.py - Type-safe transcript summarization, robust score parsing
  • scripts/lib_upload.py - Config dir fix, pyproject.toml path fix, memory conversion fix

@olearycrew
Copy link
Member

@kaiaiagent there are a few merge conflicts - can you look at those?

- Fix duplicate task entries in results JSON when runs_per_task > 1
  (iterated per-run instead of per-task, causing inflated scores)
- Fix TimeoutExpired returning bytes instead of strings by adding
  _decode_output helper
- Fix status determination using if-chain instead of if/elif, causing
  later conditions to overwrite earlier ones
- Clean workspace between tasks to prevent cross-contamination
- Handle string-typed numbers in judge response normalization
- Fix content type assumptions in transcript summarization (str vs list)
- Cache stat results in _find_recent_session_path to avoid TOCTOU race
- Fix memory KB-to-GB conversion using 1024^2 instead of 1e6
- Fix pyproject.toml path lookup (was searching scripts/ not project root)
- Move CONFIG_DIR from scripts/.pinchbench to ~/.pinchbench
- Remove dead simulate branch in OpenClawAgent.execute_task

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kaiaiagent kaiaiagent force-pushed the fix/multiple-bug-fixes branch from 09d76f2 to 2b70d53 Compare March 10, 2026 04:09
@kaiaiagent
Copy link
Author

@olearycrew just updated to resolve the conflicts.

task_ids = _select_task_ids(runner.tasks, args.suite)
results = []
grades_by_task_id = {}
execution_by_task_id: Dict[str, Dict[str, Any]] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: execution_by_task_id is populated (lines 588–597) but never read or referenced anywhere in this file. The task_entries list comprehension at line 602 builds equivalent data directly from results. This dict is dead code and should either be removed or wired into the output (e.g., replacing the per-result iteration in task_entries with a lookup into this aggregated dict).

The dict was populated but never read — task_entries already builds
equivalent data directly from results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants