fix: Multiple bug fixes across benchmark pipeline#31
Open
kaiaiagent wants to merge 2 commits intopinchbench:mainfrom
Open
fix: Multiple bug fixes across benchmark pipeline#31kaiaiagent wants to merge 2 commits intopinchbench:mainfrom
kaiaiagent wants to merge 2 commits intopinchbench:mainfrom
Conversation
Contributor
Code Review SummaryStatus: No Issues Found | Recommendation: Merge This PR is a well-structured set of defensive fixes and bug corrections across the benchmark tooling. Key improvements include:
Files Reviewed (4 files)
|
Member
|
@kaiaiagent there are a few merge conflicts - can you look at those? |
- Fix duplicate task entries in results JSON when runs_per_task > 1 (iterated per-run instead of per-task, causing inflated scores) - Fix TimeoutExpired returning bytes instead of strings by adding _decode_output helper - Fix status determination using if-chain instead of if/elif, causing later conditions to overwrite earlier ones - Clean workspace between tasks to prevent cross-contamination - Handle string-typed numbers in judge response normalization - Fix content type assumptions in transcript summarization (str vs list) - Cache stat results in _find_recent_session_path to avoid TOCTOU race - Fix memory KB-to-GB conversion using 1024^2 instead of 1e6 - Fix pyproject.toml path lookup (was searching scripts/ not project root) - Move CONFIG_DIR from scripts/.pinchbench to ~/.pinchbench - Remove dead simulate branch in OpenClawAgent.execute_task Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
09d76f2 to
2b70d53
Compare
Author
|
@olearycrew just updated to resolve the conflicts. |
scripts/benchmark.py
Outdated
| task_ids = _select_task_ids(runner.tasks, args.suite) | ||
| results = [] | ||
| grades_by_task_id = {} | ||
| execution_by_task_id: Dict[str, Dict[str, Any]] = {} |
Contributor
There was a problem hiding this comment.
WARNING: execution_by_task_id is populated (lines 588–597) but never read or referenced anywhere in this file. The task_entries list comprehension at line 602 builds equivalent data directly from results. This dict is dead code and should either be removed or wired into the output (e.g., replacing the per-result iteration in task_entries with a lookup into this aggregated dict).
The dict was populated but never read — task_entries already builds equivalent data directly from results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--runs > 1: results iterated per-run instead of per-task, causing duplicate entries with inflated/deflated scores in uploaded payloadssubprocess.TimeoutExpiredreturns bytes for stdout/stderr even withtext=True; added_decode_output()helperifstatements allowed later conditions to silently overwrite earlier ones (e.g. timeout overwritten by empty-transcript error); converted to properif/elif/elsechain_normalize_judge_responsenow accepts"0.75"strings instead of silently dropping them to 0.0_summarize_transcriptand verbose logging assumedcontentwas always a list; now handles bothstrandlist_find_recent_session_pathto avoid TOCTOU race and redundant syscalls1e6instead of1024²(~4.9% error)pyproject.tomlpath: looked inscripts/instead of project root, so client version was always emptyCONFIG_DIRfromscripts/.pinchbench/to~/.pinchbench/to avoid accidental credential commitssimulatebranch inOpenClawAgent.execute_taskfell through toraiseTest plan
--runs 1and verify results JSON has one entry per task with correct execution metadata--runs 2and verify no duplicate task entries, scores are properly averaged--uploadstill works with the updated results JSON schema~/.pinchbench/config.jsonis used for token storage instead ofscripts/.pinchbench/🤖 Generated with Claude Code