Add thinking-level benchmarking support (off/minimal/low/medium/high/xhigh/adaptive)#12
Add thinking-level benchmarking support (off/minimal/low/medium/high/xhigh/adaptive)#12jb510 wants to merge 4 commits intopinchbench:mainfrom
Conversation
- Add --thinking CLI argument to specify comma-separated thinking levels - Pass thinking level to OpenClaw agent via --thinking flag - Run each task across all specified thinking levels - Include thinking_level in task results - Add thinking_aggregates section with per-level statistics - Support levels: off, minimal, low, medium, high - Update SKILL.md and README.md with documentation Closes pinchbench#9
- Add xhigh and adaptive to valid thinking levels (matching OpenClaw) - Add model-aware xhigh validation (only GPT-5.x models support it) - Validate thinking levels before passing to OpenClaw subprocess - Document model-specific restrictions in help text and docs - Follow existing code style (Optional[str] instead of str | None) - No unnecessary changes to existing code
- Add strict xhigh model matching (provider-aware) - Add adaptive support detection (Anthropic Claude 4.6 family) - Deduplicate requested thinking levels while preserving order - Fail fast when --thinking is provided but no valid levels remain - Keep subprocess input constrained to validated levels
Code Review SummaryStatus: 1 Issue Found | Recommendation: Address before merge Fix these issues in Kilo Cloud Overview
Issue Details (click to expand)WARNING
Files Reviewed (4 files)
|
|
Tested his, but ran into a couple of problems because of the hard-coding to openrouter. I'm forking it to change that, as well as allowing users to pick a judge (not hard-coding to OpenRouter/Opus-4.6 as it is now). Because frankly I'd rather use provider subscriptions to test and judge than pay OpenRouter credits 😱. |
|
@jb510 thanks for this! I would for sure consider a PR that extracts out the hardcoding to OpenRouter and the judge agent. I think some of those kinds of defaults make sense for the "official leaderboard" scores just to try and control for as many variables as possible - but I'd be okay if that is still configurable for the user for sure. |
| logger.info("%s", "=" * 80) | ||
|
|
||
| for i, task in enumerate(tasks_to_run, 1): | ||
| task_key = f"{task.task_id}:{thinking_label}" if thinking_level else task.task_id |
There was a problem hiding this comment.
WARNING: Inconsistent key construction between storage and lookup
When thinking_level is None (no --thinking flag), task_key is set to bare task.task_id (e.g., "task_01"). However, the grading lookup at line 579 always constructs the primary key as f"{task_id}:default" (because result.get('thinking_level') or 'default' evaluates to "default" when thinking_level is None).
This means the primary lookup key "task_01:default" will never match the stored key "task_01". The fallback at line 580 (grades_by_task_and_thinking.get(result["task_id"], {})) saves it, but this is fragile — if the fallback is ever removed, grading data will silently be replaced with {}.
Consider using a consistent key format. For example, always include the label:
| task_key = f"{task.task_id}:{thinking_label}" if thinking_level else task.task_id | |
| task_key = f"{task.task_id}:{thinking_label}" |
This would make task_key always "task_id:default" when no thinking level is set, matching the lookup at line 579.
|
@jb510 looks like some of these changes got mixed in https://github.com/pinchbench/skill/pull/27/changes yes? |
27 is currently based on these changes, not main, but these changes aren't part of the 27 PR. you should be able to merge/cherry-pick them seperately and out of order if you choose to. |
Summary
Adds thinking-level benchmarking support to PinchBench so benchmark runs can compare model performance across reasoning levels.
What changed
--thinkingCLI flag toscripts/benchmark.pyoff,medium,high)off,minimal,low,medium,high,xhigh,adaptivethinking_levelsthinking_aggregatesscripts/lib_agent.py:xhighallowed only on GPT-5.x family model refsadaptiveconstrained to Anthropic Claude 4.6 family--thinkingis provided and no valid levels remain, runner exits with errorREADME.md,SKILL.md) with thinking-level usage and constraints.Notes
Example
Closes #9