Skip to content

Add thinking-level benchmarking support (off/minimal/low/medium/high/xhigh/adaptive)#12

Open
jb510 wants to merge 4 commits intopinchbench:mainfrom
jb510:thinking-levels
Open

Add thinking-level benchmarking support (off/minimal/low/medium/high/xhigh/adaptive)#12
jb510 wants to merge 4 commits intopinchbench:mainfrom
jb510:thinking-levels

Conversation

@jb510
Copy link

@jb510 jb510 commented Mar 7, 2026

Summary

Adds thinking-level benchmarking support to PinchBench so benchmark runs can compare model performance across reasoning levels.

What changed

  • Added --thinking CLI flag to scripts/benchmark.py
    • Accepts comma-separated levels (e.g. off,medium,high)
    • Valid levels: off, minimal, low, medium, high, xhigh, adaptive
  • Runs benchmark tasks across each selected thinking level and records per-level output.
  • Added per-level aggregate stats in results JSON:
    • thinking_levels
    • thinking_aggregates
  • Added model-aware validation in scripts/lib_agent.py:
    • xhigh allowed only on GPT-5.x family model refs
    • adaptive constrained to Anthropic Claude 4.6 family
    • Invalid/incompatible levels are warned and filtered
    • If --thinking is provided and no valid levels remain, runner exits with error
  • Updated docs (README.md, SKILL.md) with thinking-level usage and constraints.

Notes

  • This PR is intentionally scoped to thinking-level benchmark support and validation only.
  • No new test suite was added, consistent with current repo state (pytest config exists, but no existing test files yet).

Example

./scripts/run.sh --model anthropic/claude-sonnet-4 --thinking off,medium,high

Closes #9

OpenClaw Agent added 3 commits March 7, 2026 09:04
- Add --thinking CLI argument to specify comma-separated thinking levels
- Pass thinking level to OpenClaw agent via --thinking flag
- Run each task across all specified thinking levels
- Include thinking_level in task results
- Add thinking_aggregates section with per-level statistics
- Support levels: off, minimal, low, medium, high
- Update SKILL.md and README.md with documentation

Closes pinchbench#9
- Add xhigh and adaptive to valid thinking levels (matching OpenClaw)
- Add model-aware xhigh validation (only GPT-5.x models support it)
- Validate thinking levels before passing to OpenClaw subprocess
- Document model-specific restrictions in help text and docs
- Follow existing code style (Optional[str] instead of str | None)
- No unnecessary changes to existing code
- Add strict xhigh model matching (provider-aware)
- Add adaptive support detection (Anthropic Claude 4.6 family)
- Deduplicate requested thinking levels while preserving order
- Fail fast when --thinking is provided but no valid levels remain
- Keep subprocess input constrained to validated levels
@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 7, 2026

Code Review Summary

Status: 1 Issue Found | Recommendation: Address before merge

Fix these issues in Kilo Cloud

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0
Issue Details (click to expand)

WARNING

File Line Issue
scripts/benchmark.py 438 Inconsistent task_key construction: when thinking_level is None, key is stored as bare task_id but looked up as task_id:default at line 579. The fallback at line 580 masks this mismatch, but it's fragile.
Files Reviewed (4 files)
  • README.md - 0 issues (documentation updates)
  • SKILL.md - 0 issues (documentation updates)
  • scripts/benchmark.py - 1 issue (thinking level loop refactor, grading lookup)
  • scripts/lib_agent.py - 0 issues (thinking level validation, model compatibility checks)

@jb510
Copy link
Author

jb510 commented Mar 7, 2026

Tested his, but ran into a couple of problems because of the hard-coding to openrouter. I'm forking it to change that, as well as allowing users to pick a judge (not hard-coding to OpenRouter/Opus-4.6 as it is now). Because frankly I'd rather use provider subscriptions to test and judge than pay OpenRouter credits 😱.

@olearycrew
Copy link
Member

@jb510 thanks for this!

I would for sure consider a PR that extracts out the hardcoding to OpenRouter and the judge agent.

I think some of those kinds of defaults make sense for the "official leaderboard" scores just to try and control for as many variables as possible - but I'd be okay if that is still configurable for the user for sure.

logger.info("%s", "=" * 80)

for i, task in enumerate(tasks_to_run, 1):
task_key = f"{task.task_id}:{thinking_label}" if thinking_level else task.task_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Inconsistent key construction between storage and lookup

When thinking_level is None (no --thinking flag), task_key is set to bare task.task_id (e.g., "task_01"). However, the grading lookup at line 579 always constructs the primary key as f"{task_id}:default" (because result.get('thinking_level') or 'default' evaluates to "default" when thinking_level is None).

This means the primary lookup key "task_01:default" will never match the stored key "task_01". The fallback at line 580 (grades_by_task_and_thinking.get(result["task_id"], {})) saves it, but this is fragile — if the fallback is ever removed, grading data will silently be replaced with {}.

Consider using a consistent key format. For example, always include the label:

Suggested change
task_key = f"{task.task_id}:{thinking_label}" if thinking_level else task.task_id
task_key = f"{task.task_id}:{thinking_label}"

This would make task_key always "task_id:default" when no thinking level is set, matching the lookup at line 579.

@olearycrew
Copy link
Member

@jb510 looks like some of these changes got mixed in https://github.com/pinchbench/skill/pull/27/changes yes?

@jb510
Copy link
Author

jb510 commented Mar 11, 2026

@jb510 looks like some of these changes got mixed in https://github.com/pinchbench/skill/pull/27/changes yes?

27 is currently based on these changes, not main, but these changes aren't part of the 27 PR. you should be able to merge/cherry-pick them seperately and out of order if you choose to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Thinking Levels

2 participants