Add thinking-level benchmarking support (off/minimal/low/medium/high/xhigh/adaptive) by jb510 · Pull Request #12 · pinchbench/skill

jb510 · 2026-03-07T20:44:51Z

Summary

Adds thinking-level benchmarking support to PinchBench so benchmark runs can compare model performance across reasoning levels.

What changed

Added --thinking CLI flag to scripts/benchmark.py
- Accepts comma-separated levels (e.g. off,medium,high)
- Valid levels: off, minimal, low, medium, high, xhigh, adaptive
Runs benchmark tasks across each selected thinking level and records per-level output.
Added per-level aggregate stats in results JSON:
- thinking_levels
- thinking_aggregates
Added model-aware validation in scripts/lib_agent.py:
- xhigh allowed only on GPT-5.x family model refs
- adaptive constrained to Anthropic Claude 4.6 family
- Invalid/incompatible levels are warned and filtered
- If --thinking is provided and no valid levels remain, runner exits with error
Updated docs (README.md, SKILL.md) with thinking-level usage and constraints.

Notes

This PR is intentionally scoped to thinking-level benchmark support and validation only.
No new test suite was added, consistent with current repo state (pytest config exists, but no existing test files yet).

Example

./scripts/run.sh --model anthropic/claude-sonnet-4 --thinking off,medium,high

Closes #9

- Add --thinking CLI argument to specify comma-separated thinking levels - Pass thinking level to OpenClaw agent via --thinking flag - Run each task across all specified thinking levels - Include thinking_level in task results - Add thinking_aggregates section with per-level statistics - Support levels: off, minimal, low, medium, high - Update SKILL.md and README.md with documentation Closes pinchbench#9

- Add xhigh and adaptive to valid thinking levels (matching OpenClaw) - Add model-aware xhigh validation (only GPT-5.x models support it) - Validate thinking levels before passing to OpenClaw subprocess - Document model-specific restrictions in help text and docs - Follow existing code style (Optional[str] instead of str | None) - No unnecessary changes to existing code

- Add strict xhigh model matching (provider-aware) - Add adaptive support detection (Anthropic Claude 4.6 family) - Deduplicate requested thinking levels while preserving order - Fail fast when --thinking is provided but no valid levels remain - Keep subprocess input constrained to validated levels

kilo-code-bot · 2026-03-07T20:47:55Z

Code Review Summary

Status: 1 Issue Found | Recommendation: Address before merge

Fix these issues in Kilo Cloud

Overview

Severity	Count
CRITICAL	0
WARNING	1
SUGGESTION	0

Issue Details (click to expand)

WARNING

File	Line	Issue
`scripts/benchmark.py`	438	Inconsistent `task_key` construction: when `thinking_level` is `None`, key is stored as bare `task_id` but looked up as `task_id:default` at line 579. The fallback at line 580 masks this mismatch, but it's fragile.

Files Reviewed (4 files)

README.md - 0 issues (documentation updates)
SKILL.md - 0 issues (documentation updates)
scripts/benchmark.py - 1 issue (thinking level loop refactor, grading lookup)
scripts/lib_agent.py - 0 issues (thinking level validation, model compatibility checks)

jb510 · 2026-03-07T20:48:45Z

Tested his, but ran into a couple of problems because of the hard-coding to openrouter. I'm forking it to change that, as well as allowing users to pick a judge (not hard-coding to OpenRouter/Opus-4.6 as it is now). Because frankly I'd rather use provider subscriptions to test and judge than pay OpenRouter credits 😱.

olearycrew · 2026-03-07T23:06:53Z

@jb510 thanks for this!

I would for sure consider a PR that extracts out the hardcoding to OpenRouter and the judge agent.

I think some of those kinds of defaults make sense for the "official leaderboard" scores just to try and control for as many variables as possible - but I'd be okay if that is still configurable for the user for sure.

kilo-code-bot · 2026-03-08T21:59:37Z

scripts/benchmark.py

+        logger.info("%s", "=" * 80)
+
+        for i, task in enumerate(tasks_to_run, 1):
+            task_key = f"{task.task_id}:{thinking_label}" if thinking_level else task.task_id


WARNING: Inconsistent key construction between storage and lookup

When thinking_level is None (no --thinking flag), task_key is set to bare task.task_id (e.g., "task_01"). However, the grading lookup at line 579 always constructs the primary key as f"{task_id}:default" (because result.get('thinking_level') or 'default' evaluates to "default" when thinking_level is None).

This means the primary lookup key "task_01:default" will never match the stored key "task_01". The fallback at line 580 (grades_by_task_and_thinking.get(result["task_id"], {})) saves it, but this is fragile — if the fallback is ever removed, grading data will silently be replaced with {}.

Consider using a consistent key format. For example, always include the label:

Suggested change

task_key = f"{task.task_id}:{thinking_label}" if thinking_level else task.task_id

task_key = f"{task.task_id}:{thinking_label}"

This would make task_key always "task_id:default" when no thinking level is set, matching the lookup at line 579.

olearycrew · 2026-03-10T19:31:42Z

@jb510 looks like some of these changes got mixed in https://github.com/pinchbench/skill/pull/27/changes yes?

jb510 · 2026-03-11T15:23:59Z

@jb510 looks like some of these changes got mixed in https://github.com/pinchbench/skill/pull/27/changes yes?

27 is currently based on these changes, not main, but these changes aren't part of the 27 PR. you should be able to merge/cherry-pick them seperately and out of order if you choose to.

OpenClaw Agent added 3 commits March 7, 2026 09:04

jb510 mentioned this pull request Mar 7, 2026

Hardcoded to use OpenRouter for both bechmarks and judge #16

Open

Merge upstream/main into thinking-levels and resolve conflicts

01a62d2

kilo-code-bot bot reviewed Mar 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add thinking-level benchmarking support (off/minimal/low/medium/high/xhigh/adaptive)#12

Add thinking-level benchmarking support (off/minimal/low/medium/high/xhigh/adaptive)#12
jb510 wants to merge 4 commits intopinchbench:mainfrom
jb510:thinking-levels

jb510 commented Mar 7, 2026

Uh oh!

kilo-code-bot bot commented Mar 7, 2026 •

edited

Loading

WARNING

Uh oh!

jb510 commented Mar 7, 2026

Uh oh!

olearycrew commented Mar 7, 2026

Uh oh!

kilo-code-bot bot Mar 8, 2026

Uh oh!

olearycrew commented Mar 10, 2026

Uh oh!

jb510 commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	task_key = f"{task.task_id}:{thinking_label}" if thinking_level else task.task_id
	task_key = f"{task.task_id}:{thinking_label}"

Conversation

jb510 commented Mar 7, 2026

Summary

What changed

Notes

Example

Uh oh!

kilo-code-bot bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Overview

WARNING

Uh oh!

jb510 commented Mar 7, 2026

Uh oh!

olearycrew commented Mar 7, 2026

Uh oh!

kilo-code-bot bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

olearycrew commented Mar 10, 2026

Uh oh!

jb510 commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kilo-code-bot bot commented Mar 7, 2026 •

edited

Loading