feat: Add --judge CLI argument and require full provider prefix for model IDs by yudistiraashadi · Pull Request #29 · pinchbench/skill

yudistiraashadi · 2026-03-08T06:07:13Z

Motivation

Currently, model IDs are normalized internally which limits running benchmarks to a specific provider. By requiring full provider prefixes and adding the --judge flag, users can run benchmarks from any provider they want. For example, using Qwen models from a different provider, or any other model/provider combination — giving full flexibility over both the benchmarked model and the judge model.

Summary

Add --judge CLI argument to benchmark.py to allow customizing the judge model used for grading
Require full provider prefix in model IDs (e.g. openrouter/anthropic/claude-3.5-sonnet) instead of normalizing them internally — removes normalize_model_id() from lib_agent.py
Update default judge model in lib_grading.py to use openrouter/ prefix
Update README with new model ID format, --judge flag documentation, and formatting improvements

Introduced a new command-line argument `--judge` to specify the judge model identifier, defaulting to "openrouter/anthropic/claude-opus-4.5". Updated grading logic to include the judge model if provided, enhancing flexibility in task evaluation.

Changed model identifiers in benchmark commands to include the 'openrouter/' prefix. Added a note about the requirement for provider prefixes in model IDs. Enhanced the table formatting for better readability.

kilo-code-bot · 2026-03-08T06:08:58Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Other Observations (not in diff)

Minor inconsistencies found in unchanged code that cannot receive inline comments:

File	Line	Issue
`scripts/benchmark.py`	175	`--model` help text still shows old format `anthropic/claude-sonnet-4` without `openrouter/` prefix, inconsistent with the rest of this PR
`scripts/run.sh`	3	Usage comment still shows old format `anthropic/claude-sonnet-4` without `openrouter/` prefix

These are documentation-only nits and do not affect functionality.

Files Reviewed (4 files)

README.md - 0 issues (documentation updates, model ID prefix, table formatting, --judge flag)
scripts/benchmark.py - 0 issues (added --judge CLI argument, passes to grading)
scripts/lib_agent.py - 0 issues (removed normalize_model_id, passes model ID directly)
scripts/lib_grading.py - 0 issues (updated DEFAULT_JUDGE_MODEL with openrouter/ prefix)

olearycrew · 2026-03-10T19:32:17Z

@yudistiraashadi thanks for this!

How would you compare this to #27 ?

yudistiraashadi · 2026-03-11T03:24:38Z

@yudistiraashadi thanks for this!

How would you compare this to #27 ?

I think #27 includes changes in this PR + more. If you merge #27 , we can safely close this PR.

yudistiraashadi added 2 commits March 8, 2026 12:57

Update README.md to reflect new model identifiers and improve formatting

1cf20bb

Changed model identifiers in benchmark commands to include the 'openrouter/' prefix. Added a note about the requirement for provider prefixes in model IDs. Enhanced the table formatting for better readability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add --judge CLI argument and require full provider prefix for model IDs#29

feat: Add --judge CLI argument and require full provider prefix for model IDs#29
yudistiraashadi wants to merge 2 commits intopinchbench:mainfrom
yudistiraashadi:feat/judge-model-and-provider-prefix

yudistiraashadi commented Mar 8, 2026 •

edited

Loading

Uh oh!

kilo-code-bot bot commented Mar 8, 2026

Uh oh!

olearycrew commented Mar 10, 2026

Uh oh!

yudistiraashadi commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yudistiraashadi commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Summary

Uh oh!

kilo-code-bot bot commented Mar 8, 2026

Code Review Summary

Uh oh!

olearycrew commented Mar 10, 2026

Uh oh!

yudistiraashadi commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yudistiraashadi commented Mar 8, 2026 •

edited

Loading