Skip to content

feat: Add --judge CLI argument and require full provider prefix for model IDs#29

Open
yudistiraashadi wants to merge 2 commits intopinchbench:mainfrom
yudistiraashadi:feat/judge-model-and-provider-prefix
Open

feat: Add --judge CLI argument and require full provider prefix for model IDs#29
yudistiraashadi wants to merge 2 commits intopinchbench:mainfrom
yudistiraashadi:feat/judge-model-and-provider-prefix

Conversation

@yudistiraashadi
Copy link

@yudistiraashadi yudistiraashadi commented Mar 8, 2026

Motivation

Currently, model IDs are normalized internally which limits running benchmarks to a specific provider. By requiring full provider prefixes and adding the --judge flag, users can run benchmarks from any provider they want. For example, using Qwen models from a different provider, or any other model/provider combination — giving full flexibility over both the benchmarked model and the judge model.

Summary

  • Add --judge CLI argument to benchmark.py to allow customizing the judge model used for grading
  • Require full provider prefix in model IDs (e.g. openrouter/anthropic/claude-3.5-sonnet) instead of normalizing them internally — removes normalize_model_id() from lib_agent.py
  • Update default judge model in lib_grading.py to use openrouter/ prefix
  • Update README with new model ID format, --judge flag documentation, and formatting improvements

Introduced a new command-line argument `--judge` to specify the judge model identifier, defaulting to "openrouter/anthropic/claude-opus-4.5". Updated grading logic to include the judge model if provided, enhancing flexibility in task evaluation.
Changed model identifiers in benchmark commands to include the 'openrouter/' prefix. Added a note about the requirement for provider prefixes in model IDs. Enhanced the table formatting for better readability.
@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 8, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Other Observations (not in diff)

Minor inconsistencies found in unchanged code that cannot receive inline comments:

File Line Issue
scripts/benchmark.py 175 --model help text still shows old format anthropic/claude-sonnet-4 without openrouter/ prefix, inconsistent with the rest of this PR
scripts/run.sh 3 Usage comment still shows old format anthropic/claude-sonnet-4 without openrouter/ prefix

These are documentation-only nits and do not affect functionality.

Files Reviewed (4 files)
  • README.md - 0 issues (documentation updates, model ID prefix, table formatting, --judge flag)
  • scripts/benchmark.py - 0 issues (added --judge CLI argument, passes to grading)
  • scripts/lib_agent.py - 0 issues (removed normalize_model_id, passes model ID directly)
  • scripts/lib_grading.py - 0 issues (updated DEFAULT_JUDGE_MODEL with openrouter/ prefix)

@olearycrew
Copy link
Member

@yudistiraashadi thanks for this!

How would you compare this to #27 ?

@yudistiraashadi
Copy link
Author

@yudistiraashadi thanks for this!

How would you compare this to #27 ?

I think #27 includes changes in this PR + more. If you merge #27 , we can safely close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants