Skip to content

docs(config): recommend a capable-but-cheaper judge_model to cut evaluation cost#162

Merged
Fosowl merged 1 commit into
mimosa_v2from
docs/judge-model-cost
Jun 29, 2026
Merged

docs(config): recommend a capable-but-cheaper judge_model to cut evaluation cost#162
Fosowl merged 1 commit into
mimosa_v2from
docs/judge-model-cost

Conversation

@lfnothias

Copy link
Copy Markdown
Collaborator

What this changes

This adds guidance to the configuration docs about how to choose the judge_model, because that single choice is the biggest and simplest way to lower the cost of an evaluation run.

It updates the "Judge" section in docs/getting-started/configuration.md only. It does not change any code, any default, or any behaviour. It is documentation only.

Why

During an evaluation the verifier is one of the places where the most tokens get spent. The model set in judge_model is what does that work. People often point this at the most expensive flagship model they have, but the judge does not actually need that. A capable mid-tier model does the job well, and a mid-tier model costs a lot less per token. As a rough guide a Sonnet-class Claude or a deepseek-v4-pro is in the region of five times cheaper per token than an Opus-class flagship, while still judging the claims well.

So the doc now says plainly: set judge_model to one capable-but-cheaper model and let the whole verifier run on it. That is where essentially all of the saving comes from.

The doc also explains the thing that is not worth doing: adding a second, even cheaper model just for the small mechanical steps inside the verifier (dropping duplicate claims, checking which packages a claim needs). Those steps are only a small fraction of the verifier's tokens, so splitting them out saves almost nothing and only makes the configuration harder to follow. The calls that actually cost tokens and actually need judgement stay on the one judge_model you already picked.

Model suggestions in the doc

Direct routes:

  • anthropic/claude-sonnet-4-6
  • openai/gpt-5.5

OpenRouter routes (the common setup for the metabolomics evaluation):

  • openrouter/deepseek/deepseek-v4-pro
  • openrouter/z-ai/glm-5
  • a MiniMax route (the exact version tag drifts over time on OpenRouter, so the doc tells the reader to pick the latest capable version of whichever family they prefer)

Relation to PR #160

This is the replacement for #160. #160 added a separate judge_extraction_model knob to move the mechanical sub-calls onto a cheaper model. After narrowing that knob to only the genuinely mechanical calls (claim dedup and missing-package checks), the saving is tiny, because those calls are only a small fraction of the verifier's tokens. The real saving is in the model class of judge_model itself, which already exists, so no new knob is needed. #160 is being closed in favour of this doc change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants