docs(config): recommend a capable-but-cheaper judge_model to cut evaluation cost#162
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this changes
This adds guidance to the configuration docs about how to choose the
judge_model, because that single choice is the biggest and simplest way to lower the cost of an evaluation run.It updates the "Judge" section in
docs/getting-started/configuration.mdonly. It does not change any code, any default, or any behaviour. It is documentation only.Why
During an evaluation the verifier is one of the places where the most tokens get spent. The model set in
judge_modelis what does that work. People often point this at the most expensive flagship model they have, but the judge does not actually need that. A capable mid-tier model does the job well, and a mid-tier model costs a lot less per token. As a rough guide a Sonnet-class Claude or adeepseek-v4-prois in the region of five times cheaper per token than an Opus-class flagship, while still judging the claims well.So the doc now says plainly: set
judge_modelto one capable-but-cheaper model and let the whole verifier run on it. That is where essentially all of the saving comes from.The doc also explains the thing that is not worth doing: adding a second, even cheaper model just for the small mechanical steps inside the verifier (dropping duplicate claims, checking which packages a claim needs). Those steps are only a small fraction of the verifier's tokens, so splitting them out saves almost nothing and only makes the configuration harder to follow. The calls that actually cost tokens and actually need judgement stay on the one
judge_modelyou already picked.Model suggestions in the doc
Direct routes:
anthropic/claude-sonnet-4-6openai/gpt-5.5OpenRouter routes (the common setup for the metabolomics evaluation):
openrouter/deepseek/deepseek-v4-proopenrouter/z-ai/glm-5Relation to PR #160
This is the replacement for #160. #160 added a separate
judge_extraction_modelknob to move the mechanical sub-calls onto a cheaper model. After narrowing that knob to only the genuinely mechanical calls (claim dedup and missing-package checks), the saving is tiny, because those calls are only a small fraction of the verifier's tokens. The real saving is in the model class ofjudge_modelitself, which already exists, so no new knob is needed. #160 is being closed in favour of this doc change.