docs(config): recommend a capable-but-cheaper judge_model to cut evaluation cost by lfnothias · Pull Request #162 · HolobiomicsLab/Mimosa-AI

lfnothias · 2026-06-29T14:42:55Z

What this changes

This adds guidance to the configuration docs about how to choose the judge_model, because that single choice is the biggest and simplest way to lower the cost of an evaluation run.

It updates the "Judge" section in docs/getting-started/configuration.md only. It does not change any code, any default, or any behaviour. It is documentation only.

Why

During an evaluation the verifier is one of the places where the most tokens get spent. The model set in judge_model is what does that work. People often point this at the most expensive flagship model they have, but the judge does not actually need that. A capable mid-tier model does the job well, and a mid-tier model costs a lot less per token. As a rough guide a Sonnet-class Claude or a deepseek-v4-pro is in the region of five times cheaper per token than an Opus-class flagship, while still judging the claims well.

So the doc now says plainly: set judge_model to one capable-but-cheaper model and let the whole verifier run on it. That is where essentially all of the saving comes from.

The doc also explains the thing that is not worth doing: adding a second, even cheaper model just for the small mechanical steps inside the verifier (dropping duplicate claims, checking which packages a claim needs). Those steps are only a small fraction of the verifier's tokens, so splitting them out saves almost nothing and only makes the configuration harder to follow. The calls that actually cost tokens and actually need judgement stay on the one judge_model you already picked.

Model suggestions in the doc

Direct routes:

anthropic/claude-sonnet-4-6
openai/gpt-5.5

OpenRouter routes (the common setup for the metabolomics evaluation):

openrouter/deepseek/deepseek-v4-pro
openrouter/z-ai/glm-5
a MiniMax route (the exact version tag drifts over time on OpenRouter, so the doc tells the reader to pick the latest capable version of whichever family they prefer)

Relation to PR #160

This is the replacement for #160. #160 added a separate judge_extraction_model knob to move the mechanical sub-calls onto a cheaper model. After narrowing that knob to only the genuinely mechanical calls (claim dedup and missing-package checks), the saving is tiny, because those calls are only a small fraction of the verifier's tokens. The real saving is in the model class of judge_model itself, which already exists, so no new knob is needed. #160 is being closed in favour of this doc change.

…uation cost

docs(config): recommend a capable-but-cheaper judge_model to cut eval…

e7d6dbb

…uation cost

lfnothias mentioned this pull request Jun 29, 2026

feat(verifier): add optional cheaper model tier for mechanical judge calls #160

Closed

Fosowl merged commit b778a6d into mimosa_v2 Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs(config): recommend a capable-but-cheaper judge_model to cut evaluation cost#162

docs(config): recommend a capable-but-cheaper judge_model to cut evaluation cost#162
Fosowl merged 1 commit into
mimosa_v2from
docs/judge-model-cost

lfnothias commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lfnothias commented Jun 29, 2026

What this changes

Why

Model suggestions in the doc

Relation to PR #160

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants