From e7d6dbb7a4225334118e5b5eb9ca9f056f0b95bb Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Louis-F=C3=A9lix=20Nothias?= <louis-felix.nothias@cnrs.fr>
Date: Mon, 29 Jun 2026 16:41:40 +0200
Subject: [PATCH] docs(config): recommend a capable-but-cheaper judge_model to
 cut evaluation cost

---
 docs/getting-started/configuration.md | 51 ++++++++++++++++++++++++---
 1 file changed, 47 insertions(+), 4 deletions(-)

diff --git a/docs/getting-started/configuration.md b/docs/getting-started/configuration.md
index 29ee811..030a367 100644
--- a/docs/getting-started/configuration.md
+++ b/docs/getting-started/configuration.md
@@ -49,11 +49,54 @@ LLM choice is the biggest lever you have. Some practical guidance:
 
 === "Judge"
 
-    The judge needs decent reasoning but is only called once per generation. A
-    capable medium-size model is usually enough:
-
+    The judge model scores the soft claims in the verifier. During an
+    evaluation run the verifier is one of the places where the most tokens get
+    spent, so the model you choose here has a real effect on how much a run
+    costs in the end.
+
+    The judge needs decent reasoning, but it does not need the single most
+    expensive flagship model you can find. A capable medium-size model is
+    usually enough, and a medium-size model costs a lot less per token than a
+    flagship one. As a rough guide, a mid-tier model such as a Sonnet-class
+    Claude or `deepseek-v4-pro` is in the region of five times cheaper per
+    token than a top-tier flagship such as an Opus-class Claude, and it still
+    does a good job on the kind of judging the verifier asks for. Good picks
+    that you call directly:
+
+    - `anthropic/claude-sonnet-4-6`
     - `openai/gpt-5.5`
-    - `anthropic/claude-opus-4-5`
+
+    If you run your models through OpenRouter, which is the common setup for
+    the metabolomics evaluation, there are several capable-but-cheaper routes
+    that also work well as a judge:
+
+    - `openrouter/deepseek/deepseek-v4-pro`
+    - `openrouter/z-ai/glm-5`
+    - an OpenRouter MiniMax route (for example the latest `minimax` model)
+
+    The exact version tag for a given model changes over time on OpenRouter,
+    so it is worth checking what is currently offered and picking the most
+    recent capable version of whichever family you prefer. Either way, the
+    point is the same: a mid-tier model is plenty for the judge, and it costs
+    much less than a flagship.
+
+    So the simplest and biggest cost saving you can make for evaluation is to
+    set `judge_model` to one capable-but-cheaper model and let the whole
+    verifier run on it, rather than pointing it at the most expensive model
+    you have.
+
+    You might be tempted to go one step further and add a second, even cheaper
+    model just for the small mechanical steps inside the verifier — for
+    example dropping duplicate claims, or checking which Python packages a
+    claim needs. In practice this is not worth doing. Those mechanical steps
+    are only a small fraction of all the tokens the verifier uses, so moving
+    them onto a separate cheap model saves almost nothing while making the
+    configuration harder to follow. The calls that actually cost tokens, and
+    that actually need good judgement — rating how important a claim is,
+    choosing which files to look at, writing the small verifier script, and
+    giving the final verdict — are better left on the one `judge_model` you
+    already picked. Choosing a sensible `judge_model` is where essentially all
+    of the saving comes from.
 
 !!! tip "OpenRouter routing"
     For OpenRouter models, Mimosa picks providers from `openrouter_provider`