feat(llm-evals): apologist evaluation harness with matrix scoring + rubric polisher [NES-1664]#9213
Conversation
…pts [NES-1664] Adds libs/llm-evals/ — a Nx library that fetches a labelled system prompt from Langfuse, runs it against scenario queries on a configurable LLM provider (OpenRouter default, Gemini direct, or Apologist gateway), and scores the response with a separate judge LLM against per-scenario positive and negative criteria. - nx targets: eval (vitest), fetch-secrets (filtered Doppler pull), lint, type-check - Scenario format supports acceptableExamples (positive) and unacceptableExamples (anti-patterns) - Per-run output written to results/<timestamp>/summary.md + one file per scenario - Two starter scenarios on the development base prompt (resurrection doubt, problem of evil) - README documents the flow, label conventions (development = base, no production), and provider/judge toggles Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…NES-1664]
Builds on the initial harness with three substantive additions.
1. Per-(scenario, model) matrix execution
- Scenario.models[] declares which models a scenario tests; the runner
flattens scenarios × models into cells and runs each.
- Results layout reorganised around (scenario, model) as the primary
key — results/<scenario-slug>/<provider>__<modelId>.md per cell,
plus results/summary.md aggregating the matrix.
- Selective re-runs via EVAL_SCENARIO and EVAL_MODEL env vars; only
the cells that ran are overwritten, the rest preserved from on-disk
metadata in <!-- llm-eval-meta {...} --> blocks.
- summary.md restructured: one H2 per scenario with its own table,
green/red pass indicators, judge reasoning grouped below the table.
2. Seven new scenarios covering doctrinal, factual, ethical, and
pastoral question types: Cain's wife, divorce after infidelity,
drinking alcohol, premarital sex, speaking in tongues, tattoos,
and the doctrine of the Trinity. Each scenario declares both
acceptableExamples (positive criteria) and unacceptableExamples
(anti-patterns) so the judge has paired criteria along each axis.
3. polish-rubric script — uses a configurable stronger model
(default apologist:anthropic/claude/sonnet-4.6) to read a scenario's
current rubric plus its observed cell outputs and propose a
sharpened version. Output written to libs/llm-evals/proposed-prompts/
(gitignored) for human review — never modifies scenario files
directly. Invoked via `nx run llm-evals:polish-rubric --scenario=<slug>`.
Also adds scripts/verify-routing.ts — diagnostic that confirms apologist
provider calls hit the configured gateway URL with the apologist key.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
header must not be longer than 100 characters, current length is 102 |
|
View your CI Pipeline Execution ↗ for commit 814037c
☁️ Nx Cloud last updated this comment at |
…-1664] The eval summary, per-cell artefacts, and rubric proposals are the most valuable artefacts for stakeholders reviewing prompt and model behaviour. The earlier safety audit confirmed none of these files contain secrets, infrastructure URLs, or proprietary system-prompt content, so they are now tracked. - libs/llm-evals/.gitignore reduced to only .env and .env.local. - 45 cell artefacts (9 scenarios × 5 models), the summary, and 9 rubric proposals now visible in the PR. - README updated to reflect tracked state and to drop the now-misleading "sidecar" terminology in favour of "proposal file". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CodeQL flagged `escapeCell` as incomplete string escaping — without escaping backslashes first, a literal `\` in any scenario name or model id would collide with the `\|` pipe escape sequence and produce ambiguous markdown. Also strip newlines, which break table cells regardless of escaping. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
A standalone LLM evaluation harness for the apologist chat's system prompt. Fetches a labelled prompt from Langfuse, runs each scenario × model in a matrix, and scores responses with an LLM-as-judge against per-scenario positive and negative criteria — all without booting
apps/journeysor hitting/api/chat.This is intentionally a long-lived branch, not a one-shot feature merge. The intent is for it to track main as the place where evals live and iterate, used as the way we verify prompt changes before they reach production.
What's here
libs/llm-evals/— new Nx library, four Nx targets:eval,fetch-secrets,polish-rubric, plus standardlintandtype-check.openrouter:google/gemini-3-flash-preview(mirrors/api/chat), plus 4 apologist gateway models (openai/gpt/4o-mini,anthropic/claude/haiku-4.5,google/gemini/3-flash,anthropic/claude/sonnet-4.6).polish-rubricscript — uses a stronger model (default Apologist Sonnet 4.6) to read a scenario's rubric plus its observed cell outputs and propose sharpened criteria. Writes proposals to gitignoredproposed-prompts/for human review; never modifies scenarios directly.verify-routingscript — diagnostic that confirms apologist calls hit the configured gateway with the apologist key.Results layout (gitignored — see "Out of scope below")
Per-cell artefacts contain prompt label, model, score, query, model output, judge reason, acceptable + unacceptable examples. Re-running a single cell overwrites only that cell; the summary self-heals by scanning all existing files on disk.
Selective re-runs
Provider / judge separation
The judge defaults to OpenRouter regardless of the eval-under-test model — so running a scenario against the cost-billed apologist gateway does not double-bill it for judging. Override with
EVAL_JUDGE_PROVIDERwhen you want apples-to-apples.Out of scope (gitignored)
libs/llm-evals/.env,.env.local— Doppler-populated secretslibs/llm-evals/results/— per-run artefacts and summarylibs/llm-evals/proposed-prompts/— rubric drafts staged for reviewIf stakeholders want to read example artefacts, force-add specific files with
git add -f. README documents the workflow.Test plan
nx lint llm-evals✓nx type-check llm-evals✓developmentlabel, with the expected failure clusters surfaced by the rubric.EVAL_SCENARIO=<slug>only touches that scenario's cells; summary preserves the rest from on-disk metadata.pnpm exec tsx libs/llm-evals/scripts/verify-routing.ts— hostname, key prefix, and response identity all confirm the gateway is in the call path.polish-rubricproduces a proposal file underproposed-prompts/with rationale + ready-to-paste TypeScript snippet, grounded in observed cell outputs.Notes for reviewers
proposed-prompts/(gitignored) that we haven't applied yet.🤖 Generated with Claude Code