Skip to content

feat(llm-evals): apologist evaluation harness with matrix scoring + rubric polisher [NES-1664]#9213

Draft
jaco-brink wants to merge 6 commits into
mainfrom
jacobusbrink/nes-1664-chat-response-evaluation-harness
Draft

feat(llm-evals): apologist evaluation harness with matrix scoring + rubric polisher [NES-1664]#9213
jaco-brink wants to merge 6 commits into
mainfrom
jacobusbrink/nes-1664-chat-response-evaluation-harness

Conversation

@jaco-brink
Copy link
Copy Markdown
Collaborator

Summary

A standalone LLM evaluation harness for the apologist chat's system prompt. Fetches a labelled prompt from Langfuse, runs each scenario × model in a matrix, and scores responses with an LLM-as-judge against per-scenario positive and negative criteria — all without booting apps/journeys or hitting /api/chat.

This is intentionally a long-lived branch, not a one-shot feature merge. The intent is for it to track main as the place where evals live and iterate, used as the way we verify prompt changes before they reach production.

What's here

  • libs/llm-evals/ — new Nx library, four Nx targets: eval, fetch-secrets, polish-rubric, plus standard lint and type-check.
  • 9 scenarios covering pastoral grief, intellectual doubt, factual, doctrinal (Trinity, tongues), ethical (alcohol, tattoos, premarital sex, divorce), and one biblical lookup (Cain's wife).
  • 5 models per scenario by default: openrouter:google/gemini-3-flash-preview (mirrors /api/chat), plus 4 apologist gateway models (openai/gpt/4o-mini, anthropic/claude/haiku-4.5, google/gemini/3-flash, anthropic/claude/sonnet-4.6).
  • polish-rubric script — uses a stronger model (default Apologist Sonnet 4.6) to read a scenario's rubric plus its observed cell outputs and propose sharpened criteria. Writes proposals to gitignored proposed-prompts/ for human review; never modifies scenarios directly.
  • verify-routing script — diagnostic that confirms apologist calls hit the configured gateway with the apologist key.

Results layout (gitignored — see "Out of scope below")

libs/llm-evals/results/
├── summary.md                                       aggregate matrix, regenerated each run
├── <scenario-slug>/
│   ├── openrouter__<modelId>.md                     one canonical artefact per cell
│   ├── apologist__<modelId>.md
│   └── ...

Per-cell artefacts contain prompt label, model, score, query, model output, judge reason, acceptable + unacceptable examples. Re-running a single cell overwrites only that cell; the summary self-heals by scanning all existing files on disk.

Selective re-runs

EVAL_SCENARIO=<slug> pnpm exec nx run llm-evals:eval
EVAL_SCENARIO=<slug> EVAL_MODEL='apologist:<modelId>' pnpm exec nx run llm-evals:eval

Provider / judge separation

The judge defaults to OpenRouter regardless of the eval-under-test model — so running a scenario against the cost-billed apologist gateway does not double-bill it for judging. Override with EVAL_JUDGE_PROVIDER when you want apples-to-apples.

Out of scope (gitignored)

  • libs/llm-evals/.env, .env.local — Doppler-populated secrets
  • libs/llm-evals/results/ — per-run artefacts and summary
  • libs/llm-evals/proposed-prompts/ — rubric drafts staged for review

If stakeholders want to read example artefacts, force-add specific files with git add -f. README documents the workflow.

Test plan

  • nx lint llm-evals
  • nx type-check llm-evals
  • Full 45-cell matrix run completes (9 scenarios × 5 models). 32/45 passing on the current development label, with the expected failure clusters surfaced by the rubric.
  • Selective re-run: EVAL_SCENARIO=<slug> only touches that scenario's cells; summary preserves the rest from on-disk metadata.
  • Apologist routing verified live via pnpm exec tsx libs/llm-evals/scripts/verify-routing.ts — hostname, key prefix, and response identity all confirm the gateway is in the call path.
  • polish-rubric produces a proposal file under proposed-prompts/ with rationale + ready-to-paste TypeScript snippet, grounded in observed cell outputs.

Notes for reviewers

  • This is a draft because the eval results themselves are still being iterated on. The harness, scenarios, and tooling are stable; the rubrics in each scenario have polished proposals waiting under proposed-prompts/ (gitignored) that we haven't applied yet.
  • Branch is long-lived by design — please rebase on main when reviewing rather than merging directly. Cleanup / merge strategy can be decided when the harness is ready to be the team's standard eval workflow.

🤖 Generated with Claude Code

jaco-brink and others added 2 commits May 13, 2026 04:39
…pts [NES-1664]

Adds libs/llm-evals/ — a Nx library that fetches a labelled system prompt
from Langfuse, runs it against scenario queries on a configurable LLM
provider (OpenRouter default, Gemini direct, or Apologist gateway), and
scores the response with a separate judge LLM against per-scenario
positive and negative criteria.

- nx targets: eval (vitest), fetch-secrets (filtered Doppler pull), lint, type-check
- Scenario format supports acceptableExamples (positive) and unacceptableExamples (anti-patterns)
- Per-run output written to results/<timestamp>/summary.md + one file per scenario
- Two starter scenarios on the development base prompt (resurrection doubt, problem of evil)
- README documents the flow, label conventions (development = base, no production), and provider/judge toggles

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…NES-1664]

Builds on the initial harness with three substantive additions.

1. Per-(scenario, model) matrix execution
   - Scenario.models[] declares which models a scenario tests; the runner
     flattens scenarios × models into cells and runs each.
   - Results layout reorganised around (scenario, model) as the primary
     key — results/<scenario-slug>/<provider>__<modelId>.md per cell,
     plus results/summary.md aggregating the matrix.
   - Selective re-runs via EVAL_SCENARIO and EVAL_MODEL env vars; only
     the cells that ran are overwritten, the rest preserved from on-disk
     metadata in <!-- llm-eval-meta {...} --> blocks.
   - summary.md restructured: one H2 per scenario with its own table,
     green/red pass indicators, judge reasoning grouped below the table.

2. Seven new scenarios covering doctrinal, factual, ethical, and
   pastoral question types: Cain's wife, divorce after infidelity,
   drinking alcohol, premarital sex, speaking in tongues, tattoos,
   and the doctrine of the Trinity. Each scenario declares both
   acceptableExamples (positive criteria) and unacceptableExamples
   (anti-patterns) so the judge has paired criteria along each axis.

3. polish-rubric script — uses a configurable stronger model
   (default apologist:anthropic/claude/sonnet-4.6) to read a scenario's
   current rubric plus its observed cell outputs and propose a
   sharpened version. Output written to libs/llm-evals/proposed-prompts/
   (gitignored) for human review — never modifies scenario files
   directly. Invoked via `nx run llm-evals:polish-rubric --scenario=<slug>`.

Also adds scripts/verify-routing.ts — diagnostic that confirms apologist
provider calls hit the configured gateway URL with the apologist key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@linear
Copy link
Copy Markdown

linear Bot commented May 14, 2026

NES-1664

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7f0a4f1c-90fc-4681-8fba-fe03638500a5

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jacobusbrink/nes-1664-chat-response-evaluation-harness

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

Fails
🚫 Please ensure your PR title matches commitlint convention.
🚫 Please assign someone to merge this PR.
Warnings
⚠️ ❗ Big PR (5320 changes)

(change count - 5320): Pull Request size seems relatively large. If Pull Request contains multiple changes, split each into separate PR will helps faster, easier review.

(pr title - feat(llm-evals): apologist evaluation harness with matrix scoring + rubric polisher [NES-1664]):

header must not be longer than 100 characters, current length is 102

Generated by 🚫 dangerJS against 814037c

@nx-cloud
Copy link
Copy Markdown

nx-cloud Bot commented May 14, 2026

View your CI Pipeline Execution ↗ for commit 814037c

Command Status Duration Result
nx affected --target=subgraph-check --base=d4b1... ✅ Succeeded 4s View ↗
nx affected --target=extract-translations --bas... ✅ Succeeded 1s View ↗
nx affected --target=lint --base=d4b1f905cf1762... ✅ Succeeded 6s View ↗
nx affected --target=type-check --base=d4b1f905... ✅ Succeeded 3s View ↗
nx run-many --target=codegen --all --parallel=3 ✅ Succeeded <1s View ↗
nx run-many --target=prisma-generate --all --pa... ✅ Succeeded 4s View ↗

☁️ Nx Cloud last updated this comment at 2026-05-14 04:13:35 UTC

Comment thread libs/llm-evals/eval.spec.ts Fixed
autofix-ci Bot and others added 4 commits May 14, 2026 03:23
…-1664]

The eval summary, per-cell artefacts, and rubric proposals are the most
valuable artefacts for stakeholders reviewing prompt and model
behaviour. The earlier safety audit confirmed none of these files
contain secrets, infrastructure URLs, or proprietary system-prompt
content, so they are now tracked.

- libs/llm-evals/.gitignore reduced to only .env and .env.local.
- 45 cell artefacts (9 scenarios × 5 models), the summary, and 9
  rubric proposals now visible in the PR.
- README updated to reflect tracked state and to drop the now-misleading
  "sidecar" terminology in favour of "proposal file".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CodeQL flagged `escapeCell` as incomplete string escaping — without
escaping backslashes first, a literal `\` in any scenario name or model
id would collide with the `\|` pipe escape sequence and produce
ambiguous markdown.

Also strip newlines, which break table cells regardless of escaping.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants