feat(llm-evals): apologist evaluation harness with matrix scoring + rubric polisher [NES-1664] by jaco-brink · Pull Request #9213 · JesusFilm/core

jaco-brink · 2026-05-14T03:19:33Z

Summary

A standalone LLM evaluation harness for the apologist chat's system prompt. Fetches a labelled prompt from Langfuse, runs each scenario × model in a matrix, and scores responses with an LLM-as-judge against per-scenario positive and negative criteria — all without booting apps/journeys or hitting /api/chat.

This is intentionally a long-lived branch, not a one-shot feature merge. The intent is for it to track main as the place where evals live and iterate, used as the way we verify prompt changes before they reach production.

What's here

libs/llm-evals/ — new Nx library, four Nx targets: eval, fetch-secrets, polish-rubric, plus standard lint and type-check.
9 scenarios covering pastoral grief, intellectual doubt, factual, doctrinal (Trinity, tongues), ethical (alcohol, tattoos, premarital sex, divorce), and one biblical lookup (Cain's wife).
5 models per scenario by default: openrouter:google/gemini-3-flash-preview (mirrors /api/chat), plus 4 apologist gateway models (openai/gpt/4o-mini, anthropic/claude/haiku-4.5, google/gemini/3-flash, anthropic/claude/sonnet-4.6).
polish-rubric script — uses a stronger model (default Apologist Sonnet 4.6) to read a scenario's rubric plus its observed cell outputs and propose sharpened criteria. Writes proposals to gitignored proposed-prompts/ for human review; never modifies scenarios directly.
verify-routing script — diagnostic that confirms apologist calls hit the configured gateway with the apologist key.

Results layout (gitignored — see "Out of scope below")

libs/llm-evals/results/
├── summary.md                                       aggregate matrix, regenerated each run
├── <scenario-slug>/
│   ├── openrouter__<modelId>.md                     one canonical artefact per cell
│   ├── apologist__<modelId>.md
│   └── ...

Per-cell artefacts contain prompt label, model, score, query, model output, judge reason, acceptable + unacceptable examples. Re-running a single cell overwrites only that cell; the summary self-heals by scanning all existing files on disk.

Selective re-runs

EVAL_SCENARIO=<slug> pnpm exec nx run llm-evals:eval
EVAL_SCENARIO=<slug> EVAL_MODEL='apologist:<modelId>' pnpm exec nx run llm-evals:eval

Provider / judge separation

The judge defaults to OpenRouter regardless of the eval-under-test model — so running a scenario against the cost-billed apologist gateway does not double-bill it for judging. Override with EVAL_JUDGE_PROVIDER when you want apples-to-apples.

Out of scope (gitignored)

libs/llm-evals/.env, .env.local — Doppler-populated secrets
libs/llm-evals/results/ — per-run artefacts and summary
libs/llm-evals/proposed-prompts/ — rubric drafts staged for review

If stakeholders want to read example artefacts, force-add specific files with git add -f. README documents the workflow.

Test plan

nx lint llm-evals ✓
nx type-check llm-evals ✓
Full 45-cell matrix run completes (9 scenarios × 5 models). 32/45 passing on the current development label, with the expected failure clusters surfaced by the rubric.
Selective re-run: EVAL_SCENARIO=<slug> only touches that scenario's cells; summary preserves the rest from on-disk metadata.
Apologist routing verified live via pnpm exec tsx libs/llm-evals/scripts/verify-routing.ts — hostname, key prefix, and response identity all confirm the gateway is in the call path.
polish-rubric produces a proposal file under proposed-prompts/ with rationale + ready-to-paste TypeScript snippet, grounded in observed cell outputs.

Notes for reviewers

This is a draft because the eval results themselves are still being iterated on. The harness, scenarios, and tooling are stable; the rubrics in each scenario have polished proposals waiting under proposed-prompts/ (gitignored) that we haven't applied yet.
Branch is long-lived by design — please rebase on main when reviewing rather than merging directly. Cleanup / merge strategy can be decided when the harness is ready to be the team's standard eval workflow.

🤖 Generated with Claude Code

…pts [NES-1664] Adds libs/llm-evals/ — a Nx library that fetches a labelled system prompt from Langfuse, runs it against scenario queries on a configurable LLM provider (OpenRouter default, Gemini direct, or Apologist gateway), and scores the response with a separate judge LLM against per-scenario positive and negative criteria. - nx targets: eval (vitest), fetch-secrets (filtered Doppler pull), lint, type-check - Scenario format supports acceptableExamples (positive) and unacceptableExamples (anti-patterns) - Per-run output written to results/<timestamp>/summary.md + one file per scenario - Two starter scenarios on the development base prompt (resurrection doubt, problem of evil) - README documents the flow, label conventions (development = base, no production), and provider/judge toggles Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…NES-1664] Builds on the initial harness with three substantive additions. 1. Per-(scenario, model) matrix execution - Scenario.models[] declares which models a scenario tests; the runner flattens scenarios × models into cells and runs each. - Results layout reorganised around (scenario, model) as the primary key — results/<scenario-slug>/<provider>__<modelId>.md per cell, plus results/summary.md aggregating the matrix. - Selective re-runs via EVAL_SCENARIO and EVAL_MODEL env vars; only the cells that ran are overwritten, the rest preserved from on-disk metadata in  blocks. - summary.md restructured: one H2 per scenario with its own table, green/red pass indicators, judge reasoning grouped below the table. 2. Seven new scenarios covering doctrinal, factual, ethical, and pastoral question types: Cain's wife, divorce after infidelity, drinking alcohol, premarital sex, speaking in tongues, tattoos, and the doctrine of the Trinity. Each scenario declares both acceptableExamples (positive criteria) and unacceptableExamples (anti-patterns) so the judge has paired criteria along each axis. 3. polish-rubric script — uses a configurable stronger model (default apologist:anthropic/claude/sonnet-4.6) to read a scenario's current rubric plus its observed cell outputs and propose a sharpened version. Output written to libs/llm-evals/proposed-prompts/ (gitignored) for human review — never modifies scenario files directly. Invoked via `nx run llm-evals:polish-rubric --scenario=<slug>`. Also adds scripts/verify-routing.ts — diagnostic that confirms apologist provider calls hit the configured gateway URL with the apologist key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

linear · 2026-05-14T03:19:36Z

NES-1664

coderabbitai · 2026-05-14T03:19:40Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7f0a4f1c-90fc-4681-8fba-fe03638500a5

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jacobusbrink/nes-1664-chat-response-evaluation-harness

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-14T03:20:26Z

	Fails
🚫	Please ensure your PR title matches commitlint convention.
🚫	Please assign someone to merge this PR.

	Warnings
⚠️	❗ Big PR (5320 changes)

(change count - 5320): Pull Request size seems relatively large. If Pull Request contains multiple changes, split each into separate PR will helps faster, easier review.

(pr title - feat(llm-evals): apologist evaluation harness with matrix scoring + rubric polisher [NES-1664]):

header must not be longer than 100 characters, current length is 102

Generated by 🚫 dangerJS against 814037c

nx-cloud · 2026-05-14T03:20:43Z

View your CI Pipeline Execution ↗ for commit 814037c

Command	Status	Duration	Result
`nx affected --target=subgraph-check --base=d4b1...`	✅ Succeeded	4s	View ↗
`nx affected --target=extract-translations --bas...`	✅ Succeeded	1s	View ↗
`nx affected --target=lint --base=d4b1f905cf1762...`	✅ Succeeded	6s	View ↗
`nx affected --target=type-check --base=d4b1f905...`	✅ Succeeded	3s	View ↗
`nx run-many --target=codegen --all --parallel=3`	✅ Succeeded	<1s	View ↗
`nx run-many --target=prisma-generate --all --pa...`	✅ Succeeded	4s	View ↗

☁️ Nx Cloud last updated this comment at 2026-05-14 04:13:35 UTC

…-1664] The eval summary, per-cell artefacts, and rubric proposals are the most valuable artefacts for stakeholders reviewing prompt and model behaviour. The earlier safety audit confirmed none of these files contain secrets, infrastructure URLs, or proprietary system-prompt content, so they are now tracked. - libs/llm-evals/.gitignore reduced to only .env and .env.local. - 45 cell artefacts (9 scenarios × 5 models), the summary, and 9 rubric proposals now visible in the PR. - README updated to reflect tracked state and to drop the now-misleading "sidecar" terminology in favour of "proposal file". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CodeQL flagged `escapeCell` as incomplete string escaping — without escaping backslashes first, a literal `\` in any scenario name or model id would collide with the `\|` pipe escape sequence and produce ambiguous markdown. Also strip newlines, which break table cells regardless of escaping. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jaco-brink and others added 2 commits May 13, 2026 04:39

github-advanced-security AI found potential problems May 14, 2026

View reviewed changes

Comment thread libs/llm-evals/eval.spec.ts Fixed

autofix-ci Bot and others added 4 commits May 14, 2026 03:23

fix: lint issues

d40d0f0

fix: lint issues

f2755c6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm-evals): apologist evaluation harness with matrix scoring + rubric polisher [NES-1664]#9213

feat(llm-evals): apologist evaluation harness with matrix scoring + rubric polisher [NES-1664]#9213
jaco-brink wants to merge 6 commits into
mainfrom
jacobusbrink/nes-1664-chat-response-evaluation-harness

jaco-brink commented May 14, 2026

Uh oh!

linear Bot commented May 14, 2026

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Review skipped

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

nx-cloud Bot commented May 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jaco-brink commented May 14, 2026

Summary

What's here

Results layout (gitignored — see "Out of scope below")

Selective re-runs

Provider / judge separation

Out of scope (gitignored)

Test plan

Notes for reviewers

Uh oh!

linear Bot commented May 14, 2026

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

header must not be longer than 100 characters, current length is 102

Uh oh!

nx-cloud Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading

nx-cloud Bot commented May 14, 2026 •

edited

Loading