You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is no lightweight way to quickly test a config change, prompt edit, or model swap against a small dataset without running the full pipeline. This makes iteration slow and expensive.
Current state
The existing test suite (tests/engine/) contains 22 unit tests that mock LLM calls. There is no tool for running live end-to-end experiments with real models against a small slice of data and comparing results across runs.
Proposed harness
A CLI command or notebook (or both) that supports:
Run a named config preset or inline config against N records from a dataset
Diff output columns (leakage_mass, utility_score, judge, needs_human_review) across two runs side by side
Print a summary table: avg leakage, avg utility, avg judge, # needing repair, # needing human review, # errors
Optionally pin a model or override defaults without editing source files
Works with any of the existing test datasets (TAB, BIO, RAT-Bench, Enron)
Why this matters
Right now testing a prompt change requires: editing code → reinstalling → running a full notebook → manually inspecting output. A harness would let Amy and others iterate on prompts and configs directly without needing to touch the library internals.
Summary
There is no lightweight way to quickly test a config change, prompt edit, or model swap against a small dataset without running the full pipeline. This makes iteration slow and expensive.
Current state
The existing test suite (
tests/engine/) contains 22 unit tests that mock LLM calls. There is no tool for running live end-to-end experiments with real models against a small slice of data and comparing results across runs.Proposed harness
A CLI command or notebook (or both) that supports:
leakage_mass,utility_score,judge,needs_human_review) across two runs side by sideWhy this matters
Right now testing a prompt change requires: editing code → reinstalling → running a full notebook → manually inspecting output. A harness would let Amy and others iterate on prompts and configs directly without needing to touch the library internals.