feat: quick experimentation harness for testing configs and prompts

## Summary

There is no lightweight way to quickly test a config change, prompt edit, or model swap against a small dataset without running the full pipeline. This makes iteration slow and expensive.

## Current state

The existing test suite (`tests/engine/`) contains 22 unit tests that mock LLM calls. There is no tool for running live end-to-end experiments with real models against a small slice of data and comparing results across runs.

## Proposed harness

A CLI command or notebook (or both) that supports:
- [ ] Run a named config preset or inline config against N records from a dataset
- [ ] Diff output columns (`leakage_mass`, `utility_score`, `judge`, `needs_human_review`) across two runs side by side
- [ ] Print a summary table: avg leakage, avg utility, avg judge, # needing repair, # needing human review, # errors
- [ ] Optionally pin a model or override defaults without editing source files
- [ ] Works with any of the existing test datasets (TAB, BIO, RAT-Bench, Enron)

## Why this matters

Right now testing a prompt change requires: editing code → reinstalling → running a full notebook → manually inspecting output. A harness would let Amy and others iterate on prompts and configs directly without needing to touch the library internals.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: quick experimentation harness for testing configs and prompts #111

Summary

Current state

Proposed harness

Why this matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: quick experimentation harness for testing configs and prompts #111

Description

Summary

Current state

Proposed harness

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions