Skip to content

feat: quick experimentation harness for testing configs and prompts #111

@lipikaramaswamy

Description

@lipikaramaswamy

Summary

There is no lightweight way to quickly test a config change, prompt edit, or model swap against a small dataset without running the full pipeline. This makes iteration slow and expensive.

Current state

The existing test suite (tests/engine/) contains 22 unit tests that mock LLM calls. There is no tool for running live end-to-end experiments with real models against a small slice of data and comparing results across runs.

Proposed harness

A CLI command or notebook (or both) that supports:

  • Run a named config preset or inline config against N records from a dataset
  • Diff output columns (leakage_mass, utility_score, judge, needs_human_review) across two runs side by side
  • Print a summary table: avg leakage, avg utility, avg judge, # needing repair, # needing human review, # errors
  • Optionally pin a model or override defaults without editing source files
  • Works with any of the existing test datasets (TAB, BIO, RAT-Bench, Enron)

Why this matters

Right now testing a prompt change requires: editing code → reinstalling → running a full notebook → manually inspecting output. A harness would let Amy and others iterate on prompts and configs directly without needing to touch the library internals.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions