Skip to content

Optional per-iteration prompt dump for faithful reproduction #6

@samkeen

Description

@samkeen

Motivation

memory_load events tell us which channels the agent saw and their hashes, but not the exact bytes the model received on each turn. For debugging surprising behavior — especially the kind that ships in the Altered Craft article — being able to read back the literal prompt + tool schemas + interleaved tool results is invaluable.

Proposal

Behind an env flag (e.g. TILTH_PROMPT_DUMP=1), write the rendered request to sessions/<id>/prompts/<task_id>-iter<N>.md (or .json) before each client.chat() call. Include the system prompt, the user prompt, tool schemas, and the conversation history at the moment of the call.

events.jsonl would carry a path reference in the model_call payload (e.g. prompt_dump: "prompts/T-001-iter3.md") so consumers can navigate from the event to the exact request.

Trade-offs

  • Cost: a few KB to ~20KB per iteration. A 20-iteration task ≈ 400KB; a multi-task session could push into the MB range. Mitigations: gzip on write, or only enable on demand via the env flag (default off).
  • Privacy: prompts can include workspace contents (file reads, diffs). Gating behind a flag keeps this opt-in.
  • Bloat to events.jsonl: none — the event only stores a path, not the prompt itself.

Default behavior

Off. This is a debugger feature; reach for it when chasing a specific question. Production runs should not pay the disk cost by default.

Acceptance criteria

  • TILTH_PROMPT_DUMP=1 enables prompt dumps; default is off
  • Dumps land in sessions/<id>/prompts/ with one file per client.chat() call
  • model_call event payload references the dump path when written
  • Filename naming makes ordering obvious (task_id + iter + monotonic suffix for non-iter calls like judge/self-improve)
  • No change to default-off behavior; existing test suite still passes
  • README/CLAUDE.md mention the flag in the observability section

Context

Spun out of an observability pass that added memory_load and hook_run events plus OTel-shaped trace/span IDs. See conversation around the addition of `tilth/summary.json` aggregation for related work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions