Skip to content

feat: write evolution run reports#64

Closed
steezkelly wants to merge 1 commit into
NousResearch:mainfrom
steezkelly:feat/54-run-report
Closed

feat: write evolution run reports#64
steezkelly wants to merge 1 commit into
NousResearch:mainfrom
steezkelly:feat/54-run-report

Conversation

@steezkelly
Copy link
Copy Markdown

Summary

Implements issue #54 step 2: local, machine-readable promotion artifacts for evolution runs.

Adds:

  • evolution/core/run_report.py
  • build_run_report(...) for constructing sanitized report payloads
  • write_run_report(...) for writing reports/runs/<timestamp>-<target>.json
  • artifact diffs via skill.diff
  • wiring in evolve_skill.py so successful non-dry-run evolutions write and print a run report path
  • failure-path reporting when an evolved skill fails constraints, so failed variants still produce auditable artifacts
  • tests that generate reports/diffs without live LLM calls

Report fields include:

  • baseline and optimized artifact hashes/sizes
  • compatibility aliases: evolved_hash / evolved_size
  • dataset source and split counts
  • optimizer/eval model names
  • constraint results and aggregate pass/fail
  • holdout score delta when available
  • null cost/latency estimate placeholders when unavailable
  • output and diff paths

Safety / privacy notes

The report stores artifact hashes, sizes, metrics, paths, and constraint messages. It does not persist raw session dumps or raw private tool outputs. The diff is the baseline-vs-optimized artifact diff already written for local review.

No remote mutation or PR automation is introduced here.

Test Plan

  • RED first: pytest tests/core/test_run_report.py -q failed because evolution.core.run_report did not exist
  • pytest tests/core/test_run_report.py -q
  • pytest -q
  • runtime probe for successful and failed-score report writes
  • static added-line security scan
  • git diff --check

Result: 142 passed, 11 warnings (DSPy deprecation warnings only).

Partially addresses #54.

@steezkelly
Copy link
Copy Markdown
Author

Closing this split PR in favor of consolidated PR #67. Local integration found review/merge overhead across the stack (notably #61/#64 overlap in evolution/skills/evolve_skill.py), and #67 preserves the combined local test evidence: targeted stack tests 21 passed; full suite 160 passed; GitHub checks were absent on the split PRs. Review #67 instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant