Skip to content

feat: add benchmark gate for run reports#63

Closed
steezkelly wants to merge 1 commit into
NousResearch:mainfrom
steezkelly:feat/54-benchmark-gate
Closed

feat: add benchmark gate for run reports#63
steezkelly wants to merge 1 commit into
NousResearch:mainfrom
steezkelly:feat/54-benchmark-gate

Conversation

@steezkelly
Copy link
Copy Markdown

Summary

Implements issue #54 step 3: a conservative local benchmark_gate.py for machine-readable evolution run reports.

Adds:

  • evolution/core/benchmark_gate.py
  • structured BenchmarkThresholds and GateResult
  • evaluate_report(report, thresholds) for library use
  • CLI: python -m evolution.core.benchmark_gate --report <report.json>
  • threshold overrides:
    • --min-holdout-improvement
    • --max-artifact-growth
    • --max-cost-increase
  • optional local benchmark commands via repeatable --benchmark-command
  • tests for pass, fail, missing fields, and threshold override behavior

Behavior

The gate fails closed when required promotion data is missing or unsafe:

  • missing required report fields
  • constraints_passed is not true
  • holdout improvement below threshold
  • artifact growth above threshold
  • cost threshold requested but report lacks cost fields
  • optional benchmark command exits non-zero

The CLI prints deterministic JSON and exits non-zero when the gate fails, matching the issue acceptance criteria.

Scope / non-goals

This PR implements the standalone benchmark gate. It does not yet wire the gate into evolve_skill.py; that is issue #54 step 5 and should come after run-report generation lands.

--benchmark-command is intentionally local/user-provided. No remote mutation, PR creation, or upstream write occurs.

Test Plan

  • RED first: pytest tests/core/test_benchmark_gate.py -q failed because evolution.core.benchmark_gate did not exist
  • pytest tests/core/test_benchmark_gate.py -q
  • pytest -q
  • python -m evolution.core.benchmark_gate --help
  • static added-line security scan
  • git diff --check

Result: 146 passed, 11 warnings (DSPy deprecation warnings only).

Partially addresses #54.

@steezkelly
Copy link
Copy Markdown
Author

Closing this split PR in favor of consolidated PR #67. Local integration found review/merge overhead across the stack (notably #61/#64 overlap in evolution/skills/evolve_skill.py), and #67 preserves the combined local test evidence: targeted stack tests 21 passed; full suite 160 passed; GitHub checks were absent on the split PRs. Review #67 instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant