feat: add benchmark gate for run reports by steezkelly · Pull Request #63 · NousResearch/hermes-agent-self-evolution

steezkelly · 2026-05-09T03:01:53Z

Summary

Implements issue #54 step 3: a conservative local benchmark_gate.py for machine-readable evolution run reports.

Adds:

evolution/core/benchmark_gate.py
structured BenchmarkThresholds and GateResult
evaluate_report(report, thresholds) for library use
CLI: python -m evolution.core.benchmark_gate --report <report.json>
threshold overrides:
- --min-holdout-improvement
- --max-artifact-growth
- --max-cost-increase
optional local benchmark commands via repeatable --benchmark-command
tests for pass, fail, missing fields, and threshold override behavior

Behavior

The gate fails closed when required promotion data is missing or unsafe:

missing required report fields
constraints_passed is not true
holdout improvement below threshold
artifact growth above threshold
cost threshold requested but report lacks cost fields
optional benchmark command exits non-zero

The CLI prints deterministic JSON and exits non-zero when the gate fails, matching the issue acceptance criteria.

Scope / non-goals

This PR implements the standalone benchmark gate. It does not yet wire the gate into evolve_skill.py; that is issue #54 step 5 and should come after run-report generation lands.

--benchmark-command is intentionally local/user-provided. No remote mutation, PR creation, or upstream write occurs.

Test Plan

RED first: pytest tests/core/test_benchmark_gate.py -q failed because evolution.core.benchmark_gate did not exist
pytest tests/core/test_benchmark_gate.py -q
pytest -q
python -m evolution.core.benchmark_gate --help
static added-line security scan
git diff --check

Result: 146 passed, 11 warnings (DSPy deprecation warnings only).

Partially addresses #54.

steezkelly · 2026-05-09T03:30:01Z

Closing this split PR in favor of consolidated PR #67. Local integration found review/merge overhead across the stack (notably #61/#64 overlap in evolution/skills/evolve_skill.py), and #67 preserves the combined local test evidence: targeted stack tests 21 passed; full suite 160 passed; GitHub checks were absent on the split PRs. Review #67 instead.

feat: add benchmark gate for run reports

46f07eb

This was referenced May 9, 2026

fix: fail fast on unsupported tblite gate #62

Closed

feat: consolidate issue 54 ingestion and promotion gates #67

Closed

steezkelly closed this May 9, 2026

steezkelly mentioned this pull request May 9, 2026

Implement all-agent session ingestion and promotion gates #54

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add benchmark gate for run reports#63

feat: add benchmark gate for run reports#63
steezkelly wants to merge 1 commit into
NousResearch:mainfrom
steezkelly:feat/54-benchmark-gate

steezkelly commented May 9, 2026

Uh oh!

steezkelly commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

steezkelly commented May 9, 2026

Summary

Behavior

Scope / non-goals

Test Plan

Uh oh!

steezkelly commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant