Skip to content

Benchmark Phase 1: semi-automated runner over a 3-5 task set #79

Description

@suleimansh

Child of #75. Builds on Phase 0 (#78). Turn the manual method into a semi-automated runner over a small task set.

Goal

3 to 5 tasks, run through a runner that launches the agent, detects acceptance, times the run, and logs structured human-intervention events.

Deliverables

  • runner: checkout starting commit -> launch agent (GemStack | Next.js adapter) -> stream events -> poll acceptance script -> emit report.json { framework, task, runIndex, seconds, interventions, status }.
  • Two thin adapters (GemStack / Next.js) so the runner stays framework-agnostic.
  • A task set of 3 to 5 tasks from the categories in the Benchmark: GemStack AI orchestration vs Next.js (task speed + human interventions) #75 design (feature add, schema change, bug fix, AI integration, refactor), each with an acceptance script.
  • N runs per task (start at 5); raw report.json per run.

Out of scope

Aggregation/reporting polish and the committed baseline land in Phase 2.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions