Skip to content

Benchmark Phase 0: one task, both apps, manual tally (prove the rubric) #78

Description

@suleimansh

Child of #75. First concrete step of the "our AI vs Next.js" benchmark: prove the method and the intervention rubric by hand before any automation.

Goal

Run one task on both a GemStack reference app and a functionally equivalent Next.js app, with the same AI agent, and record:

  1. Time-to-task (wall clock to acceptance pass).
  2. Human-intervention count (per the rubric in the Benchmark: GemStack AI orchestration vs Next.js (task speed + human interventions) #75 design comment).

Deliverables

  • Two minimal but functionally-equivalent reference apps (or agreement on existing ones to reuse): GemStack side with the orchestration layer in reach, Next.js side vanilla.
  • One task: prompt + starting commit + acceptance script (exit 0 = pass) + hard timeout + max-intervention cap.
  • A manual run log for each side: seconds + intervention tally + status (pass/DNF).
  • A short write-up: did the rubric hold up, what was ambiguous, what to automate next.

Out of scope

No runner, no aggregator yet. Manual stopwatch and tally only. This phase de-risks the methodology, not the tooling.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions