You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Two minimal but functionally-equivalent reference apps (or agreement on existing ones to reuse): GemStack side with the orchestration layer in reach, Next.js side vanilla.
One task: prompt + starting commit + acceptance script (exit 0 = pass) + hard timeout + max-intervention cap.
A manual run log for each side: seconds + intervention tally + status (pass/DNF).
A short write-up: did the rubric hold up, what was ambiguous, what to automate next.
Out of scope
No runner, no aggregator yet. Manual stopwatch and tally only. This phase de-risks the methodology, not the tooling.
Child of #75. First concrete step of the "our AI vs Next.js" benchmark: prove the method and the intervention rubric by hand before any automation.
Goal
Run one task on both a GemStack reference app and a functionally equivalent Next.js app, with the same AI agent, and record:
Deliverables
Out of scope
No runner, no aggregator yet. Manual stopwatch and tally only. This phase de-risks the methodology, not the tooling.