Benchmark Phase 0: one task, both apps, manual tally (prove the rubric)

Child of #75. First concrete step of the "our AI vs Next.js" benchmark: prove the method and the intervention rubric by hand before any automation.

## Goal
Run **one** task on both a GemStack reference app and a functionally equivalent Next.js app, with the same AI agent, and record:
1. Time-to-task (wall clock to acceptance pass).
2. Human-intervention count (per the rubric in the #75 design comment).

## Deliverables
- Two minimal but functionally-equivalent reference apps (or agreement on existing ones to reuse): GemStack side with the orchestration layer in reach, Next.js side vanilla.
- One task: prompt + starting commit + acceptance script (exit 0 = pass) + hard timeout + max-intervention cap.
- A manual run log for each side: seconds + intervention tally + status (pass/DNF).
- A short write-up: did the rubric hold up, what was ambiguous, what to automate next.

## Out of scope
No runner, no aggregator yet. Manual stopwatch and tally only. This phase de-risks the methodology, not the tooling.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark Phase 0: one task, both apps, manual tally (prove the rubric) #78

Goal

Deliverables

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Benchmark Phase 0: one task, both apps, manual tally (prove the rubric) #78

Description

Goal

Deliverables

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions