Benchmark: GemStack AI orchestration vs Next.js (task speed + human interventions)

Idea from the Vike collab (Rom). Measure how "our AI" performs compared to Next.js: the same AI coding agent achieving the same real development tasks, with the GemStack orchestration layer (autopilot + skills + mcp) available versus a vanilla Next.js baseline that has none of it.

## Metrics
1. Time for the AI to achieve a given task.
2. Number of human interventions required to complete it.

## Scope to define
- A fixed task set (representative dev tasks) and a Next.js comparison harness.
- Reproducible runner + reporting (time + intervention count per task).
- A baseline run we can track over time.

## Out of scope
This is not the self-healing loop (an app that watches its own tests/audits/errors and fixes itself). That is a separate direction. This benchmark only measures an AI agent building and changing apps.

Priority is medium; this is a strong story for what the orchestration layer is for, not blocking shipping. See the design spec in the comments.

## Phases
- [ ] #78 Phase 0: one task, both apps, manual tally (prove the rubric)
- [ ] #79 Phase 1: semi-automated runner over a 3-5 task set
- [ ] #80 Phase 2: full suite, aggregator, committed baseline


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark: GemStack AI orchestration vs Next.js (task speed + human interventions) #75

Metrics

Scope to define

Out of scope

Phases

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Benchmark: GemStack AI orchestration vs Next.js (task speed + human interventions) #75

Description

Metrics

Scope to define

Out of scope

Phases

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions