Idea from the Vike collab (Rom). Measure how "our AI" performs compared to Next.js: the same AI coding agent achieving the same real development tasks, with the GemStack orchestration layer (autopilot + skills + mcp) available versus a vanilla Next.js baseline that has none of it.
Metrics
- Time for the AI to achieve a given task.
- Number of human interventions required to complete it.
Scope to define
- A fixed task set (representative dev tasks) and a Next.js comparison harness.
- Reproducible runner + reporting (time + intervention count per task).
- A baseline run we can track over time.
Out of scope
This is not the self-healing loop (an app that watches its own tests/audits/errors and fixes itself). That is a separate direction. This benchmark only measures an AI agent building and changing apps.
Priority is medium; this is a strong story for what the orchestration layer is for, not blocking shipping. See the design spec in the comments.
Phases
Idea from the Vike collab (Rom). Measure how "our AI" performs compared to Next.js: the same AI coding agent achieving the same real development tasks, with the GemStack orchestration layer (autopilot + skills + mcp) available versus a vanilla Next.js baseline that has none of it.
Metrics
Scope to define
Out of scope
This is not the self-healing loop (an app that watches its own tests/audits/errors and fixes itself). That is a separate direction. This benchmark only measures an AI agent building and changing apps.
Priority is medium; this is a strong story for what the orchestration layer is for, not blocking shipping. See the design spec in the comments.
Phases