gemstack-land · suleimansh · Jun 28, 2026 · Jun 28, 2026
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -0,0 +1,67 @@
+# GemStack AI benchmark: "our AI" vs Next.js
+
+Tracking issue: [#75](https://github.com/gemstack-land/gemstack/issues/75). This is the harness for measuring how an AI coding agent performs with the GemStack orchestration layer in reach versus a vanilla Next.js app, on two metrics:
+
+1. **Time-to-task** - wall clock from task start to the acceptance script passing.
+2. **Human interventions** - count of times a human had to step in (see the rubric below).
+
+This is **not** the self-healing loop. It measures an AI agent building and changing apps.
+
+## Layout
+
+```
+benchmarks/
+  README.md            <- you are here
+  spec/
+    product.md         <- the product surface both apps implement (shared HTTP contract)
+    task-001-tags.md   <- the Phase 0 task + acceptance criteria
+  tasks/
+    task-001-tags/
+      accept.mjs        <- contract-level acceptance script (BASE_URL env, exit 0 = pass)
+examples/
+  bench-app-next/       <- Next.js baseline app (vanilla)
+  bench-app-gemstack/   <- Vike + React app wired with @gemstack/ai-*
+```
+
+Both apps implement the **same HTTP contract** (`spec/product.md`), so a single acceptance script runs against either by pointing `BASE_URL` at the running server.
+
+## Phases
+
+- **Phase 0** ([#78](https://github.com/gemstack-land/gemstack/issues/78)) - one task, both apps, manual stopwatch + manual intervention tally. Proves the method and the rubric. **(this directory)**
+- **Phase 1** ([#79](https://github.com/gemstack-land/gemstack/issues/79)) - semi-automated runner over a 3 to 5 task set.
+- **Phase 2** ([#80](https://github.com/gemstack-land/gemstack/issues/80)) - full suite, aggregator, committed baseline.
+
+## Running Phase 0 by hand
+
+For each app (`bench-app-next`, `bench-app-gemstack`):
+
+1. Reset the app to its starting commit (clean baseline).
+2. Start the dev server, note the URL.
+3. Start a stopwatch. Give the agent the task prompt from `spec/task-001-tags.md`.
+4. Let the agent work. Tally every **human intervention** (rubric below).
+5. After each agent step, run the acceptance script: `BASE_URL=<url> node benchmarks/tasks/task-001-tags/accept.mjs`. Exit 0 means done; stop the stopwatch.
+6. Record seconds, intervention count, and status (pass / DNF) in a run log.
+
+Stop at acceptance pass, or at the hard timeout / max-intervention cap (record as DNF).
+
+## Intervention rubric
+
+Counts as **one human intervention**:
+
+- a manual code correction by a human
+- unblocking a stuck agent with a hint
+- a clarification the agent had to ask before it could proceed
+- an approval gate that required a human
+- a manual retry / rerun a human had to trigger
+
+Does **not** count (this is the point of the orchestration layer):
+
+- the agent's own internal retries, planning, and autopilot worker dispatch
+- skill / MCP tool calls the agent makes autonomously
+
+## Fairness rules
+
+- Same agent, same model, same harness on both sides.
+- Both apps start from a clean, functionally-equivalent baseline implementing the contract.
+- The acceptance gate is objective (the script's exit code); no human judgement.
+- The Next.js app must be idiomatic, not a strawman.
diff --git a/benchmarks/spec/product.md b/benchmarks/spec/product.md
@@ -0,0 +1,55 @@
+# Product spec: the "Notes" app (shared by both benchmark apps)
+
+Both `bench-app-next` (Next.js) and `bench-app-gemstack` (Vike + React + `@gemstack/ai-*`) implement the **same product** and the **same HTTP contract**. Equivalence is what makes the comparison fair; the contract is what lets one acceptance script run against both.
+
+## Surface
+
+A single-user notes app:
+
+- **Auth** - email + password sign-in for one seeded user. A session cookie guards the app and the API.
+- **CRUD resource: `notes`** - fields `id`, `title`, `body`, `createdAt`. List, create, view, delete.
+- **AI feature: summarize** - produce a one-sentence summary of a note's body.
+  - GemStack app: via `@gemstack/ai-sdk` (the orchestration layer in reach).
+  - Next.js app: a vanilla inline provider call.
+  - Both default to a **deterministic stub model** (no network, no API key) so the baseline is reproducible. The stub returns the first sentence of the body, trimmed to <= 140 chars. Real providers are a later, opt-in concern.
+
+## Storage
+
+SQLite via `better-sqlite3` (already allowed in the workspace), one file per app, seeded on first boot. Same schema both sides:
+
+```sql
+CREATE TABLE users (id INTEGER PRIMARY KEY, email TEXT UNIQUE NOT NULL, password TEXT NOT NULL);
+CREATE TABLE notes (
+  id INTEGER PRIMARY KEY AUTOINCREMENT,
+  title TEXT NOT NULL,
+  body TEXT NOT NULL,
+  summary TEXT,
+  created_at TEXT NOT NULL
+);
+```
+
+Seed user: `demo@example.com` / `password`.
+
+## HTTP contract (identical on both apps)
+
+All endpoints return JSON. Auth endpoints set / clear a `session` cookie; protected endpoints require it and return `401` without it.
+
+| Method | Path | Body | Success | Notes |
+|---|---|---|---|---|
+| POST | `/api/login` | `{ email, password }` | `200 { ok: true }` + `session` cookie | `401` on bad creds |
+| POST | `/api/logout` | - | `200 { ok: true }` | clears cookie |
+| GET | `/api/notes` | - | `200 { notes: Note[] }` | newest first |
+| POST | `/api/notes` | `{ title, body }` | `201 { note: Note }` | |
+| GET | `/api/notes/:id` | - | `200 { note: Note }` | `404` if absent |
+| DELETE | `/api/notes/:id` | - | `200 { ok: true }` | |
+| POST | `/api/notes/:id/summarize` | - | `200 { note: Note }` | sets `summary` |
+
+`Note` shape: `{ id: number, title: string, body: string, summary: string | null, createdAt: string }`.
+
+## UI
+
+Minimal but real React pages (server-rendered on both): a login page, a notes list (with a create form and per-note delete + summarize buttons), and a note detail page. Parity of surface matters more than polish.
+
+## Baseline = starting commit
+
+The committed state of each app is the Phase 0 **starting point**. Tasks (e.g. `task-001-tags`) ask the agent to extend it; the acceptance script verifies the result against the contract.
diff --git a/benchmarks/spec/task-001-tags.md b/benchmarks/spec/task-001-tags.md
@@ -0,0 +1,32 @@
+# Task 001: add tags to notes
+
+A representative full-stack feature: it touches the data model, the HTTP contract, and the UI. The same prompt is given to the agent on both apps.
+
+## Agent prompt (verbatim, given on both apps)
+
+> Add tagging to notes. A note can have zero or more tags (short text labels). Update the create form so a user can enter comma-separated tags when creating a note. Show each note's tags in the list and on the detail page. Add the ability to list notes filtered to a single tag. Keep the existing HTTP contract working and extend it as described in the acceptance criteria.
+
+## Required contract changes
+
+- `Note` gains `tags: string[]` (empty array when none).
+- `POST /api/notes` accepts an optional `tags: string[]` in the body and persists it.
+- `GET /api/notes` accepts an optional `?tag=<t>` query param; when present, only notes carrying that exact tag are returned.
+- `GET /api/notes/:id` includes `tags`.
+
+## Acceptance criteria (checked by `tasks/task-001-tags/accept.mjs`)
+
+1. Log in as the seeded user.
+2. Create note A with `tags: ["work", "urgent"]`.
+3. Create note B with `tags: ["home"]`.
+4. `GET /api/notes?tag=work` returns A and not B.
+5. `GET /api/notes?tag=home` returns B and not A.
+6. `GET /api/notes/<A.id>` includes `tags` containing `work` and `urgent`.
+7. `GET /api/notes` (no filter) returns both, each with a `tags` array.
+
+The script exits `0` only when all checks pass. Any non-zero exit is a fail.
+
+## Guardrails
+
+- Hard timeout: 30 minutes wall clock.
+- Max interventions before DNF: 5.
+- A UI must exist for entering and displaying tags (spot-checked by the human), but the automated gate is the contract above.
diff --git a/benchmarks/tasks/task-001-tags/accept.mjs b/benchmarks/tasks/task-001-tags/accept.mjs
@@ -0,0 +1,88 @@
+#!/usr/bin/env node
+// Contract-level acceptance check for task-001-tags.
+// Runs against a running benchmark app. Usage: BASE_URL=http://localhost:3000 node accept.mjs
+// Exit 0 = all checks pass; non-zero = fail.
+
+const BASE = (process.env.BASE_URL || 'http://localhost:3000').replace(/\/$/, '')
+
+let cookie = ''
+let failures = 0
+
+function check(label, cond) {
+  if (cond) {
+    console.log(`  ok   ${label}`)
+  } else {
+    console.log(`  FAIL ${label}`)
+    failures++
+  }
+}
+
+async function req(method, path, body) {
+  const headers = { 'content-type': 'application/json' }
+  if (cookie) headers.cookie = cookie
+  const res = await fetch(`${BASE}${path}`, {
+    method,
+    headers,
+    body: body === undefined ? undefined : JSON.stringify(body),
+  })
+  const setCookie = res.headers.get('set-cookie')
+  if (setCookie) cookie = setCookie.split(';')[0]
+  let json = null
+  try {
+    json = await res.json()
+  } catch {
+    /* non-JSON body */
+  }
+  return { status: res.status, json }
+}
+
+async function main() {
+  console.log(`acceptance: task-001-tags against ${BASE}`)
+
+  // 1. login
+  const login = await req('POST', '/api/login', { email: 'demo@example.com', password: 'password' })
+  check('login returns 200', login.status === 200)
+  check('login set a session cookie', cookie.length > 0)
+
+  // 2 + 3. create two tagged notes
+  const a = await req('POST', '/api/notes', { title: 'Note A', body: 'Body A.', tags: ['work', 'urgent'] })
+  check('create A returns 201', a.status === 201)
+  const b = await req('POST', '/api/notes', { title: 'Note B', body: 'Body B.', tags: ['home'] })
+  check('create B returns 201', b.status === 201)
+  const aId = a.json?.note?.id
+  const bId = b.json?.note?.id
+  check('A has an id', typeof aId === 'number')
+  check('B has an id', typeof bId === 'number')
+
+  // 4. filter by "work" -> A only
+  const work = await req('GET', '/api/notes?tag=work')
+  const workIds = (work.json?.notes || []).map((n) => n.id)
+  check('?tag=work returns A', workIds.includes(aId))
+  check('?tag=work excludes B', !workIds.includes(bId))
+
+  // 5. filter by "home" -> B only
+  const home = await req('GET', '/api/notes?tag=home')
+  const homeIds = (home.json?.notes || []).map((n) => n.id)
+  check('?tag=home returns B', homeIds.includes(bId))
+  check('?tag=home excludes A', !homeIds.includes(aId))
+
+  // 6. detail includes tags
+  const detail = await req('GET', `/api/notes/${aId}`)
+  const tags = detail.json?.note?.tags || []
+  check('detail A includes tag work', tags.includes('work'))
+  check('detail A includes tag urgent', tags.includes('urgent'))
+
+  // 7. unfiltered list returns both, each with a tags array
+  const all = await req('GET', '/api/notes')
+  const allNotes = all.json?.notes || []
+  check('unfiltered list includes A and B', allNotes.some((n) => n.id === aId) && allNotes.some((n) => n.id === bId))
+  check('every note has a tags array', allNotes.every((n) => Array.isArray(n.tags)))
+
+  console.log(failures === 0 ? '\nPASS' : `\nFAIL (${failures} check(s) failed)`)
+  process.exit(failures === 0 ? 0 : 1)
+}
+
+main().catch((err) => {
+  console.error('acceptance crashed:', err)
+  process.exit(2)
+})
diff --git a/examples/bench-app-gemstack/.gitignore b/examples/bench-app-gemstack/.gitignore
@@ -0,0 +1,9 @@
+# Local SQLite database (seeded on first boot)
+data/
+*.sqlite
+*.sqlite-journal
+*.sqlite-wal
+*.sqlite-shm
+
+# Vite / Vike build artifacts
+dist/
diff --git a/examples/bench-app-gemstack/README.md b/examples/bench-app-gemstack/README.md
@@ -0,0 +1,77 @@
+# bench-app-gemstack
+
+The **GemStack** side of the AI benchmark (see `benchmarks/`). A minimal but real
+**Vike + React (SSR)** Notes app whose **AI summarize** feature is wired through
+[`@gemstack/ai-sdk`](../../packages/ai-sdk) — the orchestration layer "in reach".
+
+Its twin, `bench-app-next`, implements the **same product and the same HTTP
+contract** (`benchmarks/spec/product.md`) with a vanilla inline provider call, so
+one acceptance script runs against either by pointing `BASE_URL` at the server.
+
+> Baseline scope: notes have `id`, `title`, `body`, `summary`, `createdAt` only.
+> Tags are a later agent task and are intentionally **not** implemented here.
+
+## Run
+
+From the repo root, build the SDK once, then start the app:
+
+```bash
+pnpm --filter @gemstack/ai-sdk build      # the app imports the SDK's dist
+cd examples/bench-app-gemstack
+pnpm dev                                   # http://localhost:3100
+```
+
+Fixed port **3100** (override with `PORT`). It differs from the Next.js sibling's
+3000 so both can run at once. The server is Express in Vite middleware mode: it
+serves `/api/*` directly and hands every other route to Vike for React SSR.
+
+Seed user: `demo@example.com` / `password` (seeded into SQLite on first boot).
+
+## HTTP contract
+
+JSON everywhere. Auth endpoints set/clear a `session` cookie; protected
+endpoints require it and return `401` without it.
+
+| Method | Path | Body | Success |
+|---|---|---|---|
+| POST | `/api/login` | `{ email, password }` | `200 { ok: true }` + cookie (`401` on bad creds) |
+| POST | `/api/logout` | – | `200 { ok: true }` |
+| GET | `/api/notes` | – | `200 { notes: Note[] }` (newest first) |
+| POST | `/api/notes` | `{ title, body }` | `201 { note: Note }` |
+| GET | `/api/notes/:id` | – | `200 { note: Note }` (`404` if absent) |
+| DELETE | `/api/notes/:id` | – | `200 { ok: true }` |
+| POST | `/api/notes/:id/summarize` | – | `200 { note: Note }` (sets `summary`) |
+
+`Note` = `{ id: number, title: string, body: string, summary: string | null, createdAt: string }`.
+
+## How summarize uses `@gemstack/ai-sdk`
+
+`server/ai.ts` registers a **deterministic stub provider** on the SDK's provider
+seam (`AiRegistry.register` with a `ProviderFactory` / `ProviderAdapter`) and sets
+it as the default model. `summarize()` then calls the SDK facade, `AI.prompt(body,
+{ model: 'stub/summarize-v1' })` — so the path runs through the GemStack agent
+loop, not a direct model call. The stub computes the result from the prompt it
+receives (first sentence of the body, trimmed to ≤ 140 chars): no network, no API
+key, fully reproducible. Swapping in a real provider later is a one-line change to
+the model string.
+
+## Storage
+
+`better-sqlite3`, one file at `data/bench.sqlite` (git-ignored), created and
+seeded on first boot. Schema matches the spec (`users`, `notes`).
+
+## Layout
+
+```
+server/
+  index.ts   Express + Vite-middleware dev server (API + Vike SSR catch-all)
+  api.ts     the HTTP contract (session-cookie auth)
+  db.ts      better-sqlite3 schema, seed, queries
+  ai.ts      @gemstack/ai-sdk stub provider + summarize()
+pages/
+  +config.ts        vike-react config
+  index/+Page.tsx   notes list (create form, per-note delete + summarize)
+  login/+Page.tsx   sign-in
+  note/@id/+Page.tsx note detail
+  api.ts            client-side fetch wrapper over the contract
+```
diff --git a/examples/bench-app-gemstack/package.json b/examples/bench-app-gemstack/package.json
@@ -0,0 +1,30 @@
+{
+  "name": "@gemstack/example-bench-app-gemstack",
+  "version": "0.0.0",
+  "private": true,
+  "description": "Benchmark baseline: a Vike + React (SSR) Notes app whose AI summarize feature is wired through @gemstack/ai-sdk. Twin of bench-app-next.",
+  "type": "module",
+  "scripts": {
+    "dev": "tsx server/index.ts"
+  },
+  "dependencies": {
+    "@gemstack/ai-sdk": "workspace:^",
+    "better-sqlite3": "^12.11.1",
+    "express": "^5.2.1",
+    "react": "^19.2.0",
+    "react-dom": "^19.2.0",
+    "vike": "^0.4.260",
+    "vike-react": "^0.6.25"
+  },
+  "devDependencies": {
+    "@types/better-sqlite3": "^7.6.13",
+    "@types/express": "^5.0.6",
+    "@types/node": "^20.0.0",
+    "@types/react": "^19.2.0",
+    "@types/react-dom": "^19.2.0",
+    "@vitejs/plugin-react": "^5.2.0",
+    "tsx": "^4.22.4",
+    "typescript": "^5.4.0",
+    "vite": "^7.3.6"
+  }
+}
diff --git a/examples/bench-app-gemstack/pages/+config.ts b/examples/bench-app-gemstack/pages/+config.ts
@@ -0,0 +1,8 @@
+import vikeReact from 'vike-react/config'
+import type { Config } from 'vike/types'
+
+// SSR React app powered by vike-react. Client routing on by default.
+export default {
+  extends: vikeReact,
+  title: 'Notes — bench-app-gemstack',
+} satisfies Config