Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# GemStack AI benchmark: "our AI" vs Next.js

Tracking issue: [#75](https://github.com/gemstack-land/gemstack/issues/75). This is the harness for measuring how an AI coding agent performs with the GemStack orchestration layer in reach versus a vanilla Next.js app, on two metrics:

1. **Time-to-task** - wall clock from task start to the acceptance script passing.
2. **Human interventions** - count of times a human had to step in (see the rubric below).

This is **not** the self-healing loop. It measures an AI agent building and changing apps.

## Layout

```
benchmarks/
README.md <- you are here
spec/
product.md <- the product surface both apps implement (shared HTTP contract)
task-001-tags.md <- the Phase 0 task + acceptance criteria
tasks/
task-001-tags/
accept.mjs <- contract-level acceptance script (BASE_URL env, exit 0 = pass)
examples/
bench-app-next/ <- Next.js baseline app (vanilla)
bench-app-gemstack/ <- Vike + React app wired with @gemstack/ai-*
```

Both apps implement the **same HTTP contract** (`spec/product.md`), so a single acceptance script runs against either by pointing `BASE_URL` at the running server.

## Phases

- **Phase 0** ([#78](https://github.com/gemstack-land/gemstack/issues/78)) - one task, both apps, manual stopwatch + manual intervention tally. Proves the method and the rubric. **(this directory)**
- **Phase 1** ([#79](https://github.com/gemstack-land/gemstack/issues/79)) - semi-automated runner over a 3 to 5 task set.
- **Phase 2** ([#80](https://github.com/gemstack-land/gemstack/issues/80)) - full suite, aggregator, committed baseline.

## Running Phase 0 by hand

For each app (`bench-app-next`, `bench-app-gemstack`):

1. Reset the app to its starting commit (clean baseline).
2. Start the dev server, note the URL.
3. Start a stopwatch. Give the agent the task prompt from `spec/task-001-tags.md`.
4. Let the agent work. Tally every **human intervention** (rubric below).
5. After each agent step, run the acceptance script: `BASE_URL=<url> node benchmarks/tasks/task-001-tags/accept.mjs`. Exit 0 means done; stop the stopwatch.
6. Record seconds, intervention count, and status (pass / DNF) in a run log.

Stop at acceptance pass, or at the hard timeout / max-intervention cap (record as DNF).

## Intervention rubric

Counts as **one human intervention**:

- a manual code correction by a human
- unblocking a stuck agent with a hint
- a clarification the agent had to ask before it could proceed
- an approval gate that required a human
- a manual retry / rerun a human had to trigger

Does **not** count (this is the point of the orchestration layer):

- the agent's own internal retries, planning, and autopilot worker dispatch
- skill / MCP tool calls the agent makes autonomously

## Fairness rules

- Same agent, same model, same harness on both sides.
- Both apps start from a clean, functionally-equivalent baseline implementing the contract.
- The acceptance gate is objective (the script's exit code); no human judgement.
- The Next.js app must be idiomatic, not a strawman.
55 changes: 55 additions & 0 deletions benchmarks/spec/product.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Product spec: the "Notes" app (shared by both benchmark apps)

Both `bench-app-next` (Next.js) and `bench-app-gemstack` (Vike + React + `@gemstack/ai-*`) implement the **same product** and the **same HTTP contract**. Equivalence is what makes the comparison fair; the contract is what lets one acceptance script run against both.

## Surface

A single-user notes app:

- **Auth** - email + password sign-in for one seeded user. A session cookie guards the app and the API.
- **CRUD resource: `notes`** - fields `id`, `title`, `body`, `createdAt`. List, create, view, delete.
- **AI feature: summarize** - produce a one-sentence summary of a note's body.
- GemStack app: via `@gemstack/ai-sdk` (the orchestration layer in reach).
- Next.js app: a vanilla inline provider call.
- Both default to a **deterministic stub model** (no network, no API key) so the baseline is reproducible. The stub returns the first sentence of the body, trimmed to <= 140 chars. Real providers are a later, opt-in concern.

## Storage

SQLite via `better-sqlite3` (already allowed in the workspace), one file per app, seeded on first boot. Same schema both sides:

```sql
CREATE TABLE users (id INTEGER PRIMARY KEY, email TEXT UNIQUE NOT NULL, password TEXT NOT NULL);
CREATE TABLE notes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
body TEXT NOT NULL,
summary TEXT,
created_at TEXT NOT NULL
);
```

Seed user: `demo@example.com` / `password`.

## HTTP contract (identical on both apps)

All endpoints return JSON. Auth endpoints set / clear a `session` cookie; protected endpoints require it and return `401` without it.

| Method | Path | Body | Success | Notes |
|---|---|---|---|---|
| POST | `/api/login` | `{ email, password }` | `200 { ok: true }` + `session` cookie | `401` on bad creds |
| POST | `/api/logout` | - | `200 { ok: true }` | clears cookie |
| GET | `/api/notes` | - | `200 { notes: Note[] }` | newest first |
| POST | `/api/notes` | `{ title, body }` | `201 { note: Note }` | |
| GET | `/api/notes/:id` | - | `200 { note: Note }` | `404` if absent |
| DELETE | `/api/notes/:id` | - | `200 { ok: true }` | |
| POST | `/api/notes/:id/summarize` | - | `200 { note: Note }` | sets `summary` |

`Note` shape: `{ id: number, title: string, body: string, summary: string | null, createdAt: string }`.

## UI

Minimal but real React pages (server-rendered on both): a login page, a notes list (with a create form and per-note delete + summarize buttons), and a note detail page. Parity of surface matters more than polish.

## Baseline = starting commit

The committed state of each app is the Phase 0 **starting point**. Tasks (e.g. `task-001-tags`) ask the agent to extend it; the acceptance script verifies the result against the contract.
32 changes: 32 additions & 0 deletions benchmarks/spec/task-001-tags.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Task 001: add tags to notes

A representative full-stack feature: it touches the data model, the HTTP contract, and the UI. The same prompt is given to the agent on both apps.

## Agent prompt (verbatim, given on both apps)

> Add tagging to notes. A note can have zero or more tags (short text labels). Update the create form so a user can enter comma-separated tags when creating a note. Show each note's tags in the list and on the detail page. Add the ability to list notes filtered to a single tag. Keep the existing HTTP contract working and extend it as described in the acceptance criteria.

## Required contract changes

- `Note` gains `tags: string[]` (empty array when none).
- `POST /api/notes` accepts an optional `tags: string[]` in the body and persists it.
- `GET /api/notes` accepts an optional `?tag=<t>` query param; when present, only notes carrying that exact tag are returned.
- `GET /api/notes/:id` includes `tags`.

## Acceptance criteria (checked by `tasks/task-001-tags/accept.mjs`)

1. Log in as the seeded user.
2. Create note A with `tags: ["work", "urgent"]`.
3. Create note B with `tags: ["home"]`.
4. `GET /api/notes?tag=work` returns A and not B.
5. `GET /api/notes?tag=home` returns B and not A.
6. `GET /api/notes/<A.id>` includes `tags` containing `work` and `urgent`.
7. `GET /api/notes` (no filter) returns both, each with a `tags` array.

The script exits `0` only when all checks pass. Any non-zero exit is a fail.

## Guardrails

- Hard timeout: 30 minutes wall clock.
- Max interventions before DNF: 5.
- A UI must exist for entering and displaying tags (spot-checked by the human), but the automated gate is the contract above.
88 changes: 88 additions & 0 deletions benchmarks/tasks/task-001-tags/accept.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
#!/usr/bin/env node
// Contract-level acceptance check for task-001-tags.
// Runs against a running benchmark app. Usage: BASE_URL=http://localhost:3000 node accept.mjs
// Exit 0 = all checks pass; non-zero = fail.

const BASE = (process.env.BASE_URL || 'http://localhost:3000').replace(/\/$/, '')

let cookie = ''
let failures = 0

function check(label, cond) {
if (cond) {
console.log(` ok ${label}`)
} else {
console.log(` FAIL ${label}`)
failures++
}
}

async function req(method, path, body) {
const headers = { 'content-type': 'application/json' }
if (cookie) headers.cookie = cookie
const res = await fetch(`${BASE}${path}`, {
method,
headers,
body: body === undefined ? undefined : JSON.stringify(body),
})
const setCookie = res.headers.get('set-cookie')
if (setCookie) cookie = setCookie.split(';')[0]
let json = null
try {
json = await res.json()
} catch {
/* non-JSON body */
}
return { status: res.status, json }
}

async function main() {
console.log(`acceptance: task-001-tags against ${BASE}`)

// 1. login
const login = await req('POST', '/api/login', { email: 'demo@example.com', password: 'password' })
check('login returns 200', login.status === 200)
check('login set a session cookie', cookie.length > 0)

// 2 + 3. create two tagged notes
const a = await req('POST', '/api/notes', { title: 'Note A', body: 'Body A.', tags: ['work', 'urgent'] })
check('create A returns 201', a.status === 201)
const b = await req('POST', '/api/notes', { title: 'Note B', body: 'Body B.', tags: ['home'] })
check('create B returns 201', b.status === 201)
const aId = a.json?.note?.id
const bId = b.json?.note?.id
check('A has an id', typeof aId === 'number')
check('B has an id', typeof bId === 'number')

// 4. filter by "work" -> A only
const work = await req('GET', '/api/notes?tag=work')
const workIds = (work.json?.notes || []).map((n) => n.id)
check('?tag=work returns A', workIds.includes(aId))
check('?tag=work excludes B', !workIds.includes(bId))

// 5. filter by "home" -> B only
const home = await req('GET', '/api/notes?tag=home')
const homeIds = (home.json?.notes || []).map((n) => n.id)
check('?tag=home returns B', homeIds.includes(bId))
check('?tag=home excludes A', !homeIds.includes(aId))

// 6. detail includes tags
const detail = await req('GET', `/api/notes/${aId}`)
const tags = detail.json?.note?.tags || []
check('detail A includes tag work', tags.includes('work'))
check('detail A includes tag urgent', tags.includes('urgent'))

// 7. unfiltered list returns both, each with a tags array
const all = await req('GET', '/api/notes')
const allNotes = all.json?.notes || []
check('unfiltered list includes A and B', allNotes.some((n) => n.id === aId) && allNotes.some((n) => n.id === bId))
check('every note has a tags array', allNotes.every((n) => Array.isArray(n.tags)))

console.log(failures === 0 ? '\nPASS' : `\nFAIL (${failures} check(s) failed)`)
process.exit(failures === 0 ? 0 : 1)
}

main().catch((err) => {
console.error('acceptance crashed:', err)
process.exit(2)
})
9 changes: 9 additions & 0 deletions examples/bench-app-gemstack/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Local SQLite database (seeded on first boot)
data/
*.sqlite
*.sqlite-journal
*.sqlite-wal
*.sqlite-shm

# Vite / Vike build artifacts
dist/
77 changes: 77 additions & 0 deletions examples/bench-app-gemstack/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# bench-app-gemstack

The **GemStack** side of the AI benchmark (see `benchmarks/`). A minimal but real
**Vike + React (SSR)** Notes app whose **AI summarize** feature is wired through
[`@gemstack/ai-sdk`](../../packages/ai-sdk) — the orchestration layer "in reach".

Its twin, `bench-app-next`, implements the **same product and the same HTTP
contract** (`benchmarks/spec/product.md`) with a vanilla inline provider call, so
one acceptance script runs against either by pointing `BASE_URL` at the server.

> Baseline scope: notes have `id`, `title`, `body`, `summary`, `createdAt` only.
> Tags are a later agent task and are intentionally **not** implemented here.

## Run

From the repo root, build the SDK once, then start the app:

```bash
pnpm --filter @gemstack/ai-sdk build # the app imports the SDK's dist
cd examples/bench-app-gemstack
pnpm dev # http://localhost:3100
```

Fixed port **3100** (override with `PORT`). It differs from the Next.js sibling's
3000 so both can run at once. The server is Express in Vite middleware mode: it
serves `/api/*` directly and hands every other route to Vike for React SSR.

Seed user: `demo@example.com` / `password` (seeded into SQLite on first boot).

## HTTP contract

JSON everywhere. Auth endpoints set/clear a `session` cookie; protected
endpoints require it and return `401` without it.

| Method | Path | Body | Success |
|---|---|---|---|
| POST | `/api/login` | `{ email, password }` | `200 { ok: true }` + cookie (`401` on bad creds) |
| POST | `/api/logout` | – | `200 { ok: true }` |
| GET | `/api/notes` | – | `200 { notes: Note[] }` (newest first) |
| POST | `/api/notes` | `{ title, body }` | `201 { note: Note }` |
| GET | `/api/notes/:id` | – | `200 { note: Note }` (`404` if absent) |
| DELETE | `/api/notes/:id` | – | `200 { ok: true }` |
| POST | `/api/notes/:id/summarize` | – | `200 { note: Note }` (sets `summary`) |

`Note` = `{ id: number, title: string, body: string, summary: string | null, createdAt: string }`.

## How summarize uses `@gemstack/ai-sdk`

`server/ai.ts` registers a **deterministic stub provider** on the SDK's provider
seam (`AiRegistry.register` with a `ProviderFactory` / `ProviderAdapter`) and sets
it as the default model. `summarize()` then calls the SDK facade, `AI.prompt(body,
{ model: 'stub/summarize-v1' })` — so the path runs through the GemStack agent
loop, not a direct model call. The stub computes the result from the prompt it
receives (first sentence of the body, trimmed to ≤ 140 chars): no network, no API
key, fully reproducible. Swapping in a real provider later is a one-line change to
the model string.

## Storage

`better-sqlite3`, one file at `data/bench.sqlite` (git-ignored), created and
seeded on first boot. Schema matches the spec (`users`, `notes`).

## Layout

```
server/
index.ts Express + Vite-middleware dev server (API + Vike SSR catch-all)
api.ts the HTTP contract (session-cookie auth)
db.ts better-sqlite3 schema, seed, queries
ai.ts @gemstack/ai-sdk stub provider + summarize()
pages/
+config.ts vike-react config
index/+Page.tsx notes list (create form, per-note delete + summarize)
login/+Page.tsx sign-in
note/@id/+Page.tsx note detail
api.ts client-side fetch wrapper over the contract
```
30 changes: 30 additions & 0 deletions examples/bench-app-gemstack/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"name": "@gemstack/example-bench-app-gemstack",
"version": "0.0.0",
"private": true,
"description": "Benchmark baseline: a Vike + React (SSR) Notes app whose AI summarize feature is wired through @gemstack/ai-sdk. Twin of bench-app-next.",
"type": "module",
"scripts": {
"dev": "tsx server/index.ts"
},
"dependencies": {
"@gemstack/ai-sdk": "workspace:^",
"better-sqlite3": "^12.11.1",
"express": "^5.2.1",
"react": "^19.2.0",
"react-dom": "^19.2.0",
"vike": "^0.4.260",
"vike-react": "^0.6.25"
},
"devDependencies": {
"@types/better-sqlite3": "^7.6.13",
"@types/express": "^5.0.6",
"@types/node": "^20.0.0",
"@types/react": "^19.2.0",
"@types/react-dom": "^19.2.0",
"@vitejs/plugin-react": "^5.2.0",
"tsx": "^4.22.4",
"typescript": "^5.4.0",
"vite": "^7.3.6"
}
}
8 changes: 8 additions & 0 deletions examples/bench-app-gemstack/pages/+config.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
import vikeReact from 'vike-react/config'
import type { Config } from 'vike/types'

// SSR React app powered by vike-react. Client routing on by default.
export default {
extends: vikeReact,
title: 'Notes — bench-app-gemstack',
} satisfies Config
Loading
Loading