Skip to content

chore(bench): Phase 0 harness + Next.js and Vike/GemStack baseline apps#81

Open
suleimansh wants to merge 1 commit into
mainfrom
bench/orchestration-vs-next
Open

chore(bench): Phase 0 harness + Next.js and Vike/GemStack baseline apps#81
suleimansh wants to merge 1 commit into
mainfrom
bench/orchestration-vs-next

Conversation

@suleimansh

Copy link
Copy Markdown
Member

Phase 0 of the "our AI vs Next.js" benchmark (#75). Closes #78.

Lays down a runnable Phase 0 baseline: two functionally-equivalent apps and a contract-level acceptance harness, ready to time an AI agent against.

What's here

  • benchmarks/README.md - harness overview, manual run steps, the intervention rubric, and fairness rules.
  • benchmarks/spec/product.md - the shared "Notes" product and a single HTTP contract both apps implement (the trick that lets one acceptance script grade either).
  • benchmarks/spec/task-001-tags.md - the Phase 0 task (add tags) and its acceptance criteria.
  • benchmarks/tasks/task-001-tags/accept.mjs - contract-level acceptance check; BASE_URL=<url> node accept.mjs, exit 0 = pass.
  • examples/bench-app-next - vanilla Next.js (App Router) baseline. pnpm dev -> :4311.
  • examples/bench-app-gemstack - Vike + React baseline; summarize wired through @gemstack/ai-sdk via a deterministic stub provider (no network, no key). pnpm dev -> :3100.
  • pnpm-workspace.yaml - declare sharp:false so pnpm dev runs cleanly (both frameworks pull the optional, prebuilt sharp).

Verification

  • pnpm install + pnpm --filter @gemstack/ai-sdk build clean.
  • Both apps boot and pass the full baseline contract (login/cookie/create/list/get/summarize/delete + 401 when unauthenticated).
  • Running accept.mjs against each baseline fails the identical 5 tag-specific checks and passes everything else - confirming the two apps are equivalent and the acceptance script correctly detects the unimplemented task. Adding tags correctly turns it green.

Notes

  • The apps are private @gemstack/example-* workspace packages (no build/publish, mirroring mcp-quickstart); no changeset needed.
  • Next step is the actual Phase 0 measurement: run the same agent on each app against task-001 and record time + interventions per the rubric.

…e apps

Sets up the 'our AI vs Next.js' benchmark (#75) Phase 0 (#78):

- benchmarks/: shared product spec + a single HTTP contract both apps
  implement, the task-001 (add tags) spec, and a contract-level
  acceptance script (accept.mjs) that grades either app via BASE_URL.
- examples/bench-app-next: vanilla Next.js App Router baseline.
- examples/bench-app-gemstack: Vike + React baseline with the summarize
  feature wired through @gemstack/ai-sdk (deterministic stub provider).
- pnpm-workspace.yaml: declare sharp:false so pnpm dev runs cleanly.

Both baselines pass the contract and fail the identical 5 tag checks,
confirming the apps are equivalent and the acceptance script is correct.
@suleimansh suleimansh added enhancement New feature or request priority: medium Worth doing, not urgent labels Jun 28, 2026
@suleimansh suleimansh self-assigned this Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request priority: medium Worth doing, not urgent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark Phase 0: one task, both apps, manual tally (prove the rubric)

1 participant