diff --git a/README.md b/README.md index b3f06a7..2084bf7 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@

Noesis Agentic Software Development Evals Toolkit

-

Run an AI coding agent on a task you already know the answer to. Score the result. Compare configurations.

+

Measure how your whole AI coding setup performs — and what it costs in tokens and dollars across models and providers — so you can choose where to invest and when to switch.

Product Page Join our Discord @@ -14,6 +14,14 @@ --- +## Why NASDE? + +Your team runs AI coding agents — but **which setup is actually best for *your* codebase, and what is it costing you?** Switch from Claude to Codex, swap a model, add a skill or an MCP server — and you're guessing whether quality went up, down, or just got more expensive. + +The decisions that matter are getting expensive: **which provider, which model, which configuration** — each with a different quality-per-dollar trade-off. NASDE measures your **whole harness** — the agent, its skills, its MCP servers, against *your* tasks — and reports not just **how good** the output is, but **how many tokens and how many dollars** it took, per model and per provider. That's the data behind a real decision: where to invest, which model to standardize on, and when a migration actually pays off. + +It runs on your own machine with a subscription you already have. **Today** it drives Claude Code, the Codex CLI, and the Gemini CLI; **planned:** Pi, Cursor, and router-based setups. + ## What NASDE does — in four steps One `nasde run` command executes the whole chain. @@ -39,14 +47,13 @@ Concepts (how the scoring works, the evaluation pipeline, token & cost, rubric c ## What do I use it for? -Anyone working with AI coding agents eventually hits the same wall: *"I changed my skill / `CLAUDE.md` / MCP setup — is the agent actually better now, or does it just feel that way?"* NASDE turns that gut feeling into a repeatable measurement which is **easy to do even on a personal machine, with a Claude Code or Codex subscription**. - -Typical things you'd do with it: +The core use is a **cost-and-quality decision** about your AI coding stack: *which agent, which model, which provider, which configuration — for our codebase and our budget?* NASDE answers it with numbers instead of vibes. Typical things you'd do with it: -- **Run an agent safely on a realistic task** — a sandboxed container means the agent can `rm -rf`, install random packages, or run your tests in loops without wrecking your laptop. -- **Compare two configurations of the same agent** — baseline vs. "with my new skill"; see whether the skill moves the score up or down, and on which dimensions. -- **Compare different agents on the same task** — Claude Code vs. Codex vs. Gemini CLI against *your* workspace and *your* criteria. -- **Build a regression suite for your AI setup** — once a task set exists, re-run it every time someone tweaks the prompt/skills/MCP and spot regressions before they ship. +- **Compare providers and models on quality *and* cost** — Claude Code vs. Codex vs. Gemini, Sonnet vs. Opus, against *your* tasks; see the score *and* the tokens and dollars each one spends, and pick the best quality-per-dollar for your budget. +- **Decide whether a migration pays off** — before standardizing on a new agent or model, measure what actually changes in output quality and in spend. +- **Measure your whole harness, not just one skill** — run your real `CLAUDE.md` + skills + MCP servers as a unit and see how the full configuration performs. +- **Tune a single skill or config** — baseline vs. "with my new skill"; see whether it moves the score up or down, and on which dimensions. +- **Build a regression suite for your AI setup** — re-run the task set whenever someone tweaks the prompt/skills/MCP/model and catch quality *or* cost regressions before they ship. ## Quick start diff --git a/website/src/content/docs/getting-started/overview.md b/website/src/content/docs/getting-started/overview.md index 676ac9a..f373919 100644 --- a/website/src/content/docs/getting-started/overview.md +++ b/website/src/content/docs/getting-started/overview.md @@ -1,15 +1,17 @@ --- title: Overview -description: Why NASDE exists, what it does, what it is and isn't, and the four-step loop it automates. +description: Measure how your whole AI coding setup performs — and what it costs in tokens and dollars across models and providers — so you can choose where to invest and when to switch. --- ## Why NASDE? — the problem -You changed your agent's skill, its `CLAUDE.md`, or its MCP setup. **Is the agent actually better now — or does it just feel that way?** +Your team runs AI coding agents — but which setup is actually best for *your* codebase, and what is it costing you? Switch from Claude to Codex, swap a model, add a skill or an MCP server — and you're guessing whether quality went up, down, or just got more expensive. -Everyone working with AI coding agents hits this wall. Skill changes are a leap of faith: maybe the new prompt improves refactoring but quietly breaks the agent's tests. A passing CI check tells you the code *works*; it doesn't tell you whether the agent's output got *better* on the things you care about — architecture, test quality, clarity — or which of two agents does it better on *your* code. +The decisions that matter are getting expensive: **which provider, which model, which configuration** — each with a different quality-per-dollar trade-off. Without measurement, you're optimizing AI spend on gut feel. -NASDE turns that gut feeling into a **repeatable measurement** you can run on your own machine with a Claude Code or Codex subscription. Define a task you already know the answer to, run different agent configurations against it, and get multi-dimensional scores you can compare and trust. +NASDE measures your **whole harness** — the agent, its skills, its MCP servers, against *your* tasks — and reports not just **how good** the output is, but **how many tokens and how many dollars** it took, per model and per provider. That's the data behind a real decision: where to invest, which model to standardize on, and when a migration actually pays off. + +It runs on your own machine with a subscription you already have, and you stay in control of what "good" means — you define the tasks and the criteria; NASDE runs the experiment the same way every time and puts quality and cost side by side. ## What NASDE does — in four steps @@ -72,15 +74,21 @@ NASDE runs every agent **non-interactively** — it scripts them rather than cha Because it uses Claude programmatically, NASDE falls under Anthropic's terms for programmatic use. Anthropic has announced that, **from June 15, 2026, paid Claude plans include a dedicated monthly credit for programmatic usage** — covering `claude -p`, the Claude Agent SDK, and Claude Code GitHub Actions, among others. So running NASDE on a paid Claude subscription is expected and supported; check [Anthropic's current terms](https://www.anthropic.com/) for the exact credit and limits that apply to your plan. +### Agents and providers + +**Today** NASDE drives **Claude Code, the Codex CLI, and the Gemini CLI** — so you can already compare across three providers on the same tasks, with quality and cost side by side. **Planned:** Pi, Cursor, and router-based setups — so a single benchmark spans every agent and router you're weighing, and a migration decision comes with full visibility into the quality-vs-cost trade-offs. + ## What do I use it for? -Anyone working with AI coding agents eventually hits the same wall: *"I changed my skill / `CLAUDE.md` / MCP setup — is the agent actually better now, or does it just feel that way?"* NASDE turns that gut feeling into a repeatable measurement which is **easy to do on even on a personal machine, with a Claude Code or Codex subscription**. +The core use is a **cost-and-quality decision** about your AI coding stack: *which agent, which model, which provider, which configuration — for our codebase and our budget?* NASDE answers it with numbers instead of vibes. Typical things you'd do with it: -- **Run an agent safely on a realistic task** — a sandboxed container means the agent can `rm -rf`, install random packages, or run your tests in loops without wrecking your laptop. -- **Compare two configurations of the same agent** — baseline vs. "with my new skill"; see whether the skill moves the score up or down, and on which dimensions. -- **Compare different agents on the same task** — Claude Code vs. Codex vs. Gemini CLI against *your* workspace and *your* criteria. -- **Build a regression suite for your AI setup** — once a task set exists, re-run it every time someone tweaks the prompt/skills/MCP and spot regressions before they ship. +- **Compare providers and models on quality *and* cost** — Claude Code vs. Codex vs. Gemini, Sonnet vs. Opus, against *your* tasks; see the score *and* the tokens and dollars each one spends, so you can pick the best quality-per-dollar for your budget. +- **Decide whether a migration pays off** — before standardizing on a new agent or model, measure what actually changes in output quality and in spend. +- **Measure your whole harness, not just one skill** — run your real `CLAUDE.md` + skills + MCP servers as a unit, and see how the full configuration performs, not just an isolated component. +- **Tune a single skill or config** — baseline vs. "with my new skill"; see whether it moves the score up or down, and on which dimensions. +- **Build a regression suite for your AI setup** — once a task set exists, re-run it every time someone tweaks the prompt/skills/MCP/model and catch quality *or* cost regressions before they ship. +- **Run agents safely on realistic tasks** — a sandboxed container means the agent can `rm -rf`, install random packages, or loop your tests without wrecking your laptop. Ready to try it? Head to the [Quick Start](/nasde-toolkit/getting-started/quick-start/). diff --git a/website/src/content/docs/index.mdx b/website/src/content/docs/index.mdx index ef177c2..69c200d 100644 --- a/website/src/content/docs/index.mdx +++ b/website/src/content/docs/index.mdx @@ -1,9 +1,9 @@ --- title: Nasde Toolkit Docs -description: Documentation for Nasde — run an AI coding agent on a task you already know the answer to, score the result, and compare configurations. +description: Measure how your whole AI coding setup performs — and what it costs in tokens and dollars across models and providers — so you can choose where to invest and when to switch. template: splash hero: - tagline: The official documentation for the Nasde Toolkit — install it, learn how the scoring works, and find every command and config option. + tagline: Measure how your whole AI coding setup performs — and what it costs in tokens and dollars across models and providers — so you can choose where to invest and when to switch. This is the documentation. image: file: ../../assets/nasde-toolkit-logo.png actions: @@ -52,9 +52,9 @@ import { Card, CardGrid, LinkCard } from '@astrojs/starlight/components'; - Run an AI coding agent on a task you already know the answer to, score the result against your own criteria, and compare configurations. Start with the [Overview](/nasde-toolkit/getting-started/overview/). + Measure your whole AI coding setup — agent, skills, MCP, on your tasks — on both quality *and* cost (tokens and dollars, per model and provider), so you can decide which stack to use and when to switch. Start with the [Overview](/nasde-toolkit/getting-started/overview/). - A real task end to end — instruction, criteria, and the scores four agent configurations earned. See [A Real Task](/nasde-toolkit/concepts/real-task-example/) and [Benchmark Results](/nasde-toolkit/guides/benchmark-results/). + A real task end to end — the instruction, the criteria, and how four agent configurations scored. See [A Real Task](/nasde-toolkit/concepts/real-task-example/), then the quality-vs-cost view in [Token & Cost](/nasde-toolkit/concepts/token-cost/) and [Benchmark Results](/nasde-toolkit/guides/benchmark-results/).