diff --git a/README.md b/README.md
index b3f06a7..2084bf7 100644
--- a/README.md
+++ b/README.md
@@ -3,7 +3,7 @@
Noesis Agentic Software Development Evals Toolkit
- Run an AI coding agent on a task you already know the answer to. Score the result. Compare configurations.
+ Measure how your whole AI coding setup performs — and what it costs in tokens and dollars across models and providers — so you can choose where to invest and when to switch.
@@ -14,6 +14,14 @@
---
+## Why NASDE?
+
+Your team runs AI coding agents — but **which setup is actually best for *your* codebase, and what is it costing you?** Switch from Claude to Codex, swap a model, add a skill or an MCP server — and you're guessing whether quality went up, down, or just got more expensive.
+
+The decisions that matter are getting expensive: **which provider, which model, which configuration** — each with a different quality-per-dollar trade-off. NASDE measures your **whole harness** — the agent, its skills, its MCP servers, against *your* tasks — and reports not just **how good** the output is, but **how many tokens and how many dollars** it took, per model and per provider. That's the data behind a real decision: where to invest, which model to standardize on, and when a migration actually pays off.
+
+It runs on your own machine with a subscription you already have. **Today** it drives Claude Code, the Codex CLI, and the Gemini CLI; **planned:** Pi, Cursor, and router-based setups.
+
## What NASDE does — in four steps
One `nasde run` command executes the whole chain.
@@ -39,14 +47,13 @@ Concepts (how the scoring works, the evaluation pipeline, token & cost, rubric c
## What do I use it for?
-Anyone working with AI coding agents eventually hits the same wall: *"I changed my skill / `CLAUDE.md` / MCP setup — is the agent actually better now, or does it just feel that way?"* NASDE turns that gut feeling into a repeatable measurement which is **easy to do even on a personal machine, with a Claude Code or Codex subscription**.
-
-Typical things you'd do with it:
+The core use is a **cost-and-quality decision** about your AI coding stack: *which agent, which model, which provider, which configuration — for our codebase and our budget?* NASDE answers it with numbers instead of vibes. Typical things you'd do with it:
-- **Run an agent safely on a realistic task** — a sandboxed container means the agent can `rm -rf`, install random packages, or run your tests in loops without wrecking your laptop.
-- **Compare two configurations of the same agent** — baseline vs. "with my new skill"; see whether the skill moves the score up or down, and on which dimensions.
-- **Compare different agents on the same task** — Claude Code vs. Codex vs. Gemini CLI against *your* workspace and *your* criteria.
-- **Build a regression suite for your AI setup** — once a task set exists, re-run it every time someone tweaks the prompt/skills/MCP and spot regressions before they ship.
+- **Compare providers and models on quality *and* cost** — Claude Code vs. Codex vs. Gemini, Sonnet vs. Opus, against *your* tasks; see the score *and* the tokens and dollars each one spends, and pick the best quality-per-dollar for your budget.
+- **Decide whether a migration pays off** — before standardizing on a new agent or model, measure what actually changes in output quality and in spend.
+- **Measure your whole harness, not just one skill** — run your real `CLAUDE.md` + skills + MCP servers as a unit and see how the full configuration performs.
+- **Tune a single skill or config** — baseline vs. "with my new skill"; see whether it moves the score up or down, and on which dimensions.
+- **Build a regression suite for your AI setup** — re-run the task set whenever someone tweaks the prompt/skills/MCP/model and catch quality *or* cost regressions before they ship.
## Quick start
diff --git a/website/src/content/docs/getting-started/overview.md b/website/src/content/docs/getting-started/overview.md
index 676ac9a..f373919 100644
--- a/website/src/content/docs/getting-started/overview.md
+++ b/website/src/content/docs/getting-started/overview.md
@@ -1,15 +1,17 @@
---
title: Overview
-description: Why NASDE exists, what it does, what it is and isn't, and the four-step loop it automates.
+description: Measure how your whole AI coding setup performs — and what it costs in tokens and dollars across models and providers — so you can choose where to invest and when to switch.
---
## Why NASDE? — the problem
-You changed your agent's skill, its `CLAUDE.md`, or its MCP setup. **Is the agent actually better now — or does it just feel that way?**
+Your team runs AI coding agents — but which setup is actually best for *your* codebase, and what is it costing you? Switch from Claude to Codex, swap a model, add a skill or an MCP server — and you're guessing whether quality went up, down, or just got more expensive.
-Everyone working with AI coding agents hits this wall. Skill changes are a leap of faith: maybe the new prompt improves refactoring but quietly breaks the agent's tests. A passing CI check tells you the code *works*; it doesn't tell you whether the agent's output got *better* on the things you care about — architecture, test quality, clarity — or which of two agents does it better on *your* code.
+The decisions that matter are getting expensive: **which provider, which model, which configuration** — each with a different quality-per-dollar trade-off. Without measurement, you're optimizing AI spend on gut feel.
-NASDE turns that gut feeling into a **repeatable measurement** you can run on your own machine with a Claude Code or Codex subscription. Define a task you already know the answer to, run different agent configurations against it, and get multi-dimensional scores you can compare and trust.
+NASDE measures your **whole harness** — the agent, its skills, its MCP servers, against *your* tasks — and reports not just **how good** the output is, but **how many tokens and how many dollars** it took, per model and per provider. That's the data behind a real decision: where to invest, which model to standardize on, and when a migration actually pays off.
+
+It runs on your own machine with a subscription you already have, and you stay in control of what "good" means — you define the tasks and the criteria; NASDE runs the experiment the same way every time and puts quality and cost side by side.
## What NASDE does — in four steps
@@ -72,15 +74,21 @@ NASDE runs every agent **non-interactively** — it scripts them rather than cha
Because it uses Claude programmatically, NASDE falls under Anthropic's terms for programmatic use. Anthropic has announced that, **from June 15, 2026, paid Claude plans include a dedicated monthly credit for programmatic usage** — covering `claude -p`, the Claude Agent SDK, and Claude Code GitHub Actions, among others. So running NASDE on a paid Claude subscription is expected and supported; check [Anthropic's current terms](https://www.anthropic.com/) for the exact credit and limits that apply to your plan.
+### Agents and providers
+
+**Today** NASDE drives **Claude Code, the Codex CLI, and the Gemini CLI** — so you can already compare across three providers on the same tasks, with quality and cost side by side. **Planned:** Pi, Cursor, and router-based setups — so a single benchmark spans every agent and router you're weighing, and a migration decision comes with full visibility into the quality-vs-cost trade-offs.
+
## What do I use it for?
-Anyone working with AI coding agents eventually hits the same wall: *"I changed my skill / `CLAUDE.md` / MCP setup — is the agent actually better now, or does it just feel that way?"* NASDE turns that gut feeling into a repeatable measurement which is **easy to do on even on a personal machine, with a Claude Code or Codex subscription**.
+The core use is a **cost-and-quality decision** about your AI coding stack: *which agent, which model, which provider, which configuration — for our codebase and our budget?* NASDE answers it with numbers instead of vibes.
Typical things you'd do with it:
-- **Run an agent safely on a realistic task** — a sandboxed container means the agent can `rm -rf`, install random packages, or run your tests in loops without wrecking your laptop.
-- **Compare two configurations of the same agent** — baseline vs. "with my new skill"; see whether the skill moves the score up or down, and on which dimensions.
-- **Compare different agents on the same task** — Claude Code vs. Codex vs. Gemini CLI against *your* workspace and *your* criteria.
-- **Build a regression suite for your AI setup** — once a task set exists, re-run it every time someone tweaks the prompt/skills/MCP and spot regressions before they ship.
+- **Compare providers and models on quality *and* cost** — Claude Code vs. Codex vs. Gemini, Sonnet vs. Opus, against *your* tasks; see the score *and* the tokens and dollars each one spends, so you can pick the best quality-per-dollar for your budget.
+- **Decide whether a migration pays off** — before standardizing on a new agent or model, measure what actually changes in output quality and in spend.
+- **Measure your whole harness, not just one skill** — run your real `CLAUDE.md` + skills + MCP servers as a unit, and see how the full configuration performs, not just an isolated component.
+- **Tune a single skill or config** — baseline vs. "with my new skill"; see whether it moves the score up or down, and on which dimensions.
+- **Build a regression suite for your AI setup** — once a task set exists, re-run it every time someone tweaks the prompt/skills/MCP/model and catch quality *or* cost regressions before they ship.
+- **Run agents safely on realistic tasks** — a sandboxed container means the agent can `rm -rf`, install random packages, or loop your tests without wrecking your laptop.
Ready to try it? Head to the [Quick Start](/nasde-toolkit/getting-started/quick-start/).
diff --git a/website/src/content/docs/index.mdx b/website/src/content/docs/index.mdx
index ef177c2..69c200d 100644
--- a/website/src/content/docs/index.mdx
+++ b/website/src/content/docs/index.mdx
@@ -1,9 +1,9 @@
---
title: Nasde Toolkit Docs
-description: Documentation for Nasde — run an AI coding agent on a task you already know the answer to, score the result, and compare configurations.
+description: Measure how your whole AI coding setup performs — and what it costs in tokens and dollars across models and providers — so you can choose where to invest and when to switch.
template: splash
hero:
- tagline: The official documentation for the Nasde Toolkit — install it, learn how the scoring works, and find every command and config option.
+ tagline: Measure how your whole AI coding setup performs — and what it costs in tokens and dollars across models and providers — so you can choose where to invest and when to switch. This is the documentation.
image:
file: ../../assets/nasde-toolkit-logo.png
actions:
@@ -52,9 +52,9 @@ import { Card, CardGrid, LinkCard } from '@astrojs/starlight/components';
- Run an AI coding agent on a task you already know the answer to, score the result against your own criteria, and compare configurations. Start with the [Overview](/nasde-toolkit/getting-started/overview/).
+ Measure your whole AI coding setup — agent, skills, MCP, on your tasks — on both quality *and* cost (tokens and dollars, per model and provider), so you can decide which stack to use and when to switch. Start with the [Overview](/nasde-toolkit/getting-started/overview/).
- A real task end to end — instruction, criteria, and the scores four agent configurations earned. See [A Real Task](/nasde-toolkit/concepts/real-task-example/) and [Benchmark Results](/nasde-toolkit/guides/benchmark-results/).
+ A real task end to end — the instruction, the criteria, and how four agent configurations scored. See [A Real Task](/nasde-toolkit/concepts/real-task-example/), then the quality-vs-cost view in [Token & Cost](/nasde-toolkit/concepts/token-cost/) and [Benchmark Results](/nasde-toolkit/guides/benchmark-results/).