Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 15 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

<h3>Noesis Agentic Software Development Evals Toolkit</h3>

<p>Run an AI coding agent on a task you already know the answer to. Score the result. Compare configurations.</p>
<p>Measure how your whole AI coding setup performs — and what it costs in tokens and dollars across models and providers — so you can choose where to invest and when to switch.</p>

<a href="https://noesis.vision/nasde/"><img src="https://img.shields.io/badge/Product%20Page-Noesis%20Vision-0B6623?style=for-the-badge&logoColor=white" alt="Product Page"></a>
<a href="https://discord.gg/QF5PMX4Dqg"><img src="https://img.shields.io/badge/Discord-Join%20Community-5865F2?style=for-the-badge&logo=discord&logoColor=white" alt="Join our Discord"></a>
Expand All @@ -14,6 +14,14 @@

---

## Why NASDE?

Your team runs AI coding agents — but **which setup is actually best for *your* codebase, and what is it costing you?** Switch from Claude to Codex, swap a model, add a skill or an MCP server — and you're guessing whether quality went up, down, or just got more expensive.

The decisions that matter are getting expensive: **which provider, which model, which configuration** — each with a different quality-per-dollar trade-off. NASDE measures your **whole harness** — the agent, its skills, its MCP servers, against *your* tasks — and reports not just **how good** the output is, but **how many tokens and how many dollars** it took, per model and per provider. That's the data behind a real decision: where to invest, which model to standardize on, and when a migration actually pays off.

It runs on your own machine with a subscription you already have. **Today** it drives Claude Code, the Codex CLI, and the Gemini CLI; **planned:** Pi, Cursor, and router-based setups.

## What NASDE does — in four steps

One `nasde run` command executes the whole chain.
Expand All @@ -39,14 +47,13 @@ Concepts (how the scoring works, the evaluation pipeline, token & cost, rubric c

## What do I use it for?

Anyone working with AI coding agents eventually hits the same wall: *"I changed my skill / `CLAUDE.md` / MCP setup — is the agent actually better now, or does it just feel that way?"* NASDE turns that gut feeling into a repeatable measurement which is **easy to do even on a personal machine, with a Claude Code or Codex subscription**.

Typical things you'd do with it:
The core use is a **cost-and-quality decision** about your AI coding stack: *which agent, which model, which provider, which configuration — for our codebase and our budget?* NASDE answers it with numbers instead of vibes. Typical things you'd do with it:

- **Run an agent safely on a realistic task** — a sandboxed container means the agent can `rm -rf`, install random packages, or run your tests in loops without wrecking your laptop.
- **Compare two configurations of the same agent** — baseline vs. "with my new skill"; see whether the skill moves the score up or down, and on which dimensions.
- **Compare different agents on the same task** — Claude Code vs. Codex vs. Gemini CLI against *your* workspace and *your* criteria.
- **Build a regression suite for your AI setup** — once a task set exists, re-run it every time someone tweaks the prompt/skills/MCP and spot regressions before they ship.
- **Compare providers and models on quality *and* cost** — Claude Code vs. Codex vs. Gemini, Sonnet vs. Opus, against *your* tasks; see the score *and* the tokens and dollars each one spends, and pick the best quality-per-dollar for your budget.
- **Decide whether a migration pays off** — before standardizing on a new agent or model, measure what actually changes in output quality and in spend.
- **Measure your whole harness, not just one skill** — run your real `CLAUDE.md` + skills + MCP servers as a unit and see how the full configuration performs.
- **Tune a single skill or config** — baseline vs. "with my new skill"; see whether it moves the score up or down, and on which dimensions.
- **Build a regression suite for your AI setup** — re-run the task set whenever someone tweaks the prompt/skills/MCP/model and catch quality *or* cost regressions before they ship.

## Quick start

Expand Down
26 changes: 17 additions & 9 deletions website/src/content/docs/getting-started/overview.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
---
title: Overview
description: Why NASDE exists, what it does, what it is and isn't, and the four-step loop it automates.
description: Measure how your whole AI coding setup performs — and what it costs in tokens and dollars across models and providers — so you can choose where to invest and when to switch.
---

## Why NASDE? — the problem

You changed your agent's skill, its `CLAUDE.md`, or its MCP setup. **Is the agent actually better now — or does it just feel that way?**
Your team runs AI coding agents — but which setup is actually best for *your* codebase, and what is it costing you? Switch from Claude to Codex, swap a model, add a skill or an MCP server — and you're guessing whether quality went up, down, or just got more expensive.

Everyone working with AI coding agents hits this wall. Skill changes are a leap of faith: maybe the new prompt improves refactoring but quietly breaks the agent's tests. A passing CI check tells you the code *works*; it doesn't tell you whether the agent's output got *better* on the things you care about — architecture, test quality, clarity — or which of two agents does it better on *your* code.
The decisions that matter are getting expensive: **which provider, which model, which configuration** — each with a different quality-per-dollar trade-off. Without measurement, you're optimizing AI spend on gut feel.

NASDE turns that gut feeling into a **repeatable measurement** you can run on your own machine with a Claude Code or Codex subscription. Define a task you already know the answer to, run different agent configurations against it, and get multi-dimensional scores you can compare and trust.
NASDE measures your **whole harness** — the agent, its skills, its MCP servers, against *your* tasks — and reports not just **how good** the output is, but **how many tokens and how many dollars** it took, per model and per provider. That's the data behind a real decision: where to invest, which model to standardize on, and when a migration actually pays off.

It runs on your own machine with a subscription you already have, and you stay in control of what "good" means — you define the tasks and the criteria; NASDE runs the experiment the same way every time and puts quality and cost side by side.

## What NASDE does — in four steps

Expand Down Expand Up @@ -72,15 +74,21 @@ NASDE runs every agent **non-interactively** — it scripts them rather than cha

Because it uses Claude programmatically, NASDE falls under Anthropic's terms for programmatic use. Anthropic has announced that, **from June 15, 2026, paid Claude plans include a dedicated monthly credit for programmatic usage** — covering `claude -p`, the Claude Agent SDK, and Claude Code GitHub Actions, among others. So running NASDE on a paid Claude subscription is expected and supported; check [Anthropic's current terms](https://www.anthropic.com/) for the exact credit and limits that apply to your plan.

### Agents and providers

**Today** NASDE drives **Claude Code, the Codex CLI, and the Gemini CLI** — so you can already compare across three providers on the same tasks, with quality and cost side by side. **Planned:** Pi, Cursor, and router-based setups — so a single benchmark spans every agent and router you're weighing, and a migration decision comes with full visibility into the quality-vs-cost trade-offs.

## What do I use it for?

Anyone working with AI coding agents eventually hits the same wall: *"I changed my skill / `CLAUDE.md` / MCP setup — is the agent actually better now, or does it just feel that way?"* NASDE turns that gut feeling into a repeatable measurement which is **easy to do on even on a personal machine, with a Claude Code or Codex subscription**.
The core use is a **cost-and-quality decision** about your AI coding stack: *which agent, which model, which provider, which configuration — for our codebase and our budget?* NASDE answers it with numbers instead of vibes.

Typical things you'd do with it:

- **Run an agent safely on a realistic task** — a sandboxed container means the agent can `rm -rf`, install random packages, or run your tests in loops without wrecking your laptop.
- **Compare two configurations of the same agent** — baseline vs. "with my new skill"; see whether the skill moves the score up or down, and on which dimensions.
- **Compare different agents on the same task** — Claude Code vs. Codex vs. Gemini CLI against *your* workspace and *your* criteria.
- **Build a regression suite for your AI setup** — once a task set exists, re-run it every time someone tweaks the prompt/skills/MCP and spot regressions before they ship.
- **Compare providers and models on quality *and* cost** — Claude Code vs. Codex vs. Gemini, Sonnet vs. Opus, against *your* tasks; see the score *and* the tokens and dollars each one spends, so you can pick the best quality-per-dollar for your budget.
- **Decide whether a migration pays off** — before standardizing on a new agent or model, measure what actually changes in output quality and in spend.
- **Measure your whole harness, not just one skill** — run your real `CLAUDE.md` + skills + MCP servers as a unit, and see how the full configuration performs, not just an isolated component.
- **Tune a single skill or config** — baseline vs. "with my new skill"; see whether it moves the score up or down, and on which dimensions.
- **Build a regression suite for your AI setup** — once a task set exists, re-run it every time someone tweaks the prompt/skills/MCP/model and catch quality *or* cost regressions before they ship.
- **Run agents safely on realistic tasks** — a sandboxed container means the agent can `rm -rf`, install random packages, or loop your tests without wrecking your laptop.

Ready to try it? Head to the [Quick Start](/nasde-toolkit/getting-started/quick-start/).
8 changes: 4 additions & 4 deletions website/src/content/docs/index.mdx
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
---
title: Nasde Toolkit Docs
description: Documentation for Nasde — run an AI coding agent on a task you already know the answer to, score the result, and compare configurations.
description: Measure how your whole AI coding setup performs — and what it costs in tokens and dollars across models and providers — so you can choose where to invest and when to switch.
template: splash
hero:
tagline: The official documentation for the Nasde Toolkit — install it, learn how the scoring works, and find every command and config option.
tagline: Measure how your whole AI coding setup performs — and what it costs in tokens and dollars across models and providers — so you can choose where to invest and when to switch. This is the documentation.
image:
file: ../../assets/nasde-toolkit-logo.png
actions:
Expand Down Expand Up @@ -52,9 +52,9 @@ import { Card, CardGrid, LinkCard } from '@astrojs/starlight/components';

<CardGrid>
<Card title="What is Nasde?" icon="open-book">
Run an AI coding agent on a task you already know the answer to, score the result against your own criteria, and compare configurations. Start with the [Overview](/nasde-toolkit/getting-started/overview/).
Measure your whole AI coding setup — agent, skills, MCP, on your tasks — on both quality *and* cost (tokens and dollars, per model and provider), so you can decide which stack to use and when to switch. Start with the [Overview](/nasde-toolkit/getting-started/overview/).
</Card>
<Card title="See it in action" icon="approve-check">
A real task end to end — instruction, criteria, and the scores four agent configurations earned. See [A Real Task](/nasde-toolkit/concepts/real-task-example/) and [Benchmark Results](/nasde-toolkit/guides/benchmark-results/).
A real task end to end — the instruction, the criteria, and how four agent configurations scored. See [A Real Task](/nasde-toolkit/concepts/real-task-example/), then the quality-vs-cost view in [Token & Cost](/nasde-toolkit/concepts/token-cost/) and [Benchmark Results](/nasde-toolkit/guides/benchmark-results/).
</Card>
</CardGrid>
Loading