From 9eafdd7d55e325110b4d9d95831f8fd1d7cbc957 Mon Sep 17 00:00:00 2001 From: Szymon Janikowski Date: Mon, 22 Jun 2026 13:30:56 +0200 Subject: [PATCH] =?UTF-8?q?docs:=20rebrand=20NASDE=20=E2=86=92=20Nasde=20c?= =?UTF-8?q?asing=20across=20docs=20site?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace the all-caps acronym "NASDE" with the word-form "Nasde" throughout the Starlight docs content and brand CSS comments. Pure editorial casing change — 83 lines changed, no content edits. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../src/content/docs/concepts/calibration.md | 4 +-- .../src/content/docs/concepts/how-it-works.md | 14 ++++----- .../src/content/docs/concepts/key-terms.md | 8 ++--- .../docs/concepts/real-task-example.md | 2 +- .../src/content/docs/concepts/token-cost.md | 24 +++++++-------- .../docs/creating-benchmarks/anatomy.md | 2 +- .../assessment-criteria.md | 4 +-- .../content/docs/getting-started/overview.md | 30 +++++++++---------- .../docs/getting-started/quick-start.md | 8 ++--- .../docs/getting-started/reading-results.md | 4 +-- .../content/docs/guides/benchmark-results.mdx | 6 ++-- .../content/docs/guides/plugins-and-skills.md | 6 ++-- .../content/docs/guides/running-benchmarks.md | 8 ++--- .../content/docs/guides/troubleshooting.md | 10 +++---- website/src/content/docs/guides/use-cases.md | 14 ++++----- .../content/docs/reference/authentication.md | 12 ++++---- .../content/docs/reference/configuration.md | 6 ++-- website/src/styles/custom.css | 4 +-- 18 files changed, 83 insertions(+), 83 deletions(-) diff --git a/website/src/content/docs/concepts/calibration.md b/website/src/content/docs/concepts/calibration.md index 5e9bd0f..0c3af3f 100644 --- a/website/src/content/docs/concepts/calibration.md +++ b/website/src/content/docs/concepts/calibration.md @@ -54,6 +54,6 @@ This pulls your comments back. The `nasde-benchmark-calibration` skill then line ## Things worth knowing -- **The platform is auto-detected from the repo URL** — `github.com` uses the `gh` CLI, GitLab uses `glab`. You need the matching CLI installed and logged in (`gh auth login` / `glab auth login`); NASDE never handles your token. A self-hosted GitLab host that isn't obviously "gitlab" can be forced with `[calibration] platform = "gitlab"`. +- **The platform is auto-detected from the repo URL** — `github.com` uses the `gh` CLI, GitLab uses `glab`. You need the matching CLI installed and logged in (`gh auth login` / `glab auth login`); Nasde never handles your token. A self-hosted GitLab host that isn't obviously "gitlab" can be forced with `[calibration] platform = "gitlab"`. - **Re-running is idempotent** — a trial whose **open** PR/MR already exists is skipped, so you can publish more trials into the same sink without duplicates. Once you close a calibration round, the same trials can be re-published into a fresh round (a closed PR no longer blocks). -- **The sink repo must already exist** — NASDE pushes branches and opens PRs but does not create repositories. One base branch is seeded per `(repo, commit)` and shared by all that source's trials. +- **The sink repo must already exist** — Nasde pushes branches and opens PRs but does not create repositories. One base branch is seeded per `(repo, commit)` and shared by all that source's trials. diff --git a/website/src/content/docs/concepts/how-it-works.md b/website/src/content/docs/concepts/how-it-works.md index a3c5095..85cf259 100644 --- a/website/src/content/docs/concepts/how-it-works.md +++ b/website/src/content/docs/concepts/how-it-works.md @@ -1,9 +1,9 @@ --- title: How It Works -description: The two kinds of scoring in NASDE and the full evaluation pipeline, stage by stage — from task definition through the reviewer agent to logged results. +description: The two kinds of scoring in Nasde and the full evaluation pipeline, stage by stage — from task definition through the reviewer agent to logged results. --- -The question that trips most people up is *how* NASDE scores a run. The short answer: there are **two independent kinds of scoring**, and they answer different questions. The rest of this page explains both, then walks the whole pipeline stage by stage. +The question that trips most people up is *how* Nasde scores a run. The short answer: there are **two independent kinds of scoring**, and they answer different questions. The rest of this page explains both, then walks the whole pipeline stage by stage. ## Two independent kinds of scoring @@ -26,7 +26,7 @@ flowchart LR style E fill:#c0392b,color:#fff ``` -One `nasde run` command drives this whole chain. Stage 1 (the agent doing the work in a sandbox) comes from [Harbor](https://www.harborframework.com/); the optional tracking stage at the end uses [Opik](https://github.com/comet-ml/opik). NASDE is the glue that connects them, adds the reviewer stage in between, and gives you the CLI, the benchmark project layout, and the [authoring skills](/nasde-toolkit/getting-started/quick-start/). +One `nasde run` command drives this whole chain. Stage 1 (the agent doing the work in a sandbox) comes from [Harbor](https://www.harborframework.com/); the optional tracking stage at the end uses [Opik](https://github.com/comet-ml/opik). Nasde is the glue that connects them, adds the reviewer stage in between, and gives you the CLI, the benchmark project layout, and the [authoring skills](/nasde-toolkit/getting-started/quick-start/). ## 1. The task — what you define @@ -41,12 +41,12 @@ You don't have to write all of these by hand — the [authoring skills](/nasde-t ## 2. The agent solves it — in an isolated sandbox -NASDE hands the task to the coding agent under test (Claude Code, Codex, or Gemini CLI) running inside a fresh, isolated container — locally on Docker by default, or on a [cloud sandbox provider](/nasde-toolkit/guides/running-benchmarks/) for horizontal scale. Isolation matters for two reasons: +Nasde hands the task to the coding agent under test (Claude Code, Codex, or Gemini CLI) running inside a fresh, isolated container — locally on Docker by default, or on a [cloud sandbox provider](/nasde-toolkit/guides/running-benchmarks/) for horizontal scale. Isolation matters for two reasons: - **Safety** — the agent can `rm -rf`, install arbitrary packages, or loop your test suite without touching your machine. - **Fairness** — every trial starts from the same clean state, so a score difference reflects the *configuration* (the `CLAUDE.md`, the skill, the MCP server, the model, the reasoning effort), not leftover state from a previous run. -What varies between runs is exactly one thing: the **variant** — the agent configuration you're testing. That's how NASDE turns "did my skill help?" into a controlled experiment. See [Configuration → variant.toml](/nasde-toolkit/reference/configuration/) for what a variant can change. +What varies between runs is exactly one thing: the **variant** — the agent configuration you're testing. That's how Nasde turns "did my skill help?" into a controlled experiment. See [Configuration → variant.toml](/nasde-toolkit/reference/configuration/) for what a variant can change. ## 3. Rough tests — a deterministic pass/fail @@ -60,7 +60,7 @@ This is the standard verifier pattern used by [Harbor](https://www.harborframewo ## 4. The reviewer agent — reading the actual work -This is the stage NASDE adds, and it's why the pipeline exists. A **second coding agent** (`claude` or `codex`) is pointed at the produced workspace and navigates it with real tools — `Read`, `Glob`, `Grep`, optionally MCP analysis servers — reading only what each dimension needs rather than stuffing the whole repo into a prompt. That's what keeps the review tractable on large codebases. It can also read the agent's full **trajectory** (tool calls, tokens, timing), so your criteria can judge the *process*, not just the final files. +This is the stage Nasde adds, and it's why the pipeline exists. A **second coding agent** (`claude` or `codex`) is pointed at the produced workspace and navigates it with real tools — `Read`, `Glob`, `Grep`, optionally MCP analysis servers — reading only what each dimension needs rather than stuffing the whole repo into a prompt. That's what keeps the review tractable on large codebases. It can also read the agent's full **trajectory** (tool calls, tokens, timing), so your criteria can judge the *process*, not just the final files. The reviewer's reference point is **two files you write** when creating the benchmark: @@ -75,7 +75,7 @@ You decide how strict the criteria are — spell out a ground-truth structure, e The reviewer scores each dimension on whatever scale you chose, with written reasoning for each. One local `nasde run` handles all of it — no separate LLM-as-a-judge stack required. -**The reviewer runs more than once.** An LLM judge is non-deterministic — score the same workspace twice and you can get 0.61 then 0.71. So by default NASDE evaluates each trial **3 times** (`eval_repetitions`, set in `nasde.toml [evaluation]` or with `--eval-repetitions`) and reports the **mean** rather than any single run. Each evaluation is kept as its own `assessment_eval_.json`; a derived `assessment_summary.json` holds the per-dimension mean, standard deviation, and range. Means are computed only within a single judge model **and a single rubric** — a Claude review and a Codex review are different benchmarks, and so is a review run after you edited `assessment_dimensions.json` (the rubric is fingerprinted, so changing a dimension, its `max_score`, or even its description starts a fresh cluster rather than silently mixing incomparable scores). After editing the rubric, just re-run `nasde eval` — the new evaluations form their own cluster automatically. +**The reviewer runs more than once.** An LLM judge is non-deterministic — score the same workspace twice and you can get 0.61 then 0.71. So by default Nasde evaluates each trial **3 times** (`eval_repetitions`, set in `nasde.toml [evaluation]` or with `--eval-repetitions`) and reports the **mean** rather than any single run. Each evaluation is kept as its own `assessment_eval_.json`; a derived `assessment_summary.json` holds the per-dimension mean, standard deviation, and range. Means are computed only within a single judge model **and a single rubric** — a Claude review and a Codex review are different benchmarks, and so is a review run after you edited `assessment_dimensions.json` (the rubric is fingerprinted, so changing a dimension, its `max_score`, or even its description starts a fresh cluster rather than silently mixing incomparable scores). After editing the rubric, just re-run `nasde eval` — the new evaluations form their own cluster automatically. When the judge's scores feel off, you can align it with your own judgment — see [Calibrating the Rubric](/nasde-toolkit/concepts/calibration/). diff --git a/website/src/content/docs/concepts/key-terms.md b/website/src/content/docs/concepts/key-terms.md index 709026f..883d9fb 100644 --- a/website/src/content/docs/concepts/key-terms.md +++ b/website/src/content/docs/concepts/key-terms.md @@ -1,9 +1,9 @@ --- title: Key Terms -description: A glossary of the NASDE vocabulary — variant, trial, job, rubric, dimension, reviewer, trajectory, and the tools it builds on. +description: A glossary of the Nasde vocabulary — variant, trial, job, rubric, dimension, reviewer, trajectory, and the tools it builds on. --- -NASDE has its own vocabulary. If a word in the docs is unfamiliar, it's probably here. +Nasde has its own vocabulary. If a word in the docs is unfamiliar, it's probably here. ## The benchmark you author @@ -50,10 +50,10 @@ NASDE has its own vocabulary. If a word in the docs is unfamiliar, it's probably **Sandbox** : The isolated container the agent works in. It can't touch your machine, and every trial starts from the same clean state. -## The tools NASDE builds on +## The tools Nasde builds on **Harbor** -: The framework that runs the agent in a sandbox (Stage 1). NASDE uses its Python API directly. [harborframework.com](https://www.harborframework.com/) +: The framework that runs the agent in a sandbox (Stage 1). Nasde uses its Python API directly. [harborframework.com](https://www.harborframework.com/) **Opik** : The optional experiment tracker scores flow to with `--with-opik`. [Opik by Comet](https://github.com/comet-ml/opik) diff --git a/website/src/content/docs/concepts/real-task-example.md b/website/src/content/docs/concepts/real-task-example.md index 55c10d4..7e62b0f 100644 --- a/website/src/content/docs/concepts/real-task-example.md +++ b/website/src/content/docs/concepts/real-task-example.md @@ -58,6 +58,6 @@ A separate study took a public DDD skill (the `tactical-ddd` skill from `ntcodin ## More benchmarks in the repo - **Refactoring katas (Java + Python)** — four classic refactorings scored on behavior preservation, clarity, technique, scope discipline. *Takeaway:* a candidate "refactoring skill" didn't move the score — shipping it would have been based on vibes. -- **Project-specific skill validation (NASDE's own repo)** — one task pulled from NASDE's git history; four skill combinations tested. *Takeaway:* the testing-discipline skill alone raised pass rate from 67% → 100%; the "full-stack, everything-on" variant scored *worse* than vanilla. +- **Project-specific skill validation (Nasde's own repo)** — one task pulled from Nasde's git history; four skill combinations tested. *Takeaway:* the testing-discipline skill alone raised pass rate from 67% → 100%; the "full-stack, everything-on" variant scored *worse* than vanilla. See **[Benchmark Results](/nasde-toolkit/guides/benchmark-results/)** for the full tables and methodology, and **[Use Cases](/nasde-toolkit/guides/use-cases/)** for the end-to-end walkthrough of building a benchmark like these yourself. diff --git a/website/src/content/docs/concepts/token-cost.md b/website/src/content/docs/concepts/token-cost.md index 1e6d118..22c97c6 100644 --- a/website/src/content/docs/concepts/token-cost.md +++ b/website/src/content/docs/concepts/token-cost.md @@ -1,13 +1,13 @@ --- title: Token & Cost -description: The raw token, cost, and quality signals NASDE records per trial — and why it compares models as a Pareto front, not a single efficiency ratio. +description: The raw token, cost, and quality signals Nasde records per trial — and why it compares models as a Pareto front, not a single efficiency ratio. --- -A passing test tells you the agent *can* do the task. It doesn't tell you what that capability **costs**. NASDE records, for every trial, how many tokens the agent burned and what that would cost in dollars. +A passing test tells you the agent *can* do the task. It doesn't tell you what that capability **costs**. Nasde records, for every trial, how many tokens the agent burned and what that would cost in dollars. ## The raw signals -NASDE records the raw quality, token, and cost signals you need to compare agents and models: +Nasde records the raw quality, token, and cost signals you need to compare agents and models: - **token usage** — total input + output tokens for the run (price-independent; a measure of how much the model "thinks" to reach a given quality). - **cost (USD)** — what those tokens cost at catalog rates. The number that matters when you're choosing a model for a budget. @@ -16,11 +16,11 @@ These appear in three places: the `nasde run` summary prints a per-`(agent, mode ## Quality vs. cost: the Pareto frontier -This is the comparison that actually drives a model choice. NASDE measures quality and cost **independently** so you can see the *trade-off* — which model gives you the best quality for your budget — instead of collapsing it into one number that hides the picture. +This is the comparison that actually drives a model choice. Nasde measures quality and cost **independently** so you can see the *trade-off* — which model gives you the best quality for your budget — instead of collapsing it into one number that hides the picture. Why not a single "efficiency" ratio (quality per dollar)? Because that ratio has an arbitrary zero — a score of 0 means an empty rubric, which no real run reaches — so the same data can re-order which model "wins" just by shifting where you put that zero. The trade-off is shift-invariant; a single ratio is not. -So NASDE plots the raw signals as a **Pareto frontier** (quality vs. cost, and quality vs. tokens). Models on the frontier are the best available trade-offs; a model above it is overpaying for its quality, one below it is buying cheap quality. *You* pick the point on the frontier that matches your budget and quality bar. +So Nasde plots the raw signals as a **Pareto frontier** (quality vs. cost, and quality vs. tokens). Models on the frontier are the best available trade-offs; a model above it is overpaying for its quality, one below it is buying cheap quality. *You* pick the point on the frontier that matches your budget and quality bar. ![Quality vs cost across models and skill variants: one shared cost panel plus a per-provider token panel](../../../assets/benchmark/pareto.png) @@ -30,27 +30,27 @@ The Pareto comparison and the chart generator live in the `nasde-benchmark-runne ## Why scores come with a ± (and why that matters) -A benchmark score is never perfectly repeatable — run the same setup twice and you'll get slightly different numbers. So a bare average can lie: if config A scores 0.82 and config B scores 0.80, is A *really* better, or did it just get lucky this time? To answer that honestly, you need to know **how much the score wobbles**. That's why NASDE always reports a score as **`mean ±std`** — the average, plus the typical wobble around it. +A benchmark score is never perfectly repeatable — run the same setup twice and you'll get slightly different numbers. So a bare average can lie: if config A scores 0.82 and config B scores 0.80, is A *really* better, or did it just get lucky this time? To answer that honestly, you need to know **how much the score wobbles**. That's why Nasde always reports a score as **`mean ±std`** — the average, plus the typical wobble around it. -There are **two separate sources of wobble**, and NASDE keeps them apart on purpose: +There are **two separate sources of wobble**, and Nasde keeps them apart on purpose: - **The agent writes different code each time.** Ask the same agent to solve the same task twice and it won't produce identical code, so the scores differ. This wobble shows up as the **`±std` in the run summary table**, measured *across your attempts* (the `Trials` column is how many attempts went into the average). Run more attempts — `--attempts` — and this estimate gets sharper. A single attempt is shown honestly as `mean (n=1)`, not a fake `±0.00`. - **The judge scores the same code slightly differently each time.** Even on identical code, the reviewer isn't perfectly consistent. This is a *different* wobble, so it's recorded *per trial* in `metrics.json` (`score_eval_std`, `score_eval_n`). -Why split them? Because the question you actually care about is: **is the gap between two configs bigger than the wobble, or is it just noise?** Keeping the two sources separate lets you answer that — a 0.02 gap means nothing if each score wobbles by ±0.08. (Formal significance testing is a separate, offline step; NASDE's job here is to surface the spread and sample size that make an average trustworthy in the first place.) +Why split them? Because the question you actually care about is: **is the gap between two configs bigger than the wobble, or is it just noise?** Keeping the two sources separate lets you answer that — a 0.02 gap means nothing if each score wobbles by ±0.08. (Formal significance testing is a separate, offline step; Nasde's job here is to surface the spread and sample size that make an average trustworthy in the first place.) ## How the cost is calculated — "as if every run were the first" -The dollar figure NASDE reports is **deliberately consistent**: run the same task ten times and you'll get the same cost ten times. That's on purpose, and here's why it matters. +The dollar figure Nasde reports is **deliberately consistent**: run the same task ten times and you'll get the same cost ten times. That's on purpose, and here's why it matters. Most providers give a discount for **prompt caching** — if you send the same prompt again soon after, the repeated part is cheaper. That sounds good, but it makes cost *unpredictable for comparison*: the exact same run can cost more or less depending on whether your cache happened to be "warm" (recently used) or "cold". You'd be comparing models on luck, not on how much they actually cost. -So NASDE **ignores the cache discount entirely** and prices every run **as if it were the very first one** — the full prompt billed at the full catalog rate, every time. The model's reasoning tokens (the "thinking" some models do) are counted as output. The result is a cost number that depends only on the model and the task, not on timing — so when you compare two models, you're comparing them fairly. +So Nasde **ignores the cache discount entirely** and prices every run **as if it were the very first one** — the full prompt billed at the full catalog rate, every time. The model's reasoning tokens (the "thinking" some models do) are counted as output. The result is a cost number that depends only on the model and the task, not on timing — so when you compare two models, you're comparing them fairly. ## Where pricing comes from -Rates live in a small, versioned `pricing.toml` bundled with NASDE, each model stamped with the date and source it came from. A model entry looks like: +Rates live in a small, versioned `pricing.toml` bundled with Nasde, each model stamped with the date and source it came from. A model entry looks like: ```toml [models."your-model-id"] @@ -63,7 +63,7 @@ source = "https://…" A model that isn't in the catalog still gets token metrics — only its `cost_usd` is left blank (with a warning), never a wrong number. :::note[Editing the catalog] -The catalog is **bundled into the package**, so editing it depends on how you installed NASDE. From a source checkout (`uv sync`) you can edit `src/nasde_toolkit/pricing.toml` directly. After a PyPI install (`uv tool install` / pipx) the file lives inside an isolated environment and any edit is overwritten on the next upgrade — so for now, adding a model or correcting a rate means contributing it upstream or running from source. A per-project / per-user pricing override is a planned improvement. +The catalog is **bundled into the package**, so editing it depends on how you installed Nasde. From a source checkout (`uv sync`) you can edit `src/nasde_toolkit/pricing.toml` directly. After a PyPI install (`uv tool install` / pipx) the file lives inside an isolated environment and any edit is overwritten on the next upgrade — so for now, adding a model or correcting a rate means contributing it upstream or running from source. A per-project / per-user pricing override is a planned improvement. ::: :::caution[Confirm rates before quoting costs] diff --git a/website/src/content/docs/creating-benchmarks/anatomy.md b/website/src/content/docs/creating-benchmarks/anatomy.md index 881c50b..d8e7872 100644 --- a/website/src/content/docs/creating-benchmarks/anatomy.md +++ b/website/src/content/docs/creating-benchmarks/anatomy.md @@ -1,6 +1,6 @@ --- title: Anatomy of a Benchmark -description: What a NASDE benchmark is made of — tasks, variants, dimensions — and how the files fit together before you ever run it. +description: What a Nasde benchmark is made of — tasks, variants, dimensions — and how the files fit together before you ever run it. --- Before you let a skill scaffold one for you, it helps to understand what a benchmark *is*. It's a directory of plain files with three moving parts: **tasks** (the problems), **variants** (the agent configurations you compare), and **dimensions** (the axes you score on). diff --git a/website/src/content/docs/creating-benchmarks/assessment-criteria.md b/website/src/content/docs/creating-benchmarks/assessment-criteria.md index 566d4fe..87fc843 100644 --- a/website/src/content/docs/creating-benchmarks/assessment-criteria.md +++ b/website/src/content/docs/creating-benchmarks/assessment-criteria.md @@ -3,7 +3,7 @@ title: Assessment Criteria & Dimensions description: How to write the rubric the reviewer scores against — define dimensions, set independent scales, and write score ladders that capture what "good" means to you. --- -The rubric is where NASDE's value lives. The rough `test.sh` answers "did it work?"; the rubric answers "how *good* is it?" — and you're the one who decides what good means. This page is how you write it. +The rubric is where Nasde's value lives. The rough `test.sh` answers "did it work?"; the rubric answers "how *good* is it?" — and you're the one who decides what good means. This page is how you write it. The rubric is two files: @@ -67,5 +67,5 @@ Whatever you choose, write it so a careful human reviewer would score the same w ## Tips - **Score the process, not just the output.** With `include_trajectory` enabled, the reviewer can read the agent's tool-call trace — so a dimension can reward verification discipline or penalize thrashing. See [Configuring the reviewer agent](/nasde-toolkit/guides/running-benchmarks/#configuring-the-reviewer-agent). -- **Edit, then re-score, never silently mix.** Changing a dimension or its `max_score` makes a different benchmark; NASDE fingerprints the rubric so old and new scores never average together. Just re-run `nasde eval`. +- **Edit, then re-score, never silently mix.** Changing a dimension or its `max_score` makes a different benchmark; Nasde fingerprints the rubric so old and new scores never average together. Just re-run `nasde eval`. - **Start with one task.** Get the rubric right on a single task before scaling — a good ladder is reusable; a vague one multiplies noise. diff --git a/website/src/content/docs/getting-started/overview.md b/website/src/content/docs/getting-started/overview.md index f373919..859369f 100644 --- a/website/src/content/docs/getting-started/overview.md +++ b/website/src/content/docs/getting-started/overview.md @@ -3,17 +3,17 @@ title: Overview description: Measure how your whole AI coding setup performs — and what it costs in tokens and dollars across models and providers — so you can choose where to invest and when to switch. --- -## Why NASDE? — the problem +## Why Nasde? — the problem Your team runs AI coding agents — but which setup is actually best for *your* codebase, and what is it costing you? Switch from Claude to Codex, swap a model, add a skill or an MCP server — and you're guessing whether quality went up, down, or just got more expensive. The decisions that matter are getting expensive: **which provider, which model, which configuration** — each with a different quality-per-dollar trade-off. Without measurement, you're optimizing AI spend on gut feel. -NASDE measures your **whole harness** — the agent, its skills, its MCP servers, against *your* tasks — and reports not just **how good** the output is, but **how many tokens and how many dollars** it took, per model and per provider. That's the data behind a real decision: where to invest, which model to standardize on, and when a migration actually pays off. +Nasde measures your **whole harness** — the agent, its skills, its MCP servers, against *your* tasks — and reports not just **how good** the output is, but **how many tokens and how many dollars** it took, per model and per provider. That's the data behind a real decision: where to invest, which model to standardize on, and when a migration actually pays off. -It runs on your own machine with a subscription you already have, and you stay in control of what "good" means — you define the tasks and the criteria; NASDE runs the experiment the same way every time and puts quality and cost side by side. +It runs on your own machine with a subscription you already have, and you stay in control of what "good" means — you define the tasks and the criteria; Nasde runs the experiment the same way every time and puts quality and cost side by side. -## What NASDE does — in four steps +## What Nasde does — in four steps One `nasde run` command executes the whole chain. @@ -22,7 +22,7 @@ One `nasde run` command executes the whole chain. 3. **A reviewer agent assesses the result against your criteria.** After initial rough tests pass or fail, a second coding agent (`claude` or `codex`) navigates the workspace and scores your chosen dimensions (e.g. *domain modeling*, *test quality*) on whatever scale you picked. The review stays token-efficient even on large codebases. 4. **Results land in a dashboard (optional).** Browse scores, compare variants, and track how your agent setup evolves over time — optionally via [Opik](https://www.comet.com/site/products/opik/). -You're the one defining "what good looks like." NASDE just automates running the experiment and assessing it the same way every time. +You're the one defining "what good looks like." Nasde just automates running the experiment and assessing it the same way every time. ## The evaluation pipeline, end to end @@ -38,7 +38,7 @@ flowchart LR style E fill:#c0392b,color:#fff ``` -Stage 1 (the agent solving the task in a sandbox) comes from [Harbor](https://www.harborframework.com/); the optional tracking stage uses [Opik](https://github.com/comet-ml/opik). NASDE is the glue that connects them and adds the **reviewer stage** in between — the part that turns "did the test pass?" into "how good is the result, on the dimensions *I* care about?" See [How It Works](/nasde-toolkit/concepts/how-it-works/) for the two kinds of scoring and the full per-stage detail. +Stage 1 (the agent solving the task in a sandbox) comes from [Harbor](https://www.harborframework.com/); the optional tracking stage uses [Opik](https://github.com/comet-ml/opik). Nasde is the glue that connects them and adds the **reviewer stage** in between — the part that turns "did the test pass?" into "how good is the result, on the dimensions *I* care about?" See [How It Works](/nasde-toolkit/concepts/how-it-works/) for the two kinds of scoring and the full per-stage detail. ## Why this is useful — a concrete example @@ -51,36 +51,36 @@ The value shows up the moment you compare configurations. Here are four agent se | `codex-vanilla` | 89% | 18.8 | 8.7 | **69.4** | | `codex-guided` (same skill) | 50% | 11.5 | 6.0 | **47.4** | -The same "DDD guidance" skill helps Claude a little (+3.5) and *badly* hurts Codex (−22) — an insight that's invisible without per-dimension assessment, and exactly what NASDE is built to surface. See [A Real Task](/nasde-toolkit/concepts/real-task-example/) for the full breakdown and [Benchmark Results](/nasde-toolkit/guides/benchmark-results/) for more. +The same "DDD guidance" skill helps Claude a little (+3.5) and *badly* hurts Codex (−22) — an insight that's invisible without per-dimension assessment, and exactly what Nasde is built to surface. See [A Real Task](/nasde-toolkit/concepts/real-task-example/) for the full breakdown and [Benchmark Results](/nasde-toolkit/guides/benchmark-results/) for more. -## What NASDE is — and is not +## What Nasde is — and is not It helps to be clear about the boundaries before you invest: -**NASDE is:** +**Nasde is:** - A way to **measure and compare agent configurations** (skills, `CLAUDE.md`, MCP, model, reasoning effort) on tasks you define. - A **local, offline-friendly** tool — runs on your machine via Docker, billed through your existing `claude` / `codex` / `gemini` subscription or API key. - A framework for **building your own benchmarks** from real work (e.g. your git history), with multi-dimensional LLM-as-a-Judge scoring. -**NASDE is not:** +**Nasde is not:** - **A replacement for your CI or test suite.** The rough `test.sh` is a pass/fail gate, not a full test harness — it answers "did it work?", and the reviewer answers "how good is it?". Neither replaces your real tests. - **A production monitor.** It scores agents on benchmark tasks in a sandbox, not live traffic. - **A judge of the "one true model."** It surfaces trade-offs (quality vs. cost, per-dimension strengths) — *you* decide what wins for your budget and priorities. - **Zero-setup or zero-cost.** It needs Docker and an agent you're authenticated to; each run spends real tokens. -### How NASDE drives the agents +### How Nasde drives the agents -NASDE runs every agent **non-interactively** — it scripts them rather than chatting with them, using each tool's programmatic mode (`claude -p`, `codex exec`, and the Gemini CLI equivalent) under the hood, for both the agent under test and the reviewer. An **interactive mode is planned** but not available yet. +Nasde runs every agent **non-interactively** — it scripts them rather than chatting with them, using each tool's programmatic mode (`claude -p`, `codex exec`, and the Gemini CLI equivalent) under the hood, for both the agent under test and the reviewer. An **interactive mode is planned** but not available yet. -Because it uses Claude programmatically, NASDE falls under Anthropic's terms for programmatic use. Anthropic has announced that, **from June 15, 2026, paid Claude plans include a dedicated monthly credit for programmatic usage** — covering `claude -p`, the Claude Agent SDK, and Claude Code GitHub Actions, among others. So running NASDE on a paid Claude subscription is expected and supported; check [Anthropic's current terms](https://www.anthropic.com/) for the exact credit and limits that apply to your plan. +Because it uses Claude programmatically, Nasde falls under Anthropic's terms for programmatic use. Anthropic has announced that, **from June 15, 2026, paid Claude plans include a dedicated monthly credit for programmatic usage** — covering `claude -p`, the Claude Agent SDK, and Claude Code GitHub Actions, among others. So running Nasde on a paid Claude subscription is expected and supported; check [Anthropic's current terms](https://www.anthropic.com/) for the exact credit and limits that apply to your plan. ### Agents and providers -**Today** NASDE drives **Claude Code, the Codex CLI, and the Gemini CLI** — so you can already compare across three providers on the same tasks, with quality and cost side by side. **Planned:** Pi, Cursor, and router-based setups — so a single benchmark spans every agent and router you're weighing, and a migration decision comes with full visibility into the quality-vs-cost trade-offs. +**Today** Nasde drives **Claude Code, the Codex CLI, and the Gemini CLI** — so you can already compare across three providers on the same tasks, with quality and cost side by side. **Planned:** Pi, Cursor, and router-based setups — so a single benchmark spans every agent and router you're weighing, and a migration decision comes with full visibility into the quality-vs-cost trade-offs. ## What do I use it for? -The core use is a **cost-and-quality decision** about your AI coding stack: *which agent, which model, which provider, which configuration — for our codebase and our budget?* NASDE answers it with numbers instead of vibes. +The core use is a **cost-and-quality decision** about your AI coding stack: *which agent, which model, which provider, which configuration — for our codebase and our budget?* Nasde answers it with numbers instead of vibes. Typical things you'd do with it: diff --git a/website/src/content/docs/getting-started/quick-start.md b/website/src/content/docs/getting-started/quick-start.md index eb09f2e..d243b8e 100644 --- a/website/src/content/docs/getting-started/quick-start.md +++ b/website/src/content/docs/getting-started/quick-start.md @@ -46,7 +46,7 @@ pip install nasde-toolkit # Latest unreleased changes from main (for testing PRs and dev builds) uv tool install git+https://github.com/NoesisVision/nasde-toolkit.git --python 3.13 -# Local clone (for developing NASDE itself) +# Local clone (for developing Nasde itself) git clone git@github.com:NoesisVision/nasde-toolkit.git cd nasde-toolkit uv sync @@ -60,7 +60,7 @@ pipx upgrade nasde-toolkit # if installed via pipx pip install --upgrade nasde-toolkit # if installed via pip ``` -`nasde` checks PyPI for newer releases on startup and prints a one-line notice on stderr when an upgrade is available (severity-tinted: patch / minor / major). Disable with `NASDE_NO_UPDATE_CHECK=1` or `CI=true`. +`nasde` checks PyPI for newer releases on startup and prints a one-line notice on stderr when an upgrade is available (severity-tinted: patch / minor / major). Disable with `Nasde_NO_UPDATE_CHECK=1` or `CI=true`. After installation, only `nasde` appears on PATH. Harbor and Opik are bundled as core dependencies. The reviewer agent spawns your already-installed `claude` or `codex` CLI as a subprocess (not bundled), so it reuses whatever authentication you've set up interactively. Check the installed version with `nasde --version`. @@ -73,14 +73,14 @@ nasde install-skills This copies the bundled `nasde-benchmark-*` skills into `~/.claude/skills/` so they're available in every Claude Code session. Use `--scope project` to install into the current project's `.claude/skills/` instead, or `--force` to overwrite after a `nasde` upgrade. :::note -The authoring helpers are Claude Code skills. Codex and Gemini users can still run NASDE from the CLI — the skills just speed up *creating* benchmarks; they are not required to *run* them. +The authoring helpers are Claude Code skills. Codex and Gemini users can still run Nasde from the CLI — the skills just speed up *creating* benchmarks; they are not required to *run* them. ::: ## Build your first benchmark from git history Open your own project in Claude Code and say something like: -> *"Create a NASDE benchmark with a single task, based on a recent piece of work from this repo — a commit, a range of commits, or a merged PR."* +> *"Create a Nasde benchmark with a single task, based on a recent piece of work from this repo — a commit, a range of commits, or a merged PR."* Start with **one task**. Point the skill at whatever unit of work feels self-contained in your workflow — a single commit, a range, a merged MR/PR, or an issue that was closed by a set of commits. The `nasde-benchmark-from-history` skill proposes a good candidate, and generates one task directory with `instruction.md`, a Dockerfile, `test.sh`, and a starter `assessment_criteria.md`. You review each file before it's written. diff --git a/website/src/content/docs/getting-started/reading-results.md b/website/src/content/docs/getting-started/reading-results.md index 5d5c617..3ad6bb5 100644 --- a/website/src/content/docs/getting-started/reading-results.md +++ b/website/src/content/docs/getting-started/reading-results.md @@ -7,11 +7,11 @@ Your first `nasde run` finishes and prints a table, then writes a pile of files. ## What a run looks like -When you start a run, NASDE echoes the configuration so you know exactly what's about to execute — agent, variant, model, attempts, whether Opik and assessment are on: +When you start a run, Nasde echoes the configuration so you know exactly what's about to execute — agent, variant, model, attempts, whether Opik and assessment are on: ![The nasde run startup banner showing agent, variant, model, attempts, and tracking configuration](../../../assets/benchmark/nasde-run-summary.png) -As trials complete, progress streams in; at the end NASDE prints a **per-configuration summary table** — one row per `(agent, model, reasoning effort)` group. Read it like this: +As trials complete, progress streams in; at the end Nasde prints a **per-configuration summary table** — one row per `(agent, model, reasoning effort)` group. Read it like this: - **Trials** — how many times that configuration ran (set by `--attempts` / `-n`). The sample size behind the mean. - **Score** — the normalized quality (0–1) as `mean ±std`. The `±std` is the spread *between attempts* (the agent writes different code each run). A single attempt shows `mean (n=1)` rather than a fake `±0.00`. diff --git a/website/src/content/docs/guides/benchmark-results.mdx b/website/src/content/docs/guides/benchmark-results.mdx index 9f2527a..d54157b 100644 --- a/website/src/content/docs/guides/benchmark-results.mdx +++ b/website/src/content/docs/guides/benchmark-results.mdx @@ -1,6 +1,6 @@ --- title: Benchmark Results -description: Results from the three example benchmarks — refactoring katas, DDD architectural challenges, and the NASDE dev-skill — with the tactical-ddd skill deep dive. +description: Results from the three example benchmarks — refactoring katas, DDD architectural challenges, and the Nasde dev-skill — with the tactical-ddd skill deep dive. --- import { Image } from 'astro:assets'; @@ -40,7 +40,7 @@ Results from the three example benchmarks included in `examples/`. All scores ar | codex-vanilla | 9 | 89% | 18.8 | 13.8 | 16.8 | 11.4 | 8.7 | **69.4** | | codex-guided | 8 | 50% | 11.5 | 9.6 | 12.9 | 7.4 | 6.0 | **47.4** | -**Takeaway:** Architectural guidance helps Claude (+3.5) but dramatically hurts Codex (-22.0). The same skill applied to different agents can have opposite effects — this is exactly the kind of insight NASDE is designed to surface. +**Takeaway:** Architectural guidance helps Claude (+3.5) but dramatically hurts Codex (-22.0). The same skill applied to different agents can have opposite effects — this is exactly the kind of insight Nasde is designed to surface. ### Deep dive — tactical-ddd skill: public vs repo-tuned (Claude Code) @@ -83,7 +83,7 @@ Cost doesn't track quality. On weather the top-scoring repo-tuned skill spends * - [Per-dimension significance — bootstrap](https://github.com/NoesisVision/nasde-toolkit/blob/main/examples/ddd-architectural-challenges/SIGNIFICANCE-per-dimension-bootstrap.md) - [Per-dimension significance — Bayesian bootstrap vs bootstrap](https://github.com/NoesisVision/nasde-toolkit/blob/main/examples/ddd-architectural-challenges/SIGNIFICANCE-per-dimension-bayes-vs-bootstrap.md) (the two methods agree on every aggregate verdict) -## UC1: Project-Specific Setup — NASDE Dev Skill +## UC1: Project-Specific Setup — Nasde Dev Skill 1 task: Add multi-attempt support to the nasde-toolkit itself. Claude only (project-specific skill, cross-agent comparison not applicable). diff --git a/website/src/content/docs/guides/plugins-and-skills.md b/website/src/content/docs/guides/plugins-and-skills.md index 75a4656..4e3a3d2 100644 --- a/website/src/content/docs/guides/plugins-and-skills.md +++ b/website/src/content/docs/guides/plugins-and-skills.md @@ -39,15 +39,15 @@ The **whole** skill directory (including `references/`) is staged into the sandb ## How skills reach each agent -Each agent family auto-discovers skills from a different place, so NASDE delivers them where the CLI actually looks (you don't manage this — but it's worth knowing where your skill files end up): +Each agent family auto-discovers skills from a different place, so Nasde delivers them where the CLI actually looks (you don't manage this — but it's worth knowing where your skill files end up): - **Claude Code** discovers from the project, so its skills land in `/app/.claude/skills/`. -- **Codex** and **Gemini** auto-discover skills only from a HOME-scoped directory — `$HOME/.agents/skills/` for Codex, `~/.gemini/skills/` for Gemini. NASDE routes Codex/Gemini skills there through the agent's native skill-injection (not into the project directory, where the CLI would never scan them). See [ADR-012](https://github.com/NoesisVision/nasde-toolkit/blob/main/docs/adr/012-native-codex-gemini-skill-injection.md). +- **Codex** and **Gemini** auto-discover skills only from a HOME-scoped directory — `$HOME/.agents/skills/` for Codex, `~/.gemini/skills/` for Gemini. Nasde routes Codex/Gemini skills there through the agent's native skill-injection (not into the project directory, where the CLI would never scan them). See [ADR-012](https://github.com/NoesisVision/nasde-toolkit/blob/main/docs/adr/012-native-codex-gemini-skill-injection.md). This applies to all three skill sources for Codex/Gemini: a variant's `agents_skills/` / `gemini_skills/` snapshot, a `[[skill]]` reference, and a `[nasde.plugin]`'s own skills. :::caution[Codex/Gemini skills must start with `---` frontmatter] -Codex's loader is strict: a `SKILL.md` that does **not start** with a `---` YAML frontmatter line is silently rejected and the skill is never registered. A common trap is a leading comment (e.g. ``) *above* the frontmatter — move it below the closing `---`. NASDE warns at run time when a skill's `SKILL.md` doesn't open with `---`, so the gotcha surfaces instead of the skill mysteriously doing nothing. (Claude is more lenient, but starting every `SKILL.md` with frontmatter is the safe habit.) +Codex's loader is strict: a `SKILL.md` that does **not start** with a `---` YAML frontmatter line is silently rejected and the skill is never registered. A common trap is a leading comment (e.g. ``) *above* the frontmatter — move it below the closing `---`. Nasde warns at run time when a skill's `SKILL.md` doesn't open with `---`, so the gotcha surfaces instead of the skill mysteriously doing nothing. (Claude is more lenient, but starting every `SKILL.md` with frontmatter is the safe habit.) ::: ## Scoping a variant to specific tasks (`tasks`) diff --git a/website/src/content/docs/guides/running-benchmarks.md b/website/src/content/docs/guides/running-benchmarks.md index 48a2ca2..ce729fd 100644 --- a/website/src/content/docs/guides/running-benchmarks.md +++ b/website/src/content/docs/guides/running-benchmarks.md @@ -6,7 +6,7 @@ description: The operational side of a run — building from a local repo, scali This guide covers the operational lifecycle of a run, in the order you typically hit it: point it at your code, scale it out, configure the two agents (the one under test and the reviewer), and keep the results. :::tip[Two agents, configured the same way] -A NASDE run involves **two** coding agents: the **agent under test** (the one whose configuration you're measuring) and the **reviewer agent** (the LLM-as-a-Judge that scores the result). Both are configurable along the same axes — instructions, skills, MCP servers, model, reasoning effort. The two sections below mirror each other deliberately. +A Nasde run involves **two** coding agents: the **agent under test** (the one whose configuration you're measuring) and the **reviewer agent** (the LLM-as-a-Judge that scores the result). Both are configurable along the same axes — instructions, skills, MCP servers, model, reasoning effort. The two sections below mirror each other deliberately. ::: ## Running on a local repo @@ -19,7 +19,7 @@ git = "../.." ref = "abc1234" ``` -NASDE auto-generates the Docker environment — no custom `Dockerfile` needed. See [`examples/nasde-dev-skill/`](https://github.com/NoesisVision/nasde-toolkit/tree/main/examples/nasde-dev-skill) for a complete example that tests nasde-toolkit itself. The full `[nasde.source]` reference is in [Configuration](/nasde-toolkit/reference/configuration/#local-repo-source-nasdesource). +Nasde auto-generates the Docker environment — no custom `Dockerfile` needed. See [`examples/nasde-dev-skill/`](https://github.com/NoesisVision/nasde-toolkit/tree/main/examples/nasde-dev-skill) for a complete example that tests nasde-toolkit itself. The full `[nasde.source]` reference is in [Configuration](/nasde-toolkit/reference/configuration/#local-repo-source-nasdesource). ## Cloud sandbox providers @@ -166,7 +166,7 @@ By default a run's output lives only in the local, gitignored `jobs/` directory nasde results-export jobs/2026-03-13__14-30-00 --to ~/Dropbox/nasde-results -C my-benchmark ``` -The destination is any path you like — an iCloud or Dropbox folder, an external drive, or a git repo you commit yourself. NASDE just writes files there; it never talks to a cloud provider, so there's nothing to authenticate. Each trial becomes one flat folder `__/` containing: +The destination is any path you like — an iCloud or Dropbox folder, an external drive, or a git repo you commit yourself. Nasde just writes files there; it never talks to a cloud provider, so there's nothing to authenticate. Each trial becomes one flat folder `__/` containing: - `metrics.json` — self-contained summary: timing, model, variant, task, reward, reasoning effort, **token usage + USD cost** (see [Token & Cost](/nasde-toolkit/concepts/token-cost/)) - `assessment_eval_*.json` — the reviewer's per-dimension scores and reasoning (one file per repetition) @@ -175,4 +175,4 @@ The destination is any path you like — an iCloud or Dropbox folder, an externa - `changes.patch` — exactly what the agent changed (a code diff, not the multi-GB workspace) - `verifier_stdout.txt`, `reward.txt` — the rough-test output -You can pass several paths at once, mixing whole jobs and individual trials — NASDE figures out which is which. Re-running is safe: it merges (copying any evaluations added since the last export) and never re-touches the immutable trajectory or patch. +You can pass several paths at once, mixing whole jobs and individual trials — Nasde figures out which is which. Re-running is safe: it merges (copying any evaluations added since the last export) and never re-touches the immutable trajectory or patch. diff --git a/website/src/content/docs/guides/troubleshooting.md b/website/src/content/docs/guides/troubleshooting.md index a2553b5..7ff815f 100644 --- a/website/src/content/docs/guides/troubleshooting.md +++ b/website/src/content/docs/guides/troubleshooting.md @@ -11,7 +11,7 @@ Most first-run problems fall into a handful of buckets. Here's how to recognize The sandbox needs Docker running. Start Docker Desktop (or your daemon) and re-run. If you'd rather not run containers locally, use a [cloud sandbox provider](/nasde-toolkit/guides/running-benchmarks/#cloud-sandbox-providers) with `--harbor-env daytona`. ### Authentication errors / empty API key -NASDE uses whatever you're already logged into. If a run fails with an auth or empty-key error: +Nasde uses whatever you're already logged into. If a run fails with an auth or empty-key error: - **Claude**: `ANTHROPIC_API_KEY`, or `CLAUDE_CODE_OAUTH_TOKEN` from `claude` login. - **Codex**: `CODEX_API_KEY`, or `codex login` for the ChatGPT subscription (an API key always wins over OAuth when both are set). - **Gemini**: `GEMINI_API_KEY` / `GOOGLE_API_KEY`, or `gemini login`. @@ -37,7 +37,7 @@ Harbor reinstalls the agent CLI per trial, which can trip provider rate limits a A single trial (one agent solving one task) is typically a few minutes to ~30 minutes, dominated by how long the agent takes on the task (set by the task's `[agent] timeout_sec`, default 1800s) plus the reviewer pass. Total wall-clock multiplies by **variants × tasks × attempts** (`--attempts` / `-n`, the independent agent runs per task) — and each trial is then reviewed `--eval-repetitions` times. So start with one task and one variant. ### How much does it cost? -Each trial spends real tokens on the agent; each trial is then reviewed several times (`--eval-repetitions`, default 3), so the reviewer cost multiplies too. NASDE records the exact token and USD cost per trial — see [Token & Cost](/nasde-toolkit/concepts/token-cost/). On a Claude Max or ChatGPT Plus subscription, casual benchmarking is covered by your plan; heavy parallel runs may hit subscription windows. +Each trial spends real tokens on the agent; each trial is then reviewed several times (`--eval-repetitions`, default 3), so the reviewer cost multiplies too. Nasde records the exact token and USD cost per trial — see [Token & Cost](/nasde-toolkit/concepts/token-cost/). On a Claude Max or ChatGPT Plus subscription, casual benchmarking is covered by your plan; heavy parallel runs may hit subscription windows. ### How many attempts do I need? Two different knobs, two different noise sources: @@ -51,11 +51,11 @@ If two configs differ by less than their combined spread, run more attempts befo **Do I need an API key, or is a subscription enough?** A subscription is enough to get going — runs reuse your `claude` / `codex` / `gemini` CLI login. API keys work too when you have them. -**Does NASDE run the agents interactively?** +**Does Nasde run the agents interactively?** No — it drives them **non-interactively** (`claude -p`, `codex exec`, the Gemini CLI equivalent), scripting them rather than chatting. An interactive mode is **planned** but not available yet. **Does running it programmatically affect my Claude plan?** -NASDE's non-interactive use counts as *programmatic* use of Claude. Anthropic has announced that **from June 15, 2026, paid Claude plans include a dedicated monthly credit for programmatic usage** (covering `claude -p`, the Agent SDK, and Claude Code GitHub Actions), so running NASDE on a paid plan is supported. Check [Anthropic's current terms](https://www.anthropic.com/) for the credit and limits on your plan. +Nasde's non-interactive use counts as *programmatic* use of Claude. Anthropic has announced that **from June 15, 2026, paid Claude plans include a dedicated monthly credit for programmatic usage** (covering `claude -p`, the Agent SDK, and Claude Code GitHub Actions), so running Nasde on a paid plan is supported. Check [Anthropic's current terms](https://www.anthropic.com/) for the credit and limits on your plan. **Can I run this without Docker?** Yes — point `--harbor-env` at a [cloud sandbox provider](/nasde-toolkit/guides/running-benchmarks/#cloud-sandbox-providers). The reviewer (Stage 2) always runs locally on the host regardless. @@ -67,4 +67,4 @@ No. Trials run in a local (or your-cloud) sandbox; results stay in `jobs/`. Only Yes — that's the norm. The agent under test is set per variant; the reviewer is set under `[evaluation]`. See [Running & Configuring Runs](/nasde-toolkit/guides/running-benchmarks/). **I edited my rubric — do old and new scores mix?** -No. NASDE fingerprints the rubric, so changing a dimension, its `max_score`, or its description starts a fresh scoring cluster. Just re-run `nasde eval`. +No. Nasde fingerprints the rubric, so changing a dimension, its `max_score`, or its description starts a fresh scoring cluster. Just re-run `nasde eval`. diff --git a/website/src/content/docs/guides/use-cases.md b/website/src/content/docs/guides/use-cases.md index 14be02f..3bdad51 100644 --- a/website/src/content/docs/guides/use-cases.md +++ b/website/src/content/docs/guides/use-cases.md @@ -13,7 +13,7 @@ description: Three end-to-end walkthroughs — evaluating your own agent config, You've tuned how Claude Code operates in your codebase, but you have no way to measure whether the full configuration actually helps. Built-in eval tools like Skill Creator can test individual skills in isolation, but they can't tell you whether your skills work well *together*, how your `CLAUDE.md` interacts with MCP server configurations, or how the same task set performs across different coding agents. Skill changes are a leap of faith — maybe the new prompt improves refactoring but breaks the agent's ability to write tests. Without structured evaluation of the complete configuration, you can't tell what's improving and what's regressing. -### What NASDE enables +### What Nasde enables You turn real problems from your team's history into repeatable benchmark tasks, then run different agent configurations against them — not just individual skills, but the full combination of `CLAUDE.md`, skills, and MCP servers. You can also compare results across different coding agents (Claude Code, Codex, Cursor, etc.) on the same task set. Results are multi-dimensional scores — not just "did it work?" but "how well did it work across code quality, architecture, testing, and whatever else matters to you." Once the task set is established, it becomes a regression suite: re-run it every time the configuration changes. @@ -132,12 +132,12 @@ The task files are committed to the benchmark project repo — they're stable, v ### Current constraints -- NASDE supports local git repos and public remote repos. Private remote repos require local clones (not a practical limitation — you already have them). +- Nasde supports local git repos and public remote repos. Private remote repos require local clones (not a practical limitation — you already have them). - Task creation from git history is manual when using `nasde-benchmark-creator` alone. The **`nasde-benchmark-from-history`** skill automates this — see below. ### Skill: nasde-benchmark-from-history -NASDE includes a dedicated skill that accelerates Phase 1 by mining git history for benchmark candidates. Instead of manually browsing PRs and writing task files from scratch, you point the skill at a commit range and it does the heavy lifting. +Nasde includes a dedicated skill that accelerates Phase 1 by mining git history for benchmark candidates. Instead of manually browsing PRs and writing task files from scratch, you point the skill at a commit range and it does the heavy lifting. **How to use it:** Open your repository in Claude Code and describe what you want — e.g., *"create benchmark tasks from the last 20 commits on main"* or *"turn PRs #45, #52, and #61 into evaluation tasks."* The skill activates automatically. @@ -149,7 +149,7 @@ NASDE includes a dedicated skill that accelerates Phase 1 by mining git history **What it won't do:** It doesn't generate instructions that leak the actual solution. The instruction describes the *problem to solve* (derived from the commit message and PR description), not the *implementation* (the diff). The agent must arrive at a solution independently. -**Relationship to other skills:** `nasde-benchmark-from-history` is an alternative entry point into the benchmark creation workflow. Where `nasde-benchmark-creator` starts from scratch ("what do you want to evaluate?"), `nasde-benchmark-from-history` starts from evidence ("here's what your team already solved"). Both produce the same NASDE task structure. +**Relationship to other skills:** `nasde-benchmark-from-history` is an alternative entry point into the benchmark creation workflow. Where `nasde-benchmark-creator` starts from scratch ("what do you want to evaluate?"), `nasde-benchmark-from-history` starts from evidence ("here's what your team already solved"). Both produce the same Nasde task structure. See the full skill reference: [`.claude/skills/nasde-benchmark-from-history/SKILL.md`](https://github.com/NoesisVision/nasde-toolkit/blob/main/.claude/skills/nasde-benchmark-from-history/SKILL.md) @@ -165,7 +165,7 @@ See the full skill reference: [`.claude/skills/nasde-benchmark-from-history/SKIL You've tested your skill on a handful of repos and it works. But you have no structured way to validate that it generalizes. Does it handle Python as well as TypeScript? Large monorepos as well as small libraries? Projects with extensive tests as well as those with none? Without a diverse, repeatable test suite, you're shipping based on anecdotes. -### What NASDE enables +### What Nasde enables A benchmark that spans multiple repositories, languages, and problem types. Define the test suite once, re-run it whenever the skill changes. Each run gives you per-task, per-dimension scores — so you can see exactly where the skill shines and where it struggles. @@ -261,7 +261,7 @@ As you discover edge cases (the skill fails on monorepos, or struggles with code ### Skill: nasde-benchmark-from-public-repos -NASDE includes a dedicated skill for curating diverse benchmark suites from public repositories. Instead of manually searching GitHub and scaffolding Dockerfiles for each language, you describe your skill and the tool guides the curation process. +Nasde includes a dedicated skill for curating diverse benchmark suites from public repositories. Instead of manually searching GitHub and scaffolding Dockerfiles for each language, you describe your skill and the tool guides the curation process. **How to use it:** Open your benchmark project in Claude Code and describe the skill you're building — e.g., *"I'm building a refactoring skill that should work across Python, TypeScript, Go, and Rust."* The skill activates automatically. @@ -289,7 +289,7 @@ A plugin/skill author whose plugin under test bundles skills **and** an MCP serv Before `[nasde.plugin]`, exercising a plugin meant paying a triple tax: vendor a frozen snapshot of the entire plugin tree into the benchmark, hand-write a Dockerfile `COPY`, hand-write `[environment.mcp_servers]` with an env-export wrapper, **and** copy the plugin's skills into each variant. The snapshot drifted from the live plugin and had to be refreshed by a documented manual procedure. A benchmark testing just one skill still had to copy that skill (and its `references/`) into `variants//skills/`. -### What NASDE enables +### What Nasde enables **Whole plugin, one declaration.** In `task.toml`: diff --git a/website/src/content/docs/reference/authentication.md b/website/src/content/docs/reference/authentication.md index 94a1b46..5600846 100644 --- a/website/src/content/docs/reference/authentication.md +++ b/website/src/content/docs/reference/authentication.md @@ -1,12 +1,12 @@ --- title: Authentication & Opik -description: How NASDE auto-detects credentials per agent — Claude Code, OpenAI Codex, Gemini CLI — plus Opik tracing setup and how to verify results landed. +description: How Nasde auto-detects credentials per agent — Claude Code, OpenAI Codex, Gemini CLI — plus Opik tracing setup and how to verify results landed. --- -NASDE auto-detects the required credentials based on the variant's agent type. +Nasde auto-detects the required credentials based on the variant's agent type. :::note[Programmatic use] -NASDE drives the agents **non-interactively** (`claude -p`, `codex exec`, and the Gemini CLI equivalent), so it counts as *programmatic* use of those tools. For Claude specifically, Anthropic has announced that **from June 15, 2026, paid plans include a dedicated monthly credit for programmatic usage** — covering `claude -p`, the Claude Agent SDK, and Claude Code GitHub Actions. Running NASDE on a paid Claude subscription is therefore supported; check [Anthropic's current terms](https://www.anthropic.com/) for the credit and limits on your plan. +Nasde drives the agents **non-interactively** (`claude -p`, `codex exec`, and the Gemini CLI equivalent), so it counts as *programmatic* use of those tools. For Claude specifically, Anthropic has announced that **from June 15, 2026, paid plans include a dedicated monthly credit for programmatic usage** — covering `claude -p`, the Claude Agent SDK, and Claude Code GitHub Actions. Running Nasde on a paid Claude subscription is therefore supported; check [Anthropic's current terms](https://www.anthropic.com/) for the credit and limits on your plan. ::: ## Claude Code @@ -36,7 +36,7 @@ source scripts/export_codex_oauth_token.sh # validate tokens are present uv run nasde run --variant codex-vanilla -C my-benchmark ``` -When no API key is set, NASDE auto-detects the presence of `~/.codex/auth.json` (created by `codex login`) and opts into uploading it to the sandbox (it sets `CODEX_FORCE_AUTH_JSON=true`; Harbor does the actual upload). No env vars needed. +When no API key is set, Nasde auto-detects the presence of `~/.codex/auth.json` (created by `codex login`) and opts into uploading it to the sandbox (it sets `CODEX_FORCE_AUTH_JSON=true`; Harbor does the actual upload). No env vars needed. **Option 2: API key** — billed per-token through your OpenAI Platform account. @@ -65,7 +65,7 @@ export GOOGLE_API_KEY=your-key export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json ``` -These are the env vars NASDE checks for the API-key path (alongside `GEMINI_API_KEY`). +These are the env vars Nasde checks for the API-key path (alongside `GEMINI_API_KEY`). **Option 3: OAuth (Google account)** — uses your Gemini subscription credits. @@ -75,7 +75,7 @@ source scripts/export_gemini_oauth_token.sh # validate tokens are present uv run nasde run --variant gemini-baseline -C my-benchmark ``` -NASDE auto-detects `~/.gemini/oauth_creds.json` and injects the credentials into the sandbox. No env vars needed. +Nasde auto-detects `~/.gemini/oauth_creds.json` and injects the credentials into the sandbox. No env vars needed. API key env vars (`GEMINI_API_KEY`, `GOOGLE_API_KEY`, `GOOGLE_APPLICATION_CREDENTIALS`) always take priority over OAuth when present. diff --git a/website/src/content/docs/reference/configuration.md b/website/src/content/docs/reference/configuration.md index a1c9a4b..c1433f9 100644 --- a/website/src/content/docs/reference/configuration.md +++ b/website/src/content/docs/reference/configuration.md @@ -124,7 +124,7 @@ If no `harbor_config.json` exists, one is auto-generated from the agent type. How hard the model thinks is a configuration you should set deliberately, not leave to chance. Each agent family ships a *different* default level, and those defaults are not comparable — Codex's `high` is the top of its three levels, while Claude's `high` is only the middle of five (`xhigh` and `max` sit above it). Comparing two agents on their respective defaults silently compares different thinking budgets. -Set the effort explicitly with the optional `reasoning_effort` field in `variant.toml`, or override it for a single run with `nasde run --effort`. Priority is **`--effort` > `variant.toml reasoning_effort` > Harbor's family default** (left unset means NASDE passes nothing and the family default applies). Typical levels (for reference — the exact set differs per model and changes over time): Claude `low`/`medium`/`high`/`xhigh`/`max`, Codex `none`/`minimal`/`low`/`medium`/`high`/`xhigh`, Gemini `minimal`/`low`/`medium`/`high`. NASDE does **not** police the value — it passes whatever you set straight to the agent, which is the source of truth and rejects an unknown level itself; this avoids a stale built-in list wrongly blocking a newly-valid level. +Set the effort explicitly with the optional `reasoning_effort` field in `variant.toml`, or override it for a single run with `nasde run --effort`. Priority is **`--effort` > `variant.toml reasoning_effort` > Harbor's family default** (left unset means Nasde passes nothing and the family default applies). Typical levels (for reference — the exact set differs per model and changes over time): Claude `low`/`medium`/`high`/`xhigh`/`max`, Codex `none`/`minimal`/`low`/`medium`/`high`/`xhigh`, Gemini `minimal`/`low`/`medium`/`high`. Nasde does **not** police the value — it passes whatever you set straight to the agent, which is the source of truth and rejects an unknown level itself; this avoids a stale built-in list wrongly blocking a newly-valid level. The effort you set is stamped onto each trial (`reasoning_effort` in `assessment_summary.json` and `metrics.json`), and the `nasde run` cost table groups by `(agent, model, effort)` — a different effort is treated as a different configuration and never averaged in with another. @@ -145,11 +145,11 @@ ref = "abc1234" ## `task.toml` -A single task config file, shared with Harbor — it reads its standard sections (`[task]`, `[agent]`, `[environment]`, `[verifier]`, `[metadata]`) directly. NASDE-specific fields live under `[nasde.*]` and are ignored by Harbor. +A single task config file, shared with Harbor — it reads its standard sections (`[task]`, `[agent]`, `[environment]`, `[verifier]`, `[metadata]`) directly. Nasde-specific fields live under `[nasde.*]` and are ignored by Harbor. ### Local repo source (`[nasde.source]`) -Build benchmarks from local (private) repositories by adding `[nasde.source]` to `task.toml` — NASDE auto-generates the Docker environment, no custom `Dockerfile` needed: +Build benchmarks from local (private) repositories by adding `[nasde.source]` to `task.toml` — Nasde auto-generates the Docker environment, no custom `Dockerfile` needed: ```toml [nasde.source] diff --git a/website/src/styles/custom.css b/website/src/styles/custom.css index 77c9526..a5e3b58 100644 --- a/website/src/styles/custom.css +++ b/website/src/styles/custom.css @@ -1,4 +1,4 @@ -/* NASDE Toolkit docs — Noesis teal brand theme. +/* Nasde Toolkit docs — Noesis teal brand theme. Replaces Starlight's default purple accent with the teal/cyan from the Noesis logo, keeping the docs visually consistent with noesis-docs. */ @@ -46,7 +46,7 @@ } /* Hero logo on the splash landing — round the corners and add a soft card - so the light-background NASDE artwork doesn't clash with dark mode. */ + so the light-background Nasde artwork doesn't clash with dark mode. */ .hero > img, .hero .hero-html img { border-radius: 1rem;