Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions website/src/content/docs/concepts/calibration.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,6 @@ This pulls your comments back. The `nasde-benchmark-calibration` skill then line

## Things worth knowing

- **The platform is auto-detected from the repo URL** — `github.com` uses the `gh` CLI, GitLab uses `glab`. You need the matching CLI installed and logged in (`gh auth login` / `glab auth login`); NASDE never handles your token. A self-hosted GitLab host that isn't obviously "gitlab" can be forced with `[calibration] platform = "gitlab"`.
- **The platform is auto-detected from the repo URL** — `github.com` uses the `gh` CLI, GitLab uses `glab`. You need the matching CLI installed and logged in (`gh auth login` / `glab auth login`); Nasde never handles your token. A self-hosted GitLab host that isn't obviously "gitlab" can be forced with `[calibration] platform = "gitlab"`.
- **Re-running is idempotent** — a trial whose **open** PR/MR already exists is skipped, so you can publish more trials into the same sink without duplicates. Once you close a calibration round, the same trials can be re-published into a fresh round (a closed PR no longer blocks).
- **The sink repo must already exist** — NASDE pushes branches and opens PRs but does not create repositories. One base branch is seeded per `(repo, commit)` and shared by all that source's trials.
- **The sink repo must already exist** — Nasde pushes branches and opens PRs but does not create repositories. One base branch is seeded per `(repo, commit)` and shared by all that source's trials.
14 changes: 7 additions & 7 deletions website/src/content/docs/concepts/how-it-works.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
---
title: How It Works
description: The two kinds of scoring in NASDE and the full evaluation pipeline, stage by stage — from task definition through the reviewer agent to logged results.
description: The two kinds of scoring in Nasde and the full evaluation pipeline, stage by stage — from task definition through the reviewer agent to logged results.
---

The question that trips most people up is *how* NASDE scores a run. The short answer: there are **two independent kinds of scoring**, and they answer different questions. The rest of this page explains both, then walks the whole pipeline stage by stage.
The question that trips most people up is *how* Nasde scores a run. The short answer: there are **two independent kinds of scoring**, and they answer different questions. The rest of this page explains both, then walks the whole pipeline stage by stage.

## Two independent kinds of scoring

Expand All @@ -26,7 +26,7 @@ flowchart LR
style E fill:#c0392b,color:#fff
```

One `nasde run` command drives this whole chain. Stage 1 (the agent doing the work in a sandbox) comes from [Harbor](https://www.harborframework.com/); the optional tracking stage at the end uses [Opik](https://github.com/comet-ml/opik). NASDE is the glue that connects them, adds the reviewer stage in between, and gives you the CLI, the benchmark project layout, and the [authoring skills](/nasde-toolkit/getting-started/quick-start/).
One `nasde run` command drives this whole chain. Stage 1 (the agent doing the work in a sandbox) comes from [Harbor](https://www.harborframework.com/); the optional tracking stage at the end uses [Opik](https://github.com/comet-ml/opik). Nasde is the glue that connects them, adds the reviewer stage in between, and gives you the CLI, the benchmark project layout, and the [authoring skills](/nasde-toolkit/getting-started/quick-start/).

## 1. The task — what you define

Expand All @@ -41,12 +41,12 @@ You don't have to write all of these by hand — the [authoring skills](/nasde-t

## 2. The agent solves it — in an isolated sandbox

NASDE hands the task to the coding agent under test (Claude Code, Codex, or Gemini CLI) running inside a fresh, isolated container — locally on Docker by default, or on a [cloud sandbox provider](/nasde-toolkit/guides/running-benchmarks/) for horizontal scale. Isolation matters for two reasons:
Nasde hands the task to the coding agent under test (Claude Code, Codex, or Gemini CLI) running inside a fresh, isolated container — locally on Docker by default, or on a [cloud sandbox provider](/nasde-toolkit/guides/running-benchmarks/) for horizontal scale. Isolation matters for two reasons:

- **Safety** — the agent can `rm -rf`, install arbitrary packages, or loop your test suite without touching your machine.
- **Fairness** — every trial starts from the same clean state, so a score difference reflects the *configuration* (the `CLAUDE.md`, the skill, the MCP server, the model, the reasoning effort), not leftover state from a previous run.

What varies between runs is exactly one thing: the **variant** — the agent configuration you're testing. That's how NASDE turns "did my skill help?" into a controlled experiment. See [Configuration → variant.toml](/nasde-toolkit/reference/configuration/) for what a variant can change.
What varies between runs is exactly one thing: the **variant** — the agent configuration you're testing. That's how Nasde turns "did my skill help?" into a controlled experiment. See [Configuration → variant.toml](/nasde-toolkit/reference/configuration/) for what a variant can change.

## 3. Rough tests — a deterministic pass/fail

Expand All @@ -60,7 +60,7 @@ This is the standard verifier pattern used by [Harbor](https://www.harborframewo

## 4. The reviewer agent — reading the actual work

This is the stage NASDE adds, and it's why the pipeline exists. A **second coding agent** (`claude` or `codex`) is pointed at the produced workspace and navigates it with real tools — `Read`, `Glob`, `Grep`, optionally MCP analysis servers — reading only what each dimension needs rather than stuffing the whole repo into a prompt. That's what keeps the review tractable on large codebases. It can also read the agent's full **trajectory** (tool calls, tokens, timing), so your criteria can judge the *process*, not just the final files.
This is the stage Nasde adds, and it's why the pipeline exists. A **second coding agent** (`claude` or `codex`) is pointed at the produced workspace and navigates it with real tools — `Read`, `Glob`, `Grep`, optionally MCP analysis servers — reading only what each dimension needs rather than stuffing the whole repo into a prompt. That's what keeps the review tractable on large codebases. It can also read the agent's full **trajectory** (tool calls, tokens, timing), so your criteria can judge the *process*, not just the final files.

The reviewer's reference point is **two files you write** when creating the benchmark:

Expand All @@ -75,7 +75,7 @@ You decide how strict the criteria are — spell out a ground-truth structure, e

The reviewer scores each dimension on whatever scale you chose, with written reasoning for each. One local `nasde run` handles all of it — no separate LLM-as-a-judge stack required.

**The reviewer runs more than once.** An LLM judge is non-deterministic — score the same workspace twice and you can get 0.61 then 0.71. So by default NASDE evaluates each trial **3 times** (`eval_repetitions`, set in `nasde.toml [evaluation]` or with `--eval-repetitions`) and reports the **mean** rather than any single run. Each evaluation is kept as its own `assessment_eval_<N>.json`; a derived `assessment_summary.json` holds the per-dimension mean, standard deviation, and range. Means are computed only within a single judge model **and a single rubric** — a Claude review and a Codex review are different benchmarks, and so is a review run after you edited `assessment_dimensions.json` (the rubric is fingerprinted, so changing a dimension, its `max_score`, or even its description starts a fresh cluster rather than silently mixing incomparable scores). After editing the rubric, just re-run `nasde eval` — the new evaluations form their own cluster automatically.
**The reviewer runs more than once.** An LLM judge is non-deterministic — score the same workspace twice and you can get 0.61 then 0.71. So by default Nasde evaluates each trial **3 times** (`eval_repetitions`, set in `nasde.toml [evaluation]` or with `--eval-repetitions`) and reports the **mean** rather than any single run. Each evaluation is kept as its own `assessment_eval_<N>.json`; a derived `assessment_summary.json` holds the per-dimension mean, standard deviation, and range. Means are computed only within a single judge model **and a single rubric** — a Claude review and a Codex review are different benchmarks, and so is a review run after you edited `assessment_dimensions.json` (the rubric is fingerprinted, so changing a dimension, its `max_score`, or even its description starts a fresh cluster rather than silently mixing incomparable scores). After editing the rubric, just re-run `nasde eval` — the new evaluations form their own cluster automatically.

When the judge's scores feel off, you can align it with your own judgment — see [Calibrating the Rubric](/nasde-toolkit/concepts/calibration/).

Expand Down
8 changes: 4 additions & 4 deletions website/src/content/docs/concepts/key-terms.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
---
title: Key Terms
description: A glossary of the NASDE vocabulary — variant, trial, job, rubric, dimension, reviewer, trajectory, and the tools it builds on.
description: A glossary of the Nasde vocabulary — variant, trial, job, rubric, dimension, reviewer, trajectory, and the tools it builds on.
---

NASDE has its own vocabulary. If a word in the docs is unfamiliar, it's probably here.
Nasde has its own vocabulary. If a word in the docs is unfamiliar, it's probably here.

## The benchmark you author

Expand Down Expand Up @@ -50,10 +50,10 @@ NASDE has its own vocabulary. If a word in the docs is unfamiliar, it's probably
**Sandbox**
: The isolated container the agent works in. It can't touch your machine, and every trial starts from the same clean state.

## The tools NASDE builds on
## The tools Nasde builds on

**Harbor**
: The framework that runs the agent in a sandbox (Stage 1). NASDE uses its Python API directly. [harborframework.com](https://www.harborframework.com/)
: The framework that runs the agent in a sandbox (Stage 1). Nasde uses its Python API directly. [harborframework.com](https://www.harborframework.com/)

**Opik**
: The optional experiment tracker scores flow to with `--with-opik`. [Opik by Comet](https://github.com/comet-ml/opik)
Expand Down
2 changes: 1 addition & 1 deletion website/src/content/docs/concepts/real-task-example.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,6 @@ A separate study took a public DDD skill (the `tactical-ddd` skill from `ntcodin
## More benchmarks in the repo

- **Refactoring katas (Java + Python)** — four classic refactorings scored on behavior preservation, clarity, technique, scope discipline. *Takeaway:* a candidate "refactoring skill" didn't move the score — shipping it would have been based on vibes.
- **Project-specific skill validation (NASDE's own repo)** — one task pulled from NASDE's git history; four skill combinations tested. *Takeaway:* the testing-discipline skill alone raised pass rate from 67% → 100%; the "full-stack, everything-on" variant scored *worse* than vanilla.
- **Project-specific skill validation (Nasde's own repo)** — one task pulled from Nasde's git history; four skill combinations tested. *Takeaway:* the testing-discipline skill alone raised pass rate from 67% → 100%; the "full-stack, everything-on" variant scored *worse* than vanilla.

See **[Benchmark Results](/nasde-toolkit/guides/benchmark-results/)** for the full tables and methodology, and **[Use Cases](/nasde-toolkit/guides/use-cases/)** for the end-to-end walkthrough of building a benchmark like these yourself.
Loading