NoesisVision · szjanikowski · Jun 22, 2026 · Jun 22, 2026
@@ -54,6 +54,6 @@ This pulls your comments back. The `nasde-benchmark-calibration` skill then line
 
 ## Things worth knowing
 
-- **The platform is auto-detected from the repo URL** — `github.com` uses the `gh` CLI, GitLab uses `glab`. You need the matching CLI installed and logged in (`gh auth login` / `glab auth login`); NASDE never handles your token. A self-hosted GitLab host that isn't obviously "gitlab" can be forced with `[calibration] platform = "gitlab"`.
+- **The platform is auto-detected from the repo URL** — `github.com` uses the `gh` CLI, GitLab uses `glab`. You need the matching CLI installed and logged in (`gh auth login` / `glab auth login`); Nasde never handles your token. A self-hosted GitLab host that isn't obviously "gitlab" can be forced with `[calibration] platform = "gitlab"`.
 - **Re-running is idempotent** — a trial whose **open** PR/MR already exists is skipped, so you can publish more trials into the same sink without duplicates. Once you close a calibration round, the same trials can be re-published into a fresh round (a closed PR no longer blocks).
-- **The sink repo must already exist** — NASDE pushes branches and opens PRs but does not create repositories. One base branch is seeded per `(repo, commit)` and shared by all that source's trials.
+- **The sink repo must already exist** — Nasde pushes branches and opens PRs but does not create repositories. One base branch is seeded per `(repo, commit)` and shared by all that source's trials.
@@ -1,9 +1,9 @@
 ---
 title: How It Works
-description: The two kinds of scoring in NASDE and the full evaluation pipeline, stage by stage — from task definition through the reviewer agent to logged results.
+description: The two kinds of scoring in Nasde and the full evaluation pipeline, stage by stage — from task definition through the reviewer agent to logged results.
 ---
 
-The question that trips most people up is *how* NASDE scores a run. The short answer: there are **two independent kinds of scoring**, and they answer different questions. The rest of this page explains both, then walks the whole pipeline stage by stage.
+The question that trips most people up is *how* Nasde scores a run. The short answer: there are **two independent kinds of scoring**, and they answer different questions. The rest of this page explains both, then walks the whole pipeline stage by stage.
 
 ## Two independent kinds of scoring
 
@@ -26,7 +26,7 @@ flowchart LR
     style E fill:#c0392b,color:#fff
 ```
 
-One `nasde run` command drives this whole chain. Stage 1 (the agent doing the work in a sandbox) comes from [Harbor](https://www.harborframework.com/); the optional tracking stage at the end uses [Opik](https://github.com/comet-ml/opik). NASDE is the glue that connects them, adds the reviewer stage in between, and gives you the CLI, the benchmark project layout, and the [authoring skills](/nasde-toolkit/getting-started/quick-start/).
+One `nasde run` command drives this whole chain. Stage 1 (the agent doing the work in a sandbox) comes from [Harbor](https://www.harborframework.com/); the optional tracking stage at the end uses [Opik](https://github.com/comet-ml/opik). Nasde is the glue that connects them, adds the reviewer stage in between, and gives you the CLI, the benchmark project layout, and the [authoring skills](/nasde-toolkit/getting-started/quick-start/).
 
 ## 1. The task — what you define
 
@@ -41,12 +41,12 @@ You don't have to write all of these by hand — the [authoring skills](/nasde-t
 
 ## 2. The agent solves it — in an isolated sandbox
 
-NASDE hands the task to the coding agent under test (Claude Code, Codex, or Gemini CLI) running inside a fresh, isolated container — locally on Docker by default, or on a [cloud sandbox provider](/nasde-toolkit/guides/running-benchmarks/) for horizontal scale. Isolation matters for two reasons:
+Nasde hands the task to the coding agent under test (Claude Code, Codex, or Gemini CLI) running inside a fresh, isolated container — locally on Docker by default, or on a [cloud sandbox provider](/nasde-toolkit/guides/running-benchmarks/) for horizontal scale. Isolation matters for two reasons:
 
 - **Safety** — the agent can `rm -rf`, install arbitrary packages, or loop your test suite without touching your machine.
 - **Fairness** — every trial starts from the same clean state, so a score difference reflects the *configuration* (the `CLAUDE.md`, the skill, the MCP server, the model, the reasoning effort), not leftover state from a previous run.
 
-What varies between runs is exactly one thing: the **variant** — the agent configuration you're testing. That's how NASDE turns "did my skill help?" into a controlled experiment. See [Configuration → variant.toml](/nasde-toolkit/reference/configuration/) for what a variant can change.
+What varies between runs is exactly one thing: the **variant** — the agent configuration you're testing. That's how Nasde turns "did my skill help?" into a controlled experiment. See [Configuration → variant.toml](/nasde-toolkit/reference/configuration/) for what a variant can change.
 
 ## 3. Rough tests — a deterministic pass/fail
 
@@ -60,7 +60,7 @@ This is the standard verifier pattern used by [Harbor](https://www.harborframewo
 
 ## 4. The reviewer agent — reading the actual work
 
-This is the stage NASDE adds, and it's why the pipeline exists. A **second coding agent** (`claude` or `codex`) is pointed at the produced workspace and navigates it with real tools — `Read`, `Glob`, `Grep`, optionally MCP analysis servers — reading only what each dimension needs rather than stuffing the whole repo into a prompt. That's what keeps the review tractable on large codebases. It can also read the agent's full **trajectory** (tool calls, tokens, timing), so your criteria can judge the *process*, not just the final files.
+This is the stage Nasde adds, and it's why the pipeline exists. A **second coding agent** (`claude` or `codex`) is pointed at the produced workspace and navigates it with real tools — `Read`, `Glob`, `Grep`, optionally MCP analysis servers — reading only what each dimension needs rather than stuffing the whole repo into a prompt. That's what keeps the review tractable on large codebases. It can also read the agent's full **trajectory** (tool calls, tokens, timing), so your criteria can judge the *process*, not just the final files.
 
 The reviewer's reference point is **two files you write** when creating the benchmark:
 
@@ -75,7 +75,7 @@ You decide how strict the criteria are — spell out a ground-truth structure, e
 
 The reviewer scores each dimension on whatever scale you chose, with written reasoning for each. One local `nasde run` handles all of it — no separate LLM-as-a-judge stack required.
 
-**The reviewer runs more than once.** An LLM judge is non-deterministic — score the same workspace twice and you can get 0.61 then 0.71. So by default NASDE evaluates each trial **3 times** (`eval_repetitions`, set in `nasde.toml [evaluation]` or with `--eval-repetitions`) and reports the **mean** rather than any single run. Each evaluation is kept as its own `assessment_eval_<N>.json`; a derived `assessment_summary.json` holds the per-dimension mean, standard deviation, and range. Means are computed only within a single judge model **and a single rubric** — a Claude review and a Codex review are different benchmarks, and so is a review run after you edited `assessment_dimensions.json` (the rubric is fingerprinted, so changing a dimension, its `max_score`, or even its description starts a fresh cluster rather than silently mixing incomparable scores). After editing the rubric, just re-run `nasde eval` — the new evaluations form their own cluster automatically.
+**The reviewer runs more than once.** An LLM judge is non-deterministic — score the same workspace twice and you can get 0.61 then 0.71. So by default Nasde evaluates each trial **3 times** (`eval_repetitions`, set in `nasde.toml [evaluation]` or with `--eval-repetitions`) and reports the **mean** rather than any single run. Each evaluation is kept as its own `assessment_eval_<N>.json`; a derived `assessment_summary.json` holds the per-dimension mean, standard deviation, and range. Means are computed only within a single judge model **and a single rubric** — a Claude review and a Codex review are different benchmarks, and so is a review run after you edited `assessment_dimensions.json` (the rubric is fingerprinted, so changing a dimension, its `max_score`, or even its description starts a fresh cluster rather than silently mixing incomparable scores). After editing the rubric, just re-run `nasde eval` — the new evaluations form their own cluster automatically.
 
 When the judge's scores feel off, you can align it with your own judgment — see [Calibrating the Rubric](/nasde-toolkit/concepts/calibration/).
 

@@ -1,9 +1,9 @@
 ---
 title: Key Terms
-description: A glossary of the NASDE vocabulary — variant, trial, job, rubric, dimension, reviewer, trajectory, and the tools it builds on.
+description: A glossary of the Nasde vocabulary — variant, trial, job, rubric, dimension, reviewer, trajectory, and the tools it builds on.
 ---
 
-NASDE has its own vocabulary. If a word in the docs is unfamiliar, it's probably here.
+Nasde has its own vocabulary. If a word in the docs is unfamiliar, it's probably here.
 
 ## The benchmark you author
 
@@ -50,10 +50,10 @@ NASDE has its own vocabulary. If a word in the docs is unfamiliar, it's probably
 **Sandbox**
 : The isolated container the agent works in. It can't touch your machine, and every trial starts from the same clean state.
 
-## The tools NASDE builds on
+## The tools Nasde builds on
 
 **Harbor**
-: The framework that runs the agent in a sandbox (Stage 1). NASDE uses its Python API directly. [harborframework.com](https://www.harborframework.com/)
+: The framework that runs the agent in a sandbox (Stage 1). Nasde uses its Python API directly. [harborframework.com](https://www.harborframework.com/)
 
 **Opik**
 : The optional experiment tracker scores flow to with `--with-opik`. [Opik by Comet](https://github.com/comet-ml/opik)

@@ -58,6 +58,6 @@ A separate study took a public DDD skill (the `tactical-ddd` skill from `ntcodin
 ## More benchmarks in the repo
 
 - **Refactoring katas (Java + Python)** — four classic refactorings scored on behavior preservation, clarity, technique, scope discipline. *Takeaway:* a candidate "refactoring skill" didn't move the score — shipping it would have been based on vibes.
-- **Project-specific skill validation (NASDE's own repo)** — one task pulled from NASDE's git history; four skill combinations tested. *Takeaway:* the testing-discipline skill alone raised pass rate from 67% → 100%; the "full-stack, everything-on" variant scored *worse* than vanilla.
+- **Project-specific skill validation (Nasde's own repo)** — one task pulled from Nasde's git history; four skill combinations tested. *Takeaway:* the testing-discipline skill alone raised pass rate from 67% → 100%; the "full-stack, everything-on" variant scored *worse* than vanilla.
 
 See **[Benchmark Results](/nasde-toolkit/guides/benchmark-results/)** for the full tables and methodology, and **[Use Cases](/nasde-toolkit/guides/use-cases/)** for the end-to-end walkthrough of building a benchmark like these yourself.