From 12397443cfabfbd2f60d269c28c539a9c5317a4c Mon Sep 17 00:00:00 2001 From: Szymon Janikowski Date: Mon, 22 Jun 2026 16:00:32 +0200 Subject: [PATCH 01/10] feat(pricing): convention-based layered override (ADR-013) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Model prices are now overridable post-install without hacking the wheel. A file literally named pricing.toml at a known location is auto-detected and merged onto the bundled catalog — no [pricing] config key: /pricing.toml > ~/.nasde/pricing.toml > bundled Per-model whole-entry merge (higher wins): an override lists only the models it changes/adds; the rest fall through. User layer is a HOME dotfolder (~/.nasde/, like ~/.claude/~/.codex/~/.gemini), deliberately NOT platformdirs (maps to ~/Library/Application Support on macOS = app state, not user config). An applied layer prints a dim transparency line. Both write paths thread project_dir so run (assessment_summary.json) and export (metrics.json) agree on cost — the ADR-011 single-extractor invariant. New load_pricing_layered(project_dir) in pricing.py; wired through evaluator/runner, results_exporter/cli, calibration_publisher, and eval_migration. Bundled lru_cache and load_pricing(path) unchanged; merged catalog is not cached (cheap per-job re-read). 14 new tests incl. a three-layer compose case with user↔project overlap (project wins) and run+export e2e. ADR-013 + docs (token-cost.md, configuration.md) + CLAUDE.md. 411 tests, ruff, mypy green. Co-Authored-By: Claude Opus 4.8 (1M context) --- CLAUDE.md | 5 +- docs/adr/013-layered-pricing-override.md | 86 +++++++++++++ src/nasde_toolkit/calibration_publisher.py | 3 +- src/nasde_toolkit/cli.py | 4 +- src/nasde_toolkit/eval_migration.py | 9 +- src/nasde_toolkit/evaluator.py | 20 +-- src/nasde_toolkit/pricing.py | 96 +++++++++++---- src/nasde_toolkit/results_exporter.py | 23 ++-- src/nasde_toolkit/runner.py | 3 + tests/test_evaluator.py | 56 ++++++++- tests/test_pricing.py | 115 +++++++++++++++++- tests/test_results_exporter.py | 46 +++++++ .../src/content/docs/concepts/token-cost.md | 22 +++- .../content/docs/reference/configuration.md | 1 + 14 files changed, 434 insertions(+), 55 deletions(-) create mode 100644 docs/adr/013-layered-pricing-override.md diff --git a/CLAUDE.md b/CLAUDE.md index adfdbab..08ab070 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -18,7 +18,7 @@ src/nasde_toolkit/ claude_subprocess.py # `claude -p` subprocess backend (default) codex_subprocess.py # `codex exec` subprocess backend results_exporter.py # `nasde results-export` — copy trial artifact essence to a plain dir - pricing.py / pricing.toml # Versioned model price catalog (ADR-011) — load + compute_cost_usd + pricing.py / pricing.toml # Versioned model price catalog (ADR-011) — load + compute_cost_usd; layered override via load_pricing_layered (ADR-013) token_metrics.py # Token usage + cost economics from trajectory final_metrics (ADR-011) calibration_publisher.py # `nasde calibrate` — publish trial diffs+assessments as PRs/MRs (ADR-010) calibration_resolve.py # Resolve [calibration] repo → (api slug, push url, platform) @@ -88,7 +88,7 @@ See [ARCHITECTURE.md](ARCHITECTURE.md) for the full system architecture with dia - **Pass-through CLI**: `nasde harbor ...` delegates to Harbor's Typer app via `add_typer()`. `nasde opik ...` forwards args to Opik's Click CLI via `ctx.args`. - **Rubric calibration (ADR-010)**: `nasde calibrate publish PATHS...` / `pull-comments` close the loop between the LLM-as-a-Judge and a human reviewer by publishing trial diffs + assessments as PRs/MRs and pulling review comments back for rubric tuning. Two layers, deliberately separated: **GIT** (`git_platform_backends/git_ops.py` — `git push`/`ls-remote`, platform-agnostic, subprocess pattern from `docker.py`, not behind a Protocol) and **PLATFORM** (`git_platform_backends/` behind a `@runtime_checkable GitPlatformBackend` Protocol — `repo_exists`/`find_open_pr_for_branch`/`create_pr`/`fetch_pr_comments`/`validate_cli_installed`/`validate_auth`, mirroring `evaluator_backends/`). The base is keyed on `(repo, commit)` as an **orphan branch** `base/-` seeded once via `git archive HEAD` from the trial workspace (git deduplicates blobs by content across orphan bases — no shared ancestor needed); each trial is a feature branch `calib/-/` = base + the agent's `changes.patch` applied as a real commit + `.calibration/` files (no trajectory — secrets/clutter). `.calibration/` carries the reviewer's context: the task's `instruction.md` + `assessment_criteria.md` + `assessment_dimensions.json` (resolved from `result.json` `task_name`/`source`, trying both `tasks/` and `evals//tasks/` layouts), all `assessment_eval_.json`, `assessment_summary.json`, and `metrics.json`. Idempotency is **open-only**: `find_open_pr_for_branch` matches only OPEN PRs/MRs (`gh pr list --state open`, `glab mr list` default), so a re-run skips a live round but lets a fresh round publish once the prior one is closed. The PR body is a pure transform of the dominant `AssessmentSummary` cluster (`calibration_publisher._render_pr_body`). Backend is **auto-detected from the sink repo URL host** (`github.com`→`gh`, `*gitlab*`→`glab`; `[calibration] platform` overrides for self-hosted) — no `backend` config field, eliminating the backend≠host mismatch. Preflight before any work: detect → `validate_cli_installed` (`shutil.which`, precise per-platform message) → `validate_auth` (`gh|glab auth status` exit code) → `repo_exists` (parses OUTPUT — `gh repo view` exits 0 even for a missing repo). Repo creation is out of scope (push creates branches ad-hoc in an existing repo). Reuses `_expand_to_trials`/`_capture_patch`/`_build_metrics` from `results_exporter.py` and `_aggregate_evaluations`/`_load_raw_evaluations`/`AssessmentSummary` from `evaluator.py`. Prerequisites mirror the evaluator's CLI requirement (ADR-002): `git` + `gh`/`glab` + login, no SDK, CLI keyring holds auth. The `nasde-benchmark-calibration` skill orchestrates the human-in-the-loop flow. - **Results export (EXPERIMENTAL)**: `results_exporter.py` + `nasde results-export PATHS... --to DIR` copy the analytic *essence* of trial artifacts out of the gitignored `jobs/` tree into a flat per-trial layout (`DIR/__/` with `metrics.json`, `assessment_eval_*.json` (all repetitions), `assessment_summary.json`, `trajectory.json`, `changes.patch`, `verifier_stdout.txt`, `reward.txt`). Re-export **merges**: missing eval files are copied and the summary/metrics refreshed, while immutable files (trajectory, patch) are left as-is — so evaluations added after a first export are picked up. A legacy bare `assessment_eval.json` (pre-migration trial) is exported as `assessment_eval_1.json` with a `nasde migrate-evals` hint, so the export is never silently empty. Filesystem-as-interface: `DIR` is any plain path (iCloud/Dropbox/git repo) — no cloud SDK. It scans Harbor artifacts (`result.json`/`config.json`/`assessment_eval*.json`/`agent/trajectory.json`/workspace), **not** the best-effort `EXPERIMENT_LOG.md`. `metrics.json` is a self-contained summary composed from `result.json`+`config.json`+`agent/trajectory.json` — including **token & cost economics** (`token_usage`, `cost_usd`, `pricing_as_of`, `reasoning_effort`); see the token-cost note below and [ADR-011](docs/adr/011-token-cost-metrics.md). The code diff is captured as a patch (`git diff HEAD` + untracked via `git ls-files -z` + `git diff --no-index`, never `git add` — the workspace `.git` index is left untouched; `-z`/NUL parsing means non-ASCII untracked filenames are not dropped under `core.quotepath`). Selection is a mixed positional list of job and/or trial dirs (auto-classified: a dir whose children have `result.json` is a job; else a dir whose own `result.json` carries a `trial_name` key is a trial; a dir with a job-level `result.json` but no trial-shaped children/`trial_name` is skipped with a warning rather than mis-exported as garbage); re-export is idempotent and merge-based (a trial is reported `exported` only when something new was copied, else `skipped`). Reuses `_collect_trial_dirs`/`_load_json`/`_compute_duration_sec`/`_resolve_agent_name` from `evaluator.py`. Deliberately does **not** model "experiments" (one job can belong to many — a future UI layer's concern). -- **Token & cost metrics (ADR-011)**: every trial gets **token usage + USD cost** computed from the agent's `agent/trajectory.json` `final_metrics` and a versioned price catalog. `token_metrics.py` is the **single extractor** feeding both write paths: `evaluator.py` writes them onto `assessment_summary.json` (run) and `results_exporter.py` onto `metrics.json` (export). Definitions: `input = total_prompt_tokens` (full, cache included), `output = total_completion_tokens + extra.reasoning_output_tokens` (Codex reasoning folded into output), `total = input + output`. **Cost is "as if every run were the first"** — full prompt volume at full input rate, **no cache discount** — so it is deterministic and independent of run order / cache TTL (the prompt-token count is fixed for a task; the cache hit rate is not). **The scalar `token_efficiency`/`cost_efficiency` ratios were REMOVED** — `normalized_score / denominator` has an arbitrary zero (score 0 = empty rubric, unreachable), so the ranking is not invariant to a baseline shift; model comparison is now a **Pareto front** (quality vs cost, quality vs tokens), shift-invariant, living in the `nasde-benchmark-runner` skill, not the toolkit. The raw signals (`token_usage`, `cost_usd`, `pricing_as_of`, score) stay and are the source of truth. Economics are **per-trial** (one agent run) → they live on `AssessmentSummary`, not `EvaluatorGroupSummary`. `model_name` and `reasoning_effort` are stamped on the summary because cross-model analysis groups by `(agent_name, model_name, reasoning_effort)` (agent_name = variant name, does not distinguish models; a different effort is a different configuration, never averaged together — mirrors how a changed `dimensions_fingerprint` is a different benchmark). The `reasoning_effort` stamp is read back from the per-trial Harbor `config.json` (`config.agent.kwargs.reasoning_effort`); when no override was set the stamp is `""` (only explicit overrides are recorded — Codex's implicit `high` default is NOT fabricated, and an unset effort is a valid "family default" state). Pricing lives in the bundled `pricing.toml` (`pricing.py::load_pricing`), each model stamped with `as_of` + `source`; `cached_input_per_1m` is recorded for reference but **not** used in the cost formula. An **unpriced model** → `cost_usd` = `null` + a warning (token metrics still computed); a **missing/legacy trajectory** → all economics `null`. Never crashes the run. `nasde run` prints a per-`(agent, model, effort)` cost table (trials, score, tokens, $cost) plus the job path and an export hint (`runner.py::_print_job_summary`, called after assessment so the summaries exist); raw cost/token columns carry an inter-trial `±std` when the group has ≥2 trials (n=1 → bare value). Backfilling existing exports whose source jobs are gone is a **one-shot ad-hoc script** (reads the export's own flat `trajectory.json`) — deliberately NOT a CLI command. +- **Token & cost metrics (ADR-011)**: every trial gets **token usage + USD cost** computed from the agent's `agent/trajectory.json` `final_metrics` and a versioned price catalog. `token_metrics.py` is the **single extractor** feeding both write paths: `evaluator.py` writes them onto `assessment_summary.json` (run) and `results_exporter.py` onto `metrics.json` (export). Definitions: `input = total_prompt_tokens` (full, cache included), `output = total_completion_tokens + extra.reasoning_output_tokens` (Codex reasoning folded into output), `total = input + output`. **Cost is "as if every run were the first"** — full prompt volume at full input rate, **no cache discount** — so it is deterministic and independent of run order / cache TTL (the prompt-token count is fixed for a task; the cache hit rate is not). **The scalar `token_efficiency`/`cost_efficiency` ratios were REMOVED** — `normalized_score / denominator` has an arbitrary zero (score 0 = empty rubric, unreachable), so the ranking is not invariant to a baseline shift; model comparison is now a **Pareto front** (quality vs cost, quality vs tokens), shift-invariant, living in the `nasde-benchmark-runner` skill, not the toolkit. The raw signals (`token_usage`, `cost_usd`, `pricing_as_of`, score) stay and are the source of truth. Economics are **per-trial** (one agent run) → they live on `AssessmentSummary`, not `EvaluatorGroupSummary`. `model_name` and `reasoning_effort` are stamped on the summary because cross-model analysis groups by `(agent_name, model_name, reasoning_effort)` (agent_name = variant name, does not distinguish models; a different effort is a different configuration, never averaged together — mirrors how a changed `dimensions_fingerprint` is a different benchmark). The `reasoning_effort` stamp is read back from the per-trial Harbor `config.json` (`config.agent.kwargs.reasoning_effort`); when no override was set the stamp is `""` (only explicit overrides are recorded — Codex's implicit `high` default is NOT fabricated, and an unset effort is a valid "family default" state). Pricing is loaded via `pricing.py::load_pricing_layered(project_dir)` (ADR-013), which merges three layers per-model (whole-entry, higher wins): `/pricing.toml` > `~/.nasde/pricing.toml` > bundled `pricing.toml`. **Convention, not config** — a file literally named `pricing.toml` (mirrors `assessment_dimensions.json`); no `[pricing]` key. User layer is a HOME dotfolder (`~/.nasde/`, like `~/.claude`/`~/.codex`/`~/.gemini`), deliberately NOT `platformdirs` (that maps to `~/Library/Application Support` on macOS = app-state, not user-editable config; `platformdirs` stays for cache in `update_check.py`). Both write paths thread `project_dir` so run (`assessment_summary.json`) and export (`metrics.json`) agree on cost — the ADR-011 single-extractor invariant. The merged catalog is NOT cached (depends on `project_dir` + on-disk contents); bundled `_load_bundled_pricing` keeps its `lru_cache`, `load_pricing(path)` is unchanged. An applied override prints a dim transparency line. Each model stamped with `as_of` + `source`; `cached_input_per_1m` is recorded for reference but **not** used in the cost formula. An **unpriced model** → `cost_usd` = `null` + a warning (token metrics still computed); a **missing/legacy trajectory** → all economics `null`. Never crashes the run. `nasde run` prints a per-`(agent, model, effort)` cost table (trials, score, tokens, $cost) plus the job path and an export hint (`runner.py::_print_job_summary`, called after assessment so the summaries exist); raw cost/token columns carry an inter-trial `±std` when the group has ≥2 trials (n=1 → bare value). Backfilling existing exports whose source jobs are gone is a **one-shot ad-hoc script** (reads the export's own flat `trajectory.json`) — deliberately NOT a CLI command. - See `docs/adr/` for detailed decision records. ## CLI reference @@ -145,6 +145,7 @@ A benchmark project managed by `nasde` has this layout: my-benchmark/ nasde.toml # Project config (name, defaults, docker, evaluation, reporting) assessment_dimensions.json # Scoring dimensions (benchmark-wide, each with independent max_score) + pricing.toml # Optional: per-project model price overrides (project > ~/.nasde > bundled, ADR-013) tasks/ / task.toml # Task config: Harbor sections + [nasde.source] for auto-Dockerfile diff --git a/docs/adr/013-layered-pricing-override.md b/docs/adr/013-layered-pricing-override.md new file mode 100644 index 0000000..1f135dc --- /dev/null +++ b/docs/adr/013-layered-pricing-override.md @@ -0,0 +1,86 @@ +# ADR-013: Layered pricing override (convention over config) + +**Status:** Accepted +**Date:** 2026-06-22 + +## Context + +Token-cost metrics (ADR-011) price every trial against a model-rate catalog, +`src/nasde_toolkit/pricing.toml`, bundled into the wheel. The loader +`pricing.load_pricing(path)` accepted an override path, but both real call sites — +`evaluator.py` (run → `assessment_summary.json`) and `results_exporter.py` +(export → `metrics.json`) — always called it with no argument, so the bundled +catalog was the *only* catalog. + +After a PyPI / `uv tool install`, that catalog lives inside an isolated +environment. Correcting a stale rate or adding a model meant editing a file in +`site-packages` — an anti-pattern that is wiped on the next upgrade. Model prices +change far more often than `nasde` releases, and users have their own rates +(enterprise discounts, Azure/Bedrock, private contracts). The product goal — +evaluating the *cost* of migrating between agents — depends on locally-correct, +locally-controllable prices. The docs already flagged a per-project / per-user +override as planned. + +## Decision + +Pricing is overridable **by convention, not configuration**. A file literally +named `pricing.toml` placed at a known location is auto-detected and merged onto +the bundled catalog — there is **no `[pricing]` section in `nasde.toml`** and **no +configurable filename**. This mirrors `assessment_dimensions.json`, which is also +a fixed-name file discovered by convention. + +A new public entry point `pricing.load_pricing_layered(project_dir)` merges three +layers, **higher wins, per-model whole-entry replacement**: + +1. `/pricing.toml` — project layer (highest) +2. `~/.nasde/pricing.toml` — user layer +3. bundled `pricing.toml` — the floor (always present, the only required layer) + +An override file lists **only the models it changes or adds**; every other model +falls through to the layer below. The merge is `dict.update` over whole +`ModelPrice` entries — **not** a per-field blend: a model entry in an override +replaces the lower layer's entry entirely, so a field the override omits takes the +`ModelPrice` default (`cached_input_per_1m=None`, `source=""`), it is *not* +inherited. This is the least-surprising rule and mirrors the whole-value override +semantics of `--model > variant.toml > default`. A missing project/user file is +silently skipped. When a layer file is found and applied, one dim console line is +printed (transparency), never for the skipped case. + +**User layer is `~/.nasde/`, a HOME dotfolder — deliberately not `platformdirs`.** +Every agent CLI the user works with keeps its *user config* in a HOME dotfolder +(`~/.claude` + `~/.claude.json`, `~/.codex`, `~/.gemini`); `platformdirs.user_config_dir` +maps to `~/Library/Application Support` on macOS, which is where Electron app-state +(cookies, caches) lives, not a file a human edits. Config belongs in the dotfolder; +`platformdirs` stays for cache (`update_check.py::user_cache_dir`). One path on +every OS. + +**Both write paths thread `project_dir`.** `token_metrics.build_trial_economics` +is the single extractor feeding both `assessment_summary.json` (run) and +`metrics.json` (export) — so the override has to reach both, or the same trial +would report two different costs. The run path already had `project_root` in hand +(`evaluate_job`); the export path gained a `project_dir` argument on +`export_results`, supplied from `config.project_dir` in the CLI. Calibration +(`calibration_publisher`) reuses `_build_metrics` and threads `project_root` too. +The merged catalog is **not** cached: it depends on `project_dir` and on on-disk +file contents that can change between runs, and a per-job re-read is cheap. The +bundled `_load_bundled_pricing` keeps its `lru_cache` (invariant), and +`load_pricing(path)` is unchanged. + +## Consequences + +- Users override model prices post-install without touching the wheel — drop a + `pricing.toml` in the project root, or `~/.nasde/pricing.toml` machine-wide. +- The bundled catalog stays the auditable floor, each entry stamped with + `as_of` / `source`; overrides layer on top per-model. +- A re-run or re-export picks up an edited `pricing.toml` immediately — no new + agent runs needed to recost. +- Run and export agree on cost for the same trial, preserving the ADR-011 + invariant. +- Determinism and the "confirm rates before quoting figures" caveat from ADR-011 + carry over — a wrong override produces wrong (but deterministic) costs. + +## References + +- ADR-011 (token & cost metrics) — the single-extractor invariant this override + must respect. +- `assessment_dimensions.json` — the fixed-name-by-convention precedent. diff --git a/src/nasde_toolkit/calibration_publisher.py b/src/nasde_toolkit/calibration_publisher.py index a93bb49..cf6a15a 100644 --- a/src/nasde_toolkit/calibration_publisher.py +++ b/src/nasde_toolkit/calibration_publisher.py @@ -36,6 +36,7 @@ push_feature_branch, ) from nasde_toolkit.git_platform_backends.protocol import GitPlatformBackend, PrRef, ReviewComment +from nasde_toolkit.pricing import load_pricing_layered from nasde_toolkit.results_exporter import ( _build_metrics, _capture_patch, @@ -166,7 +167,7 @@ def _open_pr_for_trial( project_root: Path | None, ) -> PrRef: workspace = trial_dir / "artifacts" / "workspace" - metrics = _build_metrics(trial_dir) + metrics = _build_metrics(trial_dir, load_pricing_layered(project_root)) summary = _summarize_trial(trial_dir) title = _pr_title(trial_dir, summary) body = _render_pr_body(summary, metrics) diff --git a/src/nasde_toolkit/cli.py b/src/nasde_toolkit/cli.py index 4119b01..afa5b5d 100644 --- a/src/nasde_toolkit/cli.py +++ b/src/nasde_toolkit/cli.py @@ -433,7 +433,7 @@ def results_export_command( from nasde_toolkit.config import load_project_config from nasde_toolkit.results_exporter import export_results - load_project_config(project_dir.resolve()) + config = load_project_config(project_dir.resolve()) from nasde_toolkit.banner import print_banner @@ -445,7 +445,7 @@ def results_export_command( ) ) - export_results([p.resolve() for p in paths], to.resolve()) + export_results([p.resolve() for p in paths], to.resolve(), project_dir=config.project_dir) @app.command(name="migrate-evals", hidden=True) diff --git a/src/nasde_toolkit/eval_migration.py b/src/nasde_toolkit/eval_migration.py index a04e54b..fcb99bb 100644 --- a/src/nasde_toolkit/eval_migration.py +++ b/src/nasde_toolkit/eval_migration.py @@ -18,28 +18,31 @@ _next_eval_index, _write_assessment_summary, ) +from nasde_toolkit.pricing import ModelPrice, load_pricing_layered console = Console() def migrate_job_evals(path: Path, dry_run: bool = False) -> dict[str, int]: outcomes = {"migrated": 0, "summarized": 0, "noop": 0} + pricing = load_pricing_layered() for trial_dir in _find_trial_dirs(path): - outcome = migrate_trial_evals(trial_dir, dry_run=dry_run) + outcome = migrate_trial_evals(trial_dir, pricing, dry_run=dry_run) outcomes[outcome] += 1 return outcomes -def migrate_trial_evals(trial_dir: Path, dry_run: bool = False) -> str: +def migrate_trial_evals(trial_dir: Path, pricing: dict[str, ModelPrice] | None = None, dry_run: bool = False) -> str: bare = trial_dir / "assessment_eval.json" numbered = _numbered_eval_files(trial_dir) if not bare.exists() and not numbered: return "noop" + pricing = pricing if pricing is not None else load_pricing_layered() changed = _normalize_raw_files(trial_dir, bare, numbered, dry_run) if not dry_run: - _write_assessment_summary(trial_dir) + _write_assessment_summary(trial_dir, pricing) if changed: return "migrated" return "summarized" diff --git a/src/nasde_toolkit/evaluator.py b/src/nasde_toolkit/evaluator.py index dbcff92..4af8f04 100644 --- a/src/nasde_toolkit/evaluator.py +++ b/src/nasde_toolkit/evaluator.py @@ -20,7 +20,7 @@ from nasde_toolkit.config import EvaluationConfig from nasde_toolkit.evaluator_backends import create_backend -from nasde_toolkit.pricing import load_pricing +from nasde_toolkit.pricing import ModelPrice, load_pricing_layered from nasde_toolkit.token_metrics import build_trial_economics console = Console() @@ -142,6 +142,7 @@ async def evaluate_job( _warn_if_throttled(trial_dirs, max_concurrent, eval_config.eval_repetitions) semaphore = asyncio.Semaphore(max_concurrent) + pricing = load_pricing_layered(project_root) coros = [ _evaluate_and_record_trial( td, @@ -150,6 +151,7 @@ async def evaluate_job( with_opik, semaphore, eval_config, + pricing, ) for td in trial_dirs ] @@ -164,12 +166,14 @@ async def evaluate_and_record_trial( with_opik: bool, semaphore: asyncio.Semaphore, eval_config: EvaluationConfig | None = None, + pricing: dict[str, ModelPrice] | None = None, ) -> EvaluationResult | None: """Evaluate a single trial with semaphore-based concurrency control. Public wrapper used by runner.py for streaming (Level 2) assessment. """ eval_config = eval_config or EvaluationConfig() + pricing = pricing if pricing is not None else load_pricing_layered(project_root) return await _evaluate_and_record_trial( trial_dir, project_root, @@ -177,6 +181,7 @@ async def evaluate_and_record_trial( with_opik, semaphore, eval_config, + pricing, ) @@ -186,7 +191,8 @@ async def _evaluate_and_record_trial( project_name: str, with_opik: bool, semaphore: asyncio.Semaphore, - eval_config: EvaluationConfig | None = None, + eval_config: EvaluationConfig | None, + pricing: dict[str, ModelPrice], ) -> EvaluationResult | None: eval_config = eval_config or EvaluationConfig() console.print(f"\n[bold]Evaluating: {trial_dir.name} (x{eval_config.eval_repetitions})[/bold]") @@ -196,7 +202,7 @@ async def _evaluate_and_record_trial( return None for evaluation in evaluations: _write_evaluation_result(trial_dir, evaluation) - summary = _write_assessment_summary(trial_dir) + summary = _write_assessment_summary(trial_dir, pricing) if with_opik and summary is not None: await asyncio.to_thread(_upload_to_opik, summary, project_name) return evaluations[-1] @@ -674,12 +680,12 @@ def _next_eval_index(trial_dir: Path) -> int: return max(indices, default=0) + 1 -def _write_assessment_summary(trial_dir: Path) -> AssessmentSummary | None: +def _write_assessment_summary(trial_dir: Path, pricing: dict[str, ModelPrice]) -> AssessmentSummary | None: evaluations = _load_raw_evaluations(trial_dir) if not evaluations: return None summary = _aggregate_evaluations(evaluations) - _enrich_with_economics(summary, trial_dir) + _enrich_with_economics(summary, trial_dir, pricing) output_path = trial_dir / "assessment_summary.json" with open(output_path, "w") as f: json.dump(asdict(summary), f, indent=2) @@ -687,9 +693,9 @@ def _write_assessment_summary(trial_dir: Path) -> AssessmentSummary | None: return summary -def _enrich_with_economics(summary: AssessmentSummary, trial_dir: Path) -> None: +def _enrich_with_economics(summary: AssessmentSummary, trial_dir: Path, pricing: dict[str, ModelPrice]) -> None: model = _resolve_model_name(trial_dir) - economics = build_trial_economics(trial_dir, model, load_pricing()) + economics = build_trial_economics(trial_dir, model, pricing) summary.model_name = economics["model_name"] summary.reasoning_effort = resolve_reasoning_effort(trial_dir) summary.token_usage = economics["token_usage"] diff --git a/src/nasde_toolkit/pricing.py b/src/nasde_toolkit/pricing.py index 3fbd032..47c047a 100644 --- a/src/nasde_toolkit/pricing.py +++ b/src/nasde_toolkit/pricing.py @@ -1,8 +1,11 @@ """Model pricing catalog for token-cost metrics. -Loads per-model rates from a bundled ``pricing.toml`` (overridable) and computes -USD cost from token volumes. Cost is the full catalog rate applied to the full -prompt-token volume (cache included, no discount) — see ``pricing.toml`` and ADR-011. +Loads per-model rates from a bundled ``pricing.toml`` and computes USD cost from +token volumes. The catalog is overridable by convention via layered files — +``/pricing.toml`` > ``~/.nasde/pricing.toml`` > bundled, merged per-model +(see ``load_pricing_layered`` and ADR-013). Cost is the full catalog rate applied +to the full prompt-token volume (cache included, no discount) — see ``pricing.toml`` +and ADR-011. """ from __future__ import annotations @@ -29,6 +32,23 @@ class ModelPrice: source: str = "" +def compute_cost_usd( + input_tokens: int, + output_tokens: int, + model: str, + pricing: dict[str, ModelPrice], +) -> float | None: + """Full-rate USD cost for the given token volumes, or None if model is unpriced.""" + price = pricing.get(model) + if price is None: + console.print( + f" [yellow]No pricing for model {model!r} — cost left unset. " + f"Add it to pricing.toml to enable cost metrics.[/yellow]" + ) + return None + return input_tokens / 1_000_000 * price.input_per_1m + output_tokens / 1_000_000 * price.output_per_1m + + def load_pricing(path: str | Path | None = None) -> dict[str, ModelPrice]: """Load the model pricing catalog, defaulting to the bundled pricing.toml. @@ -42,36 +62,64 @@ def load_pricing(path: str | Path | None = None) -> dict[str, ModelPrice]: return _pricing_from_raw(raw) +def load_pricing_layered(project_dir: Path | None = None) -> dict[str, ModelPrice]: + """Load the pricing catalog with convention-based layered overrides. + + Merges three layers, higher wins, per-model whole-entry replacement: + 1. ``/pricing.toml`` (project layer, highest) + 2. ``~/.nasde/pricing.toml`` (user layer) + 3. bundled ``pricing.toml`` (the floor, always present) + + An override file lists only the models it changes or adds; every other model + is inherited from the layer below. A missing project/user file is silently + skipped — only the bundled floor is required. See ADR-013. + """ + merged = dict(load_pricing()) + for override_path in _layered_override_paths(project_dir): + if not override_path.is_file(): + continue + layer = load_pricing(override_path) + merged.update(layer) + console.print(f" [dim]pricing: applied override {override_path} ({len(layer)} model(s))[/dim]") + return merged + + +def pricing_as_of(model: str, pricing: dict[str, ModelPrice]) -> str | None: + """Return the as_of date stamped on the model's price, if priced.""" + price = pricing.get(model) + return price.as_of if price is not None else None + + @lru_cache(maxsize=1) def _load_bundled_pricing() -> dict[str, ModelPrice]: return _pricing_from_raw(_read_pricing_toml(None)) -def _pricing_from_raw(raw: dict) -> dict[str, ModelPrice]: - return {name: _model_price_from_dict(entry) for name, entry in raw.get("models", {}).items()} +def _layered_override_paths(project_dir: Path | None) -> list[Path]: + paths = [_user_pricing_path()] + if project_dir is not None: + paths.append(project_dir / "pricing.toml") + return _deduped_paths(paths) -def compute_cost_usd( - input_tokens: int, - output_tokens: int, - model: str, - pricing: dict[str, ModelPrice], -) -> float | None: - """Full-rate USD cost for the given token volumes, or None if model is unpriced.""" - price = pricing.get(model) - if price is None: - console.print( - f" [yellow]No pricing for model {model!r} — cost left unset. " - f"Add it to pricing.toml to enable cost metrics.[/yellow]" - ) - return None - return input_tokens / 1_000_000 * price.input_per_1m + output_tokens / 1_000_000 * price.output_per_1m +def _user_pricing_path() -> Path: + return Path.home() / ".nasde" / "pricing.toml" -def pricing_as_of(model: str, pricing: dict[str, ModelPrice]) -> str | None: - """Return the as_of date stamped on the model's price, if priced.""" - price = pricing.get(model) - return price.as_of if price is not None else None +def _deduped_paths(paths: list[Path]) -> list[Path]: + seen: set[Path] = set() + deduped: list[Path] = [] + for path in paths: + resolved = path.resolve() + if resolved in seen: + continue + seen.add(resolved) + deduped.append(path) + return deduped + + +def _pricing_from_raw(raw: dict) -> dict[str, ModelPrice]: + return {name: _model_price_from_dict(entry) for name, entry in raw.get("models", {}).items()} def _read_pricing_toml(path: str | Path | None) -> dict: diff --git a/src/nasde_toolkit/results_exporter.py b/src/nasde_toolkit/results_exporter.py index f0b8127..1b4baca 100644 --- a/src/nasde_toolkit/results_exporter.py +++ b/src/nasde_toolkit/results_exporter.py @@ -24,7 +24,7 @@ _resolve_task_name, resolve_reasoning_effort, ) -from nasde_toolkit.pricing import load_pricing +from nasde_toolkit.pricing import ModelPrice, load_pricing_layered from nasde_toolkit.token_metrics import build_trial_economics console = Console() @@ -39,13 +39,19 @@ class ExportSummary: failed: list[str] = field(default_factory=list) -def export_results(paths: list[Path], dest: Path, include_trajectory: bool = True) -> ExportSummary: +def export_results( + paths: list[Path], + dest: Path, + include_trajectory: bool = True, + project_dir: Path | None = None, +) -> ExportSummary: """Export trial artifacts from the given job and/or trial paths into dest.""" dest.mkdir(parents=True, exist_ok=True) trials = _expand_to_trials(paths) + pricing = load_pricing_layered(project_dir) summary = ExportSummary() for job_name, trial_dir in trials: - _export_one_trial(job_name, trial_dir, dest, include_trajectory, summary) + _export_one_trial(job_name, trial_dir, dest, include_trajectory, summary, pricing) _print_summary(summary, dest) return summary @@ -117,13 +123,14 @@ def _export_one_trial( dest: Path, include_trajectory: bool, summary: ExportSummary, + pricing: dict[str, ModelPrice], ) -> None: out_dir = dest / f"{job_name}__{trial_dir.name}" label = out_dir.name existed = out_dir.exists() try: out_dir.mkdir(parents=True, exist_ok=True) - _write_metrics(trial_dir, out_dir) + _write_metrics(trial_dir, out_dir, pricing) copied = _copy_assessment_files(trial_dir, out_dir) _copy_verifier_files(trial_dir, out_dir) if include_trajectory and not (out_dir / "trajectory.json").exists(): @@ -144,16 +151,16 @@ def _export_one_trial( summary.skipped.append(label) -def _write_metrics(trial_dir: Path, out_dir: Path) -> None: - metrics = _build_metrics(trial_dir) +def _write_metrics(trial_dir: Path, out_dir: Path, pricing: dict[str, ModelPrice]) -> None: + metrics = _build_metrics(trial_dir, pricing) (out_dir / "metrics.json").write_text(json.dumps(metrics, indent=2)) -def _build_metrics(trial_dir: Path) -> dict: +def _build_metrics(trial_dir: Path, pricing: dict[str, ModelPrice]) -> dict: result = _load_json(trial_dir / "result.json") model = _resolve_model_name(trial_dir) score_stats = _resolve_score_stats(trial_dir) - economics = build_trial_economics(trial_dir, model, load_pricing()) + economics = build_trial_economics(trial_dir, model, pricing) return { "trial_name": result.get("trial_name", trial_dir.name), "task_name": _resolve_task_name(result), diff --git a/src/nasde_toolkit/runner.py b/src/nasde_toolkit/runner.py index a49ad2c..f127eaa 100644 --- a/src/nasde_toolkit/runner.py +++ b/src/nasde_toolkit/runner.py @@ -1069,9 +1069,11 @@ async def _run_job_with_streaming_eval( ) -> None: """Run Harbor job with assessment eval starting per trial as they complete.""" from nasde_toolkit.evaluator import evaluate_and_record_trial + from nasde_toolkit.pricing import load_pricing_layered project_name = config.reporting.project_name or config.name eval_semaphore = asyncio.Semaphore(max_concurrent_eval) + pricing = load_pricing_layered(config.project_dir) assessment_tasks: list[asyncio.Task] = [] async def _on_trial_complete(event: object) -> None: @@ -1084,6 +1086,7 @@ async def _on_trial_complete(event: object) -> None: with_opik=with_opik, semaphore=eval_semaphore, eval_config=config.evaluation, + pricing=pricing, ) ) assessment_tasks.append(task) diff --git a/tests/test_evaluator.py b/tests/test_evaluator.py index 0094555..2add714 100644 --- a/tests/test_evaluator.py +++ b/tests/test_evaluator.py @@ -10,6 +10,7 @@ import pytest +from nasde_toolkit import pricing as pricing_module from nasde_toolkit.config import EvaluationConfig from nasde_toolkit.evaluator import ( DimensionScore, @@ -28,6 +29,7 @@ _write_assessment_summary, _write_evaluation_result, ) +from nasde_toolkit.pricing import load_pricing, load_pricing_layered def _make_evaluation( @@ -236,7 +238,7 @@ def test_write_assessment_summary_has_no_judge_reasoning_text(tmp_path: Path) -> _write_evaluation_result(tmp_path, _make_evaluation(0.6, dim_score=6)) _write_evaluation_result(tmp_path, _make_evaluation(0.7, dim_score=7)) - summary = _write_assessment_summary(tmp_path) + summary = _write_assessment_summary(tmp_path, load_pricing()) assert summary is not None raw = (tmp_path / "assessment_summary.json").read_text() assert '"reasoning":' not in raw # verbose per-dimension judge reasoning is not dumped into the summary @@ -269,13 +271,54 @@ def test_assessment_summary_includes_economics(tmp_path: Path) -> None: _write_evaluation_result(tmp_path, _make_evaluation(0.6, dim_score=6)) _write_evaluation_result(tmp_path, _make_evaluation(0.7, dim_score=7)) - summary = _write_assessment_summary(tmp_path) + summary = _write_assessment_summary(tmp_path, load_pricing()) assert summary is not None assert summary.model_name == "claude-sonnet-4-6" assert summary.token_usage["total_tokens"] == 1_060_000 # sonnet $3/$15: 1M*3 + 0.06M*15 = 3.9 assert summary.cost_usd == pytest.approx(3.9) assert summary.pricing_as_of == "2026-06-08" + + +def _model_block(name: str, input_per_1m: float, output_per_1m: float) -> str: + return f'[models."{name}"]\ninput_per_1m = {input_per_1m}\noutput_per_1m = {output_per_1m}\n' + + +def test_evaluator_picks_up_project_pricing_override(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None: + monkeypatch.setattr(pricing_module, "_user_pricing_path", lambda: tmp_path / "no-user" / "pricing.toml") + trial = tmp_path / "trial" + trial.mkdir() + _seed_trajectory(trial) + _write_evaluation_result(trial, _make_evaluation(0.6, dim_score=6)) + project = tmp_path / "project" + project.mkdir() + (project / "pricing.toml").write_text(_model_block("claude-sonnet-4-6", 30.0, 150.0)) + + summary = _write_assessment_summary(trial, load_pricing_layered(project)) + assert summary is not None + assert summary.cost_usd == pytest.approx(1_000_000 / 1e6 * 30.0 + 60_000 / 1e6 * 150.0) + + +def test_evaluator_three_layer_compose_e2e(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None: + user_file = tmp_path / "user-home" / ".nasde" / "pricing.toml" + monkeypatch.setattr(pricing_module, "_user_pricing_path", lambda: user_file) + user_file.parent.mkdir(parents=True) + user_file.write_text(_model_block("azure-gpt5", 1.0, 2.0)) + trial = tmp_path / "trial" + trial.mkdir() + _seed_trajectory(trial) + config = json.loads((trial / "config.json").read_text()) + config["agent"]["model_name"] = "azure-gpt5" + (trial / "config.json").write_text(json.dumps(config)) + _write_evaluation_result(trial, _make_evaluation(0.6, dim_score=6)) + project = tmp_path / "project" + project.mkdir() + (project / "pricing.toml").write_text(_model_block("azure-gpt5", 0.5, 1.0)) + + summary = _write_assessment_summary(trial, load_pricing_layered(project)) + assert summary is not None + assert summary.model_name == "azure-gpt5" + assert summary.cost_usd == pytest.approx(1_000_000 / 1e6 * 0.5 + 60_000 / 1e6 * 1.0) assert not hasattr(summary, "cost_efficiency") # removed: arbitrary zero → use Pareto front assert not hasattr(summary, "token_efficiency") @@ -283,7 +326,7 @@ def test_assessment_summary_includes_economics(tmp_path: Path) -> None: def test_assessment_summary_economics_null_without_trajectory(tmp_path: Path) -> None: _write_evaluation_result(tmp_path, _make_evaluation(0.6, dim_score=6)) - summary = _write_assessment_summary(tmp_path) + summary = _write_assessment_summary(tmp_path, load_pricing()) assert summary is not None assert summary.token_usage is None assert summary.cost_usd is None @@ -295,7 +338,7 @@ def test_assessment_summary_stamps_reasoning_effort_from_config(tmp_path: Path) ) _write_evaluation_result(tmp_path, _make_evaluation(0.6, dim_score=6)) - summary = _write_assessment_summary(tmp_path) + summary = _write_assessment_summary(tmp_path, load_pricing()) assert summary is not None assert summary.reasoning_effort == "xhigh" @@ -306,7 +349,7 @@ def test_assessment_summary_reasoning_effort_empty_when_unset(tmp_path: Path) -> ) _write_evaluation_result(tmp_path, _make_evaluation(0.6, dim_score=6)) - summary = _write_assessment_summary(tmp_path) + summary = _write_assessment_summary(tmp_path, load_pricing()) assert summary is not None assert summary.reasoning_effort == "" # no override → never synthesize a default @@ -329,6 +372,7 @@ async def fake_evaluate_trial( with_opik=False, semaphore=asyncio.Semaphore(10), eval_config=config, + pricing=load_pricing(), ) ) @@ -355,6 +399,7 @@ async def fake_evaluate_trial( with_opik=False, semaphore=asyncio.Semaphore(10), eval_config=config, + pricing=load_pricing(), ) ) @@ -383,6 +428,7 @@ async def flaky_evaluate_trial( with_opik=False, semaphore=asyncio.Semaphore(10), eval_config=config, + pricing=load_pricing(), ) ) diff --git a/tests/test_pricing.py b/tests/test_pricing.py index aaa1b3a..b5917dc 100644 --- a/tests/test_pricing.py +++ b/tests/test_pricing.py @@ -6,7 +6,33 @@ import pytest -from nasde_toolkit.pricing import compute_cost_usd, load_pricing, pricing_as_of +from nasde_toolkit import pricing as pricing_module +from nasde_toolkit.pricing import ( + compute_cost_usd, + load_pricing, + load_pricing_layered, + pricing_as_of, +) + + +def _write_pricing(directory: Path, body: str) -> Path: + path = directory / "pricing.toml" + path.write_text(body) + return path + + +def _model_block(name: str, input_per_1m: float, output_per_1m: float, source: str | None = None) -> str: + block = f'[models."{name}"]\ninput_per_1m = {input_per_1m}\noutput_per_1m = {output_per_1m}\n' + if source is not None: + block += f'source = "{source}"\n' + return block + + +@pytest.fixture +def empty_user_layer(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> Path: + user_file = tmp_path / "user-home" / ".nasde" / "pricing.toml" + monkeypatch.setattr(pricing_module, "_user_pricing_path", lambda: user_file) + return user_file def test_load_bundled_pricing_has_matrix_models() -> None: @@ -42,3 +68,90 @@ def test_compute_cost_unknown_model_returns_none() -> None: def test_pricing_as_of_unknown_model_is_none() -> None: assert pricing_as_of("nonexistent-model", load_pricing()) is None assert pricing_as_of("gpt-5.4", load_pricing()) == "2026-06-08" + + +def test_layered_no_overrides_returns_bundled(tmp_path: Path, empty_user_layer: Path) -> None: + merged = load_pricing_layered(tmp_path) + assert set(merged) == set(load_pricing()) + + +def test_layered_project_overrides_bundled(tmp_path: Path, empty_user_layer: Path) -> None: + _write_pricing(tmp_path, _model_block("claude-sonnet-4-6", 99.0, 1.0)) + merged = load_pricing_layered(tmp_path) + assert merged["claude-sonnet-4-6"].input_per_1m == 99.0 + assert merged["gpt-5.5"].input_per_1m == load_pricing()["gpt-5.5"].input_per_1m + + +def test_layered_project_adds_new_model(tmp_path: Path, empty_user_layer: Path) -> None: + _write_pricing(tmp_path, _model_block("my-model", 7.0, 8.0)) + merged = load_pricing_layered(tmp_path) + assert "my-model" in merged + assert set(load_pricing()).issubset(set(merged)) + + +def test_layered_user_layer_applied(tmp_path: Path, empty_user_layer: Path) -> None: + empty_user_layer.parent.mkdir(parents=True, exist_ok=True) + empty_user_layer.write_text(_model_block("claude-opus-4-8", 4.0, 1.0)) + merged = load_pricing_layered(tmp_path) + assert merged["claude-opus-4-8"].input_per_1m == 4.0 + + +def test_layered_project_beats_user(tmp_path: Path, empty_user_layer: Path) -> None: + empty_user_layer.parent.mkdir(parents=True, exist_ok=True) + empty_user_layer.write_text(_model_block("claude-opus-4-8", 4.0, 1.0)) + _write_pricing(tmp_path, _model_block("claude-opus-4-8", 6.0, 2.0)) + merged = load_pricing_layered(tmp_path) + assert merged["claude-opus-4-8"].input_per_1m == 6.0 + + +def test_layered_missing_user_file_skipped(tmp_path: Path, empty_user_layer: Path) -> None: + assert not empty_user_layer.exists() + merged = load_pricing_layered(tmp_path) + assert set(merged) == set(load_pricing()) + + +def test_layered_whole_entry_replacement(tmp_path: Path, empty_user_layer: Path) -> None: + bundled = load_pricing()["claude-sonnet-4-6"] + assert bundled.cached_input_per_1m is not None and bundled.source + _write_pricing(tmp_path, _model_block("claude-sonnet-4-6", 2.0, 1.0)) + merged = load_pricing_layered(tmp_path) + assert merged["claude-sonnet-4-6"].input_per_1m == 2.0 + assert merged["claude-sonnet-4-6"].cached_input_per_1m is None + assert merged["claude-sonnet-4-6"].source == "" + + +def test_layered_three_layers_compose(tmp_path: Path, empty_user_layer: Path) -> None: + empty_user_layer.parent.mkdir(parents=True, exist_ok=True) + empty_user_layer.write_text( + _model_block("claude-opus-4-8", 4.0, 1.0) + _model_block("azure-gpt5", 1.0, 2.0) + ) + _write_pricing( + tmp_path, + _model_block("claude-sonnet-4-6", 2.0, 1.0) + + _model_block("azure-gpt5", 0.5, 1.0) + + _model_block("enterprise-claude", 10.0, 20.0), + ) + merged = load_pricing_layered(tmp_path) + assert set(merged) == { + "gpt-5.5", + "gpt-5.4", + "claude-opus-4-8", + "claude-sonnet-4-6", + "azure-gpt5", + "enterprise-claude", + } + assert merged["gpt-5.5"].input_per_1m == 5.0 + assert merged["gpt-5.4"].input_per_1m == 2.50 + assert merged["claude-opus-4-8"].input_per_1m == 4.0 + assert merged["claude-sonnet-4-6"].input_per_1m == 2.0 + assert merged["azure-gpt5"].input_per_1m == 0.5 + assert merged["enterprise-claude"].input_per_1m == 10.0 + + +def test_layered_three_layers_whole_entry_on_overlap(tmp_path: Path, empty_user_layer: Path) -> None: + empty_user_layer.parent.mkdir(parents=True, exist_ok=True) + empty_user_layer.write_text(_model_block("azure-gpt5", 1.0, 2.0, source="azure-contract")) + _write_pricing(tmp_path, _model_block("azure-gpt5", 0.5, 1.0)) + merged = load_pricing_layered(tmp_path) + assert merged["azure-gpt5"].input_per_1m == 0.5 + assert merged["azure-gpt5"].source == "" diff --git a/tests/test_results_exporter.py b/tests/test_results_exporter.py index 9c93fe7..9d53a33 100644 --- a/tests/test_results_exporter.py +++ b/tests/test_results_exporter.py @@ -8,6 +8,7 @@ import pytest +from nasde_toolkit import pricing as pricing_module from nasde_toolkit.results_exporter import ( _capture_patch, _classify_path, @@ -15,6 +16,10 @@ ) +def _model_block(name: str, input_per_1m: float, output_per_1m: float) -> str: + return f'[models."{name}"]\ninput_per_1m = {input_per_1m}\noutput_per_1m = {output_per_1m}\n' + + def _git(workspace: Path, *args: str) -> str: completed = subprocess.run( ["git", "-C", str(workspace), *args], @@ -321,6 +326,47 @@ def test_export_legacy_bare_idempotent(tmp_path: Path) -> None: assert out.name in second.skipped +@pytest.fixture +def empty_user_layer(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> Path: + user_file = tmp_path / "user-home" / ".nasde" / "pricing.toml" + monkeypatch.setattr(pricing_module, "_user_pricing_path", lambda: user_file) + return user_file + + +def test_export_picks_up_project_pricing_override(job_dir: Path, tmp_path: Path, empty_user_layer: Path) -> None: + project = tmp_path / "project" + project.mkdir() + (project / "pricing.toml").write_text(_model_block("claude-sonnet-4-6", 30.0, 150.0)) + dest = tmp_path / "export" + + export_results([job_dir], dest, project_dir=project) + metrics = json.loads((dest / "2026-06-03__demo-job__demo-task__aaa111" / "metrics.json").read_text()) + + assert metrics["cost_usd"] == pytest.approx(1_000_000 / 1e6 * 30.0 + 60_000 / 1e6 * 150.0) + + +def test_export_project_dir_none_uses_bundled(job_dir: Path, tmp_path: Path, empty_user_layer: Path) -> None: + dest = tmp_path / "export" + export_results([job_dir], dest, project_dir=None) + metrics = json.loads((dest / "2026-06-03__demo-job__demo-task__aaa111" / "metrics.json").read_text()) + + assert metrics["cost_usd"] == pytest.approx(3.9) + + +def test_export_three_layer_compose_e2e(job_dir: Path, tmp_path: Path, empty_user_layer: Path) -> None: + empty_user_layer.parent.mkdir(parents=True, exist_ok=True) + empty_user_layer.write_text(_model_block("claude-sonnet-4-6", 1.0, 2.0)) + project = tmp_path / "project" + project.mkdir() + (project / "pricing.toml").write_text(_model_block("claude-sonnet-4-6", 0.5, 1.0)) + dest = tmp_path / "export" + + export_results([job_dir], dest, project_dir=project) + metrics = json.loads((dest / "2026-06-03__demo-job__demo-task__aaa111" / "metrics.json").read_text()) + + assert metrics["cost_usd"] == pytest.approx(1_000_000 / 1e6 * 0.5 + 60_000 / 1e6 * 1.0) + + def test_capture_patch_includes_non_ascii_untracked_filename(tmp_path: Path) -> None: workspace = tmp_path / "artifacts" / "workspace" workspace.mkdir(parents=True) diff --git a/website/src/content/docs/concepts/token-cost.md b/website/src/content/docs/concepts/token-cost.md index 22c97c6..b3852e0 100644 --- a/website/src/content/docs/concepts/token-cost.md +++ b/website/src/content/docs/concepts/token-cost.md @@ -62,8 +62,26 @@ source = "https://…" A model that isn't in the catalog still gets token metrics — only its `cost_usd` is left blank (with a warning), never a wrong number. -:::note[Editing the catalog] -The catalog is **bundled into the package**, so editing it depends on how you installed Nasde. From a source checkout (`uv sync`) you can edit `src/nasde_toolkit/pricing.toml` directly. After a PyPI install (`uv tool install` / pipx) the file lives inside an isolated environment and any edit is overwritten on the next upgrade — so for now, adding a model or correcting a rate means contributing it upstream or running from source. A per-project / per-user pricing override is a planned improvement. +### Overriding rates — drop a `pricing.toml` + +The bundled catalog is the **floor**. To correct a rate or add a model, drop your own `pricing.toml` at one of two locations — Nasde finds it by name, no config setting: + +- **`/pricing.toml`** — per-project, sits next to `nasde.toml`. Highest precedence. +- **`~/.nasde/pricing.toml`** — per-user, applies to every project on the machine. + +The precedence is **project > user > bundled**, merged **per model**: each override file lists *only* the models you want to change or add, and every other model falls through to the layer below. Overriding one model leaves the rest of the catalog intact. (A model entry is replaced whole — fields you omit take their defaults, they aren't inherited from the bundled entry.) When an override is applied, Nasde prints a line saying so. Both `nasde run` and `nasde results-export` read the same layered catalog, so a trial's cost is identical whether you see it in the run summary or a later export. + +```toml +# ~/.nasde/pricing.toml — your enterprise rate for one model; the rest stays bundled +[models."claude-opus-4-8"] +input_per_1m = 4.0 +output_per_1m = 12.0 +as_of = "2026-06-22" +source = "internal contract" +``` + +:::note[Editing the bundled catalog directly] +You can still edit the bundled `src/nasde_toolkit/pricing.toml` from a source checkout (`uv sync`). After a PyPI install (`uv tool install` / pipx) the bundled file lives inside an isolated environment and is overwritten on upgrade — so prefer a `pricing.toml` override (above), which survives upgrades, or contribute the rate upstream. ::: :::caution[Confirm rates before quoting costs] diff --git a/website/src/content/docs/reference/configuration.md b/website/src/content/docs/reference/configuration.md index c1433f9..4a9c573 100644 --- a/website/src/content/docs/reference/configuration.md +++ b/website/src/content/docs/reference/configuration.md @@ -21,6 +21,7 @@ That's the whole agent-under-test surface. The walkthrough — what each knob is my-benchmark/ nasde.toml # Project configuration assessment_dimensions.json # Scoring dimensions (shared across tasks) + pricing.toml # Optional: per-project model price overrides (project > ~/.nasde > bundled) tasks/ feature-a/ task.toml # Task config (Harbor sections + [nasde.source] / [nasde.plugin]) From bcc9639f4559db9bb0e7a8c8bbcd65b5269eef6f Mon Sep 17 00:00:00 2001 From: Szymon Janikowski Date: Mon, 22 Jun 2026 16:04:11 +0200 Subject: [PATCH 02/10] style: ruff format test_pricing.py CI runs `ruff format --check` in addition to `ruff check`; the new three-layer test had a long line the formatter wraps. Co-Authored-By: Claude Opus 4.8 (1M context) --- tests/test_pricing.py | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/tests/test_pricing.py b/tests/test_pricing.py index b5917dc..0c41521 100644 --- a/tests/test_pricing.py +++ b/tests/test_pricing.py @@ -122,9 +122,7 @@ def test_layered_whole_entry_replacement(tmp_path: Path, empty_user_layer: Path) def test_layered_three_layers_compose(tmp_path: Path, empty_user_layer: Path) -> None: empty_user_layer.parent.mkdir(parents=True, exist_ok=True) - empty_user_layer.write_text( - _model_block("claude-opus-4-8", 4.0, 1.0) + _model_block("azure-gpt5", 1.0, 2.0) - ) + empty_user_layer.write_text(_model_block("claude-opus-4-8", 4.0, 1.0) + _model_block("azure-gpt5", 1.0, 2.0)) _write_pricing( tmp_path, _model_block("claude-sonnet-4-6", 2.0, 1.0) From ad118fb9adea9fd4c4566c174e07241a0d5a29d2 Mon Sep 17 00:00:00 2001 From: Szymon Janikowski Date: Mon, 22 Jun 2026 16:32:19 +0200 Subject: [PATCH 03/10] fix(pricing): address cloud-review findings on layered override MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - bug_001 (normal): migrate-evals dropped project_dir, so its assessment_summary.json cost used bundled+user only, disagreeing with the other three write paths (ADR-013 invariant). Thread project_dir through cli → migrate_job_evals → migrate_trial_evals → load_pricing_layered. - bug_006 (normal): switching the exporter to load_pricing_layered made pre-existing tests read the developer's real ~/.nasde/pricing.toml. Move empty_user_layer to an autouse fixture in tests/conftest.py so the whole suite is hermetic by default; layered tests opt in by name. - bug_004 (nit): calibration publish re-read layered pricing per-trial (N×L transparency lines). Hoist load_pricing_layered above the loop, thread pricing through _publish_one_trial → _open_pr_for_trial. - bug_003 (nit): move the orphaned cost_efficiency/token_efficiency hasattr guards back into test_assessment_summary_includes_economics. +1 regression test (migrate-evals threads project pricing). 412 tests, ruff check + format, mypy green. Verified with a HOME-override hermeticity check and a migrate-evals project-override smoke. Co-Authored-By: Claude Opus 4.8 (1M context) --- src/nasde_toolkit/calibration_publisher.py | 11 +++--- src/nasde_toolkit/cli.py | 4 +-- src/nasde_toolkit/eval_migration.py | 15 +++++--- tests/conftest.py | 22 ++++++++++++ tests/test_eval_migration.py | 40 +++++++++++++++++++++- tests/test_evaluator.py | 4 +-- tests/test_pricing.py | 8 ----- tests/test_results_exporter.py | 8 ----- 8 files changed, 82 insertions(+), 30 deletions(-) create mode 100644 tests/conftest.py diff --git a/src/nasde_toolkit/calibration_publisher.py b/src/nasde_toolkit/calibration_publisher.py index cf6a15a..a0e3be9 100644 --- a/src/nasde_toolkit/calibration_publisher.py +++ b/src/nasde_toolkit/calibration_publisher.py @@ -36,7 +36,7 @@ push_feature_branch, ) from nasde_toolkit.git_platform_backends.protocol import GitPlatformBackend, PrRef, ReviewComment -from nasde_toolkit.pricing import load_pricing_layered +from nasde_toolkit.pricing import ModelPrice, load_pricing_layered from nasde_toolkit.results_exporter import ( _build_metrics, _capture_patch, @@ -89,11 +89,12 @@ def publish_trials( """Publish the given job/trial paths as PRs/MRs on the sink repo.""" backend = _preflight(repo, repo_url, platform_override) trials = _expand_to_trials(paths) + pricing = load_pricing_layered(project_root) summary = PublishSummary() for index, (_, trial_dir) in enumerate(trials): if index > 0: time.sleep(throttle_sec) - _publish_one_trial(trial_dir, backend, repo, repo_url, base_branch, project_root, summary) + _publish_one_trial(trial_dir, backend, repo, repo_url, base_branch, project_root, pricing, summary) _print_publish_summary(summary, repo) return summary @@ -138,6 +139,7 @@ def _publish_one_trial( repo_url: str, base_branch: str, project_root: Path | None, + pricing: dict[str, ModelPrice], summary: PublishSummary, ) -> None: label = trial_dir.name @@ -149,7 +151,7 @@ def _publish_one_trial( if existing is not None: _record_skip(summary, label, base, feature, existing) return - created = _open_pr_for_trial(trial_dir, backend, repo, repo_url, base, feature, project_root) + created = _open_pr_for_trial(trial_dir, backend, repo, repo_url, base, feature, project_root, pricing) summary.published.append(PublishedTrial(label, base, feature, created.number, created.url, created=True)) console.print(f" [green]published: {label} → {created.url}[/green]") except Exception as error: @@ -165,9 +167,10 @@ def _open_pr_for_trial( base: str, feature: str, project_root: Path | None, + pricing: dict[str, ModelPrice], ) -> PrRef: workspace = trial_dir / "artifacts" / "workspace" - metrics = _build_metrics(trial_dir, load_pricing_layered(project_root)) + metrics = _build_metrics(trial_dir, pricing) summary = _summarize_trial(trial_dir) title = _pr_title(trial_dir, summary) body = _render_pr_body(summary, metrics) diff --git a/src/nasde_toolkit/cli.py b/src/nasde_toolkit/cli.py index afa5b5d..8454ff1 100644 --- a/src/nasde_toolkit/cli.py +++ b/src/nasde_toolkit/cli.py @@ -483,7 +483,7 @@ def migrate_evals_command( from nasde_toolkit.config import load_project_config from nasde_toolkit.eval_migration import migrate_job_evals - load_project_config(project_dir.resolve()) + config = load_project_config(project_dir.resolve()) from nasde_toolkit.banner import print_banner @@ -495,7 +495,7 @@ def migrate_evals_command( ) ) - outcomes = migrate_job_evals(path.resolve(), dry_run=dry_run) + outcomes = migrate_job_evals(path.resolve(), dry_run=dry_run, project_dir=config.project_dir) from rich.table import Table diff --git a/src/nasde_toolkit/eval_migration.py b/src/nasde_toolkit/eval_migration.py index fcb99bb..781d52b 100644 --- a/src/nasde_toolkit/eval_migration.py +++ b/src/nasde_toolkit/eval_migration.py @@ -23,23 +23,28 @@ console = Console() -def migrate_job_evals(path: Path, dry_run: bool = False) -> dict[str, int]: +def migrate_job_evals(path: Path, dry_run: bool = False, project_dir: Path | None = None) -> dict[str, int]: outcomes = {"migrated": 0, "summarized": 0, "noop": 0} - pricing = load_pricing_layered() + pricing = load_pricing_layered(project_dir) for trial_dir in _find_trial_dirs(path): - outcome = migrate_trial_evals(trial_dir, pricing, dry_run=dry_run) + outcome = migrate_trial_evals(trial_dir, pricing, dry_run=dry_run, project_dir=project_dir) outcomes[outcome] += 1 return outcomes -def migrate_trial_evals(trial_dir: Path, pricing: dict[str, ModelPrice] | None = None, dry_run: bool = False) -> str: +def migrate_trial_evals( + trial_dir: Path, + pricing: dict[str, ModelPrice] | None = None, + dry_run: bool = False, + project_dir: Path | None = None, +) -> str: bare = trial_dir / "assessment_eval.json" numbered = _numbered_eval_files(trial_dir) if not bare.exists() and not numbered: return "noop" - pricing = pricing if pricing is not None else load_pricing_layered() + pricing = pricing if pricing is not None else load_pricing_layered(project_dir) changed = _normalize_raw_files(trial_dir, bare, numbered, dry_run) if not dry_run: _write_assessment_summary(trial_dir, pricing) diff --git a/tests/conftest.py b/tests/conftest.py new file mode 100644 index 0000000..a51c777 --- /dev/null +++ b/tests/conftest.py @@ -0,0 +1,22 @@ +"""Shared pytest fixtures.""" + +from __future__ import annotations + +from pathlib import Path + +import pytest + +from nasde_toolkit import pricing as pricing_module + + +@pytest.fixture(autouse=True) +def empty_user_layer(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> Path: + """Isolate the layered-pricing user layer from the developer's real ~/.nasde. + + Autouse so every test in the tree is hermetic by default: `load_pricing_layered` + never reads the host's `~/.nasde/pricing.toml`. Tests that need a real user + layer request this fixture by name and write to the returned path. + """ + user_file = tmp_path / "user-home" / ".nasde" / "pricing.toml" + monkeypatch.setattr(pricing_module, "_user_pricing_path", lambda: user_file) + return user_file diff --git a/tests/test_eval_migration.py b/tests/test_eval_migration.py index 5247cd4..c803252 100644 --- a/tests/test_eval_migration.py +++ b/tests/test_eval_migration.py @@ -5,7 +5,9 @@ import json from pathlib import Path -from nasde_toolkit.eval_migration import migrate_trial_evals +import pytest + +from nasde_toolkit.eval_migration import migrate_job_evals, migrate_trial_evals def _eval_payload(normalized_score: float, evaluator_model: str = "claude-opus-4-7") -> dict: @@ -120,3 +122,39 @@ def test_migrate_mixed_models_yields_two_groups(tmp_path: Path) -> None: summary = json.loads((tmp_path / "assessment_summary.json").read_text()) models = {g["evaluator_model"] for g in summary["groups"]} assert models == {"claude-opus-4-7", "codex-gpt-5"} + + +def _seed_trajectory(trial_dir: Path) -> None: + (trial_dir / "config.json").write_text( + json.dumps({"agent": {"name": "demo-variant", "model_name": "claude-sonnet-4-6"}}) + ) + (trial_dir / "result.json").write_text(json.dumps({"trial_name": trial_dir.name})) + agent_dir = trial_dir / "agent" + agent_dir.mkdir() + (agent_dir / "trajectory.json").write_text( + json.dumps( + { + "final_metrics": { + "total_prompt_tokens": 1_000_000, + "total_completion_tokens": 50_000, + "extra": {"reasoning_output_tokens": 10_000}, + } + } + ) + ) + + +def test_migrate_job_evals_threads_project_pricing(tmp_path: Path) -> None: + job = tmp_path / "jobs" / "demo-job" + trial = job / "demo-task__aaa" + trial.mkdir(parents=True) + _seed_trajectory(trial) + _write(trial, "assessment_eval.json", _eval_payload(0.6)) + project = tmp_path / "project" + project.mkdir() + (project / "pricing.toml").write_text('[models."claude-sonnet-4-6"]\ninput_per_1m = 30.0\noutput_per_1m = 150.0\n') + + migrate_job_evals(job, project_dir=project) + + summary = json.loads((trial / "assessment_summary.json").read_text()) + assert summary["cost_usd"] == pytest.approx(1_000_000 / 1e6 * 30.0 + 60_000 / 1e6 * 150.0) diff --git a/tests/test_evaluator.py b/tests/test_evaluator.py index 2add714..7f7ac2e 100644 --- a/tests/test_evaluator.py +++ b/tests/test_evaluator.py @@ -278,6 +278,8 @@ def test_assessment_summary_includes_economics(tmp_path: Path) -> None: # sonnet $3/$15: 1M*3 + 0.06M*15 = 3.9 assert summary.cost_usd == pytest.approx(3.9) assert summary.pricing_as_of == "2026-06-08" + assert not hasattr(summary, "cost_efficiency") # removed: arbitrary zero → use Pareto front + assert not hasattr(summary, "token_efficiency") def _model_block(name: str, input_per_1m: float, output_per_1m: float) -> str: @@ -319,8 +321,6 @@ def test_evaluator_three_layer_compose_e2e(tmp_path: Path, monkeypatch: pytest.M assert summary is not None assert summary.model_name == "azure-gpt5" assert summary.cost_usd == pytest.approx(1_000_000 / 1e6 * 0.5 + 60_000 / 1e6 * 1.0) - assert not hasattr(summary, "cost_efficiency") # removed: arbitrary zero → use Pareto front - assert not hasattr(summary, "token_efficiency") def test_assessment_summary_economics_null_without_trajectory(tmp_path: Path) -> None: diff --git a/tests/test_pricing.py b/tests/test_pricing.py index 0c41521..a7b63c4 100644 --- a/tests/test_pricing.py +++ b/tests/test_pricing.py @@ -6,7 +6,6 @@ import pytest -from nasde_toolkit import pricing as pricing_module from nasde_toolkit.pricing import ( compute_cost_usd, load_pricing, @@ -28,13 +27,6 @@ def _model_block(name: str, input_per_1m: float, output_per_1m: float, source: s return block -@pytest.fixture -def empty_user_layer(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> Path: - user_file = tmp_path / "user-home" / ".nasde" / "pricing.toml" - monkeypatch.setattr(pricing_module, "_user_pricing_path", lambda: user_file) - return user_file - - def test_load_bundled_pricing_has_matrix_models() -> None: pricing = load_pricing() for model in ("gpt-5.5", "gpt-5.4", "claude-opus-4-8", "claude-sonnet-4-6"): diff --git a/tests/test_results_exporter.py b/tests/test_results_exporter.py index 9d53a33..8b154bf 100644 --- a/tests/test_results_exporter.py +++ b/tests/test_results_exporter.py @@ -8,7 +8,6 @@ import pytest -from nasde_toolkit import pricing as pricing_module from nasde_toolkit.results_exporter import ( _capture_patch, _classify_path, @@ -326,13 +325,6 @@ def test_export_legacy_bare_idempotent(tmp_path: Path) -> None: assert out.name in second.skipped -@pytest.fixture -def empty_user_layer(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> Path: - user_file = tmp_path / "user-home" / ".nasde" / "pricing.toml" - monkeypatch.setattr(pricing_module, "_user_pricing_path", lambda: user_file) - return user_file - - def test_export_picks_up_project_pricing_override(job_dir: Path, tmp_path: Path, empty_user_layer: Path) -> None: project = tmp_path / "project" project.mkdir() From a5ac22cc0c2b588b734ff9a33d3bce5ae325efbb Mon Sep 17 00:00:00 2001 From: Szymon Janikowski Date: Tue, 23 Jun 2026 13:36:34 +0200 Subject: [PATCH 04/10] feat(pricing): surface the effective merged catalog (show / run / export) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Layered overrides (ADR-013) had no way to inspect the merged result. Add layer provenance and three read surfaces: - pricing.py: resolve_pricing_layers(project_dir) → ordered PricingLayer stack; effective_pricing_with_source(project_dir) → {model: (price, layer)}. load_pricing_layered reimplemented on the same core (DRY, unchanged behaviour + transparency line). load_pricing(path) and bundled lru_cache untouched. - nasde pricing show [--show-source]: new `pricing` sub-app printing the effective catalog (Model / In / Out / as_of, +Layer with --show-source). Sub-app leaves room for future pricing validate/path. - nasde run: "Pricing used (effective)" table at the end of the summary, filtered to the models actually in the run, with source layer. - results-export: pricing_used.json next to the trials — effective rate + source layer per priced model, so a report is a self-contained cost audit. - pricing_report.py: shared Rich table renderer (show + run). Docs: token-cost.md (verifying the catalog), CLAUDE.md CLI reference + architecture note, ADR-013 provenance note. Website builds clean. 423 tests, ruff check + format, mypy green; smoke-verified pricing show and pricing_used.json on a real trial. Co-Authored-By: Claude Opus 4.8 (1M context) --- CLAUDE.md | 9 ++- docs/adr/013-layered-pricing-override.md | 7 ++ src/nasde_toolkit/cli.py | 36 +++++++++ src/nasde_toolkit/pricing.py | 81 ++++++++++++++----- src/nasde_toolkit/pricing_report.py | 38 +++++++++ src/nasde_toolkit/results_exporter.py | 24 ++++++ src/nasde_toolkit/runner.py | 34 +++++++- tests/test_cli.py | 32 ++++++++ tests/test_pricing.py | 38 +++++++++ tests/test_results_exporter.py | 23 ++++++ tests/test_runner_economics.py | 20 +++++ .../src/content/docs/concepts/token-cost.md | 11 +++ 12 files changed, 328 insertions(+), 25 deletions(-) create mode 100644 src/nasde_toolkit/pricing_report.py diff --git a/CLAUDE.md b/CLAUDE.md index 08ab070..f83e122 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -18,7 +18,8 @@ src/nasde_toolkit/ claude_subprocess.py # `claude -p` subprocess backend (default) codex_subprocess.py # `codex exec` subprocess backend results_exporter.py # `nasde results-export` — copy trial artifact essence to a plain dir - pricing.py / pricing.toml # Versioned model price catalog (ADR-011) — load + compute_cost_usd; layered override via load_pricing_layered (ADR-013) + pricing.py / pricing.toml # Versioned model price catalog (ADR-011) — load + compute_cost_usd; layered override via load_pricing_layered + provenance via resolve_pricing_layers/effective_pricing_with_source (ADR-013) + pricing_report.py # Shared Rich table renderer for the effective price catalog (nasde pricing show + run summary, ADR-013) token_metrics.py # Token usage + cost economics from trajectory final_metrics (ADR-011) calibration_publisher.py # `nasde calibrate` — publish trial diffs+assessments as PRs/MRs (ADR-010) calibration_resolve.py # Resolve [calibration] repo → (api slug, push url, platform) @@ -88,7 +89,7 @@ See [ARCHITECTURE.md](ARCHITECTURE.md) for the full system architecture with dia - **Pass-through CLI**: `nasde harbor ...` delegates to Harbor's Typer app via `add_typer()`. `nasde opik ...` forwards args to Opik's Click CLI via `ctx.args`. - **Rubric calibration (ADR-010)**: `nasde calibrate publish PATHS...` / `pull-comments` close the loop between the LLM-as-a-Judge and a human reviewer by publishing trial diffs + assessments as PRs/MRs and pulling review comments back for rubric tuning. Two layers, deliberately separated: **GIT** (`git_platform_backends/git_ops.py` — `git push`/`ls-remote`, platform-agnostic, subprocess pattern from `docker.py`, not behind a Protocol) and **PLATFORM** (`git_platform_backends/` behind a `@runtime_checkable GitPlatformBackend` Protocol — `repo_exists`/`find_open_pr_for_branch`/`create_pr`/`fetch_pr_comments`/`validate_cli_installed`/`validate_auth`, mirroring `evaluator_backends/`). The base is keyed on `(repo, commit)` as an **orphan branch** `base/-` seeded once via `git archive HEAD` from the trial workspace (git deduplicates blobs by content across orphan bases — no shared ancestor needed); each trial is a feature branch `calib/-/` = base + the agent's `changes.patch` applied as a real commit + `.calibration/` files (no trajectory — secrets/clutter). `.calibration/` carries the reviewer's context: the task's `instruction.md` + `assessment_criteria.md` + `assessment_dimensions.json` (resolved from `result.json` `task_name`/`source`, trying both `tasks/` and `evals//tasks/` layouts), all `assessment_eval_.json`, `assessment_summary.json`, and `metrics.json`. Idempotency is **open-only**: `find_open_pr_for_branch` matches only OPEN PRs/MRs (`gh pr list --state open`, `glab mr list` default), so a re-run skips a live round but lets a fresh round publish once the prior one is closed. The PR body is a pure transform of the dominant `AssessmentSummary` cluster (`calibration_publisher._render_pr_body`). Backend is **auto-detected from the sink repo URL host** (`github.com`→`gh`, `*gitlab*`→`glab`; `[calibration] platform` overrides for self-hosted) — no `backend` config field, eliminating the backend≠host mismatch. Preflight before any work: detect → `validate_cli_installed` (`shutil.which`, precise per-platform message) → `validate_auth` (`gh|glab auth status` exit code) → `repo_exists` (parses OUTPUT — `gh repo view` exits 0 even for a missing repo). Repo creation is out of scope (push creates branches ad-hoc in an existing repo). Reuses `_expand_to_trials`/`_capture_patch`/`_build_metrics` from `results_exporter.py` and `_aggregate_evaluations`/`_load_raw_evaluations`/`AssessmentSummary` from `evaluator.py`. Prerequisites mirror the evaluator's CLI requirement (ADR-002): `git` + `gh`/`glab` + login, no SDK, CLI keyring holds auth. The `nasde-benchmark-calibration` skill orchestrates the human-in-the-loop flow. - **Results export (EXPERIMENTAL)**: `results_exporter.py` + `nasde results-export PATHS... --to DIR` copy the analytic *essence* of trial artifacts out of the gitignored `jobs/` tree into a flat per-trial layout (`DIR/__/` with `metrics.json`, `assessment_eval_*.json` (all repetitions), `assessment_summary.json`, `trajectory.json`, `changes.patch`, `verifier_stdout.txt`, `reward.txt`). Re-export **merges**: missing eval files are copied and the summary/metrics refreshed, while immutable files (trajectory, patch) are left as-is — so evaluations added after a first export are picked up. A legacy bare `assessment_eval.json` (pre-migration trial) is exported as `assessment_eval_1.json` with a `nasde migrate-evals` hint, so the export is never silently empty. Filesystem-as-interface: `DIR` is any plain path (iCloud/Dropbox/git repo) — no cloud SDK. It scans Harbor artifacts (`result.json`/`config.json`/`assessment_eval*.json`/`agent/trajectory.json`/workspace), **not** the best-effort `EXPERIMENT_LOG.md`. `metrics.json` is a self-contained summary composed from `result.json`+`config.json`+`agent/trajectory.json` — including **token & cost economics** (`token_usage`, `cost_usd`, `pricing_as_of`, `reasoning_effort`); see the token-cost note below and [ADR-011](docs/adr/011-token-cost-metrics.md). The code diff is captured as a patch (`git diff HEAD` + untracked via `git ls-files -z` + `git diff --no-index`, never `git add` — the workspace `.git` index is left untouched; `-z`/NUL parsing means non-ASCII untracked filenames are not dropped under `core.quotepath`). Selection is a mixed positional list of job and/or trial dirs (auto-classified: a dir whose children have `result.json` is a job; else a dir whose own `result.json` carries a `trial_name` key is a trial; a dir with a job-level `result.json` but no trial-shaped children/`trial_name` is skipped with a warning rather than mis-exported as garbage); re-export is idempotent and merge-based (a trial is reported `exported` only when something new was copied, else `skipped`). Reuses `_collect_trial_dirs`/`_load_json`/`_compute_duration_sec`/`_resolve_agent_name` from `evaluator.py`. Deliberately does **not** model "experiments" (one job can belong to many — a future UI layer's concern). -- **Token & cost metrics (ADR-011)**: every trial gets **token usage + USD cost** computed from the agent's `agent/trajectory.json` `final_metrics` and a versioned price catalog. `token_metrics.py` is the **single extractor** feeding both write paths: `evaluator.py` writes them onto `assessment_summary.json` (run) and `results_exporter.py` onto `metrics.json` (export). Definitions: `input = total_prompt_tokens` (full, cache included), `output = total_completion_tokens + extra.reasoning_output_tokens` (Codex reasoning folded into output), `total = input + output`. **Cost is "as if every run were the first"** — full prompt volume at full input rate, **no cache discount** — so it is deterministic and independent of run order / cache TTL (the prompt-token count is fixed for a task; the cache hit rate is not). **The scalar `token_efficiency`/`cost_efficiency` ratios were REMOVED** — `normalized_score / denominator` has an arbitrary zero (score 0 = empty rubric, unreachable), so the ranking is not invariant to a baseline shift; model comparison is now a **Pareto front** (quality vs cost, quality vs tokens), shift-invariant, living in the `nasde-benchmark-runner` skill, not the toolkit. The raw signals (`token_usage`, `cost_usd`, `pricing_as_of`, score) stay and are the source of truth. Economics are **per-trial** (one agent run) → they live on `AssessmentSummary`, not `EvaluatorGroupSummary`. `model_name` and `reasoning_effort` are stamped on the summary because cross-model analysis groups by `(agent_name, model_name, reasoning_effort)` (agent_name = variant name, does not distinguish models; a different effort is a different configuration, never averaged together — mirrors how a changed `dimensions_fingerprint` is a different benchmark). The `reasoning_effort` stamp is read back from the per-trial Harbor `config.json` (`config.agent.kwargs.reasoning_effort`); when no override was set the stamp is `""` (only explicit overrides are recorded — Codex's implicit `high` default is NOT fabricated, and an unset effort is a valid "family default" state). Pricing is loaded via `pricing.py::load_pricing_layered(project_dir)` (ADR-013), which merges three layers per-model (whole-entry, higher wins): `/pricing.toml` > `~/.nasde/pricing.toml` > bundled `pricing.toml`. **Convention, not config** — a file literally named `pricing.toml` (mirrors `assessment_dimensions.json`); no `[pricing]` key. User layer is a HOME dotfolder (`~/.nasde/`, like `~/.claude`/`~/.codex`/`~/.gemini`), deliberately NOT `platformdirs` (that maps to `~/Library/Application Support` on macOS = app-state, not user-editable config; `platformdirs` stays for cache in `update_check.py`). Both write paths thread `project_dir` so run (`assessment_summary.json`) and export (`metrics.json`) agree on cost — the ADR-011 single-extractor invariant. The merged catalog is NOT cached (depends on `project_dir` + on-disk contents); bundled `_load_bundled_pricing` keeps its `lru_cache`, `load_pricing(path)` is unchanged. An applied override prints a dim transparency line. Each model stamped with `as_of` + `source`; `cached_input_per_1m` is recorded for reference but **not** used in the cost formula. An **unpriced model** → `cost_usd` = `null` + a warning (token metrics still computed); a **missing/legacy trajectory** → all economics `null`. Never crashes the run. `nasde run` prints a per-`(agent, model, effort)` cost table (trials, score, tokens, $cost) plus the job path and an export hint (`runner.py::_print_job_summary`, called after assessment so the summaries exist); raw cost/token columns carry an inter-trial `±std` when the group has ≥2 trials (n=1 → bare value). Backfilling existing exports whose source jobs are gone is a **one-shot ad-hoc script** (reads the export's own flat `trajectory.json`) — deliberately NOT a CLI command. +- **Token & cost metrics (ADR-011)**: every trial gets **token usage + USD cost** computed from the agent's `agent/trajectory.json` `final_metrics` and a versioned price catalog. `token_metrics.py` is the **single extractor** feeding both write paths: `evaluator.py` writes them onto `assessment_summary.json` (run) and `results_exporter.py` onto `metrics.json` (export). Definitions: `input = total_prompt_tokens` (full, cache included), `output = total_completion_tokens + extra.reasoning_output_tokens` (Codex reasoning folded into output), `total = input + output`. **Cost is "as if every run were the first"** — full prompt volume at full input rate, **no cache discount** — so it is deterministic and independent of run order / cache TTL (the prompt-token count is fixed for a task; the cache hit rate is not). **The scalar `token_efficiency`/`cost_efficiency` ratios were REMOVED** — `normalized_score / denominator` has an arbitrary zero (score 0 = empty rubric, unreachable), so the ranking is not invariant to a baseline shift; model comparison is now a **Pareto front** (quality vs cost, quality vs tokens), shift-invariant, living in the `nasde-benchmark-runner` skill, not the toolkit. The raw signals (`token_usage`, `cost_usd`, `pricing_as_of`, score) stay and are the source of truth. Economics are **per-trial** (one agent run) → they live on `AssessmentSummary`, not `EvaluatorGroupSummary`. `model_name` and `reasoning_effort` are stamped on the summary because cross-model analysis groups by `(agent_name, model_name, reasoning_effort)` (agent_name = variant name, does not distinguish models; a different effort is a different configuration, never averaged together — mirrors how a changed `dimensions_fingerprint` is a different benchmark). The `reasoning_effort` stamp is read back from the per-trial Harbor `config.json` (`config.agent.kwargs.reasoning_effort`); when no override was set the stamp is `""` (only explicit overrides are recorded — Codex's implicit `high` default is NOT fabricated, and an unset effort is a valid "family default" state). Pricing is loaded via `pricing.py::load_pricing_layered(project_dir)` (ADR-013), which merges three layers per-model (whole-entry, higher wins): `/pricing.toml` > `~/.nasde/pricing.toml` > bundled `pricing.toml`. **Convention, not config** — a file literally named `pricing.toml` (mirrors `assessment_dimensions.json`); no `[pricing]` key. User layer is a HOME dotfolder (`~/.nasde/`, like `~/.claude`/`~/.codex`/`~/.gemini`), deliberately NOT `platformdirs` (that maps to `~/Library/Application Support` on macOS = app-state, not user-editable config; `platformdirs` stays for cache in `update_check.py`). Both write paths thread `project_dir` so run (`assessment_summary.json`) and export (`metrics.json`) agree on cost — the ADR-011 single-extractor invariant. The merged catalog is NOT cached (depends on `project_dir` + on-disk contents); bundled `_load_bundled_pricing` keeps its `lru_cache`, `load_pricing(path)` is unchanged. An applied override prints a dim transparency line. **Layer provenance is exposed** (ADR-013): `pricing.py::resolve_pricing_layers(project_dir)` returns the ordered `PricingLayer` stack and `effective_pricing_with_source(project_dir)` returns `{model: (price, layer_name)}`. Surfaced three ways — `nasde pricing show [--show-source]` (sub-app, room for future `pricing validate`/`path`), a "Pricing used" table at the end of `nasde run` (only models in the run), and a `pricing_used.json` written by `results-export` (effective rate + layer per priced model, self-contained audit). The Rich table renderer is shared in `pricing_report.py::render_pricing_table` (used by `pricing show` + run summary). Each model stamped with `as_of` + `source`; `cached_input_per_1m` is recorded for reference but **not** used in the cost formula. An **unpriced model** → `cost_usd` = `null` + a warning (token metrics still computed); a **missing/legacy trajectory** → all economics `null`. Never crashes the run. `nasde run` prints a per-`(agent, model, effort)` cost table (trials, score, tokens, $cost) plus the job path and an export hint (`runner.py::_print_job_summary`, called after assessment so the summaries exist); raw cost/token columns carry an inter-trial `±std` when the group has ≥2 trials (n=1 → bare value). Backfilling existing exports whose source jobs are gone is a **one-shot ad-hoc script** (reads the export's own flat `trajectory.json`) — deliberately NOT a CLI command. - See `docs/adr/` for detailed decision records. ## CLI reference @@ -125,6 +126,10 @@ nasde calibrate pull-comments PATHS... [OPTIONS] # Fetch human review comments --json # Machine-readable output for the calibration orchestrator agent -C, --project-dir PATH +nasde pricing show [OPTIONS] # Print the effective merged price catalog (ADR-013) + --show-source # Add a Layer column (project/user/bundled) — debug + -C, --project-dir PATH + nasde init [PROJECT_DIR] # Scaffold new benchmark project -n, --name TEXT diff --git a/docs/adr/013-layered-pricing-override.md b/docs/adr/013-layered-pricing-override.md index 1f135dc..a844223 100644 --- a/docs/adr/013-layered-pricing-override.md +++ b/docs/adr/013-layered-pricing-override.md @@ -78,6 +78,13 @@ bundled `_load_bundled_pricing` keeps its `lru_cache` (invariant), and invariant. - Determinism and the "confirm rates before quoting figures" caveat from ADR-011 carry over — a wrong override produces wrong (but deterministic) costs. +- The effective catalog is **inspectable**: `resolve_pricing_layers` / + `effective_pricing_with_source` expose which layer supplied each rate, surfaced + via `nasde pricing show [--show-source]`, a "Pricing used" table in the `nasde run` + summary, and a `pricing_used.json` in `results-export`. This makes the + transparency the convention promises actually verifiable — you can confirm a + three-layer override composed as intended, and an exported report is a + self-contained cost audit. ## References diff --git a/src/nasde_toolkit/cli.py b/src/nasde_toolkit/cli.py index 8454ff1..f511993 100644 --- a/src/nasde_toolkit/cli.py +++ b/src/nasde_toolkit/cli.py @@ -507,6 +507,42 @@ def migrate_evals_command( console.print(table) +# --------------------------------------------------------------------------- +# Pricing sub-app (nasde pricing ...) +# --------------------------------------------------------------------------- + +pricing_app = typer.Typer( + name="pricing", + help="Inspect the effective model price catalog (project > ~/.nasde > bundled).", + no_args_is_help=True, +) +app.add_typer(pricing_app, name="pricing") + + +@pricing_app.command(name="show") +def pricing_show_command( + project_dir: Path = typer.Option( + Path("."), + "--project-dir", + "-C", + help="Path to evaluation project (its pricing.toml is the highest layer).", + ), + show_source: bool = typer.Option( + False, + "--show-source", + help="Add a Layer column showing which layer each rate comes from (debug).", + ), +) -> None: + """Print the effective merged pricing catalog after layered overrides.""" + from nasde_toolkit.config import load_project_config + from nasde_toolkit.pricing import effective_pricing_with_source + from nasde_toolkit.pricing_report import render_pricing_table + + config = load_project_config(project_dir.resolve()) + entries = effective_pricing_with_source(config.project_dir) + console.print(render_pricing_table(entries, show_source=show_source, title="Effective pricing")) + + # --------------------------------------------------------------------------- # Calibration sub-app (nasde calibrate ...) # --------------------------------------------------------------------------- diff --git a/src/nasde_toolkit/pricing.py b/src/nasde_toolkit/pricing.py index 47c047a..8b5a8e5 100644 --- a/src/nasde_toolkit/pricing.py +++ b/src/nasde_toolkit/pricing.py @@ -32,6 +32,16 @@ class ModelPrice: source: str = "" +@dataclass +class PricingLayer: + """One layer of the convention-based pricing stack (bundled / user / project).""" + + name: str + path: Path | None + present: bool + models: dict[str, ModelPrice] + + def compute_cost_usd( input_tokens: int, output_tokens: int, @@ -49,6 +59,21 @@ def compute_cost_usd( return input_tokens / 1_000_000 * price.input_per_1m + output_tokens / 1_000_000 * price.output_per_1m +def effective_pricing_with_source(project_dir: Path | None = None) -> dict[str, tuple[ModelPrice, str]]: + """Merged catalog plus the name of the winning layer for each model. + + Same per-model whole-entry merge as ``load_pricing_layered`` (bundled → user → + project, higher wins), but each value is paired with the layer that supplied it + (``"project"`` / ``"user"`` / ``"bundled"``). Source of truth for the source + column in ``nasde pricing show`` and ``pricing_used.json``. + """ + resolved: dict[str, tuple[ModelPrice, str]] = {} + for layer in resolve_pricing_layers(project_dir): + for name, price in layer.models.items(): + resolved[name] = (price, layer.name) + return resolved + + def load_pricing(path: str | Path | None = None) -> dict[str, ModelPrice]: """Load the model pricing catalog, defaulting to the bundled pricing.toml. @@ -74,13 +99,11 @@ def load_pricing_layered(project_dir: Path | None = None) -> dict[str, ModelPric is inherited from the layer below. A missing project/user file is silently skipped — only the bundled floor is required. See ADR-013. """ - merged = dict(load_pricing()) - for override_path in _layered_override_paths(project_dir): - if not override_path.is_file(): - continue - layer = load_pricing(override_path) - merged.update(layer) - console.print(f" [dim]pricing: applied override {override_path} ({len(layer)} model(s))[/dim]") + merged: dict[str, ModelPrice] = {} + for layer in resolve_pricing_layers(project_dir): + merged.update(layer.models) + if layer.path is not None: + console.print(f" [dim]pricing: applied override {layer.path} ({len(layer.models)} model(s))[/dim]") return merged @@ -90,32 +113,48 @@ def pricing_as_of(model: str, pricing: dict[str, ModelPrice]) -> str | None: return price.as_of if price is not None else None +def resolve_pricing_layers(project_dir: Path | None = None) -> list[PricingLayer]: + """The pricing stack from lowest precedence (bundled) to highest (project). + + The bundled layer is always present; the user (``~/.nasde/pricing.toml``) and + project (``/pricing.toml``) layers appear only when their file + exists, each carrying just the models it declares. See ADR-013. + """ + layers = [PricingLayer(name="bundled", path=None, present=True, models=dict(load_pricing()))] + layers.extend(_override_layers(project_dir)) + return layers + + @lru_cache(maxsize=1) def _load_bundled_pricing() -> dict[str, ModelPrice]: return _pricing_from_raw(_read_pricing_toml(None)) -def _layered_override_paths(project_dir: Path | None) -> list[Path]: - paths = [_user_pricing_path()] +def _override_layers(project_dir: Path | None) -> list[PricingLayer]: + layers = [] + for name, path in _override_layer_paths(project_dir): + if not path.is_file(): + continue + layers.append(PricingLayer(name=name, path=path, present=True, models=load_pricing(path))) + return layers + + +def _override_layer_paths(project_dir: Path | None) -> list[tuple[str, Path]]: + named = [("user", _user_pricing_path())] if project_dir is not None: - paths.append(project_dir / "pricing.toml") - return _deduped_paths(paths) + named.append(("project", project_dir / "pricing.toml")) + return _deduped_named_paths(named) def _user_pricing_path() -> Path: return Path.home() / ".nasde" / "pricing.toml" -def _deduped_paths(paths: list[Path]) -> list[Path]: - seen: set[Path] = set() - deduped: list[Path] = [] - for path in paths: - resolved = path.resolve() - if resolved in seen: - continue - seen.add(resolved) - deduped.append(path) - return deduped +def _deduped_named_paths(named: list[tuple[str, Path]]) -> list[tuple[str, Path]]: + by_resolved: dict[Path, tuple[str, Path]] = {} + for name, path in named: + by_resolved[path.resolve()] = (name, path) + return list(by_resolved.values()) def _pricing_from_raw(raw: dict) -> dict[str, ModelPrice]: diff --git a/src/nasde_toolkit/pricing_report.py b/src/nasde_toolkit/pricing_report.py new file mode 100644 index 0000000..e855b17 --- /dev/null +++ b/src/nasde_toolkit/pricing_report.py @@ -0,0 +1,38 @@ +"""Rendering for the effective (merged) pricing catalog. + +Shared by ``nasde pricing show`` and the ``nasde run`` summary so the rate table +looks identical wherever it appears. See ADR-013. +""" + +from __future__ import annotations + +from rich.table import Table + +from nasde_toolkit.pricing import ModelPrice + + +def render_pricing_table( + entries: dict[str, tuple[ModelPrice, str]], + show_source: bool = False, + title: str = "Effective pricing", +) -> Table: + """Build a Rich table of model rates, optionally with the source layer column.""" + table = Table(title=title) + table.add_column("Model", style="cyan") + table.add_column("In / 1M", justify="right") + table.add_column("Out / 1M", justify="right") + if show_source: + table.add_column("Layer", justify="left") + table.add_column("as_of", justify="left", style="dim") + for model in sorted(entries): + price, layer = entries[model] + row = [model, _fmt_rate(price.input_per_1m), _fmt_rate(price.output_per_1m)] + if show_source: + row.append(layer) + row.append(price.as_of or "—") + table.add_row(*row) + return table + + +def _fmt_rate(rate: float) -> str: + return f"${rate:g}" diff --git a/src/nasde_toolkit/results_exporter.py b/src/nasde_toolkit/results_exporter.py index 1b4baca..9fd35f3 100644 --- a/src/nasde_toolkit/results_exporter.py +++ b/src/nasde_toolkit/results_exporter.py @@ -52,10 +52,34 @@ def export_results( summary = ExportSummary() for job_name, trial_dir in trials: _export_one_trial(job_name, trial_dir, dest, include_trajectory, summary, pricing) + _write_pricing_used(dest, [trial_dir for _, trial_dir in trials], project_dir) _print_summary(summary, dest) return summary +def _write_pricing_used(dest: Path, trial_dirs: list[Path], project_dir: Path | None) -> None: + from nasde_toolkit.pricing import effective_pricing_with_source + + used_models = {_resolve_model_name(td) for td in trial_dirs} + used_models.discard("") + if not used_models: + return + effective = effective_pricing_with_source(project_dir) + report = { + model: { + "input_per_1m": price.input_per_1m, + "output_per_1m": price.output_per_1m, + "as_of": price.as_of, + "layer": layer, + } + for model in sorted(used_models) + if (entry := effective.get(model)) is not None + for price, layer in [entry] + } + if report: + (dest / "pricing_used.json").write_text(json.dumps(report, indent=2)) + + def _expand_to_trials(paths: list[Path]) -> list[tuple[str, Path]]: trials: list[tuple[str, Path]] = [] seen: set[Path] = set() diff --git a/src/nasde_toolkit/runner.py b/src/nasde_toolkit/runner.py index f127eaa..9017990 100644 --- a/src/nasde_toolkit/runner.py +++ b/src/nasde_toolkit/runner.py @@ -1104,7 +1104,7 @@ async def _on_trial_complete(event: object) -> None: if assessment_tasks: console.print(f"[dim]Waiting for {len(assessment_tasks)} assessment evaluation(s)...[/dim]") await asyncio.gather(*assessment_tasks, return_exceptions=True) - _print_job_summary(result, _job_dir_from_config(merged_config)) + _print_job_summary(result, _job_dir_from_config(merged_config), config.project_dir) console.print("\n[bold green]Benchmark execution completed[/bold green]\n") @@ -1144,7 +1144,7 @@ async def _run_job( os.chdir(saved_cwd) -def _print_job_summary(result: JobResult, job_dir: Path | None = None) -> None: +def _print_job_summary(result: JobResult, job_dir: Path | None = None, project_dir: Path | None = None) -> None: console.print() console.print("[bold]Job completed[/bold]") console.print(f" Trials: {result.stats.n_completed_trials}") @@ -1155,6 +1155,7 @@ def _print_job_summary(result: JobResult, job_dir: Path | None = None) -> None: if rows: _print_economics_table(rows) _print_label_legend(rows) + _print_pricing_used(job_dir, project_dir) _print_location_hints(job_dir) elif result.stats.evals: _warn_missing_economics(job_dir) @@ -1164,6 +1165,35 @@ def _print_job_summary(result: JobResult, job_dir: Path | None = None) -> None: console.print() +def _print_pricing_used(job_dir: Path, project_dir: Path | None) -> None: + from nasde_toolkit.pricing import effective_pricing_with_source + from nasde_toolkit.pricing_report import render_pricing_table + + used_models = _models_used_in_job(job_dir) + if not used_models: + return + effective = effective_pricing_with_source(project_dir) + entries = {model: effective[model] for model in used_models if model in effective} + if not entries: + return + console.print(render_pricing_table(entries, show_source=True, title="Pricing used (effective)")) + + +def _models_used_in_job(job_dir: Path) -> set[str]: + from nasde_toolkit.evaluator import _collect_trial_dirs + + models: set[str] = set() + for trial_dir in _collect_trial_dirs(job_dir): + summary_path = trial_dir / "assessment_summary.json" + if not summary_path.exists(): + continue + summary = json.loads(summary_path.read_text()) + model = summary.get("model_name") + if model: + models.add(model) + return models + + def _warn_missing_economics(job_dir: Path) -> None: console.print( f"[yellow]WARN: no assessment_summary.json found under {job_dir} — " diff --git a/tests/test_cli.py b/tests/test_cli.py index 39835a2..79f3024 100644 --- a/tests/test_cli.py +++ b/tests/test_cli.py @@ -216,3 +216,35 @@ def test_calibrate_publish_forwards_resolved_sink(mock_publish: object, tmp_path assert kwargs["repo"] == "NoesisVision/nasde-calibration" assert kwargs["repo_url"] == "git@github.com:NoesisVision/nasde-calibration.git" assert kwargs["platform_override"] == "github" + + +def test_pricing_show_lists_bundled_models(tmp_path: Path) -> None: + (tmp_path / "nasde.toml").write_text('[project]\nname = "test"\n') + + result = runner.invoke(app, ["pricing", "show", "-C", str(tmp_path)]) + + assert result.exit_code == 0, result.output + assert "claude-sonnet-4-6" in result.output + assert "Effective pricing" in result.output + + +def test_pricing_show_reflects_project_override(tmp_path: Path) -> None: + (tmp_path / "nasde.toml").write_text('[project]\nname = "test"\n') + (tmp_path / "pricing.toml").write_text('[models."claude-sonnet-4-6"]\ninput_per_1m = 42.0\noutput_per_1m = 99.0\n') + + result = runner.invoke(app, ["pricing", "show", "-C", str(tmp_path)]) + + assert result.exit_code == 0, result.output + assert "$42" in result.output + + +def test_pricing_show_source_column(tmp_path: Path) -> None: + (tmp_path / "nasde.toml").write_text('[project]\nname = "test"\n') + (tmp_path / "pricing.toml").write_text('[models."azure-x"]\ninput_per_1m = 0.5\noutput_per_1m = 1.0\n') + + result = runner.invoke(app, ["pricing", "show", "-C", str(tmp_path), "--show-source"]) + + assert result.exit_code == 0, result.output + assert "Layer" in result.output + assert "project" in result.output + assert "bundled" in result.output diff --git a/tests/test_pricing.py b/tests/test_pricing.py index a7b63c4..9dfd10e 100644 --- a/tests/test_pricing.py +++ b/tests/test_pricing.py @@ -8,9 +8,11 @@ from nasde_toolkit.pricing import ( compute_cost_usd, + effective_pricing_with_source, load_pricing, load_pricing_layered, pricing_as_of, + resolve_pricing_layers, ) @@ -145,3 +147,39 @@ def test_layered_three_layers_whole_entry_on_overlap(tmp_path: Path, empty_user_ merged = load_pricing_layered(tmp_path) assert merged["azure-gpt5"].input_per_1m == 0.5 assert merged["azure-gpt5"].source == "" + + +def test_resolve_layers_no_overrides_is_bundled_only(tmp_path: Path, empty_user_layer: Path) -> None: + layers = resolve_pricing_layers(tmp_path) + assert [layer.name for layer in layers] == ["bundled"] + assert layers[0].present is True + assert layers[0].path is None + assert set(layers[0].models) == set(load_pricing()) + + +def test_resolve_layers_three_present(tmp_path: Path, empty_user_layer: Path) -> None: + empty_user_layer.parent.mkdir(parents=True, exist_ok=True) + empty_user_layer.write_text(_model_block("claude-opus-4-8", 4.0, 1.0)) + _write_pricing(tmp_path, _model_block("my-model", 7.0, 8.0)) + layers = {layer.name: layer for layer in resolve_pricing_layers(tmp_path)} + assert set(layers) == {"bundled", "user", "project"} + assert set(layers["user"].models) == {"claude-opus-4-8"} + assert set(layers["project"].models) == {"my-model"} + assert layers["project"].path == tmp_path / "pricing.toml" + + +def test_effective_pricing_with_source_three_layers(tmp_path: Path, empty_user_layer: Path) -> None: + empty_user_layer.parent.mkdir(parents=True, exist_ok=True) + empty_user_layer.write_text(_model_block("claude-opus-4-8", 4.0, 1.0) + _model_block("azure-gpt5", 1.0, 2.0)) + _write_pricing(tmp_path, _model_block("azure-gpt5", 0.5, 1.0) + _model_block("enterprise-claude", 10.0, 20.0)) + eff = effective_pricing_with_source(tmp_path) + assert eff["gpt-5.5"][1] == "bundled" + assert eff["claude-opus-4-8"][1] == "user" + assert eff["azure-gpt5"][1] == "project" + assert eff["azure-gpt5"][0].input_per_1m == 0.5 + assert eff["enterprise-claude"][1] == "project" + + +def test_load_pricing_layered_matches_effective_keys(tmp_path: Path, empty_user_layer: Path) -> None: + _write_pricing(tmp_path, _model_block("my-model", 7.0, 8.0)) + assert set(load_pricing_layered(tmp_path)) == set(effective_pricing_with_source(tmp_path)) diff --git a/tests/test_results_exporter.py b/tests/test_results_exporter.py index 8b154bf..9f4670d 100644 --- a/tests/test_results_exporter.py +++ b/tests/test_results_exporter.py @@ -359,6 +359,29 @@ def test_export_three_layer_compose_e2e(job_dir: Path, tmp_path: Path, empty_use assert metrics["cost_usd"] == pytest.approx(1_000_000 / 1e6 * 0.5 + 60_000 / 1e6 * 1.0) +def test_export_writes_pricing_used_with_layer(job_dir: Path, tmp_path: Path, empty_user_layer: Path) -> None: + project = tmp_path / "project" + project.mkdir() + (project / "pricing.toml").write_text(_model_block("claude-sonnet-4-6", 0.5, 1.0)) + dest = tmp_path / "export" + + export_results([job_dir], dest, project_dir=project) + report = json.loads((dest / "pricing_used.json").read_text()) + + assert "claude-sonnet-4-6" in report + assert report["claude-sonnet-4-6"]["input_per_1m"] == 0.5 + assert report["claude-sonnet-4-6"]["layer"] == "project" + + +def test_export_pricing_used_layer_bundled(job_dir: Path, tmp_path: Path, empty_user_layer: Path) -> None: + dest = tmp_path / "export" + export_results([job_dir], dest, project_dir=None) + report = json.loads((dest / "pricing_used.json").read_text()) + + assert report["claude-sonnet-4-6"]["layer"] == "bundled" + assert report["claude-sonnet-4-6"]["input_per_1m"] == 3.0 + + def test_capture_patch_includes_non_ascii_untracked_filename(tmp_path: Path) -> None: workspace = tmp_path / "artifacts" / "workspace" workspace.mkdir(parents=True) diff --git a/tests/test_runner_economics.py b/tests/test_runner_economics.py index d45cdc6..ee40a57 100644 --- a/tests/test_runner_economics.py +++ b/tests/test_runner_economics.py @@ -13,6 +13,7 @@ _fmt_score, _fmt_tokens, _job_dir_from_config, + _models_used_in_job, _print_job_summary, _sample_std, _short_label, @@ -236,3 +237,22 @@ def test_print_job_summary_renders_economics_for_own_job( out = _flat(capsys.readouterr().out) assert "Results by agent/model" in out assert "no assessment_summary.json found" not in out + + +def test_models_used_in_job_collects_distinct_models(tmp_path: Path) -> None: + job = tmp_path / "jobs" / "demo" + _write_trial(job, "t__a", "claude-vanilla", "claude-sonnet-4-6", 0.8, 1000, 1.0) + _write_trial(job, "t__b", "claude-vanilla", "claude-sonnet-4-6", 0.7, 1000, 1.0) + _write_trial(job, "t__c", "codex-vanilla", "gpt-5.4", 0.6, 1000, 1.0) + + assert _models_used_in_job(job) == {"claude-sonnet-4-6", "gpt-5.4"} + + +def test_models_used_in_job_skips_trials_without_summary(tmp_path: Path) -> None: + job = tmp_path / "jobs" / "demo" + _write_trial(job, "t__a", "claude-vanilla", "claude-sonnet-4-6", 0.8, 1000, 1.0) + bare = job / "t__nosummary" + bare.mkdir(parents=True) + bare.joinpath("result.json").write_text(json.dumps({"trial_name": "t__nosummary"})) + + assert _models_used_in_job(job) == {"claude-sonnet-4-6"} diff --git a/website/src/content/docs/concepts/token-cost.md b/website/src/content/docs/concepts/token-cost.md index b3852e0..2083ca7 100644 --- a/website/src/content/docs/concepts/token-cost.md +++ b/website/src/content/docs/concepts/token-cost.md @@ -80,6 +80,17 @@ as_of = "2026-06-22" source = "internal contract" ``` +### Verifying the effective catalog + +To see the merged result after your overrides, run: + +```bash +nasde pricing show -C ./my-benchmark # effective rates per model +nasde pricing show -C ./my-benchmark --show-source # + which layer each rate came from +``` + +For audit, every `nasde results-export` also writes a `pricing_used.json` next to the exported trials — the effective rate and source layer for each model that was priced in that batch — so a report is self-contained. The `nasde run` summary prints the same "Pricing used" table for the models in the run. + :::note[Editing the bundled catalog directly] You can still edit the bundled `src/nasde_toolkit/pricing.toml` from a source checkout (`uv sync`). After a PyPI install (`uv tool install` / pipx) the bundled file lives inside an isolated environment and is overwritten on upgrade — so prefer a `pricing.toml` override (above), which survives upgrades, or contribute the rate upstream. ::: From e0408b19fd905161cb48610daa74e50d7cb0bcb6 Mon Sep 17 00:00:00 2001 From: Szymon Janikowski Date: Tue, 23 Jun 2026 14:22:17 +0200 Subject: [PATCH 05/10] fix(pricing): freeze ModelPrice + non-scientific rate formatting MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Code-review follow-up (two findings from the /code-review pass): - ModelPrice is now @dataclass(frozen=True). It was a mutable dataclass shared via the bundled lru_cache (resolve_pricing_layers does dict(load_pricing()) — a shallow copy sharing the ModelPrice objects), so an in-place field mutation would silently corrupt every later lookup in the process. A rate is an immutable fact; freezing turns a latent cache-corruption footgun into an immediate FrozenInstanceError at the mutation site. No code mutates ModelPrice fields (verified), so this is safe; build a new instance via dataclasses.replace to adjust a rate. - pricing_report._fmt_rate no longer uses ${rate:g}, which renders scientific notation at the extremes ($1e+06, $5e-05). It now trims a fixed 4-decimal format, giving $3 / $2.5 in the normal range and $0.0001 / $1000000 at the edges. (.2f was rejected — it drops sub-cent cached rates to $0.00.) +2 regression tests (frozen guard, _fmt_rate edge table). 433 tests, ruff check + format, mypy green. Co-Authored-By: Claude Opus 4.8 (1M context) --- src/nasde_toolkit/pricing.py | 9 ++++++-- src/nasde_toolkit/pricing_report.py | 3 ++- tests/test_pricing.py | 7 ++++++ tests/test_pricing_report.py | 33 +++++++++++++++++++++++++++++ 4 files changed, 49 insertions(+), 3 deletions(-) create mode 100644 tests/test_pricing_report.py diff --git a/src/nasde_toolkit/pricing.py b/src/nasde_toolkit/pricing.py index 8b5a8e5..aa82852 100644 --- a/src/nasde_toolkit/pricing.py +++ b/src/nasde_toolkit/pricing.py @@ -21,9 +21,14 @@ console = Console() -@dataclass +@dataclass(frozen=True) class ModelPrice: - """Catalog rate for one model (USD per 1M tokens).""" + """Catalog rate for one model (USD per 1M tokens). + + Frozen: a rate is an immutable fact, and the bundled catalog is shared via an + lru_cache, so an in-place mutation would silently corrupt every later lookup + in the process. Build a new instance (``dataclasses.replace``) to adjust a rate. + """ input_per_1m: float output_per_1m: float diff --git a/src/nasde_toolkit/pricing_report.py b/src/nasde_toolkit/pricing_report.py index e855b17..0bba38f 100644 --- a/src/nasde_toolkit/pricing_report.py +++ b/src/nasde_toolkit/pricing_report.py @@ -35,4 +35,5 @@ def render_pricing_table( def _fmt_rate(rate: float) -> str: - return f"${rate:g}" + trimmed = f"{rate:.4f}".rstrip("0").rstrip(".") + return f"${trimmed or '0'}" diff --git a/tests/test_pricing.py b/tests/test_pricing.py index 9dfd10e..e0f9cb7 100644 --- a/tests/test_pricing.py +++ b/tests/test_pricing.py @@ -2,6 +2,7 @@ from __future__ import annotations +from dataclasses import FrozenInstanceError from pathlib import Path import pytest @@ -16,6 +17,12 @@ ) +def test_model_price_is_frozen() -> None: + price = load_pricing()["claude-sonnet-4-6"] + with pytest.raises(FrozenInstanceError): + price.input_per_1m = 0.01 # type: ignore[misc] + + def _write_pricing(directory: Path, body: str) -> Path: path = directory / "pricing.toml" path.write_text(body) diff --git a/tests/test_pricing_report.py b/tests/test_pricing_report.py new file mode 100644 index 0000000..738ef53 --- /dev/null +++ b/tests/test_pricing_report.py @@ -0,0 +1,33 @@ +"""Tests for the effective-pricing table renderer.""" + +from __future__ import annotations + +import pytest + +from nasde_toolkit.pricing import ModelPrice +from nasde_toolkit.pricing_report import _fmt_rate, render_pricing_table + + +@pytest.mark.parametrize( + ("rate", "expected"), + [ + (3.0, "$3"), + (2.5, "$2.5"), + (0.5, "$0.5"), + (0.0125, "$0.0125"), + (0.0001, "$0.0001"), + (0.0, "$0"), + (75.0, "$75"), + (1_000_000.0, "$1000000"), + ], +) +def test_fmt_rate_no_scientific_notation(rate: float, expected: str) -> None: + assert _fmt_rate(rate) == expected + + +def test_render_pricing_table_source_column_toggles() -> None: + entries = {"m": (ModelPrice(input_per_1m=3.0, output_per_1m=15.0, as_of="2026-06-08"), "bundled")} + with_source = render_pricing_table(entries, show_source=True) + without_source = render_pricing_table(entries, show_source=False) + assert with_source.columns[-2].header == "Layer" + assert all(column.header != "Layer" for column in without_source.columns) From b52523af6352029e2d947c45160a19ae606ec449 Mon Sep 17 00:00:00 2001 From: Szymon Janikowski Date: Tue, 23 Jun 2026 15:02:25 +0200 Subject: [PATCH 06/10] refactor(pricing): drop redundant trial walk + unwind comprehension Final code-review nits (#1/#2/#3): - _print_pricing_used now derives the used-models set from the economics `rows` it is handed, instead of re-walking every trial dir and re-parsing every assessment_summary.json a second time in the same _print_job_summary call (the rows already carry model_name). _models_used_in_job is deleted. Removes a redundant per-trial I/O pass and the caller-after-callee ordering. - _finalize_economics_row now exposes "model" (already destructured) so the pricing-used table can read it without a second source. - _write_pricing_used: replaced the walrus + `for price, layer in [entry]` dict-comprehension with a plain readable loop. 432 tests, ruff check + format, mypy green. Co-Authored-By: Claude Opus 4.8 (1M context) --- src/nasde_toolkit/results_exporter.py | 13 +++++++------ src/nasde_toolkit/runner.py | 22 ++++------------------ tests/test_runner_economics.py | 23 ++++++----------------- 3 files changed, 17 insertions(+), 41 deletions(-) diff --git a/src/nasde_toolkit/results_exporter.py b/src/nasde_toolkit/results_exporter.py index 9fd35f3..9b4dd03 100644 --- a/src/nasde_toolkit/results_exporter.py +++ b/src/nasde_toolkit/results_exporter.py @@ -65,17 +65,18 @@ def _write_pricing_used(dest: Path, trial_dirs: list[Path], project_dir: Path | if not used_models: return effective = effective_pricing_with_source(project_dir) - report = { - model: { + report = {} + for model in sorted(used_models): + entry = effective.get(model) + if entry is None: + continue + price, layer = entry + report[model] = { "input_per_1m": price.input_per_1m, "output_per_1m": price.output_per_1m, "as_of": price.as_of, "layer": layer, } - for model in sorted(used_models) - if (entry := effective.get(model)) is not None - for price, layer in [entry] - } if report: (dest / "pricing_used.json").write_text(json.dumps(report, indent=2)) diff --git a/src/nasde_toolkit/runner.py b/src/nasde_toolkit/runner.py index 9017990..2fad32e 100644 --- a/src/nasde_toolkit/runner.py +++ b/src/nasde_toolkit/runner.py @@ -1155,7 +1155,7 @@ def _print_job_summary(result: JobResult, job_dir: Path | None = None, project_d if rows: _print_economics_table(rows) _print_label_legend(rows) - _print_pricing_used(job_dir, project_dir) + _print_pricing_used(rows, project_dir) _print_location_hints(job_dir) elif result.stats.evals: _warn_missing_economics(job_dir) @@ -1165,11 +1165,11 @@ def _print_job_summary(result: JobResult, job_dir: Path | None = None, project_d console.print() -def _print_pricing_used(job_dir: Path, project_dir: Path | None) -> None: +def _print_pricing_used(rows: list[dict], project_dir: Path | None) -> None: from nasde_toolkit.pricing import effective_pricing_with_source from nasde_toolkit.pricing_report import render_pricing_table - used_models = _models_used_in_job(job_dir) + used_models = {row["model"] for row in rows if row["model"]} if not used_models: return effective = effective_pricing_with_source(project_dir) @@ -1179,21 +1179,6 @@ def _print_pricing_used(job_dir: Path, project_dir: Path | None) -> None: console.print(render_pricing_table(entries, show_source=True, title="Pricing used (effective)")) -def _models_used_in_job(job_dir: Path) -> set[str]: - from nasde_toolkit.evaluator import _collect_trial_dirs - - models: set[str] = set() - for trial_dir in _collect_trial_dirs(job_dir): - summary_path = trial_dir / "assessment_summary.json" - if not summary_path.exists(): - continue - summary = json.loads(summary_path.read_text()) - model = summary.get("model_name") - if model: - models.add(model) - return models - - def _warn_missing_economics(job_dir: Path) -> None: console.print( f"[yellow]WARN: no assessment_summary.json found under {job_dir} — " @@ -1282,6 +1267,7 @@ def _finalize_economics_row(label: tuple[str, str, str], agg: dict) -> dict: return { "full_label": f"{agent} / {model}" if model else agent, "short_label": _short_label(agent, model), + "model": model, "reasoning_effort": effort, "trials": agg["trials"], "score": _mean(agg["scores"]), diff --git a/tests/test_runner_economics.py b/tests/test_runner_economics.py index ee40a57..f0b31c7 100644 --- a/tests/test_runner_economics.py +++ b/tests/test_runner_economics.py @@ -13,7 +13,6 @@ _fmt_score, _fmt_tokens, _job_dir_from_config, - _models_used_in_job, _print_job_summary, _sample_std, _short_label, @@ -239,20 +238,10 @@ def test_print_job_summary_renders_economics_for_own_job( assert "no assessment_summary.json found" not in out -def test_models_used_in_job_collects_distinct_models(tmp_path: Path) -> None: - job = tmp_path / "jobs" / "demo" - _write_trial(job, "t__a", "claude-vanilla", "claude-sonnet-4-6", 0.8, 1000, 1.0) - _write_trial(job, "t__b", "claude-vanilla", "claude-sonnet-4-6", 0.7, 1000, 1.0) - _write_trial(job, "t__c", "codex-vanilla", "gpt-5.4", 0.6, 1000, 1.0) - - assert _models_used_in_job(job) == {"claude-sonnet-4-6", "gpt-5.4"} - - -def test_models_used_in_job_skips_trials_without_summary(tmp_path: Path) -> None: - job = tmp_path / "jobs" / "demo" - _write_trial(job, "t__a", "claude-vanilla", "claude-sonnet-4-6", 0.8, 1000, 1.0) - bare = job / "t__nosummary" - bare.mkdir(parents=True) - bare.joinpath("result.json").write_text(json.dumps({"trial_name": "t__nosummary"})) +def test_economics_row_carries_model_name(tmp_path: Path) -> None: + job = tmp_path / "job" + _write_trial(job, "t__a", "codex-vanilla", "gpt-5.4", 0.7, 1000, 1.0) + _write_trial(job, "t__b", "claude-vanilla", "claude-sonnet-4-6", 0.8, 1000, 1.0) - assert _models_used_in_job(job) == {"claude-sonnet-4-6"} + rows = _collect_economics_rows(job) + assert {row["model"] for row in rows} == {"gpt-5.4", "claude-sonnet-4-6"} From 1d8f121c0664913270788218d0daccc0b3ad11e6 Mon Sep 17 00:00:00 2001 From: Szymon Janikowski Date: Wed, 24 Jun 2026 12:53:06 +0200 Subject: [PATCH 07/10] feat(pricing): readable override errors + onboarding (example, name caveat) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Improve the experience of authoring a pricing.toml override: - Malformed override files now fail fast with a clear Rich message naming the file (so you know project vs ~/.nasde layer), the cause, and a hint — instead of a raw TOMLDecodeError/KeyError traceback. A decimal comma (2,5) and a missing input_per_1m/output_per_1m are the common cases. _load_override_models wraps the per-layer load; SystemExit(1), no crash. - nasde init now scaffolds a fully-commented pricing.toml.example (a real bundled model name to copy, the decimal-point hint, and the model-name caveat). Named .example so it's inert until copied to pricing.toml. - token-cost.md: a :::caution that the model name MUST match variant.toml's `model` or the override is SILENTLY ignored — verify with `pricing show --show-source` (model under `bundled` not `project` = typo). Plus a note that malformed files fail loudly. Out of scope (deliberately later): `nasde pricing validate` (check all entries + flag unknown model names up front) and `nasde pricing set` (add a single override via CLI). 436 tests, ruff check + format, mypy, website build green. Co-Authored-By: Claude Opus 4.8 (1M context) --- src/nasde_toolkit/pricing.py | 21 +++++++++++++++- src/nasde_toolkit/scaffold/__init__.py | 24 ++++++++++++++++++ tests/test_pricing.py | 12 +++++++++ tests/test_scaffold.py | 25 +++++++++++++++++++ .../src/content/docs/concepts/token-cost.md | 16 +++++++++--- 5 files changed, 94 insertions(+), 4 deletions(-) create mode 100644 tests/test_scaffold.py diff --git a/src/nasde_toolkit/pricing.py b/src/nasde_toolkit/pricing.py index aa82852..ee5b1d4 100644 --- a/src/nasde_toolkit/pricing.py +++ b/src/nasde_toolkit/pricing.py @@ -15,6 +15,7 @@ from functools import lru_cache from importlib.resources import as_file, files from pathlib import Path +from typing import NoReturn from rich.console import Console @@ -140,10 +141,28 @@ def _override_layers(project_dir: Path | None) -> list[PricingLayer]: for name, path in _override_layer_paths(project_dir): if not path.is_file(): continue - layers.append(PricingLayer(name=name, path=path, present=True, models=load_pricing(path))) + layers.append(PricingLayer(name=name, path=path, present=True, models=_load_override_models(path))) return layers +def _load_override_models(path: Path) -> dict[str, ModelPrice]: + try: + return load_pricing(path) + except tomllib.TOMLDecodeError as error: + _exit_bad_pricing(path, f"invalid TOML — {error}", hint="prices use a decimal point (2.5), not a comma (2,5)") + except KeyError as error: + field = str(error).strip("'\"") + _exit_bad_pricing(path, f"a model is missing the required field {field!r}") + + +def _exit_bad_pricing(path: Path, reason: str, hint: str = "") -> NoReturn: + console.print(f"[red]ERROR: could not load pricing override {path}[/red]") + console.print(f" [red]{reason}[/red]") + if hint: + console.print(f" [yellow]hint: {hint}[/yellow]") + raise SystemExit(1) + + def _override_layer_paths(project_dir: Path | None) -> list[tuple[str, Path]]: named = [("user", _user_pricing_path())] if project_dir is not None: diff --git a/src/nasde_toolkit/scaffold/__init__.py b/src/nasde_toolkit/scaffold/__init__.py index f1f83d5..2cde76c 100644 --- a/src/nasde_toolkit/scaffold/__init__.py +++ b/src/nasde_toolkit/scaffold/__init__.py @@ -102,6 +102,28 @@ # tasks = ["my-benchmark/task-name"] """ +PRICING_TOML_EXAMPLE = """\ +# Optional per-project model price overrides (USD per 1M tokens). +# +# Precedence (higher wins, merged per model): this file > ~/.nasde/pricing.toml +# > the catalog bundled with nasde. List ONLY the models you want to change or +# add; every other model falls through to the layer below. +# +# The model name MUST match the `model` in your variant.toml (and the model_name +# recorded in each trial's config.json) — otherwise the override is silently +# ignored. Verify with: nasde pricing show -C . --show-source +# (your model should appear with Layer "project"; if it shows "bundled", the +# name does not match — check for a typo). +# +# Uncomment and edit to activate. Prices use a decimal POINT (2.5), not a comma. +# +# [models."claude-sonnet-4-6"] +# input_per_1m = 2.5 +# output_per_1m = 11.0 +# as_of = "2026-01-01" # optional: when you confirmed this rate +# source = "your contract / rate card" # optional: where it came from +""" + GITIGNORE_TEMPLATE = """\ jobs/ # Generated by [nasde.plugin]: staged plugin tree inside the build context. @@ -160,6 +182,7 @@ def create_project(project_dir: Path, name: str) -> None: _write_if_missing(project_dir / "nasde.toml", NASDE_TOML_TEMPLATE.format(name=name)) _write_if_missing(project_dir / "assessment_dimensions.json", ASSESSMENT_DIMENSIONS_TEMPLATE) + _write_if_missing(project_dir / "pricing.toml.example", PRICING_TOML_EXAMPLE) _write_if_missing(project_dir / ".gitignore", GITIGNORE_TEMPLATE) _write_if_missing(project_dir / ".gitattributes", GITATTRIBUTES_TEMPLATE) @@ -199,6 +222,7 @@ def create_project(project_dir: Path, name: str) -> None: console.print(f"[green]Project scaffolded at[/green] {project_dir}") console.print(" nasde.toml") console.print(" assessment_dimensions.json") + console.print(" pricing.toml.example (optional model price overrides)") console.print(" tasks/example-task/") console.print(" variants/vanilla/variant.toml") console.print(" variants/vanilla/CLAUDE.md") diff --git a/tests/test_pricing.py b/tests/test_pricing.py index e0f9cb7..6d476f1 100644 --- a/tests/test_pricing.py +++ b/tests/test_pricing.py @@ -23,6 +23,18 @@ def test_model_price_is_frozen() -> None: price.input_per_1m = 0.01 # type: ignore[misc] +def test_layered_malformed_toml_exits_with_path(tmp_path: Path, empty_user_layer: Path) -> None: + (tmp_path / "pricing.toml").write_text('[models."m"]\ninput_per_1m = 2,5\noutput_per_1m = 1.0\n') + with pytest.raises(SystemExit): + load_pricing_layered(tmp_path) + + +def test_layered_missing_required_field_exits(tmp_path: Path, empty_user_layer: Path) -> None: + (tmp_path / "pricing.toml").write_text('[models."m"]\ninput_per_1m = 2.5\n') + with pytest.raises(SystemExit): + load_pricing_layered(tmp_path) + + def _write_pricing(directory: Path, body: str) -> Path: path = directory / "pricing.toml" path.write_text(body) diff --git a/tests/test_scaffold.py b/tests/test_scaffold.py new file mode 100644 index 0000000..fff2ac6 --- /dev/null +++ b/tests/test_scaffold.py @@ -0,0 +1,25 @@ +"""Tests for project scaffolding (nasde init).""" + +from __future__ import annotations + +import tomllib +from pathlib import Path + +from nasde_toolkit.scaffold import create_project + + +def test_create_project_writes_pricing_example(tmp_path: Path) -> None: + create_project(tmp_path, "demo") + example = tmp_path / "pricing.toml.example" + assert example.exists() + body = example.read_text() + assert "claude-sonnet-4-6" in body # a real bundled model name to copy + assert "decimal" in body.lower() # the comma-vs-point hint + + +def test_create_project_pricing_example_is_inert(tmp_path: Path) -> None: + create_project(tmp_path, "demo") + parsed = tomllib.loads((tmp_path / "pricing.toml.example").read_text()) + assert parsed == {} # fully commented — no active [models] until the user edits + + assert not (tmp_path / "pricing.toml").exists() # scaffold never writes a LIVE override diff --git a/website/src/content/docs/concepts/token-cost.md b/website/src/content/docs/concepts/token-cost.md index 2083ca7..11b940b 100644 --- a/website/src/content/docs/concepts/token-cost.md +++ b/website/src/content/docs/concepts/token-cost.md @@ -74,12 +74,18 @@ The precedence is **project > user > bundled**, merged **per model**: each overr ```toml # ~/.nasde/pricing.toml — your enterprise rate for one model; the rest stays bundled [models."claude-opus-4-8"] -input_per_1m = 4.0 +input_per_1m = 4.0 # prices use a decimal point (4.0), not a comma output_per_1m = 12.0 -as_of = "2026-06-22" -source = "internal contract" +as_of = "2026-06-22" # optional — when you confirmed this rate +source = "internal contract" # optional — where it came from ``` +`nasde init` drops a fully-commented `pricing.toml.example` in a new project — copy it to `pricing.toml` and edit, rather than writing one from scratch. + +:::caution[The model name must match exactly] +The key (`claude-opus-4-8` above) must match the `model` in your `variant.toml` — i.e. the `model_name` recorded in each trial. If it doesn't (a typo like `claude-opus-4.8`, or a model you don't actually run), the override is **silently ignored** and the trial keeps the bundled rate — no error. Confirm your override took effect with `--show-source` (below): your model should show layer `project`/`user`, not `bundled`. +::: + ### Verifying the effective catalog To see the merged result after your overrides, run: @@ -89,6 +95,10 @@ nasde pricing show -C ./my-benchmark # effective rates per model nasde pricing show -C ./my-benchmark --show-source # + which layer each rate came from ``` +If an override doesn't seem to apply, this is the first thing to check: a model showing layer `bundled` when you expected `project` means the name in your `pricing.toml` doesn't match the model you ran. + +A malformed override file (bad TOML, a missing `input_per_1m`/`output_per_1m`) fails fast with a clear message naming the file — it never silently produces a wrong cost. + For audit, every `nasde results-export` also writes a `pricing_used.json` next to the exported trials — the effective rate and source layer for each model that was priced in that batch — so a report is self-contained. The `nasde run` summary prints the same "Pricing used" table for the models in the run. :::note[Editing the bundled catalog directly] From 07facdd21610d0f8b174adc720e14af9ef70a100 Mon Sep 17 00:00:00 2001 From: Szymon Janikowski Date: Wed, 24 Jun 2026 13:40:34 +0200 Subject: [PATCH 08/10] docs: show real pricing-verification output (not just prose) Rewrite the "Verifying the effective catalog" section of token-cost.md to include actual command output: a --show-source table for a working override, a side-by-side example of the silent model-name-typo failure (the real model stays `bundled` while the typo'd key sits as a dead `project` row), the loud malformed-file errors (decimal comma, missing field), and a sample pricing_used.json. Theory alone didn't make the silent-miss case obvious. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../src/content/docs/concepts/token-cost.md | 67 +++++++++++++++++-- 1 file changed, 61 insertions(+), 6 deletions(-) diff --git a/website/src/content/docs/concepts/token-cost.md b/website/src/content/docs/concepts/token-cost.md index 11b940b..1b83a7d 100644 --- a/website/src/content/docs/concepts/token-cost.md +++ b/website/src/content/docs/concepts/token-cost.md @@ -88,18 +88,73 @@ The key (`claude-opus-4-8` above) must match the `model` in your `variant.toml` ### Verifying the effective catalog -To see the merged result after your overrides, run: +After dropping an override, **always check it took effect** — run `pricing show` with `--show-source`: ```bash -nasde pricing show -C ./my-benchmark # effective rates per model -nasde pricing show -C ./my-benchmark --show-source # + which layer each rate came from +nasde pricing show -C ./my-benchmark --show-source ``` -If an override doesn't seem to apply, this is the first thing to check: a model showing layer `bundled` when you expected `project` means the name in your `pricing.toml` doesn't match the model you ran. +With a project override of `claude-sonnet-4-6` (to $2.5 / $11), you'll see it win the `project` layer while everything else stays `bundled`: -A malformed override file (bad TOML, a missing `input_per_1m`/`output_per_1m`) fails fast with a clear message naming the file — it never silently produces a wrong cost. +``` + Effective pricing +┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓ +┃ Model ┃ In / 1M ┃ Out / 1M ┃ Layer ┃ as_of ┃ +┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩ +│ claude-opus-4-8 │ $5 │ $15 │ bundled │ 2026-06-08 │ +│ claude-sonnet-4-6 │ $2.5 │ $11 │ project │ 2026-06-24 │ +│ gpt-5.4 │ $2.5 │ $15 │ bundled │ 2026-06-08 │ +│ gpt-5.5 │ $5 │ $30 │ bundled │ 2026-06-08 │ +└───────────────────┴─────────┴──────────┴─────────┴────────────┘ +``` + +#### Catching a silent miss (wrong model name) + +This is the failure mode to watch for. Suppose you typo'd the key as `claude-sonnet-4.6` (dot) instead of `claude-sonnet-4-6` (dashes). The override **loads fine — no error** — but it doesn't match the model you actually run, so it just sits there as an extra, unused row while the real model keeps the bundled rate: + +``` +┃ Model ┃ In / 1M ┃ Out / 1M ┃ Layer ┃ as_of ┃ +│ claude-sonnet-4-6 │ $3 │ $15 │ bundled │ 2026-06-08 │ ← the model you run: NOT overridden +│ claude-sonnet-4.6 │ $2.5 │ $11 │ project │ — │ ← your typo: a dead, unused entry +``` + +The tell-tale: **the model you meant to override shows `bundled`, not `project`.** Fix the key to match the `model` in your `variant.toml` and re-run `pricing show`. + +#### Loud errors (malformed file) + +A broken override file never produces a wrong cost — it fails fast, naming the file and the cause. A decimal comma (`2,5` instead of `2.5`) is the classic one: + +``` +ERROR: could not load pricing override /path/to/my-benchmark/pricing.toml + invalid TOML — Expected newline or end of document after a statement (at line 2, column 17) + hint: prices use a decimal point (2.5), not a comma (2,5) +``` + +A missing required field gives: + +``` +ERROR: could not load pricing override /Users/you/.nasde/pricing.toml + a model is missing the required field 'output_per_1m' +``` + +(The path tells you which layer to fix — your project's `pricing.toml` vs the user-wide `~/.nasde/pricing.toml`.) + +#### Self-contained audit + +Every `nasde results-export` also writes a `pricing_used.json` next to the exported trials — the effective rate and source layer for each model priced in that batch: + +```json +{ + "claude-sonnet-4-6": { + "input_per_1m": 2.5, + "output_per_1m": 11.0, + "as_of": "2026-06-24", + "layer": "project" + } +} +``` -For audit, every `nasde results-export` also writes a `pricing_used.json` next to the exported trials — the effective rate and source layer for each model that was priced in that batch — so a report is self-contained. The `nasde run` summary prints the same "Pricing used" table for the models in the run. +So a report carries its own pricing provenance. The `nasde run` summary prints the same "Pricing used" table for the models in the run. :::note[Editing the bundled catalog directly] You can still edit the bundled `src/nasde_toolkit/pricing.toml` from a source checkout (`uv sync`). After a PyPI install (`uv tool install` / pipx) the bundled file lives inside an isolated environment and is overwritten on upgrade — so prefer a `pricing.toml` override (above), which survives upgrades, or contribute the rate upstream. From 72a4b8ebfc30cd5c96747c66faa18397d1118b00 Mon Sep 17 00:00:00 2001 From: Szymon Janikowski Date: Wed, 24 Jun 2026 13:45:03 +0200 Subject: [PATCH 09/10] docs(changelog): bring [Unreleased] up to date for the next release MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The Unreleased section was missing everything merged after the ADR-010/011/012 batch. Added entries for: layered pricing override + visibility (#71, ADR-013), results-export + repeated-evaluation accumulation (#57), the NASDE→Nasde rebrand and Starlight docs migration (#64/#69), the parallel-run job_dir race fix (#62), and the June-2026 CVE dependency pins (#70, new ### Security section). All [#NN] references now have link targets. Co-Authored-By: Claude Opus 4.8 (1M context) --- CHANGELOG.md | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 8207ad9..2517650 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,6 +10,23 @@ See [docs/RELEASING.md](docs/RELEASING.md) for the release procedure. ## [Unreleased] ### Added +- **Convention-based layered price overrides ([ADR-013](docs/adr/013-layered-pricing-override.md)).** Model prices + are now overridable after install without touching the wheel: drop a + `pricing.toml` in your project root or at `~/.nasde/pricing.toml` and it is + auto-detected — no config key, mirroring `assessment_dimensions.json`. The three + layers merge **project > user > bundled**, per-model whole-entry (an override + lists only the models it changes/adds; omitted fields take their defaults, never + inherited from a lower layer — so `as_of`/`source` can't leak onto a new rate). + Both write paths thread `project_dir`, so a trial reports the same `cost_usd` in + the run summary and a later export. New `nasde pricing show [--show-source]` + prints the effective merged catalog with the layer each rate came from; the + `nasde run` summary prints a "Pricing used" table for the models in the run; and + `nasde results-export` writes a `pricing_used.json` (effective rate + layer per + priced model) for self-contained cost audit. `ModelPrice` is now frozen + (immutable rate, safe to share via the bundled cache). Malformed override files + fail fast with a clear message naming the file and cause (e.g. a decimal comma); + a model-name typo is caught via `--show-source` (the real model shows `bundled`, + not `project`). `nasde init` scaffolds a commented `pricing.toml.example`. ([#71]) - **Token & cost metrics ([ADR-011](docs/adr/011-token-cost-metrics.md)).** Every trial now carries token usage and USD cost, derived from the agent's `agent/trajectory.json` `final_metrics` and a versioned price catalog (`pricing.toml`). **Cost is computed @@ -73,6 +90,19 @@ See [docs/RELEASING.md](docs/RELEASING.md) for the release procedure. declared tasks (others SKIPPED); with a single `--variant`, requesting a task outside its scope aborts with a clear error. The scope wins over an explicit `--tasks` filter. Absent/empty → unscoped (the default). ([#54]) +- **`nasde results-export` + repeated-evaluation accumulation.** `nasde + results-export PATHS... --to DIR` copies the analytic essence of trial artifacts + (metrics, all `assessment_eval_*.json`, summary, trajectory, code patch, verifier + output) out of the gitignored `jobs/` tree into a flat per-trial layout on any + plain path (iCloud/Dropbox/a git repo — no cloud SDK); re-export merges in newly + added evals. Because the judge is non-deterministic, each trial is now evaluated + `eval_repetitions` times (default 3; `[evaluation] eval_repetitions`, or + `--eval-repetitions`), written **append-only** as `assessment_eval_.json` + (never clobbered), with a derived `assessment_summary.json` holding per-cluster + mean/std aggregates — averaged only within one `(evaluator_model, + dimensions_fingerprint)` cluster, so a different judge model or a changed rubric + is never mixed in. Legacy single-eval jobs are normalized by the hidden + `nasde migrate-evals`. ([#57]) ### Removed - **Scalar efficiency metrics `token_efficiency` / `cost_efficiency` ([ADR-011](docs/adr/011-token-cost-metrics.md)).** @@ -86,6 +116,10 @@ See [docs/RELEASING.md](docs/RELEASING.md) for the release procedure. `cost_usd`, `pricing_as_of`, score) stay and are the source of truth. ### Changed +- **Docs migrated to a Starlight site + rebranded `NASDE` → `Nasde`.** The README + long-form docs now live in an Astro/Starlight site under `website/`, deployed to + GitHub Pages; the README is a landing page. Prose uses the word-form **Nasde** + rather than the all-caps acronym. ([#64], [#69]) - **Bump Harbor `0.6` → `0.13`.** Pulls 6 minor releases of upstream fixes and features (incl. network-policy config). A full API audit (diffing the 0.13.0 wheel against 0.6.4 for every symbol nasde uses — `Job`/`JobConfig`, @@ -104,6 +138,11 @@ See [docs/RELEASING.md](docs/RELEASING.md) for the release procedure. max_turns` in `nasde.toml`. ([#54]) ### Fixed +- **`nasde run` cost table resolves its own job dir from config, not the + globally-newest one.** With parallel runs, the summary previously picked the + newest `jobs/` directory by name — which could be a *different, concurrent* + run — and so showed the wrong (or an empty) cost table. The job dir is now + resolved from the run's own merged config. ([#62]) - **`[[skill]]` by-reference and `[nasde.plugin]` skills now register natively for Codex/Gemini ([ADR-012](docs/adr/012-native-codex-gemini-skill-injection.md)).** PR #65 fixed the *snapshot* path (`agents_skills/`, `gemini_skills/`) but left the two other ways a skill reaches an agent — `[[skill]]` by-reference in @@ -153,6 +192,15 @@ See [docs/RELEASING.md](docs/RELEASING.md) for the release procedure. - **Bump `starlette` 1.0.0 → 1.1.0** (PYSEC-2026-161; transitive via harbor/fastapi/mcp). ([#54]) +### Security +- **Pin transitive deps for the June-2026 CVE batch.** `pip-audit` flagged 13 + advisories across 5 transitive packages (published after the last green main + run): `aiohttp>=3.14.1` (CVE-2026-54273..54280), `cryptography>=48.0.1` + (GHSA-537c-gmf6-5ccf), `python-multipart>=0.0.31` (CVE-2026-53540), + `pydantic-settings>=2.14.2` (GHSA-4xgf-cpjx-pc3j), and `starlette>=1.3.1` + (CVE-2026-54282 / CVE-2026-54283). Floors added in `pyproject.toml`; the CVE + audit gate is green again. ([#70]) + ## [0.4.0] — 2026-05-21 ### Added @@ -540,4 +588,10 @@ Initial release under the **nasde-toolkit** name (rebrand from [#61]: https://github.com/NoesisVision/nasde-toolkit/pull/61 [#65]: https://github.com/NoesisVision/nasde-toolkit/pull/65 [#67]: https://github.com/NoesisVision/nasde-toolkit/pull/67 +[#57]: https://github.com/NoesisVision/nasde-toolkit/pull/57 +[#62]: https://github.com/NoesisVision/nasde-toolkit/pull/62 +[#64]: https://github.com/NoesisVision/nasde-toolkit/pull/64 +[#69]: https://github.com/NoesisVision/nasde-toolkit/pull/69 +[#70]: https://github.com/NoesisVision/nasde-toolkit/pull/70 +[#71]: https://github.com/NoesisVision/nasde-toolkit/pull/71 [gh-litellm-2026-04]: https://github.com/BerriAI/litellm/security/advisories/GHSA-xqmj-j6mv-4862 From 5a592189f69beba7d263002f42cc705576e85c2f Mon Sep 17 00:00:00 2001 From: Szymon Janikowski Date: Wed, 24 Jun 2026 13:52:03 +0200 Subject: [PATCH 10/10] docs: make "update CHANGELOG per feature" an explicit definition-of-done MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The Unreleased section drifted 5 PRs behind before v0.5.0 even though the nasde-dev skill already mentioned updating it — the rule was buried mid-list. Surface it: a new "Development workflow" section in CLAUDE.md (which had no release guidance), and a prominent "Definition of done — CHANGELOG first" callout at the top of the skill's doc-consistency step. A user-visible change is not done until it has an [Unreleased] entry with a [#NN] ref, in the same PR. Co-Authored-By: Claude Opus 4.8 (1M context) --- .claude/skills/nasde-dev/SKILL.md | 10 +++++++++- CLAUDE.md | 13 +++++++++++++ 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/.claude/skills/nasde-dev/SKILL.md b/.claude/skills/nasde-dev/SKILL.md index c9d7ed3..ab5a1ec 100644 --- a/.claude/skills/nasde-dev/SKILL.md +++ b/.claude/skills/nasde-dev/SKILL.md @@ -55,10 +55,18 @@ uv sync After any change to the evaluation pipeline, CLI flags, configuration schema, agent support, or sandbox/environment handling, update **all** of these: +> **Definition of done — CHANGELOG first.** A feature/fix is NOT complete until it +> has an entry under `## [Unreleased]` in `CHANGELOG.md`. Add it **as part of the +> change** (same PR), never "at release time" — leaving it for later means the +> Unreleased section silently drifts behind the merged PRs and release notes go +> out wrong. This has happened (the section fell 5 PRs behind before v0.5.0). When +> finishing any user-visible work, the last step before declaring done is: does +> `[Unreleased]` describe it, in the right section, with a `[#NN]` link-ref? + **Think DX-first:** For every new option or feature, ask "where will the user be when they need this?" and put the documentation there. A feature that exists only in CLAUDE.md is invisible to most users. Check every touchpoint: **Documentation:** -- `CHANGELOG.md` — **add an entry under `## [Unreleased]`** (Added / Changed / Fixed) for any user-visible change: new CLI flag, config field, behavior change, dependency bump. Add the `[#NN]` PR link-reference at the bottom. Easy to forget — do it as part of the change, not at release time. +- `CHANGELOG.md` — **add an entry under `## [Unreleased]`** in the right section (Added / Changed / Fixed / Removed / Security) for any user-visible change: new CLI flag, config field, behavior change, dependency bump, CVE pin, security fix. Add the `[#NN]` PR link-reference at the bottom. Easy to forget — do it as part of the change, not at release time. Format is [Keep a Changelog](https://keepachangelog.com/); see `docs/RELEASING.md`. - `README.md` — user-facing documentation (CLI options table, nasde.toml config reference, explanatory text). This is where most users look first. - `CLAUDE.md` — agent instructions (CLI reference, nasde.toml example, architecture decisions) - `ARCHITECTURE.md` — system architecture with mermaid diagrams (end-to-end flow, trial lifecycle, cloud sandbox providers, assessment evaluation) diff --git a/CLAUDE.md b/CLAUDE.md index f83e122..9742bda 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -57,6 +57,19 @@ nasde --version uv run pytest ``` +## Development workflow + +**A user-visible change is not done until `CHANGELOG.md` is updated.** When you +finish a feature, fix, dependency bump, CVE pin, or any change a user would +notice, add an entry under `## [Unreleased]` in `CHANGELOG.md` (in the right +Keep-a-Changelog section — Added / Changed / Fixed / Removed / Security) with a +`[#NN]` PR link-reference at the bottom — **as part of that change, not at +release time.** Deferring it lets `[Unreleased]` silently drift behind merged +PRs, so release notes ship incomplete. The full post-change checklist +(quality gates, docs/skills/scaffold consistency, smoke + e2e) lives in the +`nasde-dev` skill; releasing itself is in [docs/RELEASING.md](docs/RELEASING.md) +(move `[Unreleased]` under the new version, then tag → publish). + ## Code style 1. PEP 8 with type hints on all public functions.