diff --git a/.claude/skills/nasde-dev/SKILL.md b/.claude/skills/nasde-dev/SKILL.md index c9d7ed3..ab5a1ec 100644 --- a/.claude/skills/nasde-dev/SKILL.md +++ b/.claude/skills/nasde-dev/SKILL.md @@ -55,10 +55,18 @@ uv sync After any change to the evaluation pipeline, CLI flags, configuration schema, agent support, or sandbox/environment handling, update **all** of these: +> **Definition of done — CHANGELOG first.** A feature/fix is NOT complete until it +> has an entry under `## [Unreleased]` in `CHANGELOG.md`. Add it **as part of the +> change** (same PR), never "at release time" — leaving it for later means the +> Unreleased section silently drifts behind the merged PRs and release notes go +> out wrong. This has happened (the section fell 5 PRs behind before v0.5.0). When +> finishing any user-visible work, the last step before declaring done is: does +> `[Unreleased]` describe it, in the right section, with a `[#NN]` link-ref? + **Think DX-first:** For every new option or feature, ask "where will the user be when they need this?" and put the documentation there. A feature that exists only in CLAUDE.md is invisible to most users. Check every touchpoint: **Documentation:** -- `CHANGELOG.md` — **add an entry under `## [Unreleased]`** (Added / Changed / Fixed) for any user-visible change: new CLI flag, config field, behavior change, dependency bump. Add the `[#NN]` PR link-reference at the bottom. Easy to forget — do it as part of the change, not at release time. +- `CHANGELOG.md` — **add an entry under `## [Unreleased]`** in the right section (Added / Changed / Fixed / Removed / Security) for any user-visible change: new CLI flag, config field, behavior change, dependency bump, CVE pin, security fix. Add the `[#NN]` PR link-reference at the bottom. Easy to forget — do it as part of the change, not at release time. Format is [Keep a Changelog](https://keepachangelog.com/); see `docs/RELEASING.md`. - `README.md` — user-facing documentation (CLI options table, nasde.toml config reference, explanatory text). This is where most users look first. - `CLAUDE.md` — agent instructions (CLI reference, nasde.toml example, architecture decisions) - `ARCHITECTURE.md` — system architecture with mermaid diagrams (end-to-end flow, trial lifecycle, cloud sandbox providers, assessment evaluation) diff --git a/CHANGELOG.md b/CHANGELOG.md index 8207ad9..2517650 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,6 +10,23 @@ See [docs/RELEASING.md](docs/RELEASING.md) for the release procedure. ## [Unreleased] ### Added +- **Convention-based layered price overrides ([ADR-013](docs/adr/013-layered-pricing-override.md)).** Model prices + are now overridable after install without touching the wheel: drop a + `pricing.toml` in your project root or at `~/.nasde/pricing.toml` and it is + auto-detected — no config key, mirroring `assessment_dimensions.json`. The three + layers merge **project > user > bundled**, per-model whole-entry (an override + lists only the models it changes/adds; omitted fields take their defaults, never + inherited from a lower layer — so `as_of`/`source` can't leak onto a new rate). + Both write paths thread `project_dir`, so a trial reports the same `cost_usd` in + the run summary and a later export. New `nasde pricing show [--show-source]` + prints the effective merged catalog with the layer each rate came from; the + `nasde run` summary prints a "Pricing used" table for the models in the run; and + `nasde results-export` writes a `pricing_used.json` (effective rate + layer per + priced model) for self-contained cost audit. `ModelPrice` is now frozen + (immutable rate, safe to share via the bundled cache). Malformed override files + fail fast with a clear message naming the file and cause (e.g. a decimal comma); + a model-name typo is caught via `--show-source` (the real model shows `bundled`, + not `project`). `nasde init` scaffolds a commented `pricing.toml.example`. ([#71]) - **Token & cost metrics ([ADR-011](docs/adr/011-token-cost-metrics.md)).** Every trial now carries token usage and USD cost, derived from the agent's `agent/trajectory.json` `final_metrics` and a versioned price catalog (`pricing.toml`). **Cost is computed @@ -73,6 +90,19 @@ See [docs/RELEASING.md](docs/RELEASING.md) for the release procedure. declared tasks (others SKIPPED); with a single `--variant`, requesting a task outside its scope aborts with a clear error. The scope wins over an explicit `--tasks` filter. Absent/empty → unscoped (the default). ([#54]) +- **`nasde results-export` + repeated-evaluation accumulation.** `nasde + results-export PATHS... --to DIR` copies the analytic essence of trial artifacts + (metrics, all `assessment_eval_*.json`, summary, trajectory, code patch, verifier + output) out of the gitignored `jobs/` tree into a flat per-trial layout on any + plain path (iCloud/Dropbox/a git repo — no cloud SDK); re-export merges in newly + added evals. Because the judge is non-deterministic, each trial is now evaluated + `eval_repetitions` times (default 3; `[evaluation] eval_repetitions`, or + `--eval-repetitions`), written **append-only** as `assessment_eval_.json` + (never clobbered), with a derived `assessment_summary.json` holding per-cluster + mean/std aggregates — averaged only within one `(evaluator_model, + dimensions_fingerprint)` cluster, so a different judge model or a changed rubric + is never mixed in. Legacy single-eval jobs are normalized by the hidden + `nasde migrate-evals`. ([#57]) ### Removed - **Scalar efficiency metrics `token_efficiency` / `cost_efficiency` ([ADR-011](docs/adr/011-token-cost-metrics.md)).** @@ -86,6 +116,10 @@ See [docs/RELEASING.md](docs/RELEASING.md) for the release procedure. `cost_usd`, `pricing_as_of`, score) stay and are the source of truth. ### Changed +- **Docs migrated to a Starlight site + rebranded `NASDE` → `Nasde`.** The README + long-form docs now live in an Astro/Starlight site under `website/`, deployed to + GitHub Pages; the README is a landing page. Prose uses the word-form **Nasde** + rather than the all-caps acronym. ([#64], [#69]) - **Bump Harbor `0.6` → `0.13`.** Pulls 6 minor releases of upstream fixes and features (incl. network-policy config). A full API audit (diffing the 0.13.0 wheel against 0.6.4 for every symbol nasde uses — `Job`/`JobConfig`, @@ -104,6 +138,11 @@ See [docs/RELEASING.md](docs/RELEASING.md) for the release procedure. max_turns` in `nasde.toml`. ([#54]) ### Fixed +- **`nasde run` cost table resolves its own job dir from config, not the + globally-newest one.** With parallel runs, the summary previously picked the + newest `jobs/` directory by name — which could be a *different, concurrent* + run — and so showed the wrong (or an empty) cost table. The job dir is now + resolved from the run's own merged config. ([#62]) - **`[[skill]]` by-reference and `[nasde.plugin]` skills now register natively for Codex/Gemini ([ADR-012](docs/adr/012-native-codex-gemini-skill-injection.md)).** PR #65 fixed the *snapshot* path (`agents_skills/`, `gemini_skills/`) but left the two other ways a skill reaches an agent — `[[skill]]` by-reference in @@ -153,6 +192,15 @@ See [docs/RELEASING.md](docs/RELEASING.md) for the release procedure. - **Bump `starlette` 1.0.0 → 1.1.0** (PYSEC-2026-161; transitive via harbor/fastapi/mcp). ([#54]) +### Security +- **Pin transitive deps for the June-2026 CVE batch.** `pip-audit` flagged 13 + advisories across 5 transitive packages (published after the last green main + run): `aiohttp>=3.14.1` (CVE-2026-54273..54280), `cryptography>=48.0.1` + (GHSA-537c-gmf6-5ccf), `python-multipart>=0.0.31` (CVE-2026-53540), + `pydantic-settings>=2.14.2` (GHSA-4xgf-cpjx-pc3j), and `starlette>=1.3.1` + (CVE-2026-54282 / CVE-2026-54283). Floors added in `pyproject.toml`; the CVE + audit gate is green again. ([#70]) + ## [0.4.0] — 2026-05-21 ### Added @@ -540,4 +588,10 @@ Initial release under the **nasde-toolkit** name (rebrand from [#61]: https://github.com/NoesisVision/nasde-toolkit/pull/61 [#65]: https://github.com/NoesisVision/nasde-toolkit/pull/65 [#67]: https://github.com/NoesisVision/nasde-toolkit/pull/67 +[#57]: https://github.com/NoesisVision/nasde-toolkit/pull/57 +[#62]: https://github.com/NoesisVision/nasde-toolkit/pull/62 +[#64]: https://github.com/NoesisVision/nasde-toolkit/pull/64 +[#69]: https://github.com/NoesisVision/nasde-toolkit/pull/69 +[#70]: https://github.com/NoesisVision/nasde-toolkit/pull/70 +[#71]: https://github.com/NoesisVision/nasde-toolkit/pull/71 [gh-litellm-2026-04]: https://github.com/BerriAI/litellm/security/advisories/GHSA-xqmj-j6mv-4862 diff --git a/CLAUDE.md b/CLAUDE.md index adfdbab..9742bda 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -18,7 +18,8 @@ src/nasde_toolkit/ claude_subprocess.py # `claude -p` subprocess backend (default) codex_subprocess.py # `codex exec` subprocess backend results_exporter.py # `nasde results-export` — copy trial artifact essence to a plain dir - pricing.py / pricing.toml # Versioned model price catalog (ADR-011) — load + compute_cost_usd + pricing.py / pricing.toml # Versioned model price catalog (ADR-011) — load + compute_cost_usd; layered override via load_pricing_layered + provenance via resolve_pricing_layers/effective_pricing_with_source (ADR-013) + pricing_report.py # Shared Rich table renderer for the effective price catalog (nasde pricing show + run summary, ADR-013) token_metrics.py # Token usage + cost economics from trajectory final_metrics (ADR-011) calibration_publisher.py # `nasde calibrate` — publish trial diffs+assessments as PRs/MRs (ADR-010) calibration_resolve.py # Resolve [calibration] repo → (api slug, push url, platform) @@ -56,6 +57,19 @@ nasde --version uv run pytest ``` +## Development workflow + +**A user-visible change is not done until `CHANGELOG.md` is updated.** When you +finish a feature, fix, dependency bump, CVE pin, or any change a user would +notice, add an entry under `## [Unreleased]` in `CHANGELOG.md` (in the right +Keep-a-Changelog section — Added / Changed / Fixed / Removed / Security) with a +`[#NN]` PR link-reference at the bottom — **as part of that change, not at +release time.** Deferring it lets `[Unreleased]` silently drift behind merged +PRs, so release notes ship incomplete. The full post-change checklist +(quality gates, docs/skills/scaffold consistency, smoke + e2e) lives in the +`nasde-dev` skill; releasing itself is in [docs/RELEASING.md](docs/RELEASING.md) +(move `[Unreleased]` under the new version, then tag → publish). + ## Code style 1. PEP 8 with type hints on all public functions. @@ -88,7 +102,7 @@ See [ARCHITECTURE.md](ARCHITECTURE.md) for the full system architecture with dia - **Pass-through CLI**: `nasde harbor ...` delegates to Harbor's Typer app via `add_typer()`. `nasde opik ...` forwards args to Opik's Click CLI via `ctx.args`. - **Rubric calibration (ADR-010)**: `nasde calibrate publish PATHS...` / `pull-comments` close the loop between the LLM-as-a-Judge and a human reviewer by publishing trial diffs + assessments as PRs/MRs and pulling review comments back for rubric tuning. Two layers, deliberately separated: **GIT** (`git_platform_backends/git_ops.py` — `git push`/`ls-remote`, platform-agnostic, subprocess pattern from `docker.py`, not behind a Protocol) and **PLATFORM** (`git_platform_backends/` behind a `@runtime_checkable GitPlatformBackend` Protocol — `repo_exists`/`find_open_pr_for_branch`/`create_pr`/`fetch_pr_comments`/`validate_cli_installed`/`validate_auth`, mirroring `evaluator_backends/`). The base is keyed on `(repo, commit)` as an **orphan branch** `base/-` seeded once via `git archive HEAD` from the trial workspace (git deduplicates blobs by content across orphan bases — no shared ancestor needed); each trial is a feature branch `calib/-/` = base + the agent's `changes.patch` applied as a real commit + `.calibration/` files (no trajectory — secrets/clutter). `.calibration/` carries the reviewer's context: the task's `instruction.md` + `assessment_criteria.md` + `assessment_dimensions.json` (resolved from `result.json` `task_name`/`source`, trying both `tasks/` and `evals//tasks/` layouts), all `assessment_eval_.json`, `assessment_summary.json`, and `metrics.json`. Idempotency is **open-only**: `find_open_pr_for_branch` matches only OPEN PRs/MRs (`gh pr list --state open`, `glab mr list` default), so a re-run skips a live round but lets a fresh round publish once the prior one is closed. The PR body is a pure transform of the dominant `AssessmentSummary` cluster (`calibration_publisher._render_pr_body`). Backend is **auto-detected from the sink repo URL host** (`github.com`→`gh`, `*gitlab*`→`glab`; `[calibration] platform` overrides for self-hosted) — no `backend` config field, eliminating the backend≠host mismatch. Preflight before any work: detect → `validate_cli_installed` (`shutil.which`, precise per-platform message) → `validate_auth` (`gh|glab auth status` exit code) → `repo_exists` (parses OUTPUT — `gh repo view` exits 0 even for a missing repo). Repo creation is out of scope (push creates branches ad-hoc in an existing repo). Reuses `_expand_to_trials`/`_capture_patch`/`_build_metrics` from `results_exporter.py` and `_aggregate_evaluations`/`_load_raw_evaluations`/`AssessmentSummary` from `evaluator.py`. Prerequisites mirror the evaluator's CLI requirement (ADR-002): `git` + `gh`/`glab` + login, no SDK, CLI keyring holds auth. The `nasde-benchmark-calibration` skill orchestrates the human-in-the-loop flow. - **Results export (EXPERIMENTAL)**: `results_exporter.py` + `nasde results-export PATHS... --to DIR` copy the analytic *essence* of trial artifacts out of the gitignored `jobs/` tree into a flat per-trial layout (`DIR/__/` with `metrics.json`, `assessment_eval_*.json` (all repetitions), `assessment_summary.json`, `trajectory.json`, `changes.patch`, `verifier_stdout.txt`, `reward.txt`). Re-export **merges**: missing eval files are copied and the summary/metrics refreshed, while immutable files (trajectory, patch) are left as-is — so evaluations added after a first export are picked up. A legacy bare `assessment_eval.json` (pre-migration trial) is exported as `assessment_eval_1.json` with a `nasde migrate-evals` hint, so the export is never silently empty. Filesystem-as-interface: `DIR` is any plain path (iCloud/Dropbox/git repo) — no cloud SDK. It scans Harbor artifacts (`result.json`/`config.json`/`assessment_eval*.json`/`agent/trajectory.json`/workspace), **not** the best-effort `EXPERIMENT_LOG.md`. `metrics.json` is a self-contained summary composed from `result.json`+`config.json`+`agent/trajectory.json` — including **token & cost economics** (`token_usage`, `cost_usd`, `pricing_as_of`, `reasoning_effort`); see the token-cost note below and [ADR-011](docs/adr/011-token-cost-metrics.md). The code diff is captured as a patch (`git diff HEAD` + untracked via `git ls-files -z` + `git diff --no-index`, never `git add` — the workspace `.git` index is left untouched; `-z`/NUL parsing means non-ASCII untracked filenames are not dropped under `core.quotepath`). Selection is a mixed positional list of job and/or trial dirs (auto-classified: a dir whose children have `result.json` is a job; else a dir whose own `result.json` carries a `trial_name` key is a trial; a dir with a job-level `result.json` but no trial-shaped children/`trial_name` is skipped with a warning rather than mis-exported as garbage); re-export is idempotent and merge-based (a trial is reported `exported` only when something new was copied, else `skipped`). Reuses `_collect_trial_dirs`/`_load_json`/`_compute_duration_sec`/`_resolve_agent_name` from `evaluator.py`. Deliberately does **not** model "experiments" (one job can belong to many — a future UI layer's concern). -- **Token & cost metrics (ADR-011)**: every trial gets **token usage + USD cost** computed from the agent's `agent/trajectory.json` `final_metrics` and a versioned price catalog. `token_metrics.py` is the **single extractor** feeding both write paths: `evaluator.py` writes them onto `assessment_summary.json` (run) and `results_exporter.py` onto `metrics.json` (export). Definitions: `input = total_prompt_tokens` (full, cache included), `output = total_completion_tokens + extra.reasoning_output_tokens` (Codex reasoning folded into output), `total = input + output`. **Cost is "as if every run were the first"** — full prompt volume at full input rate, **no cache discount** — so it is deterministic and independent of run order / cache TTL (the prompt-token count is fixed for a task; the cache hit rate is not). **The scalar `token_efficiency`/`cost_efficiency` ratios were REMOVED** — `normalized_score / denominator` has an arbitrary zero (score 0 = empty rubric, unreachable), so the ranking is not invariant to a baseline shift; model comparison is now a **Pareto front** (quality vs cost, quality vs tokens), shift-invariant, living in the `nasde-benchmark-runner` skill, not the toolkit. The raw signals (`token_usage`, `cost_usd`, `pricing_as_of`, score) stay and are the source of truth. Economics are **per-trial** (one agent run) → they live on `AssessmentSummary`, not `EvaluatorGroupSummary`. `model_name` and `reasoning_effort` are stamped on the summary because cross-model analysis groups by `(agent_name, model_name, reasoning_effort)` (agent_name = variant name, does not distinguish models; a different effort is a different configuration, never averaged together — mirrors how a changed `dimensions_fingerprint` is a different benchmark). The `reasoning_effort` stamp is read back from the per-trial Harbor `config.json` (`config.agent.kwargs.reasoning_effort`); when no override was set the stamp is `""` (only explicit overrides are recorded — Codex's implicit `high` default is NOT fabricated, and an unset effort is a valid "family default" state). Pricing lives in the bundled `pricing.toml` (`pricing.py::load_pricing`), each model stamped with `as_of` + `source`; `cached_input_per_1m` is recorded for reference but **not** used in the cost formula. An **unpriced model** → `cost_usd` = `null` + a warning (token metrics still computed); a **missing/legacy trajectory** → all economics `null`. Never crashes the run. `nasde run` prints a per-`(agent, model, effort)` cost table (trials, score, tokens, $cost) plus the job path and an export hint (`runner.py::_print_job_summary`, called after assessment so the summaries exist); raw cost/token columns carry an inter-trial `±std` when the group has ≥2 trials (n=1 → bare value). Backfilling existing exports whose source jobs are gone is a **one-shot ad-hoc script** (reads the export's own flat `trajectory.json`) — deliberately NOT a CLI command. +- **Token & cost metrics (ADR-011)**: every trial gets **token usage + USD cost** computed from the agent's `agent/trajectory.json` `final_metrics` and a versioned price catalog. `token_metrics.py` is the **single extractor** feeding both write paths: `evaluator.py` writes them onto `assessment_summary.json` (run) and `results_exporter.py` onto `metrics.json` (export). Definitions: `input = total_prompt_tokens` (full, cache included), `output = total_completion_tokens + extra.reasoning_output_tokens` (Codex reasoning folded into output), `total = input + output`. **Cost is "as if every run were the first"** — full prompt volume at full input rate, **no cache discount** — so it is deterministic and independent of run order / cache TTL (the prompt-token count is fixed for a task; the cache hit rate is not). **The scalar `token_efficiency`/`cost_efficiency` ratios were REMOVED** — `normalized_score / denominator` has an arbitrary zero (score 0 = empty rubric, unreachable), so the ranking is not invariant to a baseline shift; model comparison is now a **Pareto front** (quality vs cost, quality vs tokens), shift-invariant, living in the `nasde-benchmark-runner` skill, not the toolkit. The raw signals (`token_usage`, `cost_usd`, `pricing_as_of`, score) stay and are the source of truth. Economics are **per-trial** (one agent run) → they live on `AssessmentSummary`, not `EvaluatorGroupSummary`. `model_name` and `reasoning_effort` are stamped on the summary because cross-model analysis groups by `(agent_name, model_name, reasoning_effort)` (agent_name = variant name, does not distinguish models; a different effort is a different configuration, never averaged together — mirrors how a changed `dimensions_fingerprint` is a different benchmark). The `reasoning_effort` stamp is read back from the per-trial Harbor `config.json` (`config.agent.kwargs.reasoning_effort`); when no override was set the stamp is `""` (only explicit overrides are recorded — Codex's implicit `high` default is NOT fabricated, and an unset effort is a valid "family default" state). Pricing is loaded via `pricing.py::load_pricing_layered(project_dir)` (ADR-013), which merges three layers per-model (whole-entry, higher wins): `/pricing.toml` > `~/.nasde/pricing.toml` > bundled `pricing.toml`. **Convention, not config** — a file literally named `pricing.toml` (mirrors `assessment_dimensions.json`); no `[pricing]` key. User layer is a HOME dotfolder (`~/.nasde/`, like `~/.claude`/`~/.codex`/`~/.gemini`), deliberately NOT `platformdirs` (that maps to `~/Library/Application Support` on macOS = app-state, not user-editable config; `platformdirs` stays for cache in `update_check.py`). Both write paths thread `project_dir` so run (`assessment_summary.json`) and export (`metrics.json`) agree on cost — the ADR-011 single-extractor invariant. The merged catalog is NOT cached (depends on `project_dir` + on-disk contents); bundled `_load_bundled_pricing` keeps its `lru_cache`, `load_pricing(path)` is unchanged. An applied override prints a dim transparency line. **Layer provenance is exposed** (ADR-013): `pricing.py::resolve_pricing_layers(project_dir)` returns the ordered `PricingLayer` stack and `effective_pricing_with_source(project_dir)` returns `{model: (price, layer_name)}`. Surfaced three ways — `nasde pricing show [--show-source]` (sub-app, room for future `pricing validate`/`path`), a "Pricing used" table at the end of `nasde run` (only models in the run), and a `pricing_used.json` written by `results-export` (effective rate + layer per priced model, self-contained audit). The Rich table renderer is shared in `pricing_report.py::render_pricing_table` (used by `pricing show` + run summary). Each model stamped with `as_of` + `source`; `cached_input_per_1m` is recorded for reference but **not** used in the cost formula. An **unpriced model** → `cost_usd` = `null` + a warning (token metrics still computed); a **missing/legacy trajectory** → all economics `null`. Never crashes the run. `nasde run` prints a per-`(agent, model, effort)` cost table (trials, score, tokens, $cost) plus the job path and an export hint (`runner.py::_print_job_summary`, called after assessment so the summaries exist); raw cost/token columns carry an inter-trial `±std` when the group has ≥2 trials (n=1 → bare value). Backfilling existing exports whose source jobs are gone is a **one-shot ad-hoc script** (reads the export's own flat `trajectory.json`) — deliberately NOT a CLI command. - See `docs/adr/` for detailed decision records. ## CLI reference @@ -125,6 +139,10 @@ nasde calibrate pull-comments PATHS... [OPTIONS] # Fetch human review comments --json # Machine-readable output for the calibration orchestrator agent -C, --project-dir PATH +nasde pricing show [OPTIONS] # Print the effective merged price catalog (ADR-013) + --show-source # Add a Layer column (project/user/bundled) — debug + -C, --project-dir PATH + nasde init [PROJECT_DIR] # Scaffold new benchmark project -n, --name TEXT @@ -145,6 +163,7 @@ A benchmark project managed by `nasde` has this layout: my-benchmark/ nasde.toml # Project config (name, defaults, docker, evaluation, reporting) assessment_dimensions.json # Scoring dimensions (benchmark-wide, each with independent max_score) + pricing.toml # Optional: per-project model price overrides (project > ~/.nasde > bundled, ADR-013) tasks/ / task.toml # Task config: Harbor sections + [nasde.source] for auto-Dockerfile diff --git a/docs/adr/013-layered-pricing-override.md b/docs/adr/013-layered-pricing-override.md new file mode 100644 index 0000000..a844223 --- /dev/null +++ b/docs/adr/013-layered-pricing-override.md @@ -0,0 +1,93 @@ +# ADR-013: Layered pricing override (convention over config) + +**Status:** Accepted +**Date:** 2026-06-22 + +## Context + +Token-cost metrics (ADR-011) price every trial against a model-rate catalog, +`src/nasde_toolkit/pricing.toml`, bundled into the wheel. The loader +`pricing.load_pricing(path)` accepted an override path, but both real call sites — +`evaluator.py` (run → `assessment_summary.json`) and `results_exporter.py` +(export → `metrics.json`) — always called it with no argument, so the bundled +catalog was the *only* catalog. + +After a PyPI / `uv tool install`, that catalog lives inside an isolated +environment. Correcting a stale rate or adding a model meant editing a file in +`site-packages` — an anti-pattern that is wiped on the next upgrade. Model prices +change far more often than `nasde` releases, and users have their own rates +(enterprise discounts, Azure/Bedrock, private contracts). The product goal — +evaluating the *cost* of migrating between agents — depends on locally-correct, +locally-controllable prices. The docs already flagged a per-project / per-user +override as planned. + +## Decision + +Pricing is overridable **by convention, not configuration**. A file literally +named `pricing.toml` placed at a known location is auto-detected and merged onto +the bundled catalog — there is **no `[pricing]` section in `nasde.toml`** and **no +configurable filename**. This mirrors `assessment_dimensions.json`, which is also +a fixed-name file discovered by convention. + +A new public entry point `pricing.load_pricing_layered(project_dir)` merges three +layers, **higher wins, per-model whole-entry replacement**: + +1. `/pricing.toml` — project layer (highest) +2. `~/.nasde/pricing.toml` — user layer +3. bundled `pricing.toml` — the floor (always present, the only required layer) + +An override file lists **only the models it changes or adds**; every other model +falls through to the layer below. The merge is `dict.update` over whole +`ModelPrice` entries — **not** a per-field blend: a model entry in an override +replaces the lower layer's entry entirely, so a field the override omits takes the +`ModelPrice` default (`cached_input_per_1m=None`, `source=""`), it is *not* +inherited. This is the least-surprising rule and mirrors the whole-value override +semantics of `--model > variant.toml > default`. A missing project/user file is +silently skipped. When a layer file is found and applied, one dim console line is +printed (transparency), never for the skipped case. + +**User layer is `~/.nasde/`, a HOME dotfolder — deliberately not `platformdirs`.** +Every agent CLI the user works with keeps its *user config* in a HOME dotfolder +(`~/.claude` + `~/.claude.json`, `~/.codex`, `~/.gemini`); `platformdirs.user_config_dir` +maps to `~/Library/Application Support` on macOS, which is where Electron app-state +(cookies, caches) lives, not a file a human edits. Config belongs in the dotfolder; +`platformdirs` stays for cache (`update_check.py::user_cache_dir`). One path on +every OS. + +**Both write paths thread `project_dir`.** `token_metrics.build_trial_economics` +is the single extractor feeding both `assessment_summary.json` (run) and +`metrics.json` (export) — so the override has to reach both, or the same trial +would report two different costs. The run path already had `project_root` in hand +(`evaluate_job`); the export path gained a `project_dir` argument on +`export_results`, supplied from `config.project_dir` in the CLI. Calibration +(`calibration_publisher`) reuses `_build_metrics` and threads `project_root` too. +The merged catalog is **not** cached: it depends on `project_dir` and on on-disk +file contents that can change between runs, and a per-job re-read is cheap. The +bundled `_load_bundled_pricing` keeps its `lru_cache` (invariant), and +`load_pricing(path)` is unchanged. + +## Consequences + +- Users override model prices post-install without touching the wheel — drop a + `pricing.toml` in the project root, or `~/.nasde/pricing.toml` machine-wide. +- The bundled catalog stays the auditable floor, each entry stamped with + `as_of` / `source`; overrides layer on top per-model. +- A re-run or re-export picks up an edited `pricing.toml` immediately — no new + agent runs needed to recost. +- Run and export agree on cost for the same trial, preserving the ADR-011 + invariant. +- Determinism and the "confirm rates before quoting figures" caveat from ADR-011 + carry over — a wrong override produces wrong (but deterministic) costs. +- The effective catalog is **inspectable**: `resolve_pricing_layers` / + `effective_pricing_with_source` expose which layer supplied each rate, surfaced + via `nasde pricing show [--show-source]`, a "Pricing used" table in the `nasde run` + summary, and a `pricing_used.json` in `results-export`. This makes the + transparency the convention promises actually verifiable — you can confirm a + three-layer override composed as intended, and an exported report is a + self-contained cost audit. + +## References + +- ADR-011 (token & cost metrics) — the single-extractor invariant this override + must respect. +- `assessment_dimensions.json` — the fixed-name-by-convention precedent. diff --git a/src/nasde_toolkit/calibration_publisher.py b/src/nasde_toolkit/calibration_publisher.py index a93bb49..a0e3be9 100644 --- a/src/nasde_toolkit/calibration_publisher.py +++ b/src/nasde_toolkit/calibration_publisher.py @@ -36,6 +36,7 @@ push_feature_branch, ) from nasde_toolkit.git_platform_backends.protocol import GitPlatformBackend, PrRef, ReviewComment +from nasde_toolkit.pricing import ModelPrice, load_pricing_layered from nasde_toolkit.results_exporter import ( _build_metrics, _capture_patch, @@ -88,11 +89,12 @@ def publish_trials( """Publish the given job/trial paths as PRs/MRs on the sink repo.""" backend = _preflight(repo, repo_url, platform_override) trials = _expand_to_trials(paths) + pricing = load_pricing_layered(project_root) summary = PublishSummary() for index, (_, trial_dir) in enumerate(trials): if index > 0: time.sleep(throttle_sec) - _publish_one_trial(trial_dir, backend, repo, repo_url, base_branch, project_root, summary) + _publish_one_trial(trial_dir, backend, repo, repo_url, base_branch, project_root, pricing, summary) _print_publish_summary(summary, repo) return summary @@ -137,6 +139,7 @@ def _publish_one_trial( repo_url: str, base_branch: str, project_root: Path | None, + pricing: dict[str, ModelPrice], summary: PublishSummary, ) -> None: label = trial_dir.name @@ -148,7 +151,7 @@ def _publish_one_trial( if existing is not None: _record_skip(summary, label, base, feature, existing) return - created = _open_pr_for_trial(trial_dir, backend, repo, repo_url, base, feature, project_root) + created = _open_pr_for_trial(trial_dir, backend, repo, repo_url, base, feature, project_root, pricing) summary.published.append(PublishedTrial(label, base, feature, created.number, created.url, created=True)) console.print(f" [green]published: {label} → {created.url}[/green]") except Exception as error: @@ -164,9 +167,10 @@ def _open_pr_for_trial( base: str, feature: str, project_root: Path | None, + pricing: dict[str, ModelPrice], ) -> PrRef: workspace = trial_dir / "artifacts" / "workspace" - metrics = _build_metrics(trial_dir) + metrics = _build_metrics(trial_dir, pricing) summary = _summarize_trial(trial_dir) title = _pr_title(trial_dir, summary) body = _render_pr_body(summary, metrics) diff --git a/src/nasde_toolkit/cli.py b/src/nasde_toolkit/cli.py index 4119b01..f511993 100644 --- a/src/nasde_toolkit/cli.py +++ b/src/nasde_toolkit/cli.py @@ -433,7 +433,7 @@ def results_export_command( from nasde_toolkit.config import load_project_config from nasde_toolkit.results_exporter import export_results - load_project_config(project_dir.resolve()) + config = load_project_config(project_dir.resolve()) from nasde_toolkit.banner import print_banner @@ -445,7 +445,7 @@ def results_export_command( ) ) - export_results([p.resolve() for p in paths], to.resolve()) + export_results([p.resolve() for p in paths], to.resolve(), project_dir=config.project_dir) @app.command(name="migrate-evals", hidden=True) @@ -483,7 +483,7 @@ def migrate_evals_command( from nasde_toolkit.config import load_project_config from nasde_toolkit.eval_migration import migrate_job_evals - load_project_config(project_dir.resolve()) + config = load_project_config(project_dir.resolve()) from nasde_toolkit.banner import print_banner @@ -495,7 +495,7 @@ def migrate_evals_command( ) ) - outcomes = migrate_job_evals(path.resolve(), dry_run=dry_run) + outcomes = migrate_job_evals(path.resolve(), dry_run=dry_run, project_dir=config.project_dir) from rich.table import Table @@ -507,6 +507,42 @@ def migrate_evals_command( console.print(table) +# --------------------------------------------------------------------------- +# Pricing sub-app (nasde pricing ...) +# --------------------------------------------------------------------------- + +pricing_app = typer.Typer( + name="pricing", + help="Inspect the effective model price catalog (project > ~/.nasde > bundled).", + no_args_is_help=True, +) +app.add_typer(pricing_app, name="pricing") + + +@pricing_app.command(name="show") +def pricing_show_command( + project_dir: Path = typer.Option( + Path("."), + "--project-dir", + "-C", + help="Path to evaluation project (its pricing.toml is the highest layer).", + ), + show_source: bool = typer.Option( + False, + "--show-source", + help="Add a Layer column showing which layer each rate comes from (debug).", + ), +) -> None: + """Print the effective merged pricing catalog after layered overrides.""" + from nasde_toolkit.config import load_project_config + from nasde_toolkit.pricing import effective_pricing_with_source + from nasde_toolkit.pricing_report import render_pricing_table + + config = load_project_config(project_dir.resolve()) + entries = effective_pricing_with_source(config.project_dir) + console.print(render_pricing_table(entries, show_source=show_source, title="Effective pricing")) + + # --------------------------------------------------------------------------- # Calibration sub-app (nasde calibrate ...) # --------------------------------------------------------------------------- diff --git a/src/nasde_toolkit/eval_migration.py b/src/nasde_toolkit/eval_migration.py index a04e54b..781d52b 100644 --- a/src/nasde_toolkit/eval_migration.py +++ b/src/nasde_toolkit/eval_migration.py @@ -18,28 +18,36 @@ _next_eval_index, _write_assessment_summary, ) +from nasde_toolkit.pricing import ModelPrice, load_pricing_layered console = Console() -def migrate_job_evals(path: Path, dry_run: bool = False) -> dict[str, int]: +def migrate_job_evals(path: Path, dry_run: bool = False, project_dir: Path | None = None) -> dict[str, int]: outcomes = {"migrated": 0, "summarized": 0, "noop": 0} + pricing = load_pricing_layered(project_dir) for trial_dir in _find_trial_dirs(path): - outcome = migrate_trial_evals(trial_dir, dry_run=dry_run) + outcome = migrate_trial_evals(trial_dir, pricing, dry_run=dry_run, project_dir=project_dir) outcomes[outcome] += 1 return outcomes -def migrate_trial_evals(trial_dir: Path, dry_run: bool = False) -> str: +def migrate_trial_evals( + trial_dir: Path, + pricing: dict[str, ModelPrice] | None = None, + dry_run: bool = False, + project_dir: Path | None = None, +) -> str: bare = trial_dir / "assessment_eval.json" numbered = _numbered_eval_files(trial_dir) if not bare.exists() and not numbered: return "noop" + pricing = pricing if pricing is not None else load_pricing_layered(project_dir) changed = _normalize_raw_files(trial_dir, bare, numbered, dry_run) if not dry_run: - _write_assessment_summary(trial_dir) + _write_assessment_summary(trial_dir, pricing) if changed: return "migrated" return "summarized" diff --git a/src/nasde_toolkit/evaluator.py b/src/nasde_toolkit/evaluator.py index dbcff92..4af8f04 100644 --- a/src/nasde_toolkit/evaluator.py +++ b/src/nasde_toolkit/evaluator.py @@ -20,7 +20,7 @@ from nasde_toolkit.config import EvaluationConfig from nasde_toolkit.evaluator_backends import create_backend -from nasde_toolkit.pricing import load_pricing +from nasde_toolkit.pricing import ModelPrice, load_pricing_layered from nasde_toolkit.token_metrics import build_trial_economics console = Console() @@ -142,6 +142,7 @@ async def evaluate_job( _warn_if_throttled(trial_dirs, max_concurrent, eval_config.eval_repetitions) semaphore = asyncio.Semaphore(max_concurrent) + pricing = load_pricing_layered(project_root) coros = [ _evaluate_and_record_trial( td, @@ -150,6 +151,7 @@ async def evaluate_job( with_opik, semaphore, eval_config, + pricing, ) for td in trial_dirs ] @@ -164,12 +166,14 @@ async def evaluate_and_record_trial( with_opik: bool, semaphore: asyncio.Semaphore, eval_config: EvaluationConfig | None = None, + pricing: dict[str, ModelPrice] | None = None, ) -> EvaluationResult | None: """Evaluate a single trial with semaphore-based concurrency control. Public wrapper used by runner.py for streaming (Level 2) assessment. """ eval_config = eval_config or EvaluationConfig() + pricing = pricing if pricing is not None else load_pricing_layered(project_root) return await _evaluate_and_record_trial( trial_dir, project_root, @@ -177,6 +181,7 @@ async def evaluate_and_record_trial( with_opik, semaphore, eval_config, + pricing, ) @@ -186,7 +191,8 @@ async def _evaluate_and_record_trial( project_name: str, with_opik: bool, semaphore: asyncio.Semaphore, - eval_config: EvaluationConfig | None = None, + eval_config: EvaluationConfig | None, + pricing: dict[str, ModelPrice], ) -> EvaluationResult | None: eval_config = eval_config or EvaluationConfig() console.print(f"\n[bold]Evaluating: {trial_dir.name} (x{eval_config.eval_repetitions})[/bold]") @@ -196,7 +202,7 @@ async def _evaluate_and_record_trial( return None for evaluation in evaluations: _write_evaluation_result(trial_dir, evaluation) - summary = _write_assessment_summary(trial_dir) + summary = _write_assessment_summary(trial_dir, pricing) if with_opik and summary is not None: await asyncio.to_thread(_upload_to_opik, summary, project_name) return evaluations[-1] @@ -674,12 +680,12 @@ def _next_eval_index(trial_dir: Path) -> int: return max(indices, default=0) + 1 -def _write_assessment_summary(trial_dir: Path) -> AssessmentSummary | None: +def _write_assessment_summary(trial_dir: Path, pricing: dict[str, ModelPrice]) -> AssessmentSummary | None: evaluations = _load_raw_evaluations(trial_dir) if not evaluations: return None summary = _aggregate_evaluations(evaluations) - _enrich_with_economics(summary, trial_dir) + _enrich_with_economics(summary, trial_dir, pricing) output_path = trial_dir / "assessment_summary.json" with open(output_path, "w") as f: json.dump(asdict(summary), f, indent=2) @@ -687,9 +693,9 @@ def _write_assessment_summary(trial_dir: Path) -> AssessmentSummary | None: return summary -def _enrich_with_economics(summary: AssessmentSummary, trial_dir: Path) -> None: +def _enrich_with_economics(summary: AssessmentSummary, trial_dir: Path, pricing: dict[str, ModelPrice]) -> None: model = _resolve_model_name(trial_dir) - economics = build_trial_economics(trial_dir, model, load_pricing()) + economics = build_trial_economics(trial_dir, model, pricing) summary.model_name = economics["model_name"] summary.reasoning_effort = resolve_reasoning_effort(trial_dir) summary.token_usage = economics["token_usage"] diff --git a/src/nasde_toolkit/pricing.py b/src/nasde_toolkit/pricing.py index 3fbd032..ee5b1d4 100644 --- a/src/nasde_toolkit/pricing.py +++ b/src/nasde_toolkit/pricing.py @@ -1,8 +1,11 @@ """Model pricing catalog for token-cost metrics. -Loads per-model rates from a bundled ``pricing.toml`` (overridable) and computes -USD cost from token volumes. Cost is the full catalog rate applied to the full -prompt-token volume (cache included, no discount) — see ``pricing.toml`` and ADR-011. +Loads per-model rates from a bundled ``pricing.toml`` and computes USD cost from +token volumes. The catalog is overridable by convention via layered files — +``/pricing.toml`` > ``~/.nasde/pricing.toml`` > bundled, merged per-model +(see ``load_pricing_layered`` and ADR-013). Cost is the full catalog rate applied +to the full prompt-token volume (cache included, no discount) — see ``pricing.toml`` +and ADR-011. """ from __future__ import annotations @@ -12,15 +15,21 @@ from functools import lru_cache from importlib.resources import as_file, files from pathlib import Path +from typing import NoReturn from rich.console import Console console = Console() -@dataclass +@dataclass(frozen=True) class ModelPrice: - """Catalog rate for one model (USD per 1M tokens).""" + """Catalog rate for one model (USD per 1M tokens). + + Frozen: a rate is an immutable fact, and the bundled catalog is shared via an + lru_cache, so an in-place mutation would silently corrupt every later lookup + in the process. Build a new instance (``dataclasses.replace``) to adjust a rate. + """ input_per_1m: float output_per_1m: float @@ -29,26 +38,14 @@ class ModelPrice: source: str = "" -def load_pricing(path: str | Path | None = None) -> dict[str, ModelPrice]: - """Load the model pricing catalog, defaulting to the bundled pricing.toml. - - The bundled catalog (``path=None``) is read and parsed once per process and - cached, since it is invariant — callers (one per trial) must treat the - returned mapping as read-only. - """ - if path is None: - return _load_bundled_pricing() - raw = _read_pricing_toml(path) - return _pricing_from_raw(raw) - - -@lru_cache(maxsize=1) -def _load_bundled_pricing() -> dict[str, ModelPrice]: - return _pricing_from_raw(_read_pricing_toml(None)) - +@dataclass +class PricingLayer: + """One layer of the convention-based pricing stack (bundled / user / project).""" -def _pricing_from_raw(raw: dict) -> dict[str, ModelPrice]: - return {name: _model_price_from_dict(entry) for name, entry in raw.get("models", {}).items()} + name: str + path: Path | None + present: bool + models: dict[str, ModelPrice] def compute_cost_usd( @@ -68,12 +65,126 @@ def compute_cost_usd( return input_tokens / 1_000_000 * price.input_per_1m + output_tokens / 1_000_000 * price.output_per_1m +def effective_pricing_with_source(project_dir: Path | None = None) -> dict[str, tuple[ModelPrice, str]]: + """Merged catalog plus the name of the winning layer for each model. + + Same per-model whole-entry merge as ``load_pricing_layered`` (bundled → user → + project, higher wins), but each value is paired with the layer that supplied it + (``"project"`` / ``"user"`` / ``"bundled"``). Source of truth for the source + column in ``nasde pricing show`` and ``pricing_used.json``. + """ + resolved: dict[str, tuple[ModelPrice, str]] = {} + for layer in resolve_pricing_layers(project_dir): + for name, price in layer.models.items(): + resolved[name] = (price, layer.name) + return resolved + + +def load_pricing(path: str | Path | None = None) -> dict[str, ModelPrice]: + """Load the model pricing catalog, defaulting to the bundled pricing.toml. + + The bundled catalog (``path=None``) is read and parsed once per process and + cached, since it is invariant — callers (one per trial) must treat the + returned mapping as read-only. + """ + if path is None: + return _load_bundled_pricing() + raw = _read_pricing_toml(path) + return _pricing_from_raw(raw) + + +def load_pricing_layered(project_dir: Path | None = None) -> dict[str, ModelPrice]: + """Load the pricing catalog with convention-based layered overrides. + + Merges three layers, higher wins, per-model whole-entry replacement: + 1. ``/pricing.toml`` (project layer, highest) + 2. ``~/.nasde/pricing.toml`` (user layer) + 3. bundled ``pricing.toml`` (the floor, always present) + + An override file lists only the models it changes or adds; every other model + is inherited from the layer below. A missing project/user file is silently + skipped — only the bundled floor is required. See ADR-013. + """ + merged: dict[str, ModelPrice] = {} + for layer in resolve_pricing_layers(project_dir): + merged.update(layer.models) + if layer.path is not None: + console.print(f" [dim]pricing: applied override {layer.path} ({len(layer.models)} model(s))[/dim]") + return merged + + def pricing_as_of(model: str, pricing: dict[str, ModelPrice]) -> str | None: """Return the as_of date stamped on the model's price, if priced.""" price = pricing.get(model) return price.as_of if price is not None else None +def resolve_pricing_layers(project_dir: Path | None = None) -> list[PricingLayer]: + """The pricing stack from lowest precedence (bundled) to highest (project). + + The bundled layer is always present; the user (``~/.nasde/pricing.toml``) and + project (``/pricing.toml``) layers appear only when their file + exists, each carrying just the models it declares. See ADR-013. + """ + layers = [PricingLayer(name="bundled", path=None, present=True, models=dict(load_pricing()))] + layers.extend(_override_layers(project_dir)) + return layers + + +@lru_cache(maxsize=1) +def _load_bundled_pricing() -> dict[str, ModelPrice]: + return _pricing_from_raw(_read_pricing_toml(None)) + + +def _override_layers(project_dir: Path | None) -> list[PricingLayer]: + layers = [] + for name, path in _override_layer_paths(project_dir): + if not path.is_file(): + continue + layers.append(PricingLayer(name=name, path=path, present=True, models=_load_override_models(path))) + return layers + + +def _load_override_models(path: Path) -> dict[str, ModelPrice]: + try: + return load_pricing(path) + except tomllib.TOMLDecodeError as error: + _exit_bad_pricing(path, f"invalid TOML — {error}", hint="prices use a decimal point (2.5), not a comma (2,5)") + except KeyError as error: + field = str(error).strip("'\"") + _exit_bad_pricing(path, f"a model is missing the required field {field!r}") + + +def _exit_bad_pricing(path: Path, reason: str, hint: str = "") -> NoReturn: + console.print(f"[red]ERROR: could not load pricing override {path}[/red]") + console.print(f" [red]{reason}[/red]") + if hint: + console.print(f" [yellow]hint: {hint}[/yellow]") + raise SystemExit(1) + + +def _override_layer_paths(project_dir: Path | None) -> list[tuple[str, Path]]: + named = [("user", _user_pricing_path())] + if project_dir is not None: + named.append(("project", project_dir / "pricing.toml")) + return _deduped_named_paths(named) + + +def _user_pricing_path() -> Path: + return Path.home() / ".nasde" / "pricing.toml" + + +def _deduped_named_paths(named: list[tuple[str, Path]]) -> list[tuple[str, Path]]: + by_resolved: dict[Path, tuple[str, Path]] = {} + for name, path in named: + by_resolved[path.resolve()] = (name, path) + return list(by_resolved.values()) + + +def _pricing_from_raw(raw: dict) -> dict[str, ModelPrice]: + return {name: _model_price_from_dict(entry) for name, entry in raw.get("models", {}).items()} + + def _read_pricing_toml(path: str | Path | None) -> dict: if path is not None: with open(path, "rb") as f: diff --git a/src/nasde_toolkit/pricing_report.py b/src/nasde_toolkit/pricing_report.py new file mode 100644 index 0000000..0bba38f --- /dev/null +++ b/src/nasde_toolkit/pricing_report.py @@ -0,0 +1,39 @@ +"""Rendering for the effective (merged) pricing catalog. + +Shared by ``nasde pricing show`` and the ``nasde run`` summary so the rate table +looks identical wherever it appears. See ADR-013. +""" + +from __future__ import annotations + +from rich.table import Table + +from nasde_toolkit.pricing import ModelPrice + + +def render_pricing_table( + entries: dict[str, tuple[ModelPrice, str]], + show_source: bool = False, + title: str = "Effective pricing", +) -> Table: + """Build a Rich table of model rates, optionally with the source layer column.""" + table = Table(title=title) + table.add_column("Model", style="cyan") + table.add_column("In / 1M", justify="right") + table.add_column("Out / 1M", justify="right") + if show_source: + table.add_column("Layer", justify="left") + table.add_column("as_of", justify="left", style="dim") + for model in sorted(entries): + price, layer = entries[model] + row = [model, _fmt_rate(price.input_per_1m), _fmt_rate(price.output_per_1m)] + if show_source: + row.append(layer) + row.append(price.as_of or "—") + table.add_row(*row) + return table + + +def _fmt_rate(rate: float) -> str: + trimmed = f"{rate:.4f}".rstrip("0").rstrip(".") + return f"${trimmed or '0'}" diff --git a/src/nasde_toolkit/results_exporter.py b/src/nasde_toolkit/results_exporter.py index f0b8127..9b4dd03 100644 --- a/src/nasde_toolkit/results_exporter.py +++ b/src/nasde_toolkit/results_exporter.py @@ -24,7 +24,7 @@ _resolve_task_name, resolve_reasoning_effort, ) -from nasde_toolkit.pricing import load_pricing +from nasde_toolkit.pricing import ModelPrice, load_pricing_layered from nasde_toolkit.token_metrics import build_trial_economics console = Console() @@ -39,17 +39,48 @@ class ExportSummary: failed: list[str] = field(default_factory=list) -def export_results(paths: list[Path], dest: Path, include_trajectory: bool = True) -> ExportSummary: +def export_results( + paths: list[Path], + dest: Path, + include_trajectory: bool = True, + project_dir: Path | None = None, +) -> ExportSummary: """Export trial artifacts from the given job and/or trial paths into dest.""" dest.mkdir(parents=True, exist_ok=True) trials = _expand_to_trials(paths) + pricing = load_pricing_layered(project_dir) summary = ExportSummary() for job_name, trial_dir in trials: - _export_one_trial(job_name, trial_dir, dest, include_trajectory, summary) + _export_one_trial(job_name, trial_dir, dest, include_trajectory, summary, pricing) + _write_pricing_used(dest, [trial_dir for _, trial_dir in trials], project_dir) _print_summary(summary, dest) return summary +def _write_pricing_used(dest: Path, trial_dirs: list[Path], project_dir: Path | None) -> None: + from nasde_toolkit.pricing import effective_pricing_with_source + + used_models = {_resolve_model_name(td) for td in trial_dirs} + used_models.discard("") + if not used_models: + return + effective = effective_pricing_with_source(project_dir) + report = {} + for model in sorted(used_models): + entry = effective.get(model) + if entry is None: + continue + price, layer = entry + report[model] = { + "input_per_1m": price.input_per_1m, + "output_per_1m": price.output_per_1m, + "as_of": price.as_of, + "layer": layer, + } + if report: + (dest / "pricing_used.json").write_text(json.dumps(report, indent=2)) + + def _expand_to_trials(paths: list[Path]) -> list[tuple[str, Path]]: trials: list[tuple[str, Path]] = [] seen: set[Path] = set() @@ -117,13 +148,14 @@ def _export_one_trial( dest: Path, include_trajectory: bool, summary: ExportSummary, + pricing: dict[str, ModelPrice], ) -> None: out_dir = dest / f"{job_name}__{trial_dir.name}" label = out_dir.name existed = out_dir.exists() try: out_dir.mkdir(parents=True, exist_ok=True) - _write_metrics(trial_dir, out_dir) + _write_metrics(trial_dir, out_dir, pricing) copied = _copy_assessment_files(trial_dir, out_dir) _copy_verifier_files(trial_dir, out_dir) if include_trajectory and not (out_dir / "trajectory.json").exists(): @@ -144,16 +176,16 @@ def _export_one_trial( summary.skipped.append(label) -def _write_metrics(trial_dir: Path, out_dir: Path) -> None: - metrics = _build_metrics(trial_dir) +def _write_metrics(trial_dir: Path, out_dir: Path, pricing: dict[str, ModelPrice]) -> None: + metrics = _build_metrics(trial_dir, pricing) (out_dir / "metrics.json").write_text(json.dumps(metrics, indent=2)) -def _build_metrics(trial_dir: Path) -> dict: +def _build_metrics(trial_dir: Path, pricing: dict[str, ModelPrice]) -> dict: result = _load_json(trial_dir / "result.json") model = _resolve_model_name(trial_dir) score_stats = _resolve_score_stats(trial_dir) - economics = build_trial_economics(trial_dir, model, load_pricing()) + economics = build_trial_economics(trial_dir, model, pricing) return { "trial_name": result.get("trial_name", trial_dir.name), "task_name": _resolve_task_name(result), diff --git a/src/nasde_toolkit/runner.py b/src/nasde_toolkit/runner.py index a49ad2c..2fad32e 100644 --- a/src/nasde_toolkit/runner.py +++ b/src/nasde_toolkit/runner.py @@ -1069,9 +1069,11 @@ async def _run_job_with_streaming_eval( ) -> None: """Run Harbor job with assessment eval starting per trial as they complete.""" from nasde_toolkit.evaluator import evaluate_and_record_trial + from nasde_toolkit.pricing import load_pricing_layered project_name = config.reporting.project_name or config.name eval_semaphore = asyncio.Semaphore(max_concurrent_eval) + pricing = load_pricing_layered(config.project_dir) assessment_tasks: list[asyncio.Task] = [] async def _on_trial_complete(event: object) -> None: @@ -1084,6 +1086,7 @@ async def _on_trial_complete(event: object) -> None: with_opik=with_opik, semaphore=eval_semaphore, eval_config=config.evaluation, + pricing=pricing, ) ) assessment_tasks.append(task) @@ -1101,7 +1104,7 @@ async def _on_trial_complete(event: object) -> None: if assessment_tasks: console.print(f"[dim]Waiting for {len(assessment_tasks)} assessment evaluation(s)...[/dim]") await asyncio.gather(*assessment_tasks, return_exceptions=True) - _print_job_summary(result, _job_dir_from_config(merged_config)) + _print_job_summary(result, _job_dir_from_config(merged_config), config.project_dir) console.print("\n[bold green]Benchmark execution completed[/bold green]\n") @@ -1141,7 +1144,7 @@ async def _run_job( os.chdir(saved_cwd) -def _print_job_summary(result: JobResult, job_dir: Path | None = None) -> None: +def _print_job_summary(result: JobResult, job_dir: Path | None = None, project_dir: Path | None = None) -> None: console.print() console.print("[bold]Job completed[/bold]") console.print(f" Trials: {result.stats.n_completed_trials}") @@ -1152,6 +1155,7 @@ def _print_job_summary(result: JobResult, job_dir: Path | None = None) -> None: if rows: _print_economics_table(rows) _print_label_legend(rows) + _print_pricing_used(rows, project_dir) _print_location_hints(job_dir) elif result.stats.evals: _warn_missing_economics(job_dir) @@ -1161,6 +1165,20 @@ def _print_job_summary(result: JobResult, job_dir: Path | None = None) -> None: console.print() +def _print_pricing_used(rows: list[dict], project_dir: Path | None) -> None: + from nasde_toolkit.pricing import effective_pricing_with_source + from nasde_toolkit.pricing_report import render_pricing_table + + used_models = {row["model"] for row in rows if row["model"]} + if not used_models: + return + effective = effective_pricing_with_source(project_dir) + entries = {model: effective[model] for model in used_models if model in effective} + if not entries: + return + console.print(render_pricing_table(entries, show_source=True, title="Pricing used (effective)")) + + def _warn_missing_economics(job_dir: Path) -> None: console.print( f"[yellow]WARN: no assessment_summary.json found under {job_dir} — " @@ -1249,6 +1267,7 @@ def _finalize_economics_row(label: tuple[str, str, str], agg: dict) -> dict: return { "full_label": f"{agent} / {model}" if model else agent, "short_label": _short_label(agent, model), + "model": model, "reasoning_effort": effort, "trials": agg["trials"], "score": _mean(agg["scores"]), diff --git a/src/nasde_toolkit/scaffold/__init__.py b/src/nasde_toolkit/scaffold/__init__.py index f1f83d5..2cde76c 100644 --- a/src/nasde_toolkit/scaffold/__init__.py +++ b/src/nasde_toolkit/scaffold/__init__.py @@ -102,6 +102,28 @@ # tasks = ["my-benchmark/task-name"] """ +PRICING_TOML_EXAMPLE = """\ +# Optional per-project model price overrides (USD per 1M tokens). +# +# Precedence (higher wins, merged per model): this file > ~/.nasde/pricing.toml +# > the catalog bundled with nasde. List ONLY the models you want to change or +# add; every other model falls through to the layer below. +# +# The model name MUST match the `model` in your variant.toml (and the model_name +# recorded in each trial's config.json) — otherwise the override is silently +# ignored. Verify with: nasde pricing show -C . --show-source +# (your model should appear with Layer "project"; if it shows "bundled", the +# name does not match — check for a typo). +# +# Uncomment and edit to activate. Prices use a decimal POINT (2.5), not a comma. +# +# [models."claude-sonnet-4-6"] +# input_per_1m = 2.5 +# output_per_1m = 11.0 +# as_of = "2026-01-01" # optional: when you confirmed this rate +# source = "your contract / rate card" # optional: where it came from +""" + GITIGNORE_TEMPLATE = """\ jobs/ # Generated by [nasde.plugin]: staged plugin tree inside the build context. @@ -160,6 +182,7 @@ def create_project(project_dir: Path, name: str) -> None: _write_if_missing(project_dir / "nasde.toml", NASDE_TOML_TEMPLATE.format(name=name)) _write_if_missing(project_dir / "assessment_dimensions.json", ASSESSMENT_DIMENSIONS_TEMPLATE) + _write_if_missing(project_dir / "pricing.toml.example", PRICING_TOML_EXAMPLE) _write_if_missing(project_dir / ".gitignore", GITIGNORE_TEMPLATE) _write_if_missing(project_dir / ".gitattributes", GITATTRIBUTES_TEMPLATE) @@ -199,6 +222,7 @@ def create_project(project_dir: Path, name: str) -> None: console.print(f"[green]Project scaffolded at[/green] {project_dir}") console.print(" nasde.toml") console.print(" assessment_dimensions.json") + console.print(" pricing.toml.example (optional model price overrides)") console.print(" tasks/example-task/") console.print(" variants/vanilla/variant.toml") console.print(" variants/vanilla/CLAUDE.md") diff --git a/tests/conftest.py b/tests/conftest.py new file mode 100644 index 0000000..a51c777 --- /dev/null +++ b/tests/conftest.py @@ -0,0 +1,22 @@ +"""Shared pytest fixtures.""" + +from __future__ import annotations + +from pathlib import Path + +import pytest + +from nasde_toolkit import pricing as pricing_module + + +@pytest.fixture(autouse=True) +def empty_user_layer(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> Path: + """Isolate the layered-pricing user layer from the developer's real ~/.nasde. + + Autouse so every test in the tree is hermetic by default: `load_pricing_layered` + never reads the host's `~/.nasde/pricing.toml`. Tests that need a real user + layer request this fixture by name and write to the returned path. + """ + user_file = tmp_path / "user-home" / ".nasde" / "pricing.toml" + monkeypatch.setattr(pricing_module, "_user_pricing_path", lambda: user_file) + return user_file diff --git a/tests/test_cli.py b/tests/test_cli.py index 39835a2..79f3024 100644 --- a/tests/test_cli.py +++ b/tests/test_cli.py @@ -216,3 +216,35 @@ def test_calibrate_publish_forwards_resolved_sink(mock_publish: object, tmp_path assert kwargs["repo"] == "NoesisVision/nasde-calibration" assert kwargs["repo_url"] == "git@github.com:NoesisVision/nasde-calibration.git" assert kwargs["platform_override"] == "github" + + +def test_pricing_show_lists_bundled_models(tmp_path: Path) -> None: + (tmp_path / "nasde.toml").write_text('[project]\nname = "test"\n') + + result = runner.invoke(app, ["pricing", "show", "-C", str(tmp_path)]) + + assert result.exit_code == 0, result.output + assert "claude-sonnet-4-6" in result.output + assert "Effective pricing" in result.output + + +def test_pricing_show_reflects_project_override(tmp_path: Path) -> None: + (tmp_path / "nasde.toml").write_text('[project]\nname = "test"\n') + (tmp_path / "pricing.toml").write_text('[models."claude-sonnet-4-6"]\ninput_per_1m = 42.0\noutput_per_1m = 99.0\n') + + result = runner.invoke(app, ["pricing", "show", "-C", str(tmp_path)]) + + assert result.exit_code == 0, result.output + assert "$42" in result.output + + +def test_pricing_show_source_column(tmp_path: Path) -> None: + (tmp_path / "nasde.toml").write_text('[project]\nname = "test"\n') + (tmp_path / "pricing.toml").write_text('[models."azure-x"]\ninput_per_1m = 0.5\noutput_per_1m = 1.0\n') + + result = runner.invoke(app, ["pricing", "show", "-C", str(tmp_path), "--show-source"]) + + assert result.exit_code == 0, result.output + assert "Layer" in result.output + assert "project" in result.output + assert "bundled" in result.output diff --git a/tests/test_eval_migration.py b/tests/test_eval_migration.py index 5247cd4..c803252 100644 --- a/tests/test_eval_migration.py +++ b/tests/test_eval_migration.py @@ -5,7 +5,9 @@ import json from pathlib import Path -from nasde_toolkit.eval_migration import migrate_trial_evals +import pytest + +from nasde_toolkit.eval_migration import migrate_job_evals, migrate_trial_evals def _eval_payload(normalized_score: float, evaluator_model: str = "claude-opus-4-7") -> dict: @@ -120,3 +122,39 @@ def test_migrate_mixed_models_yields_two_groups(tmp_path: Path) -> None: summary = json.loads((tmp_path / "assessment_summary.json").read_text()) models = {g["evaluator_model"] for g in summary["groups"]} assert models == {"claude-opus-4-7", "codex-gpt-5"} + + +def _seed_trajectory(trial_dir: Path) -> None: + (trial_dir / "config.json").write_text( + json.dumps({"agent": {"name": "demo-variant", "model_name": "claude-sonnet-4-6"}}) + ) + (trial_dir / "result.json").write_text(json.dumps({"trial_name": trial_dir.name})) + agent_dir = trial_dir / "agent" + agent_dir.mkdir() + (agent_dir / "trajectory.json").write_text( + json.dumps( + { + "final_metrics": { + "total_prompt_tokens": 1_000_000, + "total_completion_tokens": 50_000, + "extra": {"reasoning_output_tokens": 10_000}, + } + } + ) + ) + + +def test_migrate_job_evals_threads_project_pricing(tmp_path: Path) -> None: + job = tmp_path / "jobs" / "demo-job" + trial = job / "demo-task__aaa" + trial.mkdir(parents=True) + _seed_trajectory(trial) + _write(trial, "assessment_eval.json", _eval_payload(0.6)) + project = tmp_path / "project" + project.mkdir() + (project / "pricing.toml").write_text('[models."claude-sonnet-4-6"]\ninput_per_1m = 30.0\noutput_per_1m = 150.0\n') + + migrate_job_evals(job, project_dir=project) + + summary = json.loads((trial / "assessment_summary.json").read_text()) + assert summary["cost_usd"] == pytest.approx(1_000_000 / 1e6 * 30.0 + 60_000 / 1e6 * 150.0) diff --git a/tests/test_evaluator.py b/tests/test_evaluator.py index 0094555..7f7ac2e 100644 --- a/tests/test_evaluator.py +++ b/tests/test_evaluator.py @@ -10,6 +10,7 @@ import pytest +from nasde_toolkit import pricing as pricing_module from nasde_toolkit.config import EvaluationConfig from nasde_toolkit.evaluator import ( DimensionScore, @@ -28,6 +29,7 @@ _write_assessment_summary, _write_evaluation_result, ) +from nasde_toolkit.pricing import load_pricing, load_pricing_layered def _make_evaluation( @@ -236,7 +238,7 @@ def test_write_assessment_summary_has_no_judge_reasoning_text(tmp_path: Path) -> _write_evaluation_result(tmp_path, _make_evaluation(0.6, dim_score=6)) _write_evaluation_result(tmp_path, _make_evaluation(0.7, dim_score=7)) - summary = _write_assessment_summary(tmp_path) + summary = _write_assessment_summary(tmp_path, load_pricing()) assert summary is not None raw = (tmp_path / "assessment_summary.json").read_text() assert '"reasoning":' not in raw # verbose per-dimension judge reasoning is not dumped into the summary @@ -269,7 +271,7 @@ def test_assessment_summary_includes_economics(tmp_path: Path) -> None: _write_evaluation_result(tmp_path, _make_evaluation(0.6, dim_score=6)) _write_evaluation_result(tmp_path, _make_evaluation(0.7, dim_score=7)) - summary = _write_assessment_summary(tmp_path) + summary = _write_assessment_summary(tmp_path, load_pricing()) assert summary is not None assert summary.model_name == "claude-sonnet-4-6" assert summary.token_usage["total_tokens"] == 1_060_000 @@ -280,10 +282,51 @@ def test_assessment_summary_includes_economics(tmp_path: Path) -> None: assert not hasattr(summary, "token_efficiency") +def _model_block(name: str, input_per_1m: float, output_per_1m: float) -> str: + return f'[models."{name}"]\ninput_per_1m = {input_per_1m}\noutput_per_1m = {output_per_1m}\n' + + +def test_evaluator_picks_up_project_pricing_override(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None: + monkeypatch.setattr(pricing_module, "_user_pricing_path", lambda: tmp_path / "no-user" / "pricing.toml") + trial = tmp_path / "trial" + trial.mkdir() + _seed_trajectory(trial) + _write_evaluation_result(trial, _make_evaluation(0.6, dim_score=6)) + project = tmp_path / "project" + project.mkdir() + (project / "pricing.toml").write_text(_model_block("claude-sonnet-4-6", 30.0, 150.0)) + + summary = _write_assessment_summary(trial, load_pricing_layered(project)) + assert summary is not None + assert summary.cost_usd == pytest.approx(1_000_000 / 1e6 * 30.0 + 60_000 / 1e6 * 150.0) + + +def test_evaluator_three_layer_compose_e2e(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None: + user_file = tmp_path / "user-home" / ".nasde" / "pricing.toml" + monkeypatch.setattr(pricing_module, "_user_pricing_path", lambda: user_file) + user_file.parent.mkdir(parents=True) + user_file.write_text(_model_block("azure-gpt5", 1.0, 2.0)) + trial = tmp_path / "trial" + trial.mkdir() + _seed_trajectory(trial) + config = json.loads((trial / "config.json").read_text()) + config["agent"]["model_name"] = "azure-gpt5" + (trial / "config.json").write_text(json.dumps(config)) + _write_evaluation_result(trial, _make_evaluation(0.6, dim_score=6)) + project = tmp_path / "project" + project.mkdir() + (project / "pricing.toml").write_text(_model_block("azure-gpt5", 0.5, 1.0)) + + summary = _write_assessment_summary(trial, load_pricing_layered(project)) + assert summary is not None + assert summary.model_name == "azure-gpt5" + assert summary.cost_usd == pytest.approx(1_000_000 / 1e6 * 0.5 + 60_000 / 1e6 * 1.0) + + def test_assessment_summary_economics_null_without_trajectory(tmp_path: Path) -> None: _write_evaluation_result(tmp_path, _make_evaluation(0.6, dim_score=6)) - summary = _write_assessment_summary(tmp_path) + summary = _write_assessment_summary(tmp_path, load_pricing()) assert summary is not None assert summary.token_usage is None assert summary.cost_usd is None @@ -295,7 +338,7 @@ def test_assessment_summary_stamps_reasoning_effort_from_config(tmp_path: Path) ) _write_evaluation_result(tmp_path, _make_evaluation(0.6, dim_score=6)) - summary = _write_assessment_summary(tmp_path) + summary = _write_assessment_summary(tmp_path, load_pricing()) assert summary is not None assert summary.reasoning_effort == "xhigh" @@ -306,7 +349,7 @@ def test_assessment_summary_reasoning_effort_empty_when_unset(tmp_path: Path) -> ) _write_evaluation_result(tmp_path, _make_evaluation(0.6, dim_score=6)) - summary = _write_assessment_summary(tmp_path) + summary = _write_assessment_summary(tmp_path, load_pricing()) assert summary is not None assert summary.reasoning_effort == "" # no override → never synthesize a default @@ -329,6 +372,7 @@ async def fake_evaluate_trial( with_opik=False, semaphore=asyncio.Semaphore(10), eval_config=config, + pricing=load_pricing(), ) ) @@ -355,6 +399,7 @@ async def fake_evaluate_trial( with_opik=False, semaphore=asyncio.Semaphore(10), eval_config=config, + pricing=load_pricing(), ) ) @@ -383,6 +428,7 @@ async def flaky_evaluate_trial( with_opik=False, semaphore=asyncio.Semaphore(10), eval_config=config, + pricing=load_pricing(), ) ) diff --git a/tests/test_pricing.py b/tests/test_pricing.py index aaa1b3a..6d476f1 100644 --- a/tests/test_pricing.py +++ b/tests/test_pricing.py @@ -2,11 +2,50 @@ from __future__ import annotations +from dataclasses import FrozenInstanceError from pathlib import Path import pytest -from nasde_toolkit.pricing import compute_cost_usd, load_pricing, pricing_as_of +from nasde_toolkit.pricing import ( + compute_cost_usd, + effective_pricing_with_source, + load_pricing, + load_pricing_layered, + pricing_as_of, + resolve_pricing_layers, +) + + +def test_model_price_is_frozen() -> None: + price = load_pricing()["claude-sonnet-4-6"] + with pytest.raises(FrozenInstanceError): + price.input_per_1m = 0.01 # type: ignore[misc] + + +def test_layered_malformed_toml_exits_with_path(tmp_path: Path, empty_user_layer: Path) -> None: + (tmp_path / "pricing.toml").write_text('[models."m"]\ninput_per_1m = 2,5\noutput_per_1m = 1.0\n') + with pytest.raises(SystemExit): + load_pricing_layered(tmp_path) + + +def test_layered_missing_required_field_exits(tmp_path: Path, empty_user_layer: Path) -> None: + (tmp_path / "pricing.toml").write_text('[models."m"]\ninput_per_1m = 2.5\n') + with pytest.raises(SystemExit): + load_pricing_layered(tmp_path) + + +def _write_pricing(directory: Path, body: str) -> Path: + path = directory / "pricing.toml" + path.write_text(body) + return path + + +def _model_block(name: str, input_per_1m: float, output_per_1m: float, source: str | None = None) -> str: + block = f'[models."{name}"]\ninput_per_1m = {input_per_1m}\noutput_per_1m = {output_per_1m}\n' + if source is not None: + block += f'source = "{source}"\n' + return block def test_load_bundled_pricing_has_matrix_models() -> None: @@ -42,3 +81,124 @@ def test_compute_cost_unknown_model_returns_none() -> None: def test_pricing_as_of_unknown_model_is_none() -> None: assert pricing_as_of("nonexistent-model", load_pricing()) is None assert pricing_as_of("gpt-5.4", load_pricing()) == "2026-06-08" + + +def test_layered_no_overrides_returns_bundled(tmp_path: Path, empty_user_layer: Path) -> None: + merged = load_pricing_layered(tmp_path) + assert set(merged) == set(load_pricing()) + + +def test_layered_project_overrides_bundled(tmp_path: Path, empty_user_layer: Path) -> None: + _write_pricing(tmp_path, _model_block("claude-sonnet-4-6", 99.0, 1.0)) + merged = load_pricing_layered(tmp_path) + assert merged["claude-sonnet-4-6"].input_per_1m == 99.0 + assert merged["gpt-5.5"].input_per_1m == load_pricing()["gpt-5.5"].input_per_1m + + +def test_layered_project_adds_new_model(tmp_path: Path, empty_user_layer: Path) -> None: + _write_pricing(tmp_path, _model_block("my-model", 7.0, 8.0)) + merged = load_pricing_layered(tmp_path) + assert "my-model" in merged + assert set(load_pricing()).issubset(set(merged)) + + +def test_layered_user_layer_applied(tmp_path: Path, empty_user_layer: Path) -> None: + empty_user_layer.parent.mkdir(parents=True, exist_ok=True) + empty_user_layer.write_text(_model_block("claude-opus-4-8", 4.0, 1.0)) + merged = load_pricing_layered(tmp_path) + assert merged["claude-opus-4-8"].input_per_1m == 4.0 + + +def test_layered_project_beats_user(tmp_path: Path, empty_user_layer: Path) -> None: + empty_user_layer.parent.mkdir(parents=True, exist_ok=True) + empty_user_layer.write_text(_model_block("claude-opus-4-8", 4.0, 1.0)) + _write_pricing(tmp_path, _model_block("claude-opus-4-8", 6.0, 2.0)) + merged = load_pricing_layered(tmp_path) + assert merged["claude-opus-4-8"].input_per_1m == 6.0 + + +def test_layered_missing_user_file_skipped(tmp_path: Path, empty_user_layer: Path) -> None: + assert not empty_user_layer.exists() + merged = load_pricing_layered(tmp_path) + assert set(merged) == set(load_pricing()) + + +def test_layered_whole_entry_replacement(tmp_path: Path, empty_user_layer: Path) -> None: + bundled = load_pricing()["claude-sonnet-4-6"] + assert bundled.cached_input_per_1m is not None and bundled.source + _write_pricing(tmp_path, _model_block("claude-sonnet-4-6", 2.0, 1.0)) + merged = load_pricing_layered(tmp_path) + assert merged["claude-sonnet-4-6"].input_per_1m == 2.0 + assert merged["claude-sonnet-4-6"].cached_input_per_1m is None + assert merged["claude-sonnet-4-6"].source == "" + + +def test_layered_three_layers_compose(tmp_path: Path, empty_user_layer: Path) -> None: + empty_user_layer.parent.mkdir(parents=True, exist_ok=True) + empty_user_layer.write_text(_model_block("claude-opus-4-8", 4.0, 1.0) + _model_block("azure-gpt5", 1.0, 2.0)) + _write_pricing( + tmp_path, + _model_block("claude-sonnet-4-6", 2.0, 1.0) + + _model_block("azure-gpt5", 0.5, 1.0) + + _model_block("enterprise-claude", 10.0, 20.0), + ) + merged = load_pricing_layered(tmp_path) + assert set(merged) == { + "gpt-5.5", + "gpt-5.4", + "claude-opus-4-8", + "claude-sonnet-4-6", + "azure-gpt5", + "enterprise-claude", + } + assert merged["gpt-5.5"].input_per_1m == 5.0 + assert merged["gpt-5.4"].input_per_1m == 2.50 + assert merged["claude-opus-4-8"].input_per_1m == 4.0 + assert merged["claude-sonnet-4-6"].input_per_1m == 2.0 + assert merged["azure-gpt5"].input_per_1m == 0.5 + assert merged["enterprise-claude"].input_per_1m == 10.0 + + +def test_layered_three_layers_whole_entry_on_overlap(tmp_path: Path, empty_user_layer: Path) -> None: + empty_user_layer.parent.mkdir(parents=True, exist_ok=True) + empty_user_layer.write_text(_model_block("azure-gpt5", 1.0, 2.0, source="azure-contract")) + _write_pricing(tmp_path, _model_block("azure-gpt5", 0.5, 1.0)) + merged = load_pricing_layered(tmp_path) + assert merged["azure-gpt5"].input_per_1m == 0.5 + assert merged["azure-gpt5"].source == "" + + +def test_resolve_layers_no_overrides_is_bundled_only(tmp_path: Path, empty_user_layer: Path) -> None: + layers = resolve_pricing_layers(tmp_path) + assert [layer.name for layer in layers] == ["bundled"] + assert layers[0].present is True + assert layers[0].path is None + assert set(layers[0].models) == set(load_pricing()) + + +def test_resolve_layers_three_present(tmp_path: Path, empty_user_layer: Path) -> None: + empty_user_layer.parent.mkdir(parents=True, exist_ok=True) + empty_user_layer.write_text(_model_block("claude-opus-4-8", 4.0, 1.0)) + _write_pricing(tmp_path, _model_block("my-model", 7.0, 8.0)) + layers = {layer.name: layer for layer in resolve_pricing_layers(tmp_path)} + assert set(layers) == {"bundled", "user", "project"} + assert set(layers["user"].models) == {"claude-opus-4-8"} + assert set(layers["project"].models) == {"my-model"} + assert layers["project"].path == tmp_path / "pricing.toml" + + +def test_effective_pricing_with_source_three_layers(tmp_path: Path, empty_user_layer: Path) -> None: + empty_user_layer.parent.mkdir(parents=True, exist_ok=True) + empty_user_layer.write_text(_model_block("claude-opus-4-8", 4.0, 1.0) + _model_block("azure-gpt5", 1.0, 2.0)) + _write_pricing(tmp_path, _model_block("azure-gpt5", 0.5, 1.0) + _model_block("enterprise-claude", 10.0, 20.0)) + eff = effective_pricing_with_source(tmp_path) + assert eff["gpt-5.5"][1] == "bundled" + assert eff["claude-opus-4-8"][1] == "user" + assert eff["azure-gpt5"][1] == "project" + assert eff["azure-gpt5"][0].input_per_1m == 0.5 + assert eff["enterprise-claude"][1] == "project" + + +def test_load_pricing_layered_matches_effective_keys(tmp_path: Path, empty_user_layer: Path) -> None: + _write_pricing(tmp_path, _model_block("my-model", 7.0, 8.0)) + assert set(load_pricing_layered(tmp_path)) == set(effective_pricing_with_source(tmp_path)) diff --git a/tests/test_pricing_report.py b/tests/test_pricing_report.py new file mode 100644 index 0000000..738ef53 --- /dev/null +++ b/tests/test_pricing_report.py @@ -0,0 +1,33 @@ +"""Tests for the effective-pricing table renderer.""" + +from __future__ import annotations + +import pytest + +from nasde_toolkit.pricing import ModelPrice +from nasde_toolkit.pricing_report import _fmt_rate, render_pricing_table + + +@pytest.mark.parametrize( + ("rate", "expected"), + [ + (3.0, "$3"), + (2.5, "$2.5"), + (0.5, "$0.5"), + (0.0125, "$0.0125"), + (0.0001, "$0.0001"), + (0.0, "$0"), + (75.0, "$75"), + (1_000_000.0, "$1000000"), + ], +) +def test_fmt_rate_no_scientific_notation(rate: float, expected: str) -> None: + assert _fmt_rate(rate) == expected + + +def test_render_pricing_table_source_column_toggles() -> None: + entries = {"m": (ModelPrice(input_per_1m=3.0, output_per_1m=15.0, as_of="2026-06-08"), "bundled")} + with_source = render_pricing_table(entries, show_source=True) + without_source = render_pricing_table(entries, show_source=False) + assert with_source.columns[-2].header == "Layer" + assert all(column.header != "Layer" for column in without_source.columns) diff --git a/tests/test_results_exporter.py b/tests/test_results_exporter.py index 9c93fe7..9f4670d 100644 --- a/tests/test_results_exporter.py +++ b/tests/test_results_exporter.py @@ -15,6 +15,10 @@ ) +def _model_block(name: str, input_per_1m: float, output_per_1m: float) -> str: + return f'[models."{name}"]\ninput_per_1m = {input_per_1m}\noutput_per_1m = {output_per_1m}\n' + + def _git(workspace: Path, *args: str) -> str: completed = subprocess.run( ["git", "-C", str(workspace), *args], @@ -321,6 +325,63 @@ def test_export_legacy_bare_idempotent(tmp_path: Path) -> None: assert out.name in second.skipped +def test_export_picks_up_project_pricing_override(job_dir: Path, tmp_path: Path, empty_user_layer: Path) -> None: + project = tmp_path / "project" + project.mkdir() + (project / "pricing.toml").write_text(_model_block("claude-sonnet-4-6", 30.0, 150.0)) + dest = tmp_path / "export" + + export_results([job_dir], dest, project_dir=project) + metrics = json.loads((dest / "2026-06-03__demo-job__demo-task__aaa111" / "metrics.json").read_text()) + + assert metrics["cost_usd"] == pytest.approx(1_000_000 / 1e6 * 30.0 + 60_000 / 1e6 * 150.0) + + +def test_export_project_dir_none_uses_bundled(job_dir: Path, tmp_path: Path, empty_user_layer: Path) -> None: + dest = tmp_path / "export" + export_results([job_dir], dest, project_dir=None) + metrics = json.loads((dest / "2026-06-03__demo-job__demo-task__aaa111" / "metrics.json").read_text()) + + assert metrics["cost_usd"] == pytest.approx(3.9) + + +def test_export_three_layer_compose_e2e(job_dir: Path, tmp_path: Path, empty_user_layer: Path) -> None: + empty_user_layer.parent.mkdir(parents=True, exist_ok=True) + empty_user_layer.write_text(_model_block("claude-sonnet-4-6", 1.0, 2.0)) + project = tmp_path / "project" + project.mkdir() + (project / "pricing.toml").write_text(_model_block("claude-sonnet-4-6", 0.5, 1.0)) + dest = tmp_path / "export" + + export_results([job_dir], dest, project_dir=project) + metrics = json.loads((dest / "2026-06-03__demo-job__demo-task__aaa111" / "metrics.json").read_text()) + + assert metrics["cost_usd"] == pytest.approx(1_000_000 / 1e6 * 0.5 + 60_000 / 1e6 * 1.0) + + +def test_export_writes_pricing_used_with_layer(job_dir: Path, tmp_path: Path, empty_user_layer: Path) -> None: + project = tmp_path / "project" + project.mkdir() + (project / "pricing.toml").write_text(_model_block("claude-sonnet-4-6", 0.5, 1.0)) + dest = tmp_path / "export" + + export_results([job_dir], dest, project_dir=project) + report = json.loads((dest / "pricing_used.json").read_text()) + + assert "claude-sonnet-4-6" in report + assert report["claude-sonnet-4-6"]["input_per_1m"] == 0.5 + assert report["claude-sonnet-4-6"]["layer"] == "project" + + +def test_export_pricing_used_layer_bundled(job_dir: Path, tmp_path: Path, empty_user_layer: Path) -> None: + dest = tmp_path / "export" + export_results([job_dir], dest, project_dir=None) + report = json.loads((dest / "pricing_used.json").read_text()) + + assert report["claude-sonnet-4-6"]["layer"] == "bundled" + assert report["claude-sonnet-4-6"]["input_per_1m"] == 3.0 + + def test_capture_patch_includes_non_ascii_untracked_filename(tmp_path: Path) -> None: workspace = tmp_path / "artifacts" / "workspace" workspace.mkdir(parents=True) diff --git a/tests/test_runner_economics.py b/tests/test_runner_economics.py index d45cdc6..f0b31c7 100644 --- a/tests/test_runner_economics.py +++ b/tests/test_runner_economics.py @@ -236,3 +236,12 @@ def test_print_job_summary_renders_economics_for_own_job( out = _flat(capsys.readouterr().out) assert "Results by agent/model" in out assert "no assessment_summary.json found" not in out + + +def test_economics_row_carries_model_name(tmp_path: Path) -> None: + job = tmp_path / "job" + _write_trial(job, "t__a", "codex-vanilla", "gpt-5.4", 0.7, 1000, 1.0) + _write_trial(job, "t__b", "claude-vanilla", "claude-sonnet-4-6", 0.8, 1000, 1.0) + + rows = _collect_economics_rows(job) + assert {row["model"] for row in rows} == {"gpt-5.4", "claude-sonnet-4-6"} diff --git a/tests/test_scaffold.py b/tests/test_scaffold.py new file mode 100644 index 0000000..fff2ac6 --- /dev/null +++ b/tests/test_scaffold.py @@ -0,0 +1,25 @@ +"""Tests for project scaffolding (nasde init).""" + +from __future__ import annotations + +import tomllib +from pathlib import Path + +from nasde_toolkit.scaffold import create_project + + +def test_create_project_writes_pricing_example(tmp_path: Path) -> None: + create_project(tmp_path, "demo") + example = tmp_path / "pricing.toml.example" + assert example.exists() + body = example.read_text() + assert "claude-sonnet-4-6" in body # a real bundled model name to copy + assert "decimal" in body.lower() # the comma-vs-point hint + + +def test_create_project_pricing_example_is_inert(tmp_path: Path) -> None: + create_project(tmp_path, "demo") + parsed = tomllib.loads((tmp_path / "pricing.toml.example").read_text()) + assert parsed == {} # fully commented — no active [models] until the user edits + + assert not (tmp_path / "pricing.toml").exists() # scaffold never writes a LIVE override diff --git a/website/src/content/docs/concepts/token-cost.md b/website/src/content/docs/concepts/token-cost.md index 22c97c6..1b83a7d 100644 --- a/website/src/content/docs/concepts/token-cost.md +++ b/website/src/content/docs/concepts/token-cost.md @@ -62,8 +62,102 @@ source = "https://…" A model that isn't in the catalog still gets token metrics — only its `cost_usd` is left blank (with a warning), never a wrong number. -:::note[Editing the catalog] -The catalog is **bundled into the package**, so editing it depends on how you installed Nasde. From a source checkout (`uv sync`) you can edit `src/nasde_toolkit/pricing.toml` directly. After a PyPI install (`uv tool install` / pipx) the file lives inside an isolated environment and any edit is overwritten on the next upgrade — so for now, adding a model or correcting a rate means contributing it upstream or running from source. A per-project / per-user pricing override is a planned improvement. +### Overriding rates — drop a `pricing.toml` + +The bundled catalog is the **floor**. To correct a rate or add a model, drop your own `pricing.toml` at one of two locations — Nasde finds it by name, no config setting: + +- **`/pricing.toml`** — per-project, sits next to `nasde.toml`. Highest precedence. +- **`~/.nasde/pricing.toml`** — per-user, applies to every project on the machine. + +The precedence is **project > user > bundled**, merged **per model**: each override file lists *only* the models you want to change or add, and every other model falls through to the layer below. Overriding one model leaves the rest of the catalog intact. (A model entry is replaced whole — fields you omit take their defaults, they aren't inherited from the bundled entry.) When an override is applied, Nasde prints a line saying so. Both `nasde run` and `nasde results-export` read the same layered catalog, so a trial's cost is identical whether you see it in the run summary or a later export. + +```toml +# ~/.nasde/pricing.toml — your enterprise rate for one model; the rest stays bundled +[models."claude-opus-4-8"] +input_per_1m = 4.0 # prices use a decimal point (4.0), not a comma +output_per_1m = 12.0 +as_of = "2026-06-22" # optional — when you confirmed this rate +source = "internal contract" # optional — where it came from +``` + +`nasde init` drops a fully-commented `pricing.toml.example` in a new project — copy it to `pricing.toml` and edit, rather than writing one from scratch. + +:::caution[The model name must match exactly] +The key (`claude-opus-4-8` above) must match the `model` in your `variant.toml` — i.e. the `model_name` recorded in each trial. If it doesn't (a typo like `claude-opus-4.8`, or a model you don't actually run), the override is **silently ignored** and the trial keeps the bundled rate — no error. Confirm your override took effect with `--show-source` (below): your model should show layer `project`/`user`, not `bundled`. +::: + +### Verifying the effective catalog + +After dropping an override, **always check it took effect** — run `pricing show` with `--show-source`: + +```bash +nasde pricing show -C ./my-benchmark --show-source +``` + +With a project override of `claude-sonnet-4-6` (to $2.5 / $11), you'll see it win the `project` layer while everything else stays `bundled`: + +``` + Effective pricing +┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓ +┃ Model ┃ In / 1M ┃ Out / 1M ┃ Layer ┃ as_of ┃ +┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩ +│ claude-opus-4-8 │ $5 │ $15 │ bundled │ 2026-06-08 │ +│ claude-sonnet-4-6 │ $2.5 │ $11 │ project │ 2026-06-24 │ +│ gpt-5.4 │ $2.5 │ $15 │ bundled │ 2026-06-08 │ +│ gpt-5.5 │ $5 │ $30 │ bundled │ 2026-06-08 │ +└───────────────────┴─────────┴──────────┴─────────┴────────────┘ +``` + +#### Catching a silent miss (wrong model name) + +This is the failure mode to watch for. Suppose you typo'd the key as `claude-sonnet-4.6` (dot) instead of `claude-sonnet-4-6` (dashes). The override **loads fine — no error** — but it doesn't match the model you actually run, so it just sits there as an extra, unused row while the real model keeps the bundled rate: + +``` +┃ Model ┃ In / 1M ┃ Out / 1M ┃ Layer ┃ as_of ┃ +│ claude-sonnet-4-6 │ $3 │ $15 │ bundled │ 2026-06-08 │ ← the model you run: NOT overridden +│ claude-sonnet-4.6 │ $2.5 │ $11 │ project │ — │ ← your typo: a dead, unused entry +``` + +The tell-tale: **the model you meant to override shows `bundled`, not `project`.** Fix the key to match the `model` in your `variant.toml` and re-run `pricing show`. + +#### Loud errors (malformed file) + +A broken override file never produces a wrong cost — it fails fast, naming the file and the cause. A decimal comma (`2,5` instead of `2.5`) is the classic one: + +``` +ERROR: could not load pricing override /path/to/my-benchmark/pricing.toml + invalid TOML — Expected newline or end of document after a statement (at line 2, column 17) + hint: prices use a decimal point (2.5), not a comma (2,5) +``` + +A missing required field gives: + +``` +ERROR: could not load pricing override /Users/you/.nasde/pricing.toml + a model is missing the required field 'output_per_1m' +``` + +(The path tells you which layer to fix — your project's `pricing.toml` vs the user-wide `~/.nasde/pricing.toml`.) + +#### Self-contained audit + +Every `nasde results-export` also writes a `pricing_used.json` next to the exported trials — the effective rate and source layer for each model priced in that batch: + +```json +{ + "claude-sonnet-4-6": { + "input_per_1m": 2.5, + "output_per_1m": 11.0, + "as_of": "2026-06-24", + "layer": "project" + } +} +``` + +So a report carries its own pricing provenance. The `nasde run` summary prints the same "Pricing used" table for the models in the run. + +:::note[Editing the bundled catalog directly] +You can still edit the bundled `src/nasde_toolkit/pricing.toml` from a source checkout (`uv sync`). After a PyPI install (`uv tool install` / pipx) the bundled file lives inside an isolated environment and is overwritten on upgrade — so prefer a `pricing.toml` override (above), which survives upgrades, or contribute the rate upstream. ::: :::caution[Confirm rates before quoting costs] diff --git a/website/src/content/docs/reference/configuration.md b/website/src/content/docs/reference/configuration.md index c1433f9..4a9c573 100644 --- a/website/src/content/docs/reference/configuration.md +++ b/website/src/content/docs/reference/configuration.md @@ -21,6 +21,7 @@ That's the whole agent-under-test surface. The walkthrough — what each knob is my-benchmark/ nasde.toml # Project configuration assessment_dimensions.json # Scoring dimensions (shared across tasks) + pricing.toml # Optional: per-project model price overrides (project > ~/.nasde > bundled) tasks/ feature-a/ task.toml # Task config (Harbor sections + [nasde.source] / [nasde.plugin])