diff --git a/.claude/skills/nasde-benchmark-creator/SKILL.md b/.claude/skills/nasde-benchmark-creator/SKILL.md index 5d8b487..478abe0 100644 --- a/.claude/skills/nasde-benchmark-creator/SKILL.md +++ b/.claude/skills/nasde-benchmark-creator/SKILL.md @@ -336,10 +336,14 @@ variants/gemini-baseline/ > **Codex/Gemini skills are registered natively** (Harbor `config.agent.skills`), > not via `sandbox_files` — these CLIs auto-discover skills only from a -> HOME-scoped dir, never from a `/app` cwd dir. Each `SKILL.md` **must start with -> a `---` YAML frontmatter line**: Codex's loader rejects a file that opens with -> anything else (`missing YAML frontmatter delimited by ---`) and silently skips -> the skill. Put any provenance comment *below* the closing `---`. See +> HOME-scoped dir, never from a `/app` cwd dir. This applies to **all** ways a +> skill is supplied to a Codex/Gemini variant: the `agents_skills/` / +> `gemini_skills/` snapshot above, a `[[skill]]` by-reference entry in +> `variant.toml`, and a `[nasde.plugin]`'s own `skills/`. Each `SKILL.md` +> **must start with a `---` YAML frontmatter line**: Codex's loader rejects a +> file that opens with anything else (`missing YAML frontmatter delimited by +> ---`) and silently skips the skill. Put any provenance comment *below* the +> closing `---`. See > [ADR-012](../../../docs/adr/012-native-codex-gemini-skill-injection.md). If no `harbor_config.json` exists, `nasde` auto-generates one from `variant.toml`. To customize (e.g., add MCP servers), create it explicitly: diff --git a/CHANGELOG.md b/CHANGELOG.md index 5b37b97..8207ad9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -104,6 +104,23 @@ See [docs/RELEASING.md](docs/RELEASING.md) for the release procedure. max_turns` in `nasde.toml`. ([#54]) ### Fixed +- **`[[skill]]` by-reference and `[nasde.plugin]` skills now register natively for Codex/Gemini ([ADR-012](docs/adr/012-native-codex-gemini-skill-injection.md)).** + PR #65 fixed the *snapshot* path (`agents_skills/`, `gemini_skills/`) but left + the two other ways a skill reaches an agent — `[[skill]]` by-reference in + `variant.toml` and a `[nasde.plugin]`'s own `skills/` — still hardcoding + `/app/.claude/skills/` via `stage_skill_dir`. Correct for Claude (cwd discovery + root), but for a **Codex/Gemini** variant the skill landed where the CLI never + scans, so it was never natively registered (the same silent "skill-as-document" + failure). Both paths now resolve to skill *directories* + (`collect_referenced_skill_dirs` / `collect_plugin_skill_dirs`) routed through + Harbor's native `config.agent.skills` for Codex/Gemini, unioned with the + snapshot dirs under one `_nasde_derived_skills` marker (stale-drop + + hand-authored preservation work across both sources); Claude keeps its + unchanged `sandbox_files` path. The frontmatter warning now also fires on this + path, and a new warning flags two derived skill dirs sharing a basename (Harbor + registers only the last). New example variants `codex-nasde-dev-with-arch` and + `gemini-nasde-dev-with-arch` exercise `[[skill]]` by-reference on a non-Claude + agent. ([#67]) - **Codex/Gemini skills are now natively registered ([ADR-012](docs/adr/012-native-codex-gemini-skill-injection.md)).** Codex and Gemini auto-discover skills only from a HOME-scoped directory (`$HOME/.agents/skills`, `~/.gemini/skills`), never from a `.agents/skills` / @@ -522,4 +539,5 @@ Initial release under the **nasde-toolkit** name (rebrand from [#59]: https://github.com/NoesisVision/nasde-toolkit/pull/59 [#61]: https://github.com/NoesisVision/nasde-toolkit/pull/61 [#65]: https://github.com/NoesisVision/nasde-toolkit/pull/65 +[#67]: https://github.com/NoesisVision/nasde-toolkit/pull/67 [gh-litellm-2026-04]: https://github.com/BerriAI/litellm/security/advisories/GHSA-xqmj-j6mv-4862 diff --git a/CLAUDE.md b/CLAUDE.md index 26f29f9..adfdbab 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -76,15 +76,15 @@ See [ARCHITECTURE.md](ARCHITECTURE.md) for the full system architecture with dia - **Configuration**: Two-layer config — `nasde.toml` for project-level settings, `task.toml` per task (single file, shared with Harbor; nasde-specific fields live under `[nasde.*]`). Both parsed into `@dataclass` models in `config.py`. Task discovery walks `tasks/` (or `.nasde/tasks/`) automatically. - **Benchmark runner**: Uses Harbor Python API (`Job`, `JobConfig`) directly instead of subprocess. The runner merges variant config with task registry into a dict, passes it to `JobConfig.model_validate()`, then runs `await job.run()`. Opik tracking via `track_harbor()` (monkey-patches Harbor at runtime). - **Evaluator**: Subprocess-based assessment evaluation with pluggable backends. The evaluator spawns a CLI agent (`claude -p` or `codex exec`) as a subprocess to read trial artifacts and score them against assessment criteria. Backend selection via `[evaluation] backend` in `nasde.toml` (`"claude"` default, `"codex"` available). Claude backend uses `--output-format json` (without `--bare` — preserves OAuth/keychain auth for subscription billing and skills auto-discovery). Codex backend uses `--json --quiet --full-auto` for JSONL event streaming. Both backends respect the user's existing CLI authentication (OAuth subscription or API key) — no Agent SDK required, which avoids Anthropic's SDK-billing restrictions on subscription accounts. Configurable via `[evaluation]` in `nasde.toml` — backend, model, tools, MCP servers, skills, system prompt, and trajectory inclusion can all be customized. When `include_trajectory = true`, the evaluator also has access to the agent's ATIF trajectory (`agent/trajectory.json`). Default Claude model is `claude-opus-4-7`. Skills dir is mounted via temp-dir + `cwd` with `--add-dir ` so Claude's auto-discovery picks them up. **Repeated evaluations (output contract):** the judge is non-deterministic, so each trial is evaluated `eval_repetitions` times (default 3, `[evaluation] eval_repetitions` in `nasde.toml`, override `--eval-repetitions` on `run`/`eval`). The N evaluations of one trial run concurrently via `asyncio.gather` under the shared `--max-concurrent-eval` semaphore; the gather barrier ensures aggregation runs only after all N finish. Each is written **append-only** as `assessment_eval_.json` (never overwritten — the bug being fixed was a single `assessment_eval.json` clobbered on every re-eval). A derived `assessment_summary.json` holds per-cluster mean/std(sample,n-1)/min/max aggregates: averages are computed **only within one `(evaluator_model, dimensions_fingerprint)` cluster**. A different judge model OR a changed `assessment_dimensions.json` rubric (added/removed dimension, changed `max_score`, OR changed `description`) is a different benchmark and is never averaged together — the fingerprint is the first 12 hex chars of the sha256 of the normalized (`sort_keys`) rubric JSON, stamped onto each `assessment_eval_.json` at eval time (legacy files with no fingerprint → `""` cluster). Clustering by fingerprint also makes the per-group `members[0]` dimension/max_score read correct by construction. The largest cluster (tie → newest) is `dominant`. Opik upload uses the **dominant cluster's mean** (plus per-dimension std and `eval_n`), not the last evaluation. Legacy jobs are normalized by `eval_migration.py`, exposed as the **hidden** one-shot command `nasde migrate-evals` (`@app.command(hidden=True)` — invokable, but kept out of `nasde --help` and user docs because it is a migration tool, not a routine command; its tests stay to guard the logic). It takes a `jobs/` root (recurses) or a single trial dir, not a single job dir. -- **Variant system**: Each variant is a directory under `variants/` with a required `variant.toml` declaring the agent type (`agent = "claude"`, `agent = "codex"`, or `agent = "gemini"`). For Claude Code variants, `CLAUDE.md` is injected into `/app/CLAUDE.md`; for Codex variants, `AGENTS.md` is injected into `/app/AGENTS.md`; for Gemini CLI variants, `GEMINI.md` is injected into `/app/GEMINI.md`. An optional `skills/` subdirectory contains Claude skill snapshots — each `skills//` is injected **whole** (incl. `references/` and sibling files, via the shared `plugin_registration.stage_skill_dir`) into `/app/.claude/skills//`. A `variant.toml` may also declare a `[[skill]]` array (skill-by-reference, ADR-009): each entry points at a source skill dir (optional `ref`) staged whole into `/app/.claude/skills//` without a copy under `variants/`. An optional `agents_skills/` subdirectory contains Codex skill snapshots and an optional `gemini_skills/` subdirectory contains Gemini skill snapshots — each `/` (a dir with a `SKILL.md`) is registered through **Harbor's native skill-injection** (`config.agent.skills`), NOT via `sandbox_files`. See the "Codex/Gemini native skill injection" decision below for why: Codex and Gemini auto-discover skills only from a HOME-scoped dir (`$HOME/.agents/skills`, `~/.gemini/skills`), never from a cwd `/app/.agents/skills` or `/app/.gemini/skills` dir. If no `harbor_config.json` exists, one is auto-generated from `variant.toml`. -- **Codex/Gemini native skill injection (ADR-012)**: Codex and Gemini auto-discover skills **only from a HOME-scoped directory** — `$HOME/.agents/skills` for Codex, `~/.gemini/skills` for Gemini — never from a `.agents/skills` / `.gemini/skills` dir sitting in the agent's `/app` cwd. The old path uploaded skill files into `/app/.agents/skills/` and `/app/.gemini/skills/` via `sandbox_files`, where the CLI never scans, so the skill was **never registered as a native slash-command** — the agent only stumbled on the file by reading it by hand. The fix routes `agents_skills//` and `gemini_skills//` through **Harbor's native `config.agent.skills`** list (`runner._collect_native_skill_dirs`): at trial time Harbor uploads each skill dir into the container's `skills_dir`, passes `skills_dir` to the agent ctor, and the agent's `_build_register_skills_command` copies them into the correct HOME-scoped location *inside* `run()` — the only point with the right timing (an earlier `setup()`/`sandbox_files` upload is in the wrong session relative to the CLI's skill scan, empirically verified). The `skills` list on the harbor_config agent is regenerated each run (tracked via `_nasde_derived_skills`), so a removed skill dir drops out and a hand-authored `skills` entry is preserved — mirroring the `sandbox_files` / `_nasde_derived_keys` contract. **Claude is intentionally NOT routed here**: its `variants//skills/` → `sandbox_files` → `/app/.claude/skills/` (+ `~/.claude/skills/`) path already lands where Claude Code auto-discovers from cwd, is tested, and also carries `[nasde.plugin]` / `[[skill]]` entries. See ADR-012. +- **Variant system**: Each variant is a directory under `variants/` with a required `variant.toml` declaring the agent type (`agent = "claude"`, `agent = "codex"`, or `agent = "gemini"`). For Claude Code variants, `CLAUDE.md` is injected into `/app/CLAUDE.md`; for Codex variants, `AGENTS.md` is injected into `/app/AGENTS.md`; for Gemini CLI variants, `GEMINI.md` is injected into `/app/GEMINI.md`. An optional `skills/` subdirectory contains Claude skill snapshots — each `skills//` is injected **whole** (incl. `references/` and sibling files, via the shared `plugin_registration.stage_skill_dir`) into `/app/.claude/skills//`. A `variant.toml` may also declare a `[[skill]]` array (skill-by-reference, ADR-009): each entry points at a source skill dir (optional `ref`) without a copy under `variants/` — for Claude staged whole into `/app/.claude/skills//`, for Codex/Gemini routed through Harbor's native `config.agent.skills` (ADR-012, see the native-injection note below). An optional `agents_skills/` subdirectory contains Codex skill snapshots and an optional `gemini_skills/` subdirectory contains Gemini skill snapshots — each `/` (a dir with a `SKILL.md`) is registered through **Harbor's native skill-injection** (`config.agent.skills`), NOT via `sandbox_files`. See the "Codex/Gemini native skill injection" decision below for why: Codex and Gemini auto-discover skills only from a HOME-scoped dir (`$HOME/.agents/skills`, `~/.gemini/skills`), never from a cwd `/app/.agents/skills` or `/app/.gemini/skills` dir. If no `harbor_config.json` exists, one is auto-generated from `variant.toml`. +- **Codex/Gemini native skill injection (ADR-012)**: Codex and Gemini auto-discover skills **only from a HOME-scoped directory** — `$HOME/.agents/skills` for Codex, `~/.gemini/skills` for Gemini — never from a `.agents/skills` / `.gemini/skills` dir sitting in the agent's `/app` cwd. The old path uploaded skill files into `/app/.agents/skills/` and `/app/.gemini/skills/` via `sandbox_files`, where the CLI never scans, so the skill was **never registered as a native slash-command** — the agent only stumbled on the file by reading it by hand. The fix routes `agents_skills//` and `gemini_skills//` through **Harbor's native `config.agent.skills`** list (`runner._collect_native_skill_dirs`): at trial time Harbor uploads each skill dir into the container's `skills_dir`, passes `skills_dir` to the agent ctor, and the agent's `_build_register_skills_command` copies them into the correct HOME-scoped location *inside* `run()` — the only point with the right timing (an earlier `setup()`/`sandbox_files` upload is in the wrong session relative to the CLI's skill scan, empirically verified). The `skills` list on the harbor_config agent is regenerated each run (tracked via `_nasde_derived_skills`), so a removed skill dir drops out and a hand-authored `skills` entry is preserved — mirroring the `sandbox_files` / `_nasde_derived_keys` contract. This covers **all three** skill paths for Codex/Gemini: the variant snapshot (`agents_skills/`, `gemini_skills/`) plus `[[skill]]` by-reference and `[nasde.plugin]` skills — the latter two are resolved to skill dirs by `plugin_registration.collect_referenced_skill_dirs` / `collect_plugin_skill_dirs` and **unioned** into the same `skills` list (one `_nasde_derived_skills` marker tracks both; nasde warns when two derived dirs share a basename since Harbor's `resolve_skills` is last-wins). **Claude is intentionally NOT routed here**: its `variants//skills/` → `sandbox_files` → `/app/.claude/skills/` (+ `~/.claude/skills/`) path already lands where Claude Code auto-discovers from cwd, is tested, and is the carrier for its own `[nasde.plugin]` / `[[skill]]` skills (via `stage_skill_dir`); a Claude agent drops `extra_skill_dirs` at merge time to avoid double-injecting. See ADR-012. - **Codex ChatGPT OAuth**: Harbor's `Codex` agent uploads `~/.codex/auth.json` into the sandbox as `$CODEX_HOME/auth.json` — but **harbor 0.13 made this opt-in**: the agent now *defaults to `OPENAI_API_KEY`* and only uploads the OAuth `auth.json` when `CODEX_AUTH_JSON_PATH` (a specific file) or `CODEX_FORCE_AUTH_JSON` (truthy → `~/.codex/auth.json`) is set. Without an API key it would otherwise write an **empty** key into the sandbox (`Incorrect API key provided: ''` on `wss://api.openai.com/v1/responses`). So `_ensure_auth` opts into OAuth automatically: when no `OPENAI_API_KEY`/`CODEX_API_KEY` is set but `~/.codex/auth.json` exists, it sets `CODEX_FORCE_AUTH_JSON=true` (respecting a user-set `CODEX_AUTH_JSON_PATH`/`CODEX_FORCE_AUTH_JSON`). API key (`OPENAI_API_KEY`) always takes priority over OAuth. Tokens are not extracted back from sandbox after runs. Validate with `source scripts/export_codex_oauth_token.sh`. The runner bridges `CODEX_API_KEY` → `OPENAI_API_KEY` in `_ensure_auth` for users who prefer the Codex-specific env var name. - **Gemini Google OAuth**: When `~/.gemini/oauth_creds.json` exists (created by `gemini login`) and no API key env var is set (`GEMINI_API_KEY`, `GOOGLE_API_KEY`, `GOOGLE_APPLICATION_CREDENTIALS`), `ConfigurableGemini` injects the OAuth credentials into the sandbox via env var. API key always takes priority over OAuth. Validate with `source scripts/export_gemini_oauth_token.sh`. - **All dependencies are core**: `harbor`, `opik` are in `[project.dependencies]`. The evaluator depends on having `claude` CLI (default) or `codex` CLI installed and authenticated on the host — not bundled. No optional extras — `uv tool install .` gives full functionality. Assessment evaluation is on by default (`--without-eval` to skip). - **Versioning**: Derived from git tags via `hatch-vcs` plugin. Tag `v0.1.0` → version `0.1.0`. Commits after a tag produce dev versions like `0.1.1.dev3+gabcdef`. `_version.py` is auto-generated at build time and gitignored. See ADR-007. - **Bundled authoring skills**: `.claude/skills/nasde-benchmark-*` are shipped inside the wheel via `[tool.hatch.build.targets.wheel.force-include]` → `nasde_toolkit/_bundled_skills/`. `nasde install-skills` discovers them via `importlib.resources` (wheel install) with a fallback to the live `.claude/skills/` when running from an editable / source checkout. `nasde-dev` is deliberately excluded from the bundle (internal dev skill, not for end users). - **Auto-generated Dockerfile**: When a task has no `environment/Dockerfile`, nasde generates one from `source.git` + `[docker]`. For local paths, also generates `docker-compose.yaml` to override the build context. See `docker.py:ensure_task_environment()`. -- **`[nasde.plugin]` (ADR-009)**: Ships a local Claude Code plugin (dir with `.claude-plugin/plugin.json`) into the sandbox with one `task.toml` declaration. `docker.py:ensure_task_plugin()` stages the plugin tree (at `ref` via a temp worktree) into a gitignored `_nasde-plugin/` inside the **active** build context (Harbor pins context to `environment/`; with `[nasde.source]` the context is the source repo/worktree — nasde reads it back from the generated compose and stages there), then appends a sentinel-fenced `COPY`+build stage to the Dockerfile (generated base if none, hand-written preserved). `plugin_registration.py` then registers the plugin's own `skills/` (whole dirs) and injects its MCP server (from `/.mcp.json`, env-wrapped) into the task's `task.toml` — idempotent, fenced, never clobbers an author-declared same-name server. Skill-by-reference (`[[skill]]` in `variant.toml`) feeds the same skill-registration machinery. Derived sandbox files are merged into the variant's `harbor_config.json` each run. MCP injection writes `task.toml` because Harbor reads MCP servers only from there (`trial.py` → `task.config.environment.mcp_servers`). +- **`[nasde.plugin]` (ADR-009)**: Ships a local Claude Code plugin (dir with `.claude-plugin/plugin.json`) into the sandbox with one `task.toml` declaration. `docker.py:ensure_task_plugin()` stages the plugin tree (at `ref` via a temp worktree) into a gitignored `_nasde-plugin/` inside the **active** build context (Harbor pins context to `environment/`; with `[nasde.source]` the context is the source repo/worktree — nasde reads it back from the generated compose and stages there), then appends a sentinel-fenced `COPY`+build stage to the Dockerfile (generated base if none, hand-written preserved). `plugin_registration.py` then registers the plugin's own `skills/` (whole dirs — for Claude into `/app/.claude/skills/`, for Codex/Gemini through Harbor's native `config.agent.skills` via `collect_plugin_skill_dirs`, ADR-012) and injects its MCP server (from `/.mcp.json`, env-wrapped) into the task's `task.toml` — idempotent, fenced, never clobbers an author-declared same-name server. Skill-by-reference (`[[skill]]` in `variant.toml`) feeds the same agent-aware skill-registration machinery. Derived sandbox files are merged into the variant's `harbor_config.json` each run. MCP injection writes `task.toml` because Harbor reads MCP servers only from there (`trial.py` → `task.config.environment.mcp_servers`). - **Pass-through CLI**: `nasde harbor ...` delegates to Harbor's Typer app via `add_typer()`. `nasde opik ...` forwards args to Opik's Click CLI via `ctx.args`. - **Rubric calibration (ADR-010)**: `nasde calibrate publish PATHS...` / `pull-comments` close the loop between the LLM-as-a-Judge and a human reviewer by publishing trial diffs + assessments as PRs/MRs and pulling review comments back for rubric tuning. Two layers, deliberately separated: **GIT** (`git_platform_backends/git_ops.py` — `git push`/`ls-remote`, platform-agnostic, subprocess pattern from `docker.py`, not behind a Protocol) and **PLATFORM** (`git_platform_backends/` behind a `@runtime_checkable GitPlatformBackend` Protocol — `repo_exists`/`find_open_pr_for_branch`/`create_pr`/`fetch_pr_comments`/`validate_cli_installed`/`validate_auth`, mirroring `evaluator_backends/`). The base is keyed on `(repo, commit)` as an **orphan branch** `base/-` seeded once via `git archive HEAD` from the trial workspace (git deduplicates blobs by content across orphan bases — no shared ancestor needed); each trial is a feature branch `calib/-/` = base + the agent's `changes.patch` applied as a real commit + `.calibration/` files (no trajectory — secrets/clutter). `.calibration/` carries the reviewer's context: the task's `instruction.md` + `assessment_criteria.md` + `assessment_dimensions.json` (resolved from `result.json` `task_name`/`source`, trying both `tasks/` and `evals//tasks/` layouts), all `assessment_eval_.json`, `assessment_summary.json`, and `metrics.json`. Idempotency is **open-only**: `find_open_pr_for_branch` matches only OPEN PRs/MRs (`gh pr list --state open`, `glab mr list` default), so a re-run skips a live round but lets a fresh round publish once the prior one is closed. The PR body is a pure transform of the dominant `AssessmentSummary` cluster (`calibration_publisher._render_pr_body`). Backend is **auto-detected from the sink repo URL host** (`github.com`→`gh`, `*gitlab*`→`glab`; `[calibration] platform` overrides for self-hosted) — no `backend` config field, eliminating the backend≠host mismatch. Preflight before any work: detect → `validate_cli_installed` (`shutil.which`, precise per-platform message) → `validate_auth` (`gh|glab auth status` exit code) → `repo_exists` (parses OUTPUT — `gh repo view` exits 0 even for a missing repo). Repo creation is out of scope (push creates branches ad-hoc in an existing repo). Reuses `_expand_to_trials`/`_capture_patch`/`_build_metrics` from `results_exporter.py` and `_aggregate_evaluations`/`_load_raw_evaluations`/`AssessmentSummary` from `evaluator.py`. Prerequisites mirror the evaluator's CLI requirement (ADR-002): `git` + `gh`/`glab` + login, no SDK, CLI keyring holds auth. The `nasde-benchmark-calibration` skill orchestrates the human-in-the-loop flow. - **Results export (EXPERIMENTAL)**: `results_exporter.py` + `nasde results-export PATHS... --to DIR` copy the analytic *essence* of trial artifacts out of the gitignored `jobs/` tree into a flat per-trial layout (`DIR/__/` with `metrics.json`, `assessment_eval_*.json` (all repetitions), `assessment_summary.json`, `trajectory.json`, `changes.patch`, `verifier_stdout.txt`, `reward.txt`). Re-export **merges**: missing eval files are copied and the summary/metrics refreshed, while immutable files (trajectory, patch) are left as-is — so evaluations added after a first export are picked up. A legacy bare `assessment_eval.json` (pre-migration trial) is exported as `assessment_eval_1.json` with a `nasde migrate-evals` hint, so the export is never silently empty. Filesystem-as-interface: `DIR` is any plain path (iCloud/Dropbox/git repo) — no cloud SDK. It scans Harbor artifacts (`result.json`/`config.json`/`assessment_eval*.json`/`agent/trajectory.json`/workspace), **not** the best-effort `EXPERIMENT_LOG.md`. `metrics.json` is a self-contained summary composed from `result.json`+`config.json`+`agent/trajectory.json` — including **token & cost economics** (`token_usage`, `cost_usd`, `pricing_as_of`, `reasoning_effort`); see the token-cost note below and [ADR-011](docs/adr/011-token-cost-metrics.md). The code diff is captured as a patch (`git diff HEAD` + untracked via `git ls-files -z` + `git diff --no-index`, never `git add` — the workspace `.git` index is left untouched; `-z`/NUL parsing means non-ASCII untracked filenames are not dropped under `core.quotepath`). Selection is a mixed positional list of job and/or trial dirs (auto-classified: a dir whose children have `result.json` is a job; else a dir whose own `result.json` carries a `trial_name` key is a trial; a dir with a job-level `result.json` but no trial-shaped children/`trial_name` is skipped with a warning rather than mis-exported as garbage); re-export is idempotent and merge-based (a trial is reported `exported` only when something new was copied, else `skipped`). Reuses `_collect_trial_dirs`/`_load_json`/`_compute_duration_sec`/`_resolve_agent_name` from `evaluator.py`. Deliberately does **not** model "experiments" (one job can belong to many — a future UI layer's concern). @@ -290,7 +290,7 @@ reasoning_effort = "high" # Optional. Passed straight to the agent (no tasks = ["my-benchmark/task-a"] # Optional: task-scope. Restrict this variant to specific tasks. -[[skill]] # Optional (ADR-009): skill-by-reference, Claude only. +[[skill]] # Optional (ADR-009): skill-by-reference. Works for any agent. path = "../../../src/plugins/my-plugin/skills/my-skill" # source skill dir (required) ref = "abc1234" # optional git ref, same semantics as [nasde.source] ``` @@ -311,12 +311,16 @@ the variant's scope the run aborts with a clear error. Either way the scope wins over an explicit `--tasks` filter. Absent or empty → unscoped (runs against every task, the default). -**`[[skill]]` (skill-by-reference)**: each entry stages the **whole** skill dir -(incl. `references/`) into `/app/.claude/skills//` from a source path — -no copy under `variants//skills/`. Optional `ref` reads from a temp git -worktree at that commit. The legacy `variants//skills//` copy path -still works (and now also carries `references/`). Both feed the same -`plugin_registration.stage_skill_dir` machinery as `[nasde.plugin]`. +**`[[skill]]` (skill-by-reference)**: each entry references the **whole** skill +dir (incl. `references/`) from a source path — no copy under +`variants//skills/`. Optional `ref` reads from a temp git worktree at that +commit. **Routing is agent-aware** (ADR-012): for **Claude** the dir is staged +into `/app/.claude/skills//` via `plugin_registration.stage_skill_dir` +(same machinery as `[nasde.plugin]`); for **Codex/Gemini** the dir is resolved +by `collect_referenced_skill_dirs` and routed through Harbor's native +`config.agent.skills` (→ `$HOME/.agents/skills`, `~/.gemini/skills`), because +those CLIs never scan the `/app` cwd. The legacy `variants//skills//` +copy path still works (and carries `references/`). ### harbor_config.json (per variant) diff --git a/docs/adr/012-native-codex-gemini-skill-injection.md b/docs/adr/012-native-codex-gemini-skill-injection.md index f9db717..4232b80 100644 --- a/docs/adr/012-native-codex-gemini-skill-injection.md +++ b/docs/adr/012-native-codex-gemini-skill-injection.md @@ -94,6 +94,36 @@ skills and `[[skill]]` by-reference (ADR-009) via `stage_skill_dir`. Migrating it onto `config.agent.skills` would be a larger, riskier change with no behavioral payoff. Scope is kept to the actual bug. +### All three skill paths, not just the snapshot + +A skill reaches an agent three ways: the variant **snapshot** +(`agents_skills/`, `gemini_skills/`), **`[[skill]]` by-reference** in +`variant.toml`, and a **`[nasde.plugin]`**'s own `skills/`. The first revision +of this decision fixed only the snapshot path; the latter two still went +through `stage_skill_dir`'s hardcoded `/app/.claude/skills/` and so were broken +for Codex/Gemini in exactly the same way. + +All three are now routed natively for Codex/Gemini. `[[skill]]`/plugin skills +are resolved to skill *directories* by `collect_referenced_skill_dirs` / +`collect_plugin_skill_dirs` (the dir-list counterparts of the unchanged +Claude-only `stage_referenced_skills` / `register_plugin_skills`), threaded as +`extra_skill_dirs` and **unioned** with the snapshot dirs inside +`_refresh_agent_skills`. The union is tracked by the single +`_nasde_derived_skills` marker, so stale-drop and hand-authored preservation +work across both sources. For a Claude agent `extra_skill_dirs` is dropped at +merge time (those skills already ride `sandbox_files`, so feeding the native +list too would double-inject). Because Harbor's `resolve_skills` keys skills by +basename (last-wins), nasde warns when two derived dirs share a basename so the +loser does not silently vanish. + +**Coverage note (honest).** The `[[skill]]` path is verified end-to-end on a +live agent (see Consequences). The `[nasde.plugin]` path shares the same channel +from `extra_skill_dirs` onward, but there is no example benchmark with a +`[nasde.plugin]` to run it e2e — so it is covered by an **integration test** that +drives the real `docker.ensure_task_plugin` staging → `collect_plugin_skill_dirs` +→ harbor_config (no Docker build), plus unit tests on the resolver. Full +live-agent e2e for a plugin is a follow-up for when a plugin benchmark exists. + ### Frontmatter guardrail Codex's loader is strict: a `SKILL.md` that does not **start** with a `---` @@ -123,8 +153,11 @@ comments moved below the closing `---`. its native mechanism, not by reading the file. - Skill×model matrix results for Codex/Gemini collected under the old behavior are invalid (skill-as-document, not skill-as-native) and must be re-run. -- `sandbox_files` no longer carries skill files for Codex/Gemini; it carries - only `AGENTS.md` / `GEMINI.md` (and, for Claude, the unchanged skill files). +- For Codex/Gemini, skills (snapshot, `[[skill]]`, and plugin) are all carried + by `config.agent.skills`; the agent's effective `sandbox_files` is just + `AGENTS.md` / `GEMINI.md`. The Claude `sandbox_files` skill map is still built + unconditionally (so Claude stays byte-identical and tested) but a Codex/Gemini + agent never consumes it. - The brittle `_expand_skill_targets` workaround in `ConfigurableGemini` (which hardcoded `/root/.gemini/skills/`) is removed — Harbor's native command uses the shell-expanded `$HOME`, correct for non-root users too. @@ -132,9 +165,13 @@ comments moved below the closing `---`. ## References - `runner._collect_native_skill_dirs`, `runner._refresh_agent_skills`, - `runner._agent_type_from_import_path` + `runner._agent_type_from_import_path`, `runner._referenced_skill_dirs_for_agent`, + `runner._warn_skill_basename_collisions` +- `plugin_registration.collect_referenced_skill_dirs`, + `plugin_registration.collect_plugin_skill_dirs` (dir-list channel), + `stage_referenced_skills` / `register_plugin_skills` (unchanged Claude sink) - Harbor `agents/installed/codex.py::_build_register_skills_command`, `gemini_cli.py::_build_register_skills_command`, `trial.py` skill upload - `openai/codex` `core-skills/src/loader.rs` (native discovery roots) -- [ADR-009](009-plugin-and-skill-by-reference.md) (the Claude skill path this - decision deliberately leaves untouched) +- [ADR-009](009-plugin-and-skill-by-reference.md) (the `[[skill]]` / plugin + mechanism whose Claude sink this decision leaves untouched) diff --git a/examples/nasde-dev-skill/variants/codex-nasde-dev-with-arch/AGENTS.md b/examples/nasde-dev-skill/variants/codex-nasde-dev-with-arch/AGENTS.md new file mode 100644 index 0000000..dcbde60 --- /dev/null +++ b/examples/nasde-dev-skill/variants/codex-nasde-dev-with-arch/AGENTS.md @@ -0,0 +1,9 @@ +# Agent Instructions + +## Approach + +1. Begin by consulting the **nasde-dev** skill and apply its guidance throughout this task. +2. Before writing any code, explore the existing codebase to understand its architecture and conventions. +3. Follow existing architectural patterns exactly — match naming, namespaces, directory structure, and code style. +4. When modifying nasde-toolkit functionality (CLI, runner, evaluator, config, agents), follow the `nasde-dev` skill's verification protocol, including documentation sync requirements. +5. After making changes, verify documentation is consistent (CLAUDE.md, README.md, ARCHITECTURE.md). diff --git a/examples/nasde-dev-skill/variants/codex-nasde-dev-with-arch/variant.toml b/examples/nasde-dev-skill/variants/codex-nasde-dev-with-arch/variant.toml new file mode 100644 index 0000000..31a8c55 --- /dev/null +++ b/examples/nasde-dev-skill/variants/codex-nasde-dev-with-arch/variant.toml @@ -0,0 +1,9 @@ +agent = "codex" +model = "gpt-5.4" + +# nasde-dev skill by-reference (ADR-009) from the live source. For a Codex +# variant this is registered through Harbor's native config.agent.skills so +# the skill lands in $HOME/.agents/skills and Codex auto-discovers it — NOT in +# the /app cwd (which Codex never scans). See ADR-012. +[[skill]] +path = "../../../../.claude/skills/nasde-dev" diff --git a/examples/nasde-dev-skill/variants/gemini-nasde-dev-with-arch/GEMINI.md b/examples/nasde-dev-skill/variants/gemini-nasde-dev-with-arch/GEMINI.md new file mode 100644 index 0000000..dcbde60 --- /dev/null +++ b/examples/nasde-dev-skill/variants/gemini-nasde-dev-with-arch/GEMINI.md @@ -0,0 +1,9 @@ +# Agent Instructions + +## Approach + +1. Begin by consulting the **nasde-dev** skill and apply its guidance throughout this task. +2. Before writing any code, explore the existing codebase to understand its architecture and conventions. +3. Follow existing architectural patterns exactly — match naming, namespaces, directory structure, and code style. +4. When modifying nasde-toolkit functionality (CLI, runner, evaluator, config, agents), follow the `nasde-dev` skill's verification protocol, including documentation sync requirements. +5. After making changes, verify documentation is consistent (CLAUDE.md, README.md, ARCHITECTURE.md). diff --git a/examples/nasde-dev-skill/variants/gemini-nasde-dev-with-arch/variant.toml b/examples/nasde-dev-skill/variants/gemini-nasde-dev-with-arch/variant.toml new file mode 100644 index 0000000..df000e7 --- /dev/null +++ b/examples/nasde-dev-skill/variants/gemini-nasde-dev-with-arch/variant.toml @@ -0,0 +1,9 @@ +agent = "gemini" +model = "google/gemini-3-flash-preview" + +# nasde-dev skill by-reference (ADR-009) from the live source. For a Gemini +# variant this is registered through Harbor's native config.agent.skills so +# the skill lands in ~/.gemini/skills and Gemini auto-discovers it — NOT in +# the /app cwd (which Gemini never scans). See ADR-012. +[[skill]] +path = "../../../../.claude/skills/nasde-dev" diff --git a/src/nasde_toolkit/plugin_registration.py b/src/nasde_toolkit/plugin_registration.py index 82f5bbb..83d841c 100644 --- a/src/nasde_toolkit/plugin_registration.py +++ b/src/nasde_toolkit/plugin_registration.py @@ -11,11 +11,17 @@ 2. The ``[[skill]]`` array in variant.toml references a skill from a source path instead of copying it into ``variants//skills/``. -Both feed :func:`stage_skill_dir`, which expands a skill directory into the -flat ``{container_path: host_file}`` mapping ConfigurableClaude uploads -(it only uploads regular files, mirroring ``_collect_codex_skills``). The -plugin path additionally calls :func:`inject_mcp_server` to register the -plugin's MCP server in the task's Harbor config. +For **Claude** both feed :func:`stage_skill_dir`, which expands a skill +directory into the flat ``{container_path: host_file}`` mapping +ConfigurableClaude uploads to ``/app/.claude/skills`` (Claude's cwd discovery +root). The plugin path additionally calls :func:`inject_mcp_server` to register +the plugin's MCP server in the task's Harbor config. + +For **Codex/Gemini** the same skill *directories* are resolved by +:func:`collect_referenced_skill_dirs` / :func:`collect_plugin_skill_dirs` and +fed to Harbor's native ``config.agent.skills`` instead — those CLIs scan only a +HOME-scoped dir (``$HOME/.agents/skills``, ``~/.gemini/skills``), never the +``/app`` cwd the sandbox map targets. See ADR-012. """ from __future__ import annotations @@ -119,6 +125,54 @@ def stage_referenced_skills( console.print(f" [dim]Registered referenced skill '{resolved.name}'[/dim]") +def collect_referenced_skill_dirs( + variant_dir: Path, + worktree_factory: WorktreeFactory, +) -> list[Path]: + """Resolve the ``[[skill]]`` array to a list of whole skill directories. + + Same resolution as :func:`stage_referenced_skills` (each entry's ``path`` + relative to the variant dir, optional ``ref`` read from a temporary git + worktree) but returns the resolved *directories* instead of flattening them + into a ``sandbox_files`` map under ``/app/.claude/skills``. Codex and Gemini + auto-discover skills only from a HOME-scoped dir, never from the agent's + ``/app`` cwd, so they take this dir list through Harbor's native + ``config.agent.skills`` mechanism. Claude keeps using + :func:`stage_referenced_skills`. See ADR-012. + + Paths are fully ``resolve()``d so they compare equal to the snapshot dirs + from ``runner._collect_native_skill_dirs`` — the union/dedup, stale-drop and + basename-collision logic in ``_refresh_agent_skills`` key on string equality, + which would break on a symlinked ``/tmp`` (macOS) or a ``ref`` worktree path. + """ + dirs: list[Path] = [] + for entry in _read_skill_entries(variant_dir): + resolved = _resolve_skill_source(variant_dir, entry["path"], entry.get("ref", ""), worktree_factory) + if not (resolved / "SKILL.md").exists(): + raise RuntimeError(f"[[skill]] '{resolved}' has no SKILL.md") + dirs.append(resolved.resolve()) + return dirs + + +def collect_plugin_skill_dirs(staged: StagedPlugin) -> list[Path]: + """Resolve a shipped plugin's own ``skills/`` to a list of skill dirs. + + The dir-list counterpart of :func:`register_plugin_skills`, for the same + reason as :func:`collect_referenced_skill_dirs`: Codex/Gemini take these + dirs through Harbor's native ``config.agent.skills`` rather than a + cwd ``sandbox_files`` mapping. Paths are ``resolve()``d for the same + string-equality reason as :func:`collect_referenced_skill_dirs`. + """ + skills_root = staged.staged_dir / "skills" + if not skills_root.is_dir(): + return [] + return [ + skill_dir.resolve() + for skill_dir in sorted(skills_root.iterdir()) + if skill_dir.is_dir() and (skill_dir / "SKILL.md").exists() + ] + + def inject_mcp_server(task_dir: Path, staged: StagedPlugin) -> None: """Write the plugin's MCP servers into the task's task.toml. diff --git a/src/nasde_toolkit/runner.py b/src/nasde_toolkit/runner.py index c43610d..a49ad2c 100644 --- a/src/nasde_toolkit/runner.py +++ b/src/nasde_toolkit/runner.py @@ -28,6 +28,8 @@ ensure_task_plugin, ) from nasde_toolkit.plugin_registration import ( + collect_plugin_skill_dirs, + collect_referenced_skill_dirs, inject_mcp_server, register_plugin_skills, stage_referenced_skills, @@ -74,9 +76,9 @@ async def run_benchmark( variant_dir = resolve_variant_dir(config.project_dir, variant) - extra_sandbox_files = _prepare_task_environments(config, variant_dir) + extra_sandbox_files, extra_skill_dirs = _prepare_task_environments(config, variant_dir) - harbor_config_path = _ensure_harbor_config(variant_dir, variant, extra_sandbox_files) + harbor_config_path = _ensure_harbor_config(variant_dir, variant, extra_sandbox_files, extra_skill_dirs) _ensure_auth(_read_agent_import_path(harbor_config_path)) @@ -121,15 +123,30 @@ async def run_benchmark( # --------------------------------------------------------------------------- -def _prepare_task_environments(config: ProjectConfig, variant_dir: Path) -> dict[str, str]: +def _prepare_task_environments(config: ProjectConfig, variant_dir: Path) -> tuple[dict[str, str], list[str]]: """Generate per-task Docker environments and collect agent skill files. Per task: auto-generate the source Dockerfile/compose ([nasde.source]), then stage the [nasde.plugin] tree into the build context, register the plugin's own skills, and inject its MCP server into the task's task.toml. - Then add the variant's referenced [[skill]] dirs. Returns the extra - ``{container_path: host_file}`` entries to merge into the variant's - harbor_config.json sandbox_files. + Then add the variant's referenced [[skill]] dirs. + + Returns ``(extra_sandbox_files, extra_skill_dirs)``: + + - ``extra_sandbox_files`` — the ``{container_path: host_file}`` map for the + **Claude** carrier (``/app/.claude/skills/...``), built unconditionally so + Claude stays byte-identical and tested. + - ``extra_skill_dirs`` — the same plugin / ``[[skill]]`` skill directories + as a host-path list, fed to **Codex/Gemini** through Harbor's native + ``config.agent.skills`` (those CLIs scan only a HOME-scoped dir, never the + ``/app`` cwd the sandbox map targets). The claude-vs-native choice is made + per agent at merge time in ``_refresh_agent_skills``. See ADR-012. + + The ``[[skill]]`` array is resolved twice (once per channel — the Claude + ``stage_referenced_skills`` and the native ``collect_referenced_skill_dirs``). + A ``ref`` worktree is created by each call; all worktrees live in + ``_active_worktrees`` until ``cleanup_worktrees()`` runs in the ``finally`` + after the job, well after Harbor has uploaded the skill dirs. Plugin skills go to a single variant-wide ``/app/.claude/skills/`` path, so heterogeneous ``[nasde.plugin]`` across tasks in one project would @@ -140,6 +157,7 @@ def _prepare_task_environments(config: ProjectConfig, variant_dir: Path) -> dict _require_homogeneous_plugins(config) extra_sandbox_files: dict[str, str] = {} + extra_skill_dirs: list[str] = [] for task in config.tasks: if task.source is not None: @@ -152,10 +170,12 @@ def _prepare_task_environments(config: ProjectConfig, variant_dir: Path) -> dict has_source=task.source is not None, ) register_plugin_skills(staged, extra_sandbox_files) + extra_skill_dirs.extend(str(d) for d in collect_plugin_skill_dirs(staged)) inject_mcp_server(task.path, staged) stage_referenced_skills(variant_dir, extra_sandbox_files, create_ref_worktree) - return extra_sandbox_files + extra_skill_dirs.extend(str(d) for d in collect_referenced_skill_dirs(variant_dir, create_ref_worktree)) + return extra_sandbox_files, extra_skill_dirs def _require_homogeneous_plugins(config: ProjectConfig) -> None: @@ -535,6 +555,7 @@ def _ensure_harbor_config( variant_dir: Path, variant: str, extra_sandbox_files: dict[str, str], + extra_skill_dirs: list[str] | None = None, ) -> Path: """Generate harbor_config.json if absent, then refresh sandbox_files. @@ -544,12 +565,16 @@ def _ensure_harbor_config( ADR-009). This is unconditional so removed/renamed ``[[skill]]`` or ``[nasde.plugin]`` entries between runs do not linger as stale mappings pointing at now-deleted worktree paths. + + ``extra_skill_dirs`` carries the same plugin / ``[[skill]]`` directories as + a host-path list for the Codex/Gemini native channel (``config.agent.skills``); + Claude ignores it (its copies ride ``extra_sandbox_files``). See ADR-012. """ harbor_config_path = variant_dir / "harbor_config.json" if not harbor_config_path.exists(): _generate_harbor_config(variant_dir, variant) - _refresh_sandbox_files(harbor_config_path, variant_dir, extra_sandbox_files) + _refresh_sandbox_files(harbor_config_path, variant_dir, extra_sandbox_files, extra_skill_dirs or []) return harbor_config_path @@ -583,6 +608,7 @@ def _refresh_sandbox_files( harbor_config_path: Path, variant_dir: Path, extra: dict[str, str], + extra_skill_dirs: list[str] | None = None, ) -> None: """Rebuild each agent's sandbox_files, preserving hand-written entries. @@ -609,7 +635,7 @@ def _refresh_sandbox_files( config = json.loads(harbor_config_path.read_text()) agent_type = load_variant_agent_type(variant_dir) for agent in config.get("agents", []): - _refresh_agent_sandbox_files(agent, variant_dir, agent_type, extra) + _refresh_agent_sandbox_files(agent, variant_dir, agent_type, extra, extra_skill_dirs or []) harbor_config_path.write_text(json.dumps(config, indent=2)) @@ -618,6 +644,7 @@ def _refresh_agent_sandbox_files( variant_dir: Path, agent_type: str, extra: dict[str, str], + extra_skill_dirs: list[str], ) -> None: kwargs = agent.setdefault("kwargs", {}) current = kwargs.get("sandbox_files", {}) or {} @@ -635,27 +662,46 @@ def _refresh_agent_sandbox_files( kwargs["sandbox_files"] = {**extra, **authored, **handwritten} agent["_nasde_derived_keys"] = sorted(extra.keys()) - _refresh_agent_skills(agent, variant_dir, agent_type) + _refresh_agent_skills(agent, variant_dir, agent_type, extra_skill_dirs) -def _refresh_agent_skills(agent: dict, variant_dir: Path, agent_type: str) -> None: +def _refresh_agent_skills( + agent: dict, + variant_dir: Path, + agent_type: str, + extra_skill_dirs: list[str] | None = None, +) -> None: """Rebuild the agent's ``skills`` list, preserving hand-authored entries. - nasde owns the entries derived from the variant's ``agents_skills/`` / - ``gemini_skills/`` subdirs (tracked via ``_nasde_derived_skills``). A - removed skill dir disappears from the list on the next run, exactly like a - removed ``[[skill]]`` drops from ``sandbox_files``. Any other entry was put - in ``harbor_config.json`` by hand and is kept untouched, so an author who - wires Harbor-native ``skills`` themselves is never clobbered. + nasde owns the entries derived from **two** sources, both routed through + Harbor's native ``config.agent.skills`` for Codex/Gemini: + + 1. the variant's ``agents_skills/`` / ``gemini_skills/`` snapshot subdirs + (``_collect_native_skill_dirs``), and + 2. ``[[skill]]`` by-reference + ``[nasde.plugin]`` skills resolved this run + (``extra_skill_dirs``), which for Claude instead ride ``sandbox_files``. + + The union is tracked via a single ``_nasde_derived_skills`` marker, so a + removed skill from *either* source drops on the next run while a + hand-authored ``skills`` entry is preserved. Any entry that is neither + derived this run nor was nasde-derived last run was put in + ``harbor_config.json`` by hand and is kept untouched. The skill subdir is keyed on *this agent's own* type, read back from its ``import_path`` (falling back to the variant's declared type). A variant is one agent type by contract, but a hand-written multi-agent config can mix - types — keying per agent keeps each one reading the right subdir instead of - the variant's first declared type. + types — keying per agent keeps each one reading the right subdir. For a + Claude agent ``extra_skill_dirs`` is intentionally dropped: those skills are + already carried in ``sandbox_files`` to ``/app/.claude/skills`` (and would + double-inject otherwise). See ADR-012. """ effective_type = _agent_type_from_import_path(agent.get("import_path")) or agent_type - derived = _collect_native_skill_dirs(variant_dir, effective_type) + snapshot = _collect_native_skill_dirs(variant_dir, effective_type) + referenced = _referenced_skill_dirs_for_agent(effective_type, extra_skill_dirs or []) + derived = list(dict.fromkeys(snapshot + referenced)) + + _warn_skill_basename_collisions(str(agent.get("name", "")), derived) + prev_derived = set(agent.get("_nasde_derived_skills", []) or []) current = agent.get("skills", []) or [] handwritten = [s for s in current if s not in prev_derived and s not in derived] @@ -663,6 +709,43 @@ def _refresh_agent_skills(agent: dict, variant_dir: Path, agent_type: str) -> No agent["_nasde_derived_skills"] = sorted(derived) +def _referenced_skill_dirs_for_agent(effective_type: str, extra_skill_dirs: list[str]) -> list[str]: + """``[[skill]]``/plugin dirs for the native channel — empty for Claude. + + Claude's referenced/plugin skills already ride ``sandbox_files`` to its + cwd ``/app/.claude/skills`` discovery root, so feeding them here too would + double-inject. Codex/Gemini get the dir list and warn on bad frontmatter. + """ + if effective_type == "claude": + return [] + for skill_dir in extra_skill_dirs: + skill_md = Path(skill_dir) / "SKILL.md" + if skill_md.exists(): + _warn_if_skill_frontmatter_malformed(skill_md, effective_type) + return extra_skill_dirs + + +def _warn_skill_basename_collisions(agent_name: str, skill_dirs: list[str]) -> None: + """Warn when two derived skill dirs share a basename (Harbor last-wins). + + Harbor's ``resolve_skills`` keys skills by ``dir.name`` and the last one + wins, so a snapshot ``agents_skills/foo`` and a referenced ``.../foo`` with + the same basename silently collapse to one in the container. Surface it so + the loser doesn't vanish unnoticed. + """ + by_name: dict[str, list[str]] = {} + for skill_dir in skill_dirs: + by_name.setdefault(Path(skill_dir).name, []).append(skill_dir) + for name, dirs in sorted(by_name.items()): + if len(dirs) > 1: + joined = "\n ".join(dirs) + console.print( + f"[yellow]WARNING ({agent_name}): {len(dirs)} skill dirs share the basename " + f"'{name}' — Harbor registers only the last (the others vanish):\n {joined}\n" + " → rename one skill dir so each has a unique name.[/yellow]" + ) + + def _agent_type_from_import_path(import_path: str | None) -> str | None: """Map an agent import_path back to its ``agent`` type, or None if unknown.""" if _is_codex_agent(import_path): diff --git a/tests/test_plugin_registration.py b/tests/test_plugin_registration.py index 3acf4d6..c8646fe 100644 --- a/tests/test_plugin_registration.py +++ b/tests/test_plugin_registration.py @@ -11,6 +11,8 @@ from nasde_toolkit.docker import StagedPlugin, cleanup_worktrees, create_ref_worktree from nasde_toolkit.plugin_registration import ( + collect_plugin_skill_dirs, + collect_referenced_skill_dirs, inject_mcp_server, register_plugin_skills, stage_referenced_skills, @@ -361,6 +363,126 @@ def test_stage_referenced_skills_no_entries_is_noop(tmp_path: Path) -> None: assert sandbox == {} +# --- collect_referenced_skill_dirs / collect_plugin_skill_dirs (ADR-012) --- +# Dir-list counterparts feeding the Codex/Gemini native config.agent.skills path. + + +def test_collect_referenced_skill_dirs_returns_dirs_not_sandbox(tmp_path: Path) -> None: + src_skills = tmp_path / "src" / "skills" + src_skills.mkdir(parents=True) + skill = _make_skill(src_skills, "analyze-conversation") + + variant_dir = tmp_path / "variants" / "with-skill" + variant_dir.mkdir(parents=True) + (variant_dir / "variant.toml").write_text( + 'agent = "codex"\nmodel = "m"\n\n[[skill]]\npath = "../../src/skills/analyze-conversation"\n' + ) + + dirs = collect_referenced_skill_dirs(variant_dir, create_ref_worktree) + + assert dirs == [skill.resolve()] + + +def test_collect_referenced_skill_dirs_no_entries_is_empty(tmp_path: Path) -> None: + variant_dir = tmp_path / "variants" / "vanilla" + variant_dir.mkdir(parents=True) + (variant_dir / "variant.toml").write_text('agent = "codex"\nmodel = "m"\n') + + assert collect_referenced_skill_dirs(variant_dir, create_ref_worktree) == [] + + +def test_collect_referenced_skill_dirs_returns_canonical_path(tmp_path: Path) -> None: + """Returned dirs must be fully resolved (no symlink, no `..`) so they + string-compare equal to runner._collect_native_skill_dirs' resolved snapshot + dirs — otherwise dedup / stale-drop / basename-collision break on a + symlinked /tmp (macOS) or a ref worktree path. See ADR-012.""" + src_skills = tmp_path / "src" / "skills" + src_skills.mkdir(parents=True) + skill = _make_skill(src_skills, "analyze-conversation") + + variant_dir = tmp_path / "variants" / "with-skill" + variant_dir.mkdir(parents=True) + (variant_dir / "variant.toml").write_text( + 'agent = "codex"\nmodel = "m"\n\n[[skill]]\npath = "../../src/skills/analyze-conversation"\n' + ) + + dirs = collect_referenced_skill_dirs(variant_dir, create_ref_worktree) + + assert dirs == [d.resolve() for d in dirs] + assert dirs == [skill.resolve()] + + +def test_collect_plugin_skill_dirs_returns_canonical_path(tmp_path: Path) -> None: + staged = _staged_plugin(tmp_path) + + dirs = collect_plugin_skill_dirs(staged) + + assert dirs == [d.resolve() for d in dirs] + + +def test_collect_referenced_skill_dirs_missing_skill_md_raises(tmp_path: Path) -> None: + bad = tmp_path / "src" / "not-a-skill" + bad.mkdir(parents=True) + variant_dir = tmp_path / "variants" / "v" + variant_dir.mkdir(parents=True) + (variant_dir / "variant.toml").write_text( + 'agent = "codex"\nmodel = "m"\n\n[[skill]]\npath = "../../src/not-a-skill"\n' + ) + + with pytest.raises(RuntimeError, match="no SKILL.md"): + collect_referenced_skill_dirs(variant_dir, create_ref_worktree) + + +def test_collect_referenced_skill_dirs_ref_uses_snapshot(tmp_path: Path) -> None: + repo = tmp_path / "repo" + skills = repo / "src" / "skills" + skills.mkdir(parents=True) + _make_skill(skills, "analyze-conversation") + subprocess.run(["git", "init"], cwd=repo, capture_output=True, check=True) + subprocess.run(["git", "config", "user.email", "t@t.com"], cwd=repo, capture_output=True, check=True) + subprocess.run(["git", "config", "user.name", "T"], cwd=repo, capture_output=True, check=True) + subprocess.run(["git", "add", "."], cwd=repo, capture_output=True, check=True) + subprocess.run(["git", "commit", "-m", "init"], cwd=repo, capture_output=True, check=True) + commit = subprocess.run( + ["git", "rev-parse", "HEAD"], cwd=repo, capture_output=True, text=True, check=True + ).stdout.strip() + (skills / "analyze-conversation" / "SKILL.md").write_text("---\nname: analyze-conversation\n---\nCHANGED") + subprocess.run(["git", "add", "."], cwd=repo, capture_output=True, check=True) + subprocess.run(["git", "commit", "-m", "later"], cwd=repo, capture_output=True, check=True) + + variant_dir = tmp_path / "variants" / "with-skill" + variant_dir.mkdir(parents=True) + (variant_dir / "variant.toml").write_text( + 'agent = "codex"\nmodel = "m"\n\n' + "[[skill]]\n" + f'path = "../../repo/src/skills/analyze-conversation"\nref = "{commit}"\n' + ) + + try: + dirs = collect_referenced_skill_dirs(variant_dir, create_ref_worktree) + assert len(dirs) == 1 + assert "CHANGED" not in (dirs[0] / "SKILL.md").read_text() + finally: + cleanup_worktrees() + + +def test_collect_plugin_skill_dirs_returns_each_skill(tmp_path: Path) -> None: + staged = _staged_plugin(tmp_path) + + dirs = collect_plugin_skill_dirs(staged) + + assert dirs == [staged.staged_dir / "skills" / "analyze-conversation"] + + +def test_collect_plugin_skill_dirs_noop_without_skills(tmp_path: Path) -> None: + staged_dir = tmp_path / "_nasde-plugin" + (staged_dir / ".claude-plugin").mkdir(parents=True) + (staged_dir / ".claude-plugin" / "plugin.json").write_text("{}") + staged = StagedPlugin(staged_dir=staged_dir, install_root="/opt/x", plugin_name="x", env={}) + + assert collect_plugin_skill_dirs(staged) == [] + + def test_stage_referenced_skills_ref_uses_snapshot(tmp_path: Path) -> None: repo = tmp_path / "repo" skills = repo / "src" / "skills" diff --git a/tests/test_runner_skills.py b/tests/test_runner_skills.py index 97adfff..2bf959d 100644 --- a/tests/test_runner_skills.py +++ b/tests/test_runner_skills.py @@ -13,12 +13,13 @@ import pytest -from nasde_toolkit.config import PluginConfig, ProjectConfig, TaskConfig +from nasde_toolkit.config import DockerConfig, PluginConfig, ProjectConfig, TaskConfig from nasde_toolkit.runner import ( _collect_native_skill_dirs, _collect_sandbox_files, _ensure_harbor_config, _generate_harbor_config, + _prepare_task_environments, _require_homogeneous_plugins, ) @@ -482,7 +483,7 @@ def test_collect_native_skill_dirs_warns_on_leading_comment_frontmatter( dirs = _collect_native_skill_dirs(variant_dir, "codex") assert dirs == [str(skill.resolve())] - assert "missing YAML frontmatter" in capsys.readouterr().out + assert "missing YAML frontmatter" in " ".join(capsys.readouterr().out.split()) def test_collect_native_skill_dirs_no_warning_on_clean_frontmatter( @@ -493,7 +494,7 @@ def test_collect_native_skill_dirs_no_warning_on_clean_frontmatter( _collect_native_skill_dirs(variant_dir, "codex") - assert "missing YAML frontmatter" not in capsys.readouterr().out + assert "missing YAML frontmatter" not in " ".join(capsys.readouterr().out.split()) def test_refresh_keys_skills_per_agent_in_mixed_config(tmp_path: Path) -> None: @@ -520,3 +521,182 @@ def test_refresh_keys_skills_per_agent_in_mixed_config(tmp_path: Path) -> None: assert _skills(harbor_path, agent_index=0) == [str(codex_skill.resolve())] assert _skills(harbor_path, agent_index=1) == [str(gemini_skill.resolve())] + + +# --------------------------------------------------------------------------- +# extra_skill_dirs — [[skill]] / [nasde.plugin] native channel (ADR-012) +# --------------------------------------------------------------------------- +# +# For Codex/Gemini these dirs flow into agent["skills"] (Harbor native), NOT +# into /app/.claude/skills sandbox_files. For Claude they are dropped here +# (they already ride sandbox_files via stage_referenced_skills/register). + + +def _standalone_skill(parent: Path, name: str) -> Path: + skill = parent / name + skill.mkdir(parents=True) + (skill / "SKILL.md").write_text(f"---\nname: {name}\n---\nbody") + return skill + + +def test_extra_skill_dirs_go_to_native_skills_for_codex(tmp_path: Path) -> None: + variant_dir = _codex_variant(tmp_path / "variants" / "v") + referenced = _standalone_skill(tmp_path / "src", "tactical-ddd") + harbor_path = variant_dir / "harbor_config.json" + _generate_harbor_config(variant_dir, "v") + + _ensure_harbor_config(variant_dir, "v", {}, [str(referenced)]) + + assert _skills(harbor_path) == [str(referenced)] + + +def test_extra_skill_dirs_dropped_for_claude(tmp_path: Path) -> None: + """Claude's [[skill]]/plugin skills ride sandbox_files (stage_skill_dir), + so the native channel must stay empty — feeding it here would double-inject.""" + variant_dir = _bare_variant(tmp_path / "variants" / "v") + referenced = _standalone_skill(tmp_path / "src", "tactical-ddd") + harbor_path = variant_dir / "harbor_config.json" + _generate_harbor_config(variant_dir, "v") + + _ensure_harbor_config(variant_dir, "v", {}, [str(referenced)]) + + assert _skills(harbor_path) == [] + + +def test_extra_skill_dirs_union_with_snapshot(tmp_path: Path) -> None: + """snapshot (agents_skills/) ∪ [[skill]]/plugin (extra_skill_dirs) — both + land in agent.skills, tracked by one _nasde_derived_skills marker.""" + variant_dir = _codex_variant(tmp_path / "variants" / "v") + snapshot = _make_native_skill(variant_dir, "agents_skills", "snap-skill") + referenced = _standalone_skill(tmp_path / "src", "ref-skill") + harbor_path = variant_dir / "harbor_config.json" + _generate_harbor_config(variant_dir, "v") + + _ensure_harbor_config(variant_dir, "v", {}, [str(referenced)]) + + assert set(_skills(harbor_path)) == {str(snapshot.resolve()), str(referenced)} + + +def test_extra_skill_dirs_stale_drop_between_runs(tmp_path: Path) -> None: + """A [[skill]] removed between runs drops from agent.skills next run.""" + variant_dir = _codex_variant(tmp_path / "variants" / "v") + referenced = _standalone_skill(tmp_path / "src", "tactical-ddd") + harbor_path = variant_dir / "harbor_config.json" + _generate_harbor_config(variant_dir, "v") + + _ensure_harbor_config(variant_dir, "v", {}, [str(referenced)]) + assert _skills(harbor_path) == [str(referenced)] + + _ensure_harbor_config(variant_dir, "v", {}, []) + assert _skills(harbor_path) == [] + + +def test_extra_skill_dirs_preserves_handwritten(tmp_path: Path) -> None: + variant_dir = _codex_variant(tmp_path / "variants" / "v") + referenced = _standalone_skill(tmp_path / "src", "tactical-ddd") + harbor_path = variant_dir / "harbor_config.json" + _generate_harbor_config(variant_dir, "v") + config = json.loads(harbor_path.read_text()) + config["agents"][0]["skills"] = ["/host/hand/authored-skill"] + harbor_path.write_text(json.dumps(config)) + + _ensure_harbor_config(variant_dir, "v", {}, [str(referenced)]) + + skills = _skills(harbor_path) + assert str(referenced) in skills + assert "/host/hand/authored-skill" in skills + + +def test_extra_skill_dirs_warns_on_basename_collision(tmp_path: Path, capsys: pytest.CaptureFixture[str]) -> None: + """snapshot agents_skills/foo and referenced .../foo share a basename — + Harbor registers only the last (last-wins). Warn so the loser isn't lost.""" + variant_dir = _codex_variant(tmp_path / "variants" / "v") + _make_native_skill(variant_dir, "agents_skills", "tactical-ddd") + collide = _standalone_skill(tmp_path / "elsewhere", "tactical-ddd") + _generate_harbor_config(variant_dir, "v") + + _ensure_harbor_config(variant_dir, "v", {}, [str(collide)]) + + out = " ".join(capsys.readouterr().out.split()) + assert "share the basename" in out + assert "tactical-ddd" in out + + +def test_extra_skill_dirs_warns_on_malformed_frontmatter(tmp_path: Path, capsys: pytest.CaptureFixture[str]) -> None: + variant_dir = _codex_variant(tmp_path / "variants" / "v") + bad = tmp_path / "src" / "tactical-ddd" + bad.mkdir(parents=True) + (bad / "SKILL.md").write_text("\n---\nname: tactical-ddd\n---\nbody") + _generate_harbor_config(variant_dir, "v") + + _ensure_harbor_config(variant_dir, "v", {}, [str(bad)]) + + assert "missing YAML frontmatter" in " ".join(capsys.readouterr().out.split()) + + +# --------------------------------------------------------------------------- +# [nasde.plugin] skills — integration through the REAL ensure_task_plugin (ADR-012) +# --------------------------------------------------------------------------- +# +# Unlike the [[skill]] e2e (verified on a live agent), there is no example with +# a [nasde.plugin], so these tests drive the real ensure_task_plugin → staging +# → collect_plugin_skill_dirs → harbor_config chain (no Docker) to prove a +# plugin's own skills route natively for codex/gemini and to sandbox for claude. + + +def _make_local_plugin(root: Path, plugin_name: str, skill_name: str) -> Path: + plugin = root / plugin_name + (plugin / ".claude-plugin").mkdir(parents=True) + (plugin / ".claude-plugin" / "plugin.json").write_text(json.dumps({"name": plugin_name})) + skill = plugin / "skills" / skill_name + skill.mkdir(parents=True) + (skill / "SKILL.md").write_text(f"---\nname: {skill_name}\n---\nbody") + return plugin + + +def _project_with_plugin(tmp_path: Path, plugin: Path) -> ProjectConfig: + task_path = tmp_path / "tasks" / "t" + (task_path / "environment").mkdir(parents=True) + (task_path / "environment" / "Dockerfile").write_text("FROM ubuntu:22.04\n") + (task_path / "task.toml").write_text('version = "1.0"\n\n[task]\nname = "p/t"\n') + task = TaskConfig( + name="t", + path=task_path, + plugin=PluginConfig(path=str(plugin)), + ) + return ProjectConfig(name="p", project_dir=tmp_path, tasks=[task], docker=DockerConfig()) + + +def test_plugin_skills_route_native_for_codex(tmp_path: Path) -> None: + """A [nasde.plugin]'s own skills/ must reach a Codex agent through the + native config.agent.skills (resolved via the REAL ensure_task_plugin + staging). The Claude sandbox map is still built unconditionally, but a + Codex agent consumes the native skills list, not that map.""" + plugin = _make_local_plugin(tmp_path / "plugins", "my-plugin", "plug-skill") + config = _project_with_plugin(tmp_path, plugin) + variant_dir = _codex_variant(tmp_path / "variants" / "v") + harbor_path = variant_dir / "harbor_config.json" + _generate_harbor_config(variant_dir, "v") + + extra_sandbox, extra_skill_dirs = _prepare_task_environments(config, variant_dir) + _ensure_harbor_config(variant_dir, "v", extra_sandbox, extra_skill_dirs) + + skills = _skills(harbor_path) + assert any(Path(s).name == "plug-skill" for s in skills) + assert skills == [str(Path(s).resolve()) for s in skills] + + +def test_plugin_skills_route_sandbox_for_claude(tmp_path: Path) -> None: + """The same plugin on a Claude variant rides sandbox_files to + /app/.claude/skills (unchanged), and the native skills list stays empty.""" + plugin = _make_local_plugin(tmp_path / "plugins", "my-plugin", "plug-skill") + config = _project_with_plugin(tmp_path, plugin) + variant_dir = _bare_variant(tmp_path / "variants" / "v") + harbor_path = variant_dir / "harbor_config.json" + _generate_harbor_config(variant_dir, "v") + + extra_sandbox, extra_skill_dirs = _prepare_task_environments(config, variant_dir) + _ensure_harbor_config(variant_dir, "v", extra_sandbox, extra_skill_dirs) + + assert "/app/.claude/skills/plug-skill/SKILL.md" in _sandbox(harbor_path) + assert _skills(harbor_path) == []