FluidNumerics · fluidnumericsJoe · Apr 17, 2026 · Apr 17, 2026 · Apr 17, 2026 · Apr 17, 2026
diff --git a/.claude/skills/xfer-manifest-analyze/SKILL.md b/.claude/skills/xfer-manifest-analyze/SKILL.md
@@ -0,0 +1,62 @@
+---
+name: xfer-manifest-analyze
+description: Analyze an xfer manifest's file-size distribution and suggest rclone flags plus a shard count. Use after `xfer manifest build` and before sharding/rendering, whenever the user asks "how should I tune rclone?", "how many shards?", or wants to understand the dataset shape before transferring.
+---
+
+# xfer-manifest-analyze
+
+Drives `xfer manifest analyze` — reads `run/manifest.jsonl` (produced by `xfer-manifest-build`) and writes a histogram + suggested rclone flags to `run/analyze.json`.
+
+## Operating model
+
+Runs **locally on the workstation** — no cluster access needed. Pure file processing over the JSONL manifest.
+
+## Step 1 — Locate the manifest
+
+Default: `run/manifest.jsonl` at the repo root. If the user has a different path or multiple runs under `run_*/`, ask which to analyze.
+
+## Step 2 — Run analyze
+
+```bash
+uv run xfer manifest analyze \
+  --in  run/manifest.jsonl \
+  --out run/analyze.json
+```
+
+Optional flags that tune the **shard-count suggestion** (not the rclone flag suggestion):
+
+| Flag                            | Default | What it does                                                                |
+| ------------------------------- | ------- | --------------------------------------------------------------------------- |
+| `--assumed-cpus-per-task`       | `4`     | Cores each worker will request. Matches `xfer slurm render` default.        |
+| `--assumed-array-concurrency`   | `64`    | Expected Slurm array concurrency. Matches `xfer slurm render` default.      |
+| `--assumed-core-budget`         | unset   | Total cores the partition will make available (supply from `sinfo`).         |
+| `--max-shard-bytes-tb`          | `10`    | Per-shard byte cap. No single shard should carry more than this.             |
+| `--base-flags "<flags>"`        | —       | Prepend the user's preferred rclone flags to the suggested ones.             |
+
+If the user already knows the transfer cluster's available core budget, pass it — the shard-count suggestion will be sharper. Otherwise the default (concurrency + bytes-only) is fine.
+
+## Step 3 — Report
+
+Read `run/analyze.json` and report to the user:
+
+1. **Dataset shape**: total object count, total bytes, median size, p10/p90 sizes, and the histogram bin counts (power-of-2 edges).
+2. **Profile classification**: which profile the analyzer picked (`small_files`, `large_files`, or `mixed`) and the reasoning (e.g., ">70% of objects are under 1 MiB").
+3. **Suggested rclone flags** (`suggested_flags`): the concrete string to pass to `--rclone-flags` for render. Typical examples:
+   - small_files → `--transfers 64 --checkers 128 --fast-list`
+   - large_files → `--transfers 16 --checkers 32 --buffer-size 256M`
+   - mixed       → `--transfers 32 --checkers 64 --fast-list`
+4. **Suggested shard count** (`suggested_shard_count`, plus `shard_count_reasoning` and `shard_count_assumptions`). The heuristic:
+   - If `total_bytes` is below the per-shard cap (default 10 TiB), **1 shard** — don't shard small datasets.
+   - Otherwise `ceil(total_bytes / cap)` shards, upper-bounded by `4 × array_concurrency` and (if a core budget was supplied) `core_budget // cpus_per_task`.
+
+   Quote `shard_count_reasoning` verbatim back to the user so they can see the trade-offs.
+
+## Step 4 — Persist for downstream skills
+
+`run/analyze.json` is the source of truth for flag/shard decisions. `xfer-manifest-shard` reads `suggested_shard_count` and `xfer-slurm-render` reads `suggested_flags` — point at this file, don't re-derive.
+
+If the user's plan changes (different transfer cluster, different concurrency cap), re-run `xfer manifest analyze` with updated `--assumed-*` flags before calling `xfer-manifest-shard`.
+
+## After this skill
+
+Recommend `xfer-manifest-shard` next.
diff --git a/.claude/skills/xfer-manifest-build/SKILL.md b/.claude/skills/xfer-manifest-build/SKILL.md
@@ -0,0 +1,117 @@
+---
+name: xfer-manifest-build
+description: Build an xfer JSONL manifest for a large S3-to-S3 (or POSIX-to-S3) data transfer. Use when the user wants to list source objects for a transfer, kick off `xfer manifest build`, or start the xfer pipeline from scratch. Prefers running on a Slurm cluster that has a POSIX mount of the source bucket, since listing over POSIX is far faster than listing over S3.
+---
+
+# xfer-manifest-build
+
+Drives `xfer manifest build` — the first stage of the xfer pipeline. Produces `run/manifest.jsonl`.
+
+## Operating model
+
+Assume the user is working from a **local workstation** at the root of the `xfer` repo, in a `uv` environment (`uv venv && uv sync` already done). The workstation orchestrates. **`xfer manifest build` itself must run on a Slurm login node** because it invokes `srun` + pyxis internally.
+
+## Step 1 — Pick the build host (POSIX-first)
+
+Ask the user (or infer from prior conversation / CLAUDE.md / a site-config file if one exists):
+
+1. What is the source? (S3 remote like `s3src:bucket/prefix` or a POSIX path like `/mnt/data/dataset`)
+2. What is the destination?
+3. **Does any Slurm cluster have a POSIX mount equivalent to the source bucket?** (e.g., `/mnt/data/important-files` on cluster `weka` corresponds to `weka-s3:important-files`.) If yes, strongly prefer that cluster as the build host and pass the POSIX path as `--source` — listing over POSIX is latency-bound and much faster than listing over S3.
+4. If no POSIX mount exists, pick any Slurm cluster with network access to the source endpoint. Still prefer the side with better network proximity to source.
+
+Record the chosen build host's hostname, username, and the xfer repo path on that host (default `~/xfer`). These are needed for SSH.
+
+## Step 2 — Pre-flight on the login node
+
+Run a single non-destructive SSH probe to discover what's already in place:
+
+```bash
+ssh <user>@<login-node> '
+  command -v uv || echo "UV_MISSING"
+  test -d <xfer-repo-path> && echo "REPO_PRESENT" || echo "REPO_MISSING"
+  test -d <xfer-repo-path>/.venv && echo "VENV_PRESENT" || echo "VENV_MISSING"
+  test -f <rclone-conf-path> && echo "RCLONE_CONF_PRESENT" || echo "RCLONE_CONF_MISSING"
+  sinfo -h -o "%P %a %D %C %G" | head -20
+'
+```
+
+Parse results and react:
+
+| State                        | Action                                                                              |
+| ---------------------------- | ----------------------------------------------------------------------------------- |
+| `UV_MISSING`                 | Install per-user: `curl -LsSf https://astral.sh/uv/install.sh \| sh` (confirm first)|
+| `REPO_MISSING`               | Rsync the local repo up (see step 2a below)                                         |
+| `REPO_PRESENT` but older     | Offer to rsync updates from the workstation (step 2a); never force without asking   |
+| `VENV_MISSING`               | Run `uv sync` on the login node after repo is in place (step 2b)                    |
+| `RCLONE_CONF_MISSING`        | Invoke `xfer-rclone-config` to create/deploy the config to this cluster — do not `scp` blindly |
+
+Also verify a CPU-only partition is visible in `sinfo` output (the `%G` column should be empty or `(null)` for CPU-only).
+
+### Step 2a — Sync the repo to the login node (if needed)
+
+```bash
+rsync -av \
+  --exclude='.venv/' --exclude='.git/' --exclude='run*/' \
+  --exclude='__pycache__/' --exclude='*.egg-info/' \
+  ./ <user>@<login-node>:<xfer-repo-path>/
+```
+
+Exclude `.venv/` (platform-specific; must be rebuilt remotely), `.git/` (not needed to run), and any local `run*/` dirs (those stay on the workstation or move separately).
+
+### Step 2b — Bootstrap the uv environment on the login node
+
+```bash
+ssh <user>@<login-node> '
+  cd <xfer-repo-path> &&
+  uv venv &&
+  uv sync
+'
+```
+
+Stream the output so the user sees `uv sync` progress. If `uv sync` fails (e.g., no network from login node, locked-down Python), surface the error and stop — don't try to force-install.
+
+## Step 3 — Run the build
+
+Prefer a **CPU-only partition** with **4–8 cores**. The `srun` inside `xfer manifest build` already requests 8 cores (see `cli.py:251`), so ask the user to confirm the partition and any `--sbatch-extras`-style account/QoS flags their site requires before submission.
+
+Invoke on the login node via SSH:
+
+```bash
+ssh <user>@<login-node> '
+  cd <xfer-repo-path> &&
+  uv run xfer manifest build \
+    --source <source> \
+    --dest   <dest> \
+    --out    run/manifest.jsonl \
+    --rclone-image rclone/rclone:latest \
+    --rclone-config <rclone-conf-path-on-this-cluster>
+'
+```
+
+`<rclone-conf-path-on-this-cluster>` is the absolute path on the **build cluster's login node**, not the workstation path. If unsure, ask the user — or run `xfer-rclone-config` to resolve/deploy it for this cluster.
+
+Notes:
+- If the source is a POSIX path, the `--source` value is the filesystem path (e.g., `/mnt/data/dataset`), not an rclone remote. The destination remains an rclone remote.
+- `--fast-list` is already the default for S3 sources (see `cli.py:200`). Pass `--no-fast-list` when the source is a POSIX path.
+- Stream output back so the user sees progress. Don't background it.
+
+## Step 4 — Retrieve the manifest and note the vantage point
+
+Pull the manifest back to the workstation so downstream skills (analyze, shard, render) can run locally:
+
+```bash
+rsync -av <user>@<login-node>:<xfer-repo-path>/run/manifest.jsonl ./run/manifest.jsonl
+```
+
+Tell the user **the vantage point of the manifest** — i.e., "source was listed as `<source>` from host `<login-node>`." If the transfer will run on a different cluster with a different view (e.g., built from POSIX `/mnt/data/x`, transferred via `weka-s3:x`), they will need to invoke the `xfer-manifest-rebase` skill before render/submit.
+
+## Safety
+
+- Do not delete or overwrite an existing `run/manifest.jsonl` without confirming.
+- Do not pick a build partition without the user's confirmation if the cluster has multiple options.
+- If `ssh` or `rsync` would touch a shared path on the login node (e.g., a group scratch dir), confirm the path first.
+
+## After this skill
+
+Recommend the user next invoke `xfer-manifest-analyze` to pick rclone flags from the file-size histogram.
diff --git a/.claude/skills/xfer-manifest-combine/SKILL.md b/.claude/skills/xfer-manifest-combine/SKILL.md
@@ -0,0 +1,78 @@
+---
+name: xfer-manifest-combine
+description: Combine multiple `rclone lsjson` part files into a single xfer JSONL manifest. Use this instead of `xfer-manifest-build` when the source is too large for a single `rclone lsjson` call and the user has already produced parallel listings (one JSON-array file per top-level prefix, with a `.prefix` sidecar naming the prefix). Produces the same `xfer.manifest.v1` schema the rest of the pipeline consumes.
+---
+
+# xfer-manifest-combine
+
+Drives `xfer manifest combine` — an alternative entry point to the pipeline when `xfer manifest build`'s single-shot listing is too slow or too large to buffer. Combines per-prefix lsjson outputs into one `manifest.jsonl`.
+
+## When to use this instead of `xfer-manifest-build`
+
+Use `xfer-manifest-build` by default. Reach for combine when **all** of the following are true:
+
+- The source has so many objects that a single `rclone lsjson` call would OOM or run for days.
+- The user (or a previous job) has already produced per-prefix lsjson outputs in a directory.
+- Each part file is a JSON array (rclone's native lsjson format), and each has a sibling `.prefix` file naming the prefix that was listed.
+
+If only the first condition is true and there are no parts yet, `xfer-manifest-build` is simpler — running parallel lsjson jobs just to feed combine is out of scope for this skill; the user should do that with their own orchestration.
+
+## Operating model
+
+Runs **locally on the workstation** if the parts dir is accessible locally, otherwise on whichever host can see the part files. Pure file processing — no Slurm, no SSH.
+
+## Step 1 — Verify the parts directory layout
+
+Each part file must follow this pattern:
+
+```
+parts/
+├── lsjson-0001.json      # JSON array: `rclone lsjson <remote>:<bucket>/<prefix-1> --recursive`
+├── lsjson-0001.prefix    # text file containing the literal prefix, e.g. "prefix-1"
+├── lsjson-0002.json
+├── lsjson-0002.prefix
+...
+```
+
+Quick probe:
+
+```bash
+ls <parts-dir>/lsjson-*.json | head
+ls <parts-dir>/lsjson-*.prefix | head
+```
+
+If a `.prefix` sidecar is missing for any part, combine will use an empty prefix for that part — the resulting `path` fields will be bucket-relative, which is usually **not** what you want. Flag this and confirm with the user before running.
+
+## Step 2 — Run combine
+
+```bash
+uv run xfer manifest combine \
+  --source    <rclone-source-root>   \
+  --dest      <rclone-dest-root>     \
+  --parts-dir <parts-dir>            \
+  --out       run/manifest.jsonl
+```
+
+`--source` / `--dest` are the full roots (e.g., `s3src:bucket`) — combine prepends them to each part's prefix + object path to produce the `source`/`dest` URIs in the manifest.
+
+`--run-id <id>` is optional; if omitted, one is generated per run.
+
+## Step 3 — Sanity-check the manifest
+
+```bash
+wc -l run/manifest.jsonl
+head -1 run/manifest.jsonl | python -m json.tool
+```
+
+Confirm:
+- Line count matches the sum of non-dir entries across parts (combine's final log line reports this).
+- `source_root`, `dest_root`, and the first record's `source`/`path` look right (path should start with a known prefix from the parts).
+
+## Safety
+
+- If `run/manifest.jsonl` already exists, confirm with the user before overwriting — combine writes unconditionally.
+- If the parts dir is on a shared filesystem, treat it as read-only.
+
+## After this skill
+
+Continue with `xfer-manifest-analyze` exactly as you would after `xfer-manifest-build`. Downstream skills don't care whether the manifest came from build or combine — the schema is identical.
diff --git a/.claude/skills/xfer-manifest-rebase/SKILL.md b/.claude/skills/xfer-manifest-rebase/SKILL.md
@@ -0,0 +1,80 @@
+---
+name: xfer-manifest-rebase
+description: Remap an xfer manifest's source/dest roots when the transfer host has a different view than the manifest build host (e.g., manifest built over POSIX `/mnt/data/x`, transfer runs via S3 `weka-s3:x`). Use whenever the vantage point changes between manifest build and transfer — this MUST run before render/submit or the transfer will fail.
+---
+
+# xfer-manifest-rebase
+
+Drives `xfer manifest rebase` — rewrites the source/dest roots in `run/manifest.jsonl` so the manifest is valid from the transfer host's perspective.
+
+## When to run this
+
+**Trigger condition**: the host that will execute the transfer has a different view of either the source or the destination than the host that built the manifest. Common cases:
+
+- Manifest built on a cluster with POSIX mount (`/mnt/data/dataset`); transfer runs on a different cluster that only sees the bucket as an rclone remote (`weka-s3:dataset`).
+- Source built via one rclone remote alias; transfer host's rclone.conf uses a different alias for the same bucket.
+- Dest root changed (e.g., adding a subprefix) after build.
+
+If the transfer host sees source and dest identically to the build host, **do not rebase** — it's a no-op that wastes a pass over the manifest.
+
+## Step 1 — Determine the mismatch
+
+Ask the user (or look at prior conversation / `run/manifest.jsonl`'s header):
+
+1. What were the `source_root` and `dest_root` recorded in the manifest? (Peek at the first line of `run/manifest.jsonl`.)
+2. What does the transfer host see as the source and dest? (Usually rclone remote names; confirm with `rclone listremotes` on that host.)
+
+Show the user the proposed before/after mapping and confirm before proceeding.
+
+## Step 2 — Run rebase
+
+```bash
+uv run xfer manifest rebase \
+  --in          run/manifest.jsonl \
+  --out         run/manifest.rebased.jsonl \
+  --source-root <new-source-root> \
+  --dest-root   <new-dest-root>
+```
+
+Always write to a new file (don't overwrite `manifest.jsonl`). Keeping the original manifest as a record of the build vantage point is useful for debugging and audits.
+
+## Step 3 — Re-shard
+
+Sharding is derived from the manifest, so **re-shard after rebasing**. The existing `run/shards/` directory contains pre-rebase paths and must be replaced.
+
+**Before running the command below, ask the user explicitly:** "OK to delete `run/shards/` and re-generate from the rebased manifest?" Offer to move it aside (`mv run/shards run/shards.pre-rebase`) as a safer alternative if they want to keep the old shards for debugging.
+
+Only after explicit confirmation:
+
+```bash
+rm -rf run/shards
+uv run xfer manifest shard \
+  --in     run/manifest.rebased.jsonl \
+  --outdir run/shards \
+  --num-shards <same-N-as-before>
+```
+
+(Or invoke `xfer-manifest-shard` with the rebased manifest as input.) Byte balance won't change meaningfully, but the shard files need to carry the rebased paths or workers will try to copy from the wrong URI.
+
+## Step 4 — Point `xfer slurm render` at the rebased manifest
+
+`xfer slurm render` reads `source_root` and `dest_root` from a manifest file. By default it reads `<run_dir>/manifest.jsonl`, which is intentionally left at the pre-rebase vantage point as an audit record. Pass `--manifest` to read the rebased file instead:
+
+```bash
+uv run xfer slurm render \
+  --run-dir run \
+  --manifest run/manifest.rebased.jsonl \
+  ...
+```
+
+Without `--manifest`, render would use the original roots and every array task would target the wrong URI.
+
+## Safety
+
+- Never delete the original manifest — always keep `run/manifest.jsonl` as an audit trail alongside `run/manifest.rebased.jsonl`.
+- Rebase is a remap, not a content migration. It does not move data. It only relabels what each shard points to.
+- Confirm before `rm -rf run/shards` — the user may want to move the old shards aside rather than delete them.
+
+## After this skill
+
+Recommend `xfer-slurm-render` (or re-shard first if you didn't in step 3).