From 3904b86b24424eaa6029738c5d47fe87b86b2491 Mon Sep 17 00:00:00 2001 From: Joe Schoonover Date: Fri, 17 Apr 2026 13:50:10 -0400 Subject: [PATCH 1/5] Add Claude Code skills and CLAUDE.md for xfer workflow guidance Introduces seven Skill definitions under .claude/skills/ covering each stage of a large data transfer (rclone.conf setup, manifest build, analyze, shard, rebase, slurm render, slurm submit) along with a CLAUDE.md that captures the workstation-as-orchestrator model and the cross-cutting invariants (per-system paths, POSIX-first manifest build, load-aware transfer cluster selection, rebase-on-vantage-change, credential hygiene). Also adds .claude/settings.local.json to .gitignore since it holds user-local permission grants. Co-Authored-By: Claude Opus 4.7 (1M context) --- .claude/skills/xfer-manifest-analyze/SKILL.md | 46 +++++++ .claude/skills/xfer-manifest-build/SKILL.md | 118 ++++++++++++++++++ .claude/skills/xfer-manifest-rebase/SKILL.md | 66 ++++++++++ .claude/skills/xfer-manifest-shard/SKILL.md | 56 +++++++++ .claude/skills/xfer-rclone-config/SKILL.md | 114 +++++++++++++++++ .claude/skills/xfer-slurm-render/SKILL.md | 100 +++++++++++++++ .claude/skills/xfer-slurm-submit/SKILL.md | 98 +++++++++++++++ .gitignore | 1 + CLAUDE.md | 37 ++++++ 9 files changed, 636 insertions(+) create mode 100644 .claude/skills/xfer-manifest-analyze/SKILL.md create mode 100644 .claude/skills/xfer-manifest-build/SKILL.md create mode 100644 .claude/skills/xfer-manifest-rebase/SKILL.md create mode 100644 .claude/skills/xfer-manifest-shard/SKILL.md create mode 100644 .claude/skills/xfer-rclone-config/SKILL.md create mode 100644 .claude/skills/xfer-slurm-render/SKILL.md create mode 100644 .claude/skills/xfer-slurm-submit/SKILL.md create mode 100644 CLAUDE.md diff --git a/.claude/skills/xfer-manifest-analyze/SKILL.md b/.claude/skills/xfer-manifest-analyze/SKILL.md new file mode 100644 index 0000000..5ccb4e2 --- /dev/null +++ b/.claude/skills/xfer-manifest-analyze/SKILL.md @@ -0,0 +1,46 @@ +--- +name: xfer-manifest-analyze +description: Analyze an xfer manifest's file-size distribution and suggest rclone flags plus a shard count. Use after `xfer manifest build` and before sharding/rendering, whenever the user asks "how should I tune rclone?", "how many shards?", or wants to understand the dataset shape before transferring. +--- + +# xfer-manifest-analyze + +Drives `xfer manifest analyze` — reads `run/manifest.jsonl` (produced by `xfer-manifest-build`) and writes a histogram + suggested rclone flags to `run/analyze.json`. + +## Operating model + +Runs **locally on the workstation** — no cluster access needed. Pure file processing over the JSONL manifest. + +## Step 1 — Locate the manifest + +Default: `run/manifest.jsonl` at the repo root. If the user has a different path or multiple runs under `run_*/`, ask which to analyze. + +## Step 2 — Run analyze + +```bash +uv run xfer manifest analyze \ + --in run/manifest.jsonl \ + --out run/analyze.json +``` + +Optionally, if the user already has a preferred base set of rclone flags they want to layer on top, pass `--base-flags ""`. + +## Step 3 — Report + +Read `run/analyze.json` and report to the user: + +1. **Dataset shape**: total object count, total bytes, median size, p90/p99 sizes, and the histogram bin counts (power-of-2 edges). +2. **Profile classification**: which profile the analyzer picked (`small_files`, `large_files`, or `mixed`) and the reasoning (e.g., ">70% of objects are under 1 MiB"). +3. **Suggested rclone flags**: the concrete string to pass to `--rclone-flags` for render. Typical examples: + - small_files → `--transfers 64 --checkers 128 --fast-list` + - large_files → `--transfers 16 --checkers 32 --buffer-size 256M` + - mixed → `--transfers 32 --checkers 64 --fast-list` +4. **Suggested shard count**: derive from total object count / target objects-per-shard (aim for ~10k–50k objects per shard for small-file workloads, smaller for large-file). Cap at what the chosen transfer cluster can reasonably host in a job array. If you don't yet know the transfer cluster, give a range and defer to `xfer-manifest-shard`. + +## Step 4 — Persist for downstream skills + +`run/analyze.json` is the source of truth for flag/shard decisions. `xfer-manifest-shard` and `xfer-slurm-render` both read it. Don't re-derive flags by hand in those skills — point at this file. + +## After this skill + +Recommend `xfer-manifest-shard` next. diff --git a/.claude/skills/xfer-manifest-build/SKILL.md b/.claude/skills/xfer-manifest-build/SKILL.md new file mode 100644 index 0000000..ad976ac --- /dev/null +++ b/.claude/skills/xfer-manifest-build/SKILL.md @@ -0,0 +1,118 @@ +--- +name: xfer-manifest-build +description: Build an xfer JSONL manifest for a large S3-to-S3 (or POSIX-to-S3) data transfer. Use when the user wants to list source objects for a transfer, kick off `xfer manifest build`, or start the xfer pipeline from scratch. Prefers running on a Slurm cluster that has a POSIX mount of the source bucket, since listing over POSIX is far faster than listing over S3. +--- + +# xfer-manifest-build + +Drives `xfer manifest build` — the first stage of the xfer pipeline. Produces `run/manifest.jsonl`. + +## Operating model + +Assume the user is working from a **local workstation** at the root of the `xfer` repo, in a `uv` environment (`uv venv && uv sync` already done). The workstation orchestrates. **`xfer manifest build` itself must run on a Slurm login node** because it invokes `srun` + pyxis internally. + +## Step 1 — Pick the build host (POSIX-first) + +Ask the user (or infer from prior conversation / CLAUDE.md / a site-config file if one exists): + +1. What is the source? (S3 remote like `s3src:bucket/prefix` or a POSIX path like `/mnt/data/dataset`) +2. What is the destination? +3. **Does any Slurm cluster have a POSIX mount equivalent to the source bucket?** (e.g., `/mnt/data/important-files` on cluster `weka` corresponds to `weka-s3:important-files`.) If yes, strongly prefer that cluster as the build host and pass the POSIX path as `--source` — listing over POSIX is latency-bound and much faster than listing over S3. +4. If no POSIX mount exists, pick any Slurm cluster with network access to the source endpoint. Still prefer the side with better network proximity to source. + +Record the chosen build host's hostname, username, and the xfer repo path on that host (default `~/xfer`). These are needed for SSH. + +## Step 2 — Pre-flight on the login node + +Run a single non-destructive SSH probe to discover what's already in place: + +```bash +ssh @ ' + command -v uv || echo "UV_MISSING" + test -d && echo "REPO_PRESENT" || echo "REPO_MISSING" + test -d /.venv && echo "VENV_PRESENT" || echo "VENV_MISSING" + test -f && echo "RCLONE_CONF_PRESENT" || echo "RCLONE_CONF_MISSING" + sinfo -h -o "%P %a %D %C %G" | head -20 +' +``` + +Parse results and react: + +| State | Action | +| ---------------------------- | ----------------------------------------------------------------------------------- | +| `UV_MISSING` | Install per-user: `curl -LsSf https://astral.sh/uv/install.sh \| sh` (confirm first)| +| `REPO_MISSING` | Rsync the local repo up (see step 2a below) | +| `REPO_PRESENT` but older | Offer to rsync updates from the workstation (step 2a); never force without asking | +| `VENV_MISSING` | Run `uv sync` on the login node after repo is in place (step 2b) | +| `RCLONE_CONF_MISSING` | Invoke `xfer-rclone-config` to create/deploy the config to this cluster — do not `scp` blindly | + +Also verify a CPU-only partition is visible in `sinfo` output (the `%G` column should be empty or `(null)` for CPU-only). + +### Step 2a — Sync the repo to the login node (if needed) + +```bash +rsync -av \ + --exclude='.venv/' --exclude='.git/' --exclude='run*/' \ + --exclude='__pycache__/' --exclude='*.egg-info/' \ + ./ @:/ +``` + +Exclude `.venv/` (platform-specific; must be rebuilt remotely), `.git/` (not needed to run), and any local `run*/` dirs (those stay on the workstation or move separately). + +### Step 2b — Bootstrap the uv environment on the login node + +```bash +ssh @ ' + cd && + uv venv && + uv sync +' +``` + +Stream the output so the user sees `uv sync` progress. If `uv sync` fails (e.g., no network from login node, locked-down Python), surface the error and stop — don't try to force-install. + +## Step 3 — Run the build + +Prefer a **CPU-only partition** with **4–8 cores**. The `srun` inside `xfer manifest build` already requests 8 cores (see `cli.py:251`), so ask the user to confirm the partition and any `--sbatch-extras`-style account/QoS flags their site requires before submission. + +Invoke on the login node via SSH: + +```bash +ssh @ ' + cd && + uv run xfer manifest build \ + --source \ + --dest \ + --out run/manifest.jsonl \ + --rclone-image rclone/rclone:latest \ + --rclone-config \ + --extra-lsjson-flags "--fast-list" +' +``` + +`` is the absolute path on the **build cluster's login node**, not the workstation path. If unsure, ask the user — or run `xfer-rclone-config` to resolve/deploy it for this cluster. + +Notes: +- If the source is a POSIX path, the `--source` value is the filesystem path (e.g., `/mnt/data/dataset`), not an rclone remote. The destination remains an rclone remote. +- `--fast-list` is usually a win for S3 sources. Skip it for POSIX sources. +- Stream output back so the user sees progress. Don't background it. + +## Step 4 — Retrieve the manifest and note the vantage point + +Pull the manifest back to the workstation so downstream skills (analyze, shard, render) can run locally: + +```bash +rsync -av @:/run/manifest.jsonl ./run/manifest.jsonl +``` + +Tell the user **the vantage point of the manifest** — i.e., "source was listed as `` from host ``." If the transfer will run on a different cluster with a different view (e.g., built from POSIX `/mnt/data/x`, transferred via `weka-s3:x`), they will need to invoke the `xfer-manifest-rebase` skill before render/submit. + +## Safety + +- Do not delete or overwrite an existing `run/manifest.jsonl` without confirming. +- Do not pick a build partition without the user's confirmation if the cluster has multiple options. +- If `ssh` or `rsync` would touch a shared path on the login node (e.g., a group scratch dir), confirm the path first. + +## After this skill + +Recommend the user next invoke `xfer-manifest-analyze` to pick rclone flags from the file-size histogram. diff --git a/.claude/skills/xfer-manifest-rebase/SKILL.md b/.claude/skills/xfer-manifest-rebase/SKILL.md new file mode 100644 index 0000000..f249e99 --- /dev/null +++ b/.claude/skills/xfer-manifest-rebase/SKILL.md @@ -0,0 +1,66 @@ +--- +name: xfer-manifest-rebase +description: Remap an xfer manifest's source/dest roots when the transfer host has a different view than the manifest build host (e.g., manifest built over POSIX `/mnt/data/x`, transfer runs via S3 `weka-s3:x`). Use whenever the vantage point changes between manifest build and transfer — this MUST run before render/submit or the transfer will fail. +--- + +# xfer-manifest-rebase + +Drives `xfer manifest rebase` — rewrites the source/dest roots in `run/manifest.jsonl` so the manifest is valid from the transfer host's perspective. + +## When to run this + +**Trigger condition**: the host that will execute the transfer has a different view of either the source or the destination than the host that built the manifest. Common cases: + +- Manifest built on a cluster with POSIX mount (`/mnt/data/dataset`); transfer runs on a different cluster that only sees the bucket as an rclone remote (`weka-s3:dataset`). +- Source built via one rclone remote alias; transfer host's rclone.conf uses a different alias for the same bucket. +- Dest root changed (e.g., adding a subprefix) after build. + +If the transfer host sees source and dest identically to the build host, **do not rebase** — it's a no-op that wastes a pass over the manifest. + +## Step 1 — Determine the mismatch + +Ask the user (or look at prior conversation / `run/manifest.jsonl`'s header): + +1. What were the `source_root` and `dest_root` recorded in the manifest? (Peek at the first line of `run/manifest.jsonl`.) +2. What does the transfer host see as the source and dest? (Usually rclone remote names; confirm with `rclone listremotes` on that host.) + +Show the user the proposed before/after mapping and confirm before proceeding. + +## Step 2 — Run rebase + +```bash +uv run xfer manifest rebase \ + --in run/manifest.jsonl \ + --out run/manifest.rebased.jsonl \ + --source-root \ + --dest-root +``` + +Always write to a new file (don't overwrite `manifest.jsonl`). Keeping the original manifest as a record of the build vantage point is useful for debugging and audits. + +## Step 3 — Re-shard + +Sharding is derived from the manifest, so **re-shard after rebasing**: + +```bash +rm -rf run/shards +uv run xfer manifest shard \ + --in run/manifest.rebased.jsonl \ + --outdir run/shards \ + --num-shards +``` + +(Or invoke `xfer-manifest-shard`.) Byte balance won't change meaningfully, but the shard files need to carry the rebased paths or workers will try to copy from the wrong URI. + +## Step 4 — Point downstream skills at the rebased manifest + +When you invoke `xfer-slurm-render` next, pass `--run-dir run` but ensure `config.resolved.json` references the rebased manifest. If the user plans a fresh `xfer slurm render`, that's automatic (render reads from `run/shards/`). + +## Safety + +- Never delete the original manifest — always keep `run/manifest.jsonl` as an audit trail alongside `run/manifest.rebased.jsonl`. +- Rebase is a remap, not a content migration. It does not move data. It only relabels what each shard points to. + +## After this skill + +Recommend `xfer-slurm-render` (or re-shard first if you didn't in step 3). diff --git a/.claude/skills/xfer-manifest-shard/SKILL.md b/.claude/skills/xfer-manifest-shard/SKILL.md new file mode 100644 index 0000000..a87ecc7 --- /dev/null +++ b/.claude/skills/xfer-manifest-shard/SKILL.md @@ -0,0 +1,56 @@ +--- +name: xfer-manifest-shard +description: Split an xfer manifest into byte-balanced shards for parallel transfer. Use after analyze and before slurm render, whenever the user wants to shard a manifest or asks how many array tasks the transfer should use. +--- + +# xfer-manifest-shard + +Drives `xfer manifest shard` — splits `run/manifest.jsonl` into `run/shards/shard_*.jsonl` + `shards.meta.json`. + +## Operating model + +Runs **locally on the workstation**. Pure file processing. No Slurm/SSH needed. + +## Step 1 — Read the analyze output + +Read `run/analyze.json` (from `xfer-manifest-analyze`). Use its `suggested_shard_count` / profile as the starting point. If analyze hasn't been run yet, invoke `xfer-manifest-analyze` first — don't guess shard counts from the raw manifest. + +## Step 2 — Reconcile shard count with cluster resources + +The right shard count depends on **both** rclone settings (from analyze) **and** the transfer cluster's available resources. Ask the user: + +1. Which cluster will run the transfer? (Same as build host, or different?) +2. What's the target array concurrency — how many shards should run at once? Typical range: 32–256, capped by the partition's `MaxArraySize` and the throughput both S3 endpoints can handle. +3. What's the partition's per-node core/memory budget? + +Rule of thumb: +- Total shards ≈ max(suggested_shard_count_from_analyze, 4 × array_concurrency). This gives the scheduler enough slack to keep the array fully packed even as slow shards trail. +- For small-files profiles (heavy listing, light bytes), bias toward **more shards** of fewer objects each. +- For large-files profiles (heavy bytes, few objects), bias toward **fewer shards** with more bytes each — byte-balancing matters more than object count. + +State your recommendation with the reasoning, then confirm before running. + +## Step 3 — Run shard + +```bash +uv run xfer manifest shard \ + --in run/manifest.jsonl \ + --outdir run/shards \ + --num-shards \ + --strategy bytes +``` + +`bytes` (the default) is almost always right — it minimizes long-tail tasks. Use `count` only if the user explicitly wants equal object counts per shard, or `hash` for deterministic assignment by key. + +## Step 4 — Report + +After the command completes, read `run/shards/shards.meta.json` and report: +- Number of shards created +- Min / max / median shard bytes (to show balance quality) +- Total object count + +If max/min bytes ratio is > 3x, warn the user — the manifest may have very large individual objects that a byte-balanced greedy bin-pack can't split. Suggest bumping shard count or splitting the outlier files out into a separate run. + +## After this skill + +Recommend `xfer-slurm-render`. If the transfer cluster has a different view of source/dest than the build host, recommend `xfer-manifest-rebase` first. diff --git a/.claude/skills/xfer-rclone-config/SKILL.md b/.claude/skills/xfer-rclone-config/SKILL.md new file mode 100644 index 0000000..f0d7d88 --- /dev/null +++ b/.claude/skills/xfer-rclone-config/SKILL.md @@ -0,0 +1,114 @@ +--- +name: xfer-rclone-config +description: Create or extend an rclone.conf for xfer — collect S3 endpoints and credentials for each remote (source, destination, optional staging), write the config with 0600 permissions, optionally test it, and guide the user through deploying it to each Slurm cluster that will run xfer jobs. Use whenever the user needs to set up rclone remotes, is bootstrapping a new transfer, or an existing skill reports a missing/incomplete rclone.conf. +--- + +# xfer-rclone-config + +Authors an rclone.conf suitable for xfer. The config is consumed by **containerized rclone** inside Slurm jobs, so it must be: + +1. Readable on the workstation (convenient for local testing). +2. Present at a known absolute path on **every Slurm cluster** that will run any stage of xfer (`xfer manifest build` and the transfer array). +3. Restricted to `0600` permissions — it contains S3 secret keys. + +The authoritative template is `rclone.conf.example` at the repo root. + +## Critical invariant — paths are per-system + +The `--rclone-config` flag always takes an **absolute path on whichever host the xfer command runs on**. That host differs between stages: + +- `xfer manifest build` — path on the **build cluster's login node** +- `xfer slurm render` — the path baked into `sbatch_array.sh` must be valid on the **transfer cluster's compute nodes** +- Local `uv run xfer analyze/shard/rebase` — path on the **workstation** (only if the user wants to sanity-check remotes locally via host rclone) + +A single workstation config does not automatically work on the cluster. **Always ask for and record the path per system.** Do not assume `~/.config/rclone/rclone.conf` on the workstation resolves the same way on the cluster — home directories, shared scratch, and site-standard locations (e.g., `/etc/rclone/rclone.conf`) all differ. + +## Step 1 — Inventory what's needed + +Ask the user: + +1. **Which remotes?** At minimum `source` and `destination`. Some transfers also need a staging remote. Each remote needs a short config-section name (e.g., `s3-src`, `s3-dest`, `weka-s3`). +2. **Per remote**: endpoint URL, access key ID, secret access key, provider (Other / AWS / Wasabi / Ceph / Minio / etc.), region (AWS only), and whether path-style addressing is required (true for most on-prem / VAST / Weka / Minio; false for AWS). +3. **Remote naming consistency across clusters.** If the same bucket will be referenced from more than one cluster, encourage the user to use **the same remote name everywhere** — this avoids needing `xfer-manifest-rebase` later. If a cluster has a POSIX mount of the same bucket, that's fine too; the build skill can point at the POSIX path directly on that cluster. +4. **Target file path on the workstation.** Default `~/.config/rclone/rclone.conf`. If the file already exists, default to **appending** new remotes rather than overwriting — confirm before touching existing sections. + +Collect secrets **one at a time** and do not echo them back in your user-facing text. When running the write step, never log the secret values. + +## Step 2 — Write the config + +For each remote, emit a section following the project's template style (`rclone.conf.example`): + +```ini +[] +type = s3 +provider = +access_key_id = +secret_access_key = +endpoint = +region = +force_path_style = +no_check_bucket = true +``` + +Notes: +- `force_path_style = true` for on-prem / VAST / Weka / Minio / Ceph. Leave unset for AWS. +- `no_check_bucket = true` is safe for xfer — workers assume the bucket exists; skipping the probe is faster at high concurrency. +- For AWS, set `region` and omit `endpoint`. + +Write the file, then immediately lock it down: + +```bash +chmod 600 +``` + +If appending to an existing file, confirm with the user that no section-name collisions exist before writing; if a section already exists, ask whether to overwrite or pick a different name. + +## Step 3 — Smoke-test (optional, workstation-local) + +If the user has `rclone` installed locally, offer to run: + +```bash +rclone --config listremotes +rclone --config lsd : --max-depth 1 +``` + +The `lsd` probe validates credentials and endpoint reachability. If `rclone` is not installed on the workstation, skip — the real validation happens on the cluster at job time anyway. + +## Step 4 — Deploy to each Slurm cluster + +For **every cluster** that will run any xfer stage: + +1. Ask the user the **absolute path** on that cluster where rclone.conf should live (e.g., `~/.config/rclone/rclone.conf`, `/home//xfer/rclone.conf`, or a site-provided shared location). +2. Check whether the cluster already has one at that path: + + ```bash + ssh @ 'test -f && echo PRESENT || echo MISSING' + ``` + +3. If missing, copy it over (**confirm with the user first — this transmits credentials**): + + ```bash + scp @: + ssh @ 'chmod 600 ' + ``` + +4. Record the `(cluster, path)` pairs in the current conversation so `xfer-manifest-build`, `xfer-slurm-render`, and `xfer-slurm-submit` can use the correct path per system. Example shape: + + ``` + workstation → ~/.config/rclone/rclone.conf + alpha-login.example.com → /home/joe/.config/rclone/rclone.conf + beta-login.example.com → /shared/rclone/joe.conf + ``` + +## Step 5 — Remind about credential hygiene + +Tell the user, explicitly: + +- The file must be `0600` on every host it lives on. +- If credentials rotate, all copies must be updated — xfer does not re-fetch. +- Do not commit rclone.conf to the repo (`.gitignore` already excludes it; confirm if unsure). +- If a cluster has a site-managed shared rclone.conf (e.g., admin-provisioned at `/etc/rclone/rclone.conf`), prefer that and skip the scp step — don't duplicate credentials. + +## After this skill + +If the user is starting fresh, continue to `xfer-manifest-build`. If they just added a remote for a new cluster, they're good to re-run whichever stage prompted this detour. diff --git a/.claude/skills/xfer-slurm-render/SKILL.md b/.claude/skills/xfer-slurm-render/SKILL.md new file mode 100644 index 0000000..5c625ce --- /dev/null +++ b/.claude/skills/xfer-slurm-render/SKILL.md @@ -0,0 +1,100 @@ +--- +name: xfer-slurm-render +description: Render Slurm batch scripts (`worker.sh`, `sbatch_array.sh`, `submit.sh`, `config.resolved.json`) for an xfer transfer. Use after manifest shard — and after rebase, if the vantage point changed — whenever the user wants to generate the runnable job artifacts. Picks partition and resources informed by current cluster load. +--- + +# xfer-slurm-render + +Drives `xfer slurm render` — produces the runnable job artifacts under `run/`. + +## Operating model + +Runs **locally on the workstation**. No SSH needed to render, but you will SSH to candidate login nodes in step 1 to check cluster load before picking the transfer cluster. + +## Step 1 — Pick the transfer cluster (load-aware, CPU-only preferred) + +If the user already decided which cluster to transfer on, skip to step 2. Otherwise, help them pick. + +1. Ask for the list of candidate Slurm clusters with login-node SSH access. +2. For each candidate, probe load and CPU-only partitions: + +```bash +ssh @ ' + sinfo -h -o "%P %a %l %D %C %G" | head -20 + echo "---" + squeue -h -o "%T %C" | awk '"'"'{s[$1]+=$2} END{for(k in s) print k, s[k]}'"'"' +' +``` + +- `%G` is the GRES column — a partition reporting `(null)` or no GPUs is CPU-only. **Prefer CPU-only partitions** — transfers are I/O-bound and GPU nodes are wasted here. +- From `%C` (allocated/idle/other/total cores), compute idle-fraction per partition. Prefer the partition with the highest idle fraction. +- From `squeue` aggregated by state, check how many jobs are queued (`PD`) vs running (`R`). High pending counts mean the next job will wait. + +Present a short comparison to the user and recommend one. Let them override. + +## Step 2 — Confirm "vantage-unchanged" invariant + +Before rendering, verify the manifest and shards reflect the transfer host's view of source/dest: + +- If the build host differs from the chosen transfer host **and** their rclone.conf views / POSIX mounts differ, **stop and invoke `xfer-manifest-rebase` first**. Otherwise workers will hit wrong-URI errors. + +Don't assume — if unsure, check `run/config.resolved.json` (if a prior render exists) against the chosen cluster's rclone remotes. + +## Step 3 — Gather render inputs + +Collect from the user (with defaults from `run/analyze.json` and the chosen partition): + +| Flag | Source | +| ---------------------- | ------------------------------------------------------- | +| `--run-dir` | `run` (or whatever the prior skills used) | +| `--num-shards` | count of `run/shards/shard_*.jsonl` | +| `--array-concurrency` | ask user; typically 32–256, bounded by partition | +| `--job-name` | ask user; default `xfer-` | +| `--time-limit` | ask user; default 24:00:00 | +| `--partition` | from step 1 | +| `--cpus-per-task` | 4 (default) | +| `--mem` | 8G (default; bump for large-files profile) | +| `--rclone-image` | `rclone/rclone:latest` | +| `--rclone-config` | absolute path to rclone.conf **on the transfer cluster's compute nodes** (see note below) | +| `--rclone-flags` | `suggested_flags` from `run/analyze.json` | +| `--max-attempts` | 3 (default) | +| `--sbatch-extras` | site-specific `--account=...`, `--qos=...`, etc. | +| `--pyxis-extra` | extra `srun --container-*` flags if site requires them | + +The `--rclone-config` path is baked into `sbatch_array.sh` and resolved **on the transfer cluster at job time**, not on the workstation. It must: + +- Be an absolute path valid on that cluster's compute nodes (home dirs and shared paths differ between sites). +- Exist with `0600` permissions before the job starts. + +If the user doesn't already have the config deployed to this cluster at a known path, **stop and invoke `xfer-rclone-config`** to set it up and record the path. Don't guess the path — a wrong value here means every array task will fail identically at container start. + +## Step 4 — Render + +```bash +uv run xfer slurm render \ + --run-dir \ + --num-shards \ + --array-concurrency \ + --job-name \ + --time-limit \ + --partition \ + --cpus-per-task \ + --mem \ + --rclone-image \ + --rclone-config \ + --rclone-flags "" \ + --max-attempts 3 \ + --sbatch-extras '' \ + --pyxis-extra '' +``` + +## Step 5 — Verify the outputs + +After render, show the user: +- `run/sbatch_array.sh` — read and show the `#SBATCH` header so they can eyeball partition/time/mem +- `run/config.resolved.json` — the frozen run config +- `run/worker.sh` exists and is executable + +## After this skill + +Recommend `xfer-slurm-submit`. diff --git a/.claude/skills/xfer-slurm-submit/SKILL.md b/.claude/skills/xfer-slurm-submit/SKILL.md new file mode 100644 index 0000000..7535a4e --- /dev/null +++ b/.claude/skills/xfer-slurm-submit/SKILL.md @@ -0,0 +1,98 @@ +--- +name: xfer-slurm-submit +description: Copy a rendered xfer run directory to a Slurm login node and submit the array job via sbatch. Use after `xfer slurm render` whenever the user is ready to kick off the actual transfer. Handles workstation→cluster staging and post-submit monitoring pointers. +--- + +# xfer-slurm-submit + +Submits the rendered Slurm array job. Assumes `xfer-slurm-render` already produced `run/worker.sh`, `run/sbatch_array.sh`, `run/shards/`, etc. on the workstation. + +## Operating model + +The user renders locally and submits to the **transfer cluster's login node**. This skill: +1. Copies (or syncs) the run directory to the login node. +2. SSHes in and runs `sbatch`. +3. Returns the job id and monitoring pointers. + +## Step 1 — Identify the transfer cluster and paths + +From `run/config.resolved.json` (or ask the user): +- Transfer cluster login node (hostname, user) +- Destination path for the run dir on the cluster (e.g., `~/xfer-runs/`) +- Path to `rclone.conf` on the cluster (referenced from `config.resolved.json`) + +Verify the rclone.conf path recorded in `run/config.resolved.json` exists on the transfer cluster — this is the path the compute nodes will see, which may differ from the workstation's path: + +```bash +ssh @ 'test -f && stat -c "%a" ' +``` + +Expect `600`. If the file is missing, **stop and invoke `xfer-rclone-config`** to deploy it — that skill handles credential hygiene, 0600 permissions, and recording the per-cluster path. Don't `scp` the file directly from this skill. + +If the file exists but isn't `0600`, ask the user whether to fix it (`chmod 600 ` via SSH) before proceeding. + +## Step 2 — Stage the run directory + +Use `rsync -av` so re-submits don't re-send unchanged shards: + +```bash +rsync -av --exclude='logs/' --exclude='state/' \ + .// \ + @:/ +``` + +Exclude `logs/` and `state/` — those are produced by the running job and re-pulling them would clobber live progress. + +If this is a re-submit after failure, do **not** `--delete` — preserve any `.done` markers so completed shards are skipped on replay. + +## Step 3 — Submit + +```bash +ssh @ ' + cd && + sbatch --export=NONE sbatch_array.sh +' +``` + +`--export=NONE` is important — it prevents the workstation's or parent job's `SLURM_*` env from leaking into the array job (per the existing project convention; see `.claude/settings.local.json`). + +Capture the returned job id (e.g., `Submitted batch job 12345`). + +## Step 4 — Report monitoring commands + +Print back to the user — ready to paste: + +```bash +# Live queue state +ssh @ 'squeue -j ' + +# Completed tasks summary +ssh @ 'sacct -j --format=JobID,State,Elapsed,ExitCode' + +# Per-shard logs +ssh @ 'ls /logs/ | tail -20' + +# Back off array concurrency live (if S3 endpoint is rate-limiting) +ssh @ 'scontrol update ArrayTaskThrottle= JobId=' + +# State markers: .done = success, .fail = last attempt failed +ssh @ 'ls /state/ | sort | uniq -c | head' +``` + +## Step 5 — Tell the user about resume semantics + +Remind them of xfer's safety net: +- Failed shards auto-requeue up to `--max-attempts` (set at render time). +- Re-submitting the same `sbatch_array.sh` is safe — `.done` markers cause completed shards to skip. +- To fully restart, delete `run/state/` on the cluster before re-submitting. + +## Safety + +- **Always confirm before copying rclone.conf** — it contains credentials. +- **Always confirm the sbatch command** before running it — this is the moment real cluster resources get consumed. +- Never run `scontrol update` or `scancel` on a running job without explicit user direction. +- Do not use `rsync --delete` against a running job's run dir. + +## After this skill + +The pipeline is live. Direct the user to monitor via the commands printed in step 4. diff --git a/.gitignore b/.gitignore index c18dd8d..6658aed 100644 --- a/.gitignore +++ b/.gitignore @@ -1 +1,2 @@ __pycache__/ +.claude/settings.local.json diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..5a3f059 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,37 @@ +# CLAUDE.md — xfer + +`xfer` orchestrates large S3↔S3 (and POSIX↔S3) data transfers via rclone-in-a-container, Slurm job arrays, and pyxis/enroot. Manifest-driven, sharded, resumable. + +## Mental model + +The **user's local workstation** is the orchestrator. Long-running and compute-bound work happens on **Slurm clusters** reached via SSH/scp to login nodes. Commands like `uv run xfer manifest build` that invoke `srun` must run on a login node, not the workstation. `analyze`, `shard`, `rebase`, and `render` are pure file processing and run locally. + +## Cross-cutting invariants + +- **Paths are per-system.** `--rclone-config`, the xfer repo path, and the run directory all differ between workstation, build cluster, and transfer cluster. Never assume the workstation's path resolves on the cluster. Always ask, verify with `ssh ... test -f`, or consult `run/config.resolved.json`. +- **POSIX-first for manifest build.** If any Slurm cluster has a POSIX mount of the source bucket, build the manifest there against the POSIX path. Listing is latency-bound; POSIX beats S3 by a large margin. +- **CPU-only, load-aware for transfer.** Prefer CPU-only partitions for both build and transfer. Pick the transfer cluster by current load (`sinfo`/`squeue` on candidate login nodes), not habit. +- **Vantage change ⇒ rebase.** If the manifest was built on a host whose view of source/dest differs from the transfer host's view, run `xfer manifest rebase` and re-shard before render. Skipping this makes every array task fail identically. +- **Credential hygiene.** rclone.conf is `0600` everywhere it lives. Never commit it (`.gitignore` already excludes it). Never echo secrets back to the user. Confirm before transmitting credentials over scp. +- **Confirm before cluster-side actions.** SSH probes and `rsync` of local files are fine. `sbatch`, `scontrol update`, `scancel`, and any write to a shared path require explicit user confirmation. + +## Workflow skills + +Use these in order; each `Skill` trigger runs the corresponding stage. Invoke explicitly when the user's intent matches. + +| Skill | Stage | +| ------------------------- | ---------------------------------------------------------------------- | +| `xfer-rclone-config` | Create rclone.conf, deploy to each cluster that will run xfer jobs | +| `xfer-manifest-build` | Run `xfer manifest build` on a login node (POSIX-preferred source) | +| `xfer-manifest-analyze` | File-size histogram → suggested rclone flags + shard count | +| `xfer-manifest-shard` | Byte-balanced split of the manifest into `run/shards/` | +| `xfer-manifest-rebase` | Remap source/dest roots when vantage changes; re-shard after | +| `xfer-slurm-render` | Render `worker.sh` / `sbatch_array.sh` / `config.resolved.json` | +| `xfer-slurm-submit` | Stage run dir to transfer cluster, `sbatch`, return monitoring cmds | + +## Dev conventions + +- Python ≥ 3.10, managed with `uv` (`uv venv && uv sync`; `uv run xfer --help`). +- Formatting: `uv run black .`; pre-commit hook available via `uv run pre-commit install`. +- Branch names: `/` or `/` where type ∈ {feature, patch, docs}. +- Do **not** squash-merge PRs. From 4e677c9fc29c0b28b469745db20392187c3d508b Mon Sep 17 00:00:00 2001 From: Joe Schoonover Date: Fri, 17 Apr 2026 13:52:26 -0400 Subject: [PATCH 2/5] Document Claude Code skills and conventions in README Adds a "Using Claude Code with xfer" section covering the seven shipped skills and the workflow invariants they enforce (workstation/cluster split, per-system paths, POSIX-first manifest build, load-aware transfer cluster selection, rebase-on-vantage-change, credential hygiene) so the same guidance applies to manual runs. Co-Authored-By: Claude Opus 4.7 (1M context) --- README.md | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/README.md b/README.md index 654d766..d589873 100644 --- a/README.md +++ b/README.md @@ -309,6 +309,37 @@ run/ --- +## Using Claude Code with xfer + +This repo ships a set of [Claude Code](https://claude.com/claude-code) skills under `.claude/skills/` that walk through each stage of a transfer. Each skill drives the corresponding `xfer` subcommand and encodes conventions we've found important on real clusters. + +### Available skills + +Invoke each by intent in a Claude Code session, or explicitly via `/`: + +| Skill | Stage | +| ------------------------ | -------------------------------------------------------------------- | +| `xfer-rclone-config` | Create `rclone.conf` and deploy it to each cluster | +| `xfer-manifest-build` | Run `xfer manifest build` on a login node (POSIX source preferred) | +| `xfer-manifest-analyze` | File-size histogram → suggested rclone flags and shard count | +| `xfer-manifest-shard` | Byte-balanced split of the manifest into shards | +| `xfer-manifest-rebase` | Remap source/dest roots when the transfer host's view differs | +| `xfer-slurm-render` | Render `worker.sh` / `sbatch_array.sh` / `config.resolved.json` | +| `xfer-slurm-submit` | Stage the run directory to the cluster and `sbatch` | + +See `CLAUDE.md` for the cross-cutting context Claude loads in every session in this repo. + +### Conventions + +The skills (and `CLAUDE.md`) enforce a few invariants. These apply whether you drive xfer through Claude Code or by hand: + +- **Workstation orchestrates, clusters execute.** Run xfer from a local checkout in a `uv` environment. SSH to Slurm login nodes for `manifest build` and `sbatch`; `analyze`, `shard`, `rebase`, and `render` run locally. +- **Paths are per-system.** `--rclone-config`, the xfer repo path, and the run directory all differ between workstation, build cluster, and transfer cluster. Always resolve the correct absolute path on whichever host the command runs on — do not assume a workstation path resolves identically on a cluster. +- **POSIX-first manifest build.** If any Slurm cluster has a POSIX mount of the source bucket, build the manifest there against the POSIX path. Listing is latency-bound, and POSIX beats S3 by a wide margin. +- **CPU-only, load-aware transfer.** Prefer CPU-only partitions for both build and transfer. Pick the transfer cluster by current `sinfo`/`squeue` load rather than by habit. +- **Vantage change ⇒ rebase.** When the host that will run the transfer has a different view of source or destination than the host that built the manifest, run `xfer manifest rebase` and re-shard before render. Skipping this makes every array task fail identically. +- **Credential hygiene.** Keep `rclone.conf` at mode `0600` on every host it lives on. Never commit it to the repo, and confirm before transmitting it over `scp`. + ## Design notes * **Manifest is immutable** → enables reproducibility and auditing From 2cb48f158a46edd3840f4d0bd489296a0fcad435 Mon Sep 17 00:00:00 2001 From: Joe Schoonover Date: Fri, 17 Apr 2026 14:19:49 -0400 Subject: [PATCH 3/5] Add shard-count suggestion and render knobs used by Claude skills The skills claimed manifest analyze emitted `suggested_shard_count`, that slurm render picked up a rebased manifest automatically, and that render could be run without a local rclone.conf. None of those matched the CLI. Make the CLI match the documented behavior: * `xfer manifest analyze` now emits `suggested_shard_count`, `shard_count_reasoning`, and `shard_count_assumptions` based on a 10 TiB per-shard cap, `4 * array_concurrency` slack, and an optional core budget. New `--assumed-*` / `--max-shard-bytes-tb` flags let the user sharpen the suggestion. * `xfer slurm render` gains `--manifest`, so rebased manifests can be consumed without clobbering `run/manifest.jsonl`, and its `--rclone-config` no longer requires a local file (it warns instead). * Skill docs updated to match, plus pytest coverage for the shard-count heuristic. Co-Authored-By: Claude Opus 4.7 (1M context) --- .claude/skills/xfer-manifest-analyze/SKILL.md | 26 ++++-- .claude/skills/xfer-manifest-rebase/SKILL.md | 20 ++++- .claude/skills/xfer-manifest-shard/SKILL.md | 21 +++-- .claude/skills/xfer-slurm-render/SKILL.md | 18 ++-- pyproject.toml | 4 + src/xfer/cli.py | 56 +++++++++++-- src/xfer/est.py | 84 +++++++++++++++++++ tests/test_est_shard_count.py | 72 ++++++++++++++++ uv.lock | 50 +++++++++++ 9 files changed, 316 insertions(+), 35 deletions(-) create mode 100644 tests/test_est_shard_count.py diff --git a/.claude/skills/xfer-manifest-analyze/SKILL.md b/.claude/skills/xfer-manifest-analyze/SKILL.md index 5ccb4e2..9c7b7bd 100644 --- a/.claude/skills/xfer-manifest-analyze/SKILL.md +++ b/.claude/skills/xfer-manifest-analyze/SKILL.md @@ -23,23 +23,39 @@ uv run xfer manifest analyze \ --out run/analyze.json ``` -Optionally, if the user already has a preferred base set of rclone flags they want to layer on top, pass `--base-flags ""`. +Optional flags that tune the **shard-count suggestion** (not the rclone flag suggestion): + +| Flag | Default | What it does | +| ------------------------------- | ------- | --------------------------------------------------------------------------- | +| `--assumed-cpus-per-task` | `4` | Cores each worker will request. Matches `xfer slurm render` default. | +| `--assumed-array-concurrency` | `64` | Expected Slurm array concurrency. Matches `xfer slurm render` default. | +| `--assumed-core-budget` | unset | Total cores the partition will make available (supply from `sinfo`). | +| `--max-shard-bytes-tb` | `10` | Per-shard byte cap. No single shard should carry more than this. | +| `--base-flags ""` | — | Prepend the user's preferred rclone flags to the suggested ones. | + +If the user already knows the transfer cluster's available core budget, pass it — the shard-count suggestion will be sharper. Otherwise the default (concurrency + bytes-only) is fine. ## Step 3 — Report Read `run/analyze.json` and report to the user: -1. **Dataset shape**: total object count, total bytes, median size, p90/p99 sizes, and the histogram bin counts (power-of-2 edges). +1. **Dataset shape**: total object count, total bytes, median size, p10/p90 sizes, and the histogram bin counts (power-of-2 edges). 2. **Profile classification**: which profile the analyzer picked (`small_files`, `large_files`, or `mixed`) and the reasoning (e.g., ">70% of objects are under 1 MiB"). -3. **Suggested rclone flags**: the concrete string to pass to `--rclone-flags` for render. Typical examples: +3. **Suggested rclone flags** (`suggested_flags`): the concrete string to pass to `--rclone-flags` for render. Typical examples: - small_files → `--transfers 64 --checkers 128 --fast-list` - large_files → `--transfers 16 --checkers 32 --buffer-size 256M` - mixed → `--transfers 32 --checkers 64 --fast-list` -4. **Suggested shard count**: derive from total object count / target objects-per-shard (aim for ~10k–50k objects per shard for small-file workloads, smaller for large-file). Cap at what the chosen transfer cluster can reasonably host in a job array. If you don't yet know the transfer cluster, give a range and defer to `xfer-manifest-shard`. +4. **Suggested shard count** (`suggested_shard_count`, plus `shard_count_reasoning` and `shard_count_assumptions`). The heuristic: + - If `total_bytes` is below the per-shard cap (default 10 TiB), **1 shard** — don't shard small datasets. + - Otherwise `ceil(total_bytes / cap)` shards, upper-bounded by `4 × array_concurrency` and (if a core budget was supplied) `core_budget // cpus_per_task`. + + Quote `shard_count_reasoning` verbatim back to the user so they can see the trade-offs. ## Step 4 — Persist for downstream skills -`run/analyze.json` is the source of truth for flag/shard decisions. `xfer-manifest-shard` and `xfer-slurm-render` both read it. Don't re-derive flags by hand in those skills — point at this file. +`run/analyze.json` is the source of truth for flag/shard decisions. `xfer-manifest-shard` reads `suggested_shard_count` and `xfer-slurm-render` reads `suggested_flags` — point at this file, don't re-derive. + +If the user's plan changes (different transfer cluster, different concurrency cap), re-run `xfer manifest analyze` with updated `--assumed-*` flags before calling `xfer-manifest-shard`. ## After this skill diff --git a/.claude/skills/xfer-manifest-rebase/SKILL.md b/.claude/skills/xfer-manifest-rebase/SKILL.md index f249e99..841ec3e 100644 --- a/.claude/skills/xfer-manifest-rebase/SKILL.md +++ b/.claude/skills/xfer-manifest-rebase/SKILL.md @@ -40,7 +40,9 @@ Always write to a new file (don't overwrite `manifest.jsonl`). Keeping the origi ## Step 3 — Re-shard -Sharding is derived from the manifest, so **re-shard after rebasing**: +Sharding is derived from the manifest, so **re-shard after rebasing**. The existing `run/shards/` directory contains pre-rebase paths and must be replaced. + +Confirm with the user before removing the old shards (`run/shards` is small but removing it is irreversible locally): ```bash rm -rf run/shards @@ -50,16 +52,26 @@ uv run xfer manifest shard \ --num-shards ``` -(Or invoke `xfer-manifest-shard`.) Byte balance won't change meaningfully, but the shard files need to carry the rebased paths or workers will try to copy from the wrong URI. +(Or invoke `xfer-manifest-shard` with the rebased manifest as input.) Byte balance won't change meaningfully, but the shard files need to carry the rebased paths or workers will try to copy from the wrong URI. + +## Step 4 — Point `xfer slurm render` at the rebased manifest + +`xfer slurm render` reads `source_root` and `dest_root` from a manifest file. By default it reads `/manifest.jsonl`, which is intentionally left at the pre-rebase vantage point as an audit record. Pass `--manifest` to read the rebased file instead: -## Step 4 — Point downstream skills at the rebased manifest +```bash +uv run xfer slurm render \ + --run-dir run \ + --manifest run/manifest.rebased.jsonl \ + ... +``` -When you invoke `xfer-slurm-render` next, pass `--run-dir run` but ensure `config.resolved.json` references the rebased manifest. If the user plans a fresh `xfer slurm render`, that's automatic (render reads from `run/shards/`). +Without `--manifest`, render would use the original roots and every array task would target the wrong URI. ## Safety - Never delete the original manifest — always keep `run/manifest.jsonl` as an audit trail alongside `run/manifest.rebased.jsonl`. - Rebase is a remap, not a content migration. It does not move data. It only relabels what each shard points to. +- Confirm before `rm -rf run/shards` — the user may want to move the old shards aside rather than delete them. ## After this skill diff --git a/.claude/skills/xfer-manifest-shard/SKILL.md b/.claude/skills/xfer-manifest-shard/SKILL.md index a87ecc7..dbc0e75 100644 --- a/.claude/skills/xfer-manifest-shard/SKILL.md +++ b/.claude/skills/xfer-manifest-shard/SKILL.md @@ -13,22 +13,21 @@ Runs **locally on the workstation**. Pure file processing. No Slurm/SSH needed. ## Step 1 — Read the analyze output -Read `run/analyze.json` (from `xfer-manifest-analyze`). Use its `suggested_shard_count` / profile as the starting point. If analyze hasn't been run yet, invoke `xfer-manifest-analyze` first — don't guess shard counts from the raw manifest. +Read `run/analyze.json` (from `xfer-manifest-analyze`) and use `suggested_shard_count` directly as the shard count. The analyzer already factors in the 10 TiB/shard cap, the expected array concurrency, and (if supplied) the core budget. -## Step 2 — Reconcile shard count with cluster resources +If `run/analyze.json` doesn't exist yet, invoke `xfer-manifest-analyze` first — don't guess shard counts from the raw manifest. -The right shard count depends on **both** rclone settings (from analyze) **and** the transfer cluster's available resources. Ask the user: +## Step 2 — Decide whether to override -1. Which cluster will run the transfer? (Same as build host, or different?) -2. What's the target array concurrency — how many shards should run at once? Typical range: 32–256, capped by the partition's `MaxArraySize` and the throughput both S3 endpoints can handle. -3. What's the partition's per-node core/memory budget? +Only override `suggested_shard_count` if one of the inputs that fed it has changed since analyze ran: -Rule of thumb: -- Total shards ≈ max(suggested_shard_count_from_analyze, 4 × array_concurrency). This gives the scheduler enough slack to keep the array fully packed even as slow shards trail. -- For small-files profiles (heavy listing, light bytes), bias toward **more shards** of fewer objects each. -- For large-files profiles (heavy bytes, few objects), bias toward **fewer shards** with more bytes each — byte-balancing matters more than object count. +- The transfer cluster is different from what analyze assumed (different core budget). +- The array concurrency cap is different from what analyze assumed (defaults: `--assumed-array-concurrency=64`). +- The user wants a different per-shard byte cap (default 10 TiB). -State your recommendation with the reasoning, then confirm before running. +In that case, **re-run `xfer-manifest-analyze`** with updated `--assumed-*` flags rather than hand-picking a new number here. Sharing the reasoning/assumptions via `run/analyze.json` is how downstream skills stay coherent. + +Show the user `suggested_shard_count` alongside `shard_count_reasoning` from analyze, then confirm before running. ## Step 3 — Run shard diff --git a/.claude/skills/xfer-slurm-render/SKILL.md b/.claude/skills/xfer-slurm-render/SKILL.md index 5c625ce..1078dac 100644 --- a/.claude/skills/xfer-slurm-render/SKILL.md +++ b/.claude/skills/xfer-slurm-render/SKILL.md @@ -57,16 +57,19 @@ Collect from the user (with defaults from `run/analyze.json` and the chosen part | `--rclone-image` | `rclone/rclone:latest` | | `--rclone-config` | absolute path to rclone.conf **on the transfer cluster's compute nodes** (see note below) | | `--rclone-flags` | `suggested_flags` from `run/analyze.json` | -| `--max-attempts` | 3 (default) | +| `--max-attempts` | 5 (default) | | `--sbatch-extras` | site-specific `--account=...`, `--qos=...`, etc. | | `--pyxis-extra` | extra `srun --container-*` flags if site requires them | +| `--manifest` | optional; path to a specific manifest (pass `run/manifest.rebased.jsonl` if the rebase skill ran) | -The `--rclone-config` path is baked into `sbatch_array.sh` and resolved **on the transfer cluster at job time**, not on the workstation. It must: +The `--rclone-config` path is baked into `sbatch_array.sh` and resolved **on the transfer cluster at job time**, not at render time. Render itself no longer requires the file to exist on the workstation — it only prints a warning if the local path is missing, since the actual consumer is the compute node. Still: -- Be an absolute path valid on that cluster's compute nodes (home dirs and shared paths differ between sites). -- Exist with `0600` permissions before the job starts. +- The path must be an absolute path valid on the cluster's compute nodes. +- It must exist with `0600` permissions **on the cluster** before the job starts. -If the user doesn't already have the config deployed to this cluster at a known path, **stop and invoke `xfer-rclone-config`** to set it up and record the path. Don't guess the path — a wrong value here means every array task will fail identically at container start. +If the user doesn't already have the config deployed to the cluster at a known path, invoke `xfer-rclone-config` to set it up and record the path. A wrong value here means every array task will fail identically at container start, so double-check the path. + +Use `--manifest` whenever the user ran `xfer-manifest-rebase`: render reads `source_root` / `dest_root` from a manifest file, and the default path (`/manifest.jsonl`) is intentionally left at the pre-rebase vantage point as an audit record. Passing `--manifest run/manifest.rebased.jsonl` is how render picks up the rebased roots. ## Step 4 — Render @@ -83,9 +86,10 @@ uv run xfer slurm render \ --rclone-image \ --rclone-config \ --rclone-flags "" \ - --max-attempts 3 \ + --max-attempts 5 \ --sbatch-extras '' \ - --pyxis-extra '' + --pyxis-extra '' \ + [--manifest run/manifest.rebased.jsonl] # only after rebase ``` ## Step 5 — Verify the outputs diff --git a/pyproject.toml b/pyproject.toml index 9c94356..fb85195 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -30,5 +30,9 @@ select = ["E", "F", "I", "B", "UP"] dev = [ "black>=26.1.0", "pre-commit>=4.5.1", + "pytest>=8.0.0", ] +[tool.pytest.ini_options] +testpaths = ["tests"] + diff --git a/src/xfer/cli.py b/src/xfer/cli.py index 2ae9ffc..25d364f 100644 --- a/src/xfer/cli.py +++ b/src/xfer/cli.py @@ -467,6 +467,25 @@ def manifest_analyze( "--retries 10 --low-level-retries 20 --stats 600s --progress", help="Base rclone flags to always include", ), + assumed_cpus_per_task: int = typer.Option( + 4, + min=1, + help="Cores per shard worker (matches slurm_render default). Used only for shard-count suggestion.", + ), + assumed_array_concurrency: int = typer.Option( + 64, + min=1, + help="Expected Slurm array concurrency (matches slurm_render default). Used only for shard-count suggestion.", + ), + assumed_core_budget: Optional[int] = typer.Option( + None, + help="Total cores the transfer cluster's partition will make available. Used only for shard-count suggestion. If omitted, the core constraint is skipped.", + ), + max_shard_bytes_tb: int = typer.Option( + 10, + min=1, + help="Per-shard byte cap in TiB (no single shard should exceed this).", + ), ) -> None: """ Analyze manifest file sizes and suggest optimal rclone flags. @@ -481,6 +500,7 @@ def manifest_analyze( format_histogram_data, human_bytes, suggest_rclone_flags_from_sizes, + suggest_shard_count, ) # Read manifest and extract sizes @@ -505,6 +525,13 @@ def manifest_analyze( stats = compute_file_size_stats(sizes) suggestion = suggest_rclone_flags_from_sizes(sizes) histogram = format_histogram_data(sizes) + shard_suggestion = suggest_shard_count( + stats.total_bytes, + cpus_per_task=assumed_cpus_per_task, + array_concurrency=assumed_array_concurrency, + core_budget=assumed_core_budget, + max_shard_bytes_tb=max_shard_bytes_tb, + ) # Combine base flags with profile-specific flags combined_flags = f"{suggestion.flags} {base_flags}" @@ -528,6 +555,9 @@ def manifest_analyze( "profile": suggestion.profile, "profile_explanation": suggestion.explanation, "suggested_flags": combined_flags, + "suggested_shard_count": shard_suggestion.num_shards, + "shard_count_reasoning": shard_suggestion.reasoning, + "shard_count_assumptions": shard_suggestion.assumptions, "histogram": histogram, } @@ -882,9 +912,7 @@ def slurm_render( rclone_image: str = typer.Option(..., help="Container image containing rclone"), rclone_config: Path = typer.Option( ..., - exists=True, - dir_okay=False, - help="Host path to rclone.conf", + help="Absolute path to rclone.conf on the transfer cluster's compute nodes. Not required to exist on this host; a warning is emitted if it is missing locally.", resolve_path=True, ), rclone_conf_in_container: str = typer.Option( @@ -911,6 +939,11 @@ def slurm_render( pyxis_extra: str = typer.Option( "", help="Extra pyxis flags (string placed after --container-mounts...)" ), + manifest: Optional[Path] = typer.Option( + None, + help="Manifest JSONL to read source/dest_root from. Defaults to /manifest.jsonl. Use this after `xfer manifest rebase` to point render at the rebased file.", + resolve_path=True, + ), ) -> None: """ Render worker.sh, sbatch_array.sh, and submit.sh under run_dir. @@ -919,12 +952,19 @@ def slurm_render( mkdirp(run_dir / "logs") mkdirp(run_dir / "state") - # If source/dest not provided, try to read first line of manifest.jsonl (if present) + if not rclone_config.exists(): + eprint( + f"WARNING: rclone_config {rclone_config} does not exist on this host. " + "That is fine if the path is valid on the transfer cluster's compute nodes. " + "Verify before submitting." + ) + + # If source/dest not provided, try to read first line of the manifest (if present) if source_root is None or dest_root is None: - manifest = run_dir / "manifest.jsonl" - if manifest.exists(): + manifest_path = manifest or (run_dir / "manifest.jsonl") + if manifest_path.exists(): first = None - for ln in manifest.read_text(encoding="utf-8").splitlines(): + for ln in manifest_path.read_text(encoding="utf-8").splitlines(): if ln.strip(): first = json.loads(ln) break @@ -934,7 +974,7 @@ def slurm_render( if not source_root or not dest_root: raise typer.BadParameter( - "source_root/dest_root not set and could not be read from run_dir/manifest.jsonl" + "source_root/dest_root not set and could not be read from the manifest (pass --manifest, --source-root, and/or --dest-root explicitly)" ) # Write scripts diff --git a/src/xfer/est.py b/src/xfer/est.py index 5d10e8c..fc31e09 100644 --- a/src/xfer/est.py +++ b/src/xfer/est.py @@ -569,6 +569,90 @@ def suggest_rclone_flags_from_sizes(sizes: List[int]) -> RcloneFlagsSuggestion: ) +@dataclass +class ShardCountSuggestion: + """Suggested shard count for `xfer manifest shard` based on bytes, cores, and concurrency.""" + + num_shards: int + reasoning: str + assumptions: Dict[str, Any] + + +def suggest_shard_count( + total_bytes: int, + *, + cpus_per_task: int = 4, + array_concurrency: int = 64, + core_budget: Optional[int] = None, + max_shard_bytes_tb: int = 10, +) -> ShardCountSuggestion: + """ + Suggest a shard count for a transfer based on three constraints: + + 1. Bytes cap: no single shard should carry more than ``max_shard_bytes_tb`` TiB + of data (worker wall-clock dominates the array's long tail otherwise). + 2. Concurrency cap: producing more than ``4 * array_concurrency`` shards is + wasteful (the scheduler only needs enough slack to keep the queue packed + as slow shards trail). + 3. Core cap (optional): if ``core_budget`` is supplied, the array can't + usefully exceed ``core_budget // cpus_per_task`` concurrent workers, so + producing more shards just lengthens the queue. + + Special case: if ``total_bytes`` is below the per-shard cap, return + ``num_shards=1`` — sharding isn't helpful. + """ + tib = 1024**4 + max_shard_bytes = max_shard_bytes_tb * tib + + assumptions: Dict[str, Any] = { + "total_bytes": total_bytes, + "cpus_per_task": cpus_per_task, + "array_concurrency": array_concurrency, + "core_budget": core_budget, + "max_shard_bytes_tb": max_shard_bytes_tb, + } + + if total_bytes < max_shard_bytes: + reasoning = ( + f"total_bytes ({human_bytes(total_bytes)}) is below the per-shard cap " + f"({max_shard_bytes_tb} TiB); a single shard is sufficient." + ) + return ShardCountSuggestion( + num_shards=1, reasoning=reasoning, assumptions=assumptions + ) + + shards_by_bytes = math.ceil(total_bytes / max_shard_bytes) + shards_by_concurrency = 4 * array_concurrency + shards_by_cores: Optional[int] = None + if core_budget is not None: + shards_by_cores = max(1, core_budget // cpus_per_task) + + upper = shards_by_concurrency + if shards_by_cores is not None: + upper = min(upper, shards_by_cores) + + num_shards = max(1, min(upper, max(shards_by_bytes, 1))) + + reasoning_parts = [ + f"shards_by_bytes={shards_by_bytes} " + f"(total {human_bytes(total_bytes)} / {max_shard_bytes_tb} TiB cap)", + f"shards_by_concurrency={shards_by_concurrency} (4 x {array_concurrency})", + ] + if shards_by_cores is not None: + reasoning_parts.append( + f"shards_by_cores={shards_by_cores} " + f"({core_budget} cores / {cpus_per_task} cpus_per_task)" + ) + reasoning_parts.append( + f"chose max(1, min(upper={upper}, shards_by_bytes)) = {num_shards}" + ) + reasoning = "; ".join(reasoning_parts) + + return ShardCountSuggestion( + num_shards=num_shards, reasoning=reasoning, assumptions=assumptions + ) + + def format_histogram_data( sizes: List[int], ) -> List[Dict[str, Any]]: diff --git a/tests/test_est_shard_count.py b/tests/test_est_shard_count.py new file mode 100644 index 0000000..fb58106 --- /dev/null +++ b/tests/test_est_shard_count.py @@ -0,0 +1,72 @@ +from xfer.est import suggest_shard_count + +TIB = 1024**4 + + +def test_below_cap_returns_one_shard(): + suggestion = suggest_shard_count(5 * TIB) + assert suggestion.num_shards == 1 + assert "below the per-shard cap" in suggestion.reasoning + + +def test_zero_bytes_returns_one_shard(): + suggestion = suggest_shard_count(0) + assert suggestion.num_shards == 1 + + +def test_bytes_bounded_by_concurrency_default(): + # 100 TiB, default cap 10 TiB -> 10 shards by bytes. + # Defaults: array_concurrency=64 -> concurrency bound 256. No core_budget. + # min(256, max(10, 1)) = 10. + suggestion = suggest_shard_count(100 * TIB) + assert suggestion.num_shards == 10 + + +def test_core_budget_tightens_upper_bound(): + # 100 TiB / 10 TiB = 10 shards by bytes. + # core_budget=40, cpus_per_task=4 -> core bound 10. + # concurrency bound = 4 * 64 = 256. + # min(min(256, 10), 10) = 10. Matches the bytes count exactly. + suggestion = suggest_shard_count(100 * TIB, core_budget=40, cpus_per_task=4) + assert suggestion.num_shards == 10 + assert "shards_by_cores=10" in suggestion.reasoning + + +def test_core_budget_caps_below_bytes_requirement(): + # 500 TiB / 10 TiB = 50 shards by bytes. + # core_budget=16, cpus_per_task=4 -> core bound 4. + # concurrency bound 256. upper = min(256, 4) = 4. + # num_shards = max(1, min(4, 50)) = 4. + suggestion = suggest_shard_count(500 * TIB, core_budget=16, cpus_per_task=4) + assert suggestion.num_shards == 4 + + +def test_concurrency_caps_below_bytes_requirement(): + # 500 TiB / 10 TiB = 50 shards by bytes. array_concurrency=8 -> bound 32. + # num_shards = max(1, min(32, 50)) = 32. + suggestion = suggest_shard_count(500 * TIB, array_concurrency=8) + assert suggestion.num_shards == 32 + + +def test_custom_max_shard_bytes_tb(): + # max_shard_bytes_tb=1 => cap is 1 TiB. + # 5 TiB at 1 TiB cap -> 5 shards by bytes. Default concurrency=64 -> upper 256. + suggestion = suggest_shard_count(5 * TIB, max_shard_bytes_tb=1) + assert suggestion.num_shards == 5 + + +def test_assumptions_are_echoed(): + suggestion = suggest_shard_count( + 100 * TIB, + cpus_per_task=8, + array_concurrency=32, + core_budget=200, + max_shard_bytes_tb=20, + ) + assert suggestion.assumptions == { + "total_bytes": 100 * TIB, + "cpus_per_task": 8, + "array_concurrency": 32, + "core_budget": 200, + "max_shard_bytes_tb": 20, + } diff --git a/uv.lock b/uv.lock index 9e0f8b4..5b2eb22 100644 --- a/uv.lock +++ b/uv.lock @@ -85,6 +85,18 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/33/6b/e0547afaf41bf2c42e52430072fa5658766e3d65bd4b03a563d1b6336f57/distlib-0.4.0-py2.py3-none-any.whl", hash = "sha256:9659f7d87e46584a30b5780e43ac7a2143098441670ff0a49d5f9034c54a6c16", size = 469047, upload-time = "2025-07-17T16:51:58.613Z" }, ] +[[package]] +name = "exceptiongroup" +version = "1.3.1" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "typing-extensions", marker = "python_full_version < '3.13'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/50/79/66800aadf48771f6b62f7eb014e352e5d06856655206165d775e675a02c9/exceptiongroup-1.3.1.tar.gz", hash = "sha256:8b412432c6055b0b7d14c310000ae93352ed6754f70fa8f7c34141f91c4e3219", size = 30371, upload-time = "2025-11-21T23:01:54.787Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/8a/0e/97c33bf5009bdbac74fd2beace167cab3f978feb69cc36f1ef79360d6c4e/exceptiongroup-1.3.1-py3-none-any.whl", hash = "sha256:a7a39a3bd276781e98394987d3a5701d0c4edffb633bb7a5144577f82c773598", size = 16740, upload-time = "2025-11-21T23:01:53.443Z" }, +] + [[package]] name = "filelock" version = "3.20.3" @@ -103,6 +115,15 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/b8/58/40fbbcefeda82364720eba5cf2270f98496bdfa19ea75b4cccae79c698e6/identify-2.6.16-py2.py3-none-any.whl", hash = "sha256:391ee4d77741d994189522896270b787aed8670389bfd60f326d677d64a6dfb0", size = 99202, upload-time = "2026-01-12T18:58:56.627Z" }, ] +[[package]] +name = "iniconfig" +version = "2.3.0" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/72/34/14ca021ce8e5dfedc35312d08ba8bf51fdd999c576889fc2c24cb97f4f10/iniconfig-2.3.0.tar.gz", hash = "sha256:c76315c77db068650d49c5b56314774a7804df16fee4402c1f19d6d15d8c4730", size = 20503, upload-time = "2025-10-18T21:55:43.219Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/cb/b1/3846dd7f199d53cb17f49cba7e651e9ce294d8497c8c150530ed11865bb8/iniconfig-2.3.0-py3-none-any.whl", hash = "sha256:f631c04d2c48c52b84d0d0549c99ff3859c98df65b3101406327ecc7d53fbf12", size = 7484, upload-time = "2025-10-18T21:55:41.639Z" }, +] + [[package]] name = "markdown-it-py" version = "4.0.0" @@ -169,6 +190,15 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/cb/28/3bfe2fa5a7b9c46fe7e13c97bda14c895fb10fa2ebf1d0abb90e0cea7ee1/platformdirs-4.5.1-py3-none-any.whl", hash = "sha256:d03afa3963c806a9bed9d5125c8f4cb2fdaf74a55ab60e5d59b3fde758104d31", size = 18731, upload-time = "2025-12-05T13:52:56.823Z" }, ] +[[package]] +name = "pluggy" +version = "1.6.0" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/f9/e2/3e91f31a7d2b083fe6ef3fa267035b518369d9511ffab804f839851d2779/pluggy-1.6.0.tar.gz", hash = "sha256:7dcc130b76258d33b90f61b658791dede3486c3e6bfb003ee5c9bfb396dd22f3", size = 69412, upload-time = "2025-05-15T12:30:07.975Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/54/20/4d324d65cc6d9205fabedc306948156824eb9f0ee1633355a8f7ec5c66bf/pluggy-1.6.0-py3-none-any.whl", hash = "sha256:e920276dd6813095e9377c0bc5566d94c932c33b27a3e3945d8389c374dd4746", size = 20538, upload-time = "2025-05-15T12:30:06.134Z" }, +] + [[package]] name = "pre-commit" version = "4.5.1" @@ -194,6 +224,24 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/c7/21/705964c7812476f378728bdf590ca4b771ec72385c533964653c68e86bdc/pygments-2.19.2-py3-none-any.whl", hash = "sha256:86540386c03d588bb81d44bc3928634ff26449851e99741617ecb9037ee5ec0b", size = 1225217, upload-time = "2025-06-21T13:39:07.939Z" }, ] +[[package]] +name = "pytest" +version = "9.0.3" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "colorama", marker = "sys_platform == 'win32'" }, + { name = "exceptiongroup", marker = "python_full_version < '3.11'" }, + { name = "iniconfig" }, + { name = "packaging" }, + { name = "pluggy" }, + { name = "pygments" }, + { name = "tomli", marker = "python_full_version < '3.11'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/7d/0d/549bd94f1a0a402dc8cf64563a117c0f3765662e2e668477624baeec44d5/pytest-9.0.3.tar.gz", hash = "sha256:b86ada508af81d19edeb213c681b1d48246c1a91d304c6c81a427674c17eb91c", size = 1572165, upload-time = "2026-04-07T17:16:18.027Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/d4/24/a372aaf5c9b7208e7112038812994107bc65a84cd00e0354a88c2c77a617/pytest-9.0.3-py3-none-any.whl", hash = "sha256:2c5efc453d45394fdd706ade797c0a81091eccd1d6e4bccfcd476e2b8e0ab5d9", size = 375249, upload-time = "2026-04-07T17:16:16.13Z" }, +] + [[package]] name = "pytokens" version = "0.4.0" @@ -425,6 +473,7 @@ dependencies = [ dev = [ { name = "black" }, { name = "pre-commit" }, + { name = "pytest" }, ] [package.metadata] @@ -437,4 +486,5 @@ requires-dist = [ dev = [ { name = "black", specifier = ">=26.1.0" }, { name = "pre-commit", specifier = ">=4.5.1" }, + { name = "pytest", specifier = ">=8.0.0" }, ] From 2baaae6a8752f2c9d25345a015701de67c898abe Mon Sep 17 00:00:00 2001 From: Joe Schoonover Date: Fri, 17 Apr 2026 14:23:39 -0400 Subject: [PATCH 4/5] Add xfer-manifest-combine and xfer-oneshot skills; rework submit Addresses the medium-priority gaps from the PR review: * xfer-slurm-submit now prefers `uv run xfer slurm submit` (which bakes in `--export=NONE`) over raw sbatch, with raw-sbatch as a fallback. Drops the incorrect attribution of --export=NONE to .claude/settings.local.json and explains the rationale inline. * New xfer-manifest-combine skill covers `xfer manifest combine` for users who build manifests from parallel `rclone lsjson` parts instead of a single listing job. * New xfer-oneshot skill covers `xfer run` as an escape hatch for small/simple transfers, with explicit redirect-to-staged-flow conditions for large or cross-vantage jobs. * CLAUDE.md and README reorganize the skill table into "main pipeline" vs "on-demand / alternative entry points" so rclone-config and the new skills are positioned as bootstrap/alternatives rather than pipeline steps. Co-Authored-By: Claude Opus 4.7 (1M context) --- .claude/skills/xfer-manifest-combine/SKILL.md | 78 ++++++++++++++ .claude/skills/xfer-oneshot/SKILL.md | 100 ++++++++++++++++++ .claude/skills/xfer-slurm-submit/SKILL.md | 13 ++- CLAUDE.md | 11 +- README.md | 13 ++- 5 files changed, 210 insertions(+), 5 deletions(-) create mode 100644 .claude/skills/xfer-manifest-combine/SKILL.md create mode 100644 .claude/skills/xfer-oneshot/SKILL.md diff --git a/.claude/skills/xfer-manifest-combine/SKILL.md b/.claude/skills/xfer-manifest-combine/SKILL.md new file mode 100644 index 0000000..edb0a3e --- /dev/null +++ b/.claude/skills/xfer-manifest-combine/SKILL.md @@ -0,0 +1,78 @@ +--- +name: xfer-manifest-combine +description: Combine multiple `rclone lsjson` part files into a single xfer JSONL manifest. Use this instead of `xfer-manifest-build` when the source is too large for a single `rclone lsjson` call and the user has already produced parallel listings (one JSON-array file per top-level prefix, with a `.prefix` sidecar naming the prefix). Produces the same `xfer.manifest.v1` schema the rest of the pipeline consumes. +--- + +# xfer-manifest-combine + +Drives `xfer manifest combine` — an alternative entry point to the pipeline when `xfer manifest build`'s single-shot listing is too slow or too large to buffer. Combines per-prefix lsjson outputs into one `manifest.jsonl`. + +## When to use this instead of `xfer-manifest-build` + +Use `xfer-manifest-build` by default. Reach for combine when **all** of the following are true: + +- The source has so many objects that a single `rclone lsjson` call would OOM or run for days. +- The user (or a previous job) has already produced per-prefix lsjson outputs in a directory. +- Each part file is a JSON array (rclone's native lsjson format), and each has a sibling `.prefix` file naming the prefix that was listed. + +If only the first condition is true and there are no parts yet, `xfer-manifest-build` is simpler — running parallel lsjson jobs just to feed combine is out of scope for this skill; the user should do that with their own orchestration. + +## Operating model + +Runs **locally on the workstation** if the parts dir is accessible locally, otherwise on whichever host can see the part files. Pure file processing — no Slurm, no SSH. + +## Step 1 — Verify the parts directory layout + +Each part file must follow this pattern: + +``` +parts/ +├── lsjson-0001.json # JSON array: `rclone lsjson :/ --recursive` +├── lsjson-0001.prefix # text file containing the literal prefix, e.g. "prefix-1" +├── lsjson-0002.json +├── lsjson-0002.prefix +... +``` + +Quick probe: + +```bash +ls /lsjson-*.json | head +ls /lsjson-*.prefix | head +``` + +If a `.prefix` sidecar is missing for any part, combine will use an empty prefix for that part — the resulting `path` fields will be bucket-relative, which is usually **not** what you want. Flag this and confirm with the user before running. + +## Step 2 — Run combine + +```bash +uv run xfer manifest combine \ + --source \ + --dest \ + --parts-dir \ + --out run/manifest.jsonl +``` + +`--source` / `--dest` are the full roots (e.g., `s3src:bucket`) — combine prepends them to each part's prefix + object path to produce the `source`/`dest` URIs in the manifest. + +`--run-id ` is optional; if omitted, one is generated per run. + +## Step 3 — Sanity-check the manifest + +```bash +wc -l run/manifest.jsonl +head -1 run/manifest.jsonl | python -m json.tool +``` + +Confirm: +- Line count matches the sum of non-dir entries across parts (combine's final log line reports this). +- `source_root`, `dest_root`, and the first record's `source`/`path` look right (path should start with a known prefix from the parts). + +## Safety + +- If `run/manifest.jsonl` already exists, confirm with the user before overwriting — combine writes unconditionally. +- If the parts dir is on a shared filesystem, treat it as read-only. + +## After this skill + +Continue with `xfer-manifest-analyze` exactly as you would after `xfer-manifest-build`. Downstream skills don't care whether the manifest came from build or combine — the schema is identical. diff --git a/.claude/skills/xfer-oneshot/SKILL.md b/.claude/skills/xfer-oneshot/SKILL.md new file mode 100644 index 0000000..2375d6c --- /dev/null +++ b/.claude/skills/xfer-oneshot/SKILL.md @@ -0,0 +1,100 @@ +--- +name: xfer-oneshot +description: Run the whole xfer pipeline (build → shard → render → optional submit) in a single `xfer run` invocation. Use this as an escape hatch for small, straightforward transfers where the user doesn't need the knobs that the staged skills (analyze, rebase, etc.) expose. Do NOT use for very large datasets, multi-cluster transfers, or any case where a separate vantage point (POSIX mount on one cluster, S3 on another) is involved — those need the staged pipeline. +--- + +# xfer-oneshot + +Drives `xfer run` — the one-shot pipeline in `cli.py:1030`. It invokes `manifest build`, `manifest shard`, and `slurm render` in order, optionally submitting. No `analyze`, no `rebase`, no load-aware cluster selection. + +## When to use this — and when not to + +**Good fit:** + +- Source and destination are both reachable as rclone remotes from a single Slurm cluster. +- Object count and total bytes are small enough that default shard/concurrency knobs are fine. +- The user has already run through the full pipeline at least once for this dataset and trusts the defaults. +- Quick demos, smoke tests, CI-style small transfers. + +**Bad fit — redirect to the staged pipeline:** + +- Total bytes ≥ 10 TiB (`xfer-manifest-analyze` will compute shard count properly). +- Source available only via POSIX mount on a different cluster than the transfer cluster (need `xfer-manifest-rebase`). +- The user wants to tune rclone flags to the dataset's size profile (`xfer-manifest-analyze` does this). +- User wants to inspect / edit the manifest, shards, or sbatch script before submitting. + +If any of the "bad fit" conditions apply, **do not invoke this skill** — recommend `xfer-manifest-build` and the staged flow instead. + +## Operating model + +Runs from a **Slurm login node** (not the workstation), same as `xfer-manifest-build`, because `xfer run` calls `xfer manifest build` internally, which requires `srun` + pyxis. Pre-flight (uv installed, repo staged, venv synced, rclone.conf deployed) is identical to `xfer-manifest-build` Step 2 — invoke `xfer-manifest-build`'s pre-flight probe or `xfer-rclone-config` as needed before running this skill. + +## Step 1 — Confirm fit + +Ask the user explicitly: + +- Rough total bytes? (If ≥ 10 TiB, redirect to staged flow.) +- Same cluster for listing and transfer? (If not, redirect.) +- Do you want tuned rclone flags? (If yes, redirect to analyze first.) + +Don't proceed silently — users sometimes reach for `xfer run` by habit when they shouldn't. + +## Step 2 — Gather inputs + +| Flag | Notes | +| ---------------------- | ---------------------------------------------------------------------------------------- | +| `--run-dir` | default `run`; pick a fresh directory | +| `--source` | rclone remote or POSIX path | +| `--dest` | rclone remote | +| `--num-shards` | default 256; override only if the user has a reason | +| `--array-concurrency` | default 64 | +| `--rclone-image` | e.g. `rclone/rclone:latest` | +| `--rclone-config` | absolute path on **this cluster** (the login node = transfer host here, since `xfer run` assumes same cluster) | +| `--rclone-flags` | sensible default is baked in; override if the user specifies | +| `--partition` | a CPU-only partition on this cluster | +| `--time-limit` | default `24:00:00` | +| `--cpus-per-task` | default 4 | +| `--mem` | default `8G` | +| `--max-attempts` | default 5 | +| `--sbatch-extras` | site-specific `--account=...`, `--qos=...` | +| `--pyxis-extra` | extra `srun --container-*` flags if needed | +| `--submit` | boolean; if set, sbatch runs immediately after render | + +## Step 3 — Run + +Via SSH to the login node, inside the xfer repo: + +```bash +ssh @ ' + cd && + uv run xfer run \ + --run-dir \ + --source \ + --dest \ + --rclone-image \ + --rclone-config \ + --partition \ + --sbatch-extras "" \ + [--submit] +' +``` + +Stream output so the user sees manifest progress, sharding stats, and the render summary. If `--submit` was set, capture the `Submitted batch job ` line. + +## Step 4 — Report + +Print a short summary: +- Total objects / total bytes (from the manifest build line). +- Number of shards (from the shard summary). +- Rendered artifacts in `/` on the login node. +- If submitted: job id and the monitoring commands from `xfer-slurm-submit` Step 4 (`squeue`, `sacct`, logs, state markers). + +## Safety + +- Never overwrite an existing `run-dir` without confirming — `xfer run` will re-run build and clobber `manifest.jsonl`. +- If `--submit` is set, this skill consumes real cluster resources on the spot. Confirm before running. +- If the user wants to inspect the sbatch script before it runs, run **without** `--submit`, show them `/sbatch_array.sh`, then invoke `xfer-slurm-submit` separately. + +## After this skill + +If `--submit` was used, direct the user to the monitoring commands from `xfer-slurm-submit` Step 4. If not, `xfer-slurm-submit` is next. diff --git a/.claude/skills/xfer-slurm-submit/SKILL.md b/.claude/skills/xfer-slurm-submit/SKILL.md index 7535a4e..a20aa53 100644 --- a/.claude/skills/xfer-slurm-submit/SKILL.md +++ b/.claude/skills/xfer-slurm-submit/SKILL.md @@ -47,6 +47,17 @@ If this is a re-submit after failure, do **not** `--delete` — preserve any `.d ## Step 3 — Submit +**Prefer `uv run xfer slurm submit`** when the xfer repo is available on the transfer cluster's login node (same pre-flight as `xfer-manifest-build` — `uv` installed, repo synced, `.venv` bootstrapped). The CLI wraps `sbatch --export=NONE` (see `cli.py:1021`) so env vars from the parent shell don't leak into the array job: + +```bash +ssh @ ' + cd && + uv run xfer slurm submit --run-dir +' +``` + +If the xfer repo isn't on the transfer cluster (and the user doesn't want to stage it), fall back to raw `sbatch`: + ```bash ssh @ ' cd && @@ -54,7 +65,7 @@ ssh @ ' ' ``` -`--export=NONE` is important — it prevents the workstation's or parent job's `SLURM_*` env from leaking into the array job (per the existing project convention; see `.claude/settings.local.json`). +`--export=NONE` is important either way — it prevents `SLURM_*` env from the workstation or a parent job from leaking into the array job (e.g., a stale `SLURM_MEM_PER_NODE` colliding with `SLURM_MEM_PER_CPU`). Capture the returned job id (e.g., `Submitted batch job 12345`). diff --git a/CLAUDE.md b/CLAUDE.md index 5a3f059..8e252c5 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -17,11 +17,10 @@ The **user's local workstation** is the orchestrator. Long-running and compute-b ## Workflow skills -Use these in order; each `Skill` trigger runs the corresponding stage. Invoke explicitly when the user's intent matches. +The main pipeline is `build → analyze → shard → (rebase) → render → submit`. Invoke each skill when the user's intent matches its stage. | Skill | Stage | | ------------------------- | ---------------------------------------------------------------------- | -| `xfer-rclone-config` | Create rclone.conf, deploy to each cluster that will run xfer jobs | | `xfer-manifest-build` | Run `xfer manifest build` on a login node (POSIX-preferred source) | | `xfer-manifest-analyze` | File-size histogram → suggested rclone flags + shard count | | `xfer-manifest-shard` | Byte-balanced split of the manifest into `run/shards/` | @@ -29,6 +28,14 @@ Use these in order; each `Skill` trigger runs the corresponding stage. Invoke ex | `xfer-slurm-render` | Render `worker.sh` / `sbatch_array.sh` / `config.resolved.json` | | `xfer-slurm-submit` | Stage run dir to transfer cluster, `sbatch`, return monitoring cmds | +**On-demand / alternative entry points:** + +| Skill | When to use | +| ------------------------- | ---------------------------------------------------------------------- | +| `xfer-rclone-config` | One-time (per cluster) setup — create/deploy rclone.conf. Invoke when another skill discovers a missing config. | +| `xfer-manifest-combine` | Alternative to `build` when the user already has parallel `rclone lsjson` parts and needs them unified into a single manifest. | +| `xfer-oneshot` | `xfer run` escape hatch for small/simple transfers that don't need the staged knobs. Redirect to the staged pipeline for large or cross-vantage jobs. | + ## Dev conventions - Python ≥ 3.10, managed with `uv` (`uv venv && uv sync`; `uv run xfer --help`). diff --git a/README.md b/README.md index d589873..2c4eae6 100644 --- a/README.md +++ b/README.md @@ -315,11 +315,12 @@ This repo ships a set of [Claude Code](https://claude.com/claude-code) skills un ### Available skills -Invoke each by intent in a Claude Code session, or explicitly via `/`: +Invoke each by intent in a Claude Code session, or explicitly via `/`. + +Main pipeline (`build → analyze → shard → (rebase) → render → submit`): | Skill | Stage | | ------------------------ | -------------------------------------------------------------------- | -| `xfer-rclone-config` | Create `rclone.conf` and deploy it to each cluster | | `xfer-manifest-build` | Run `xfer manifest build` on a login node (POSIX source preferred) | | `xfer-manifest-analyze` | File-size histogram → suggested rclone flags and shard count | | `xfer-manifest-shard` | Byte-balanced split of the manifest into shards | @@ -327,6 +328,14 @@ Invoke each by intent in a Claude Code session, or explicitly via `/ | `xfer-slurm-render` | Render `worker.sh` / `sbatch_array.sh` / `config.resolved.json` | | `xfer-slurm-submit` | Stage the run directory to the cluster and `sbatch` | +On-demand / alternative entry points: + +| Skill | When to use | +| ------------------------ | -------------------------------------------------------------------- | +| `xfer-rclone-config` | One-time (per cluster) setup — create/deploy `rclone.conf` | +| `xfer-manifest-combine` | Combine parallel `rclone lsjson` parts into a single manifest | +| `xfer-oneshot` | `xfer run` escape hatch for small transfers that don't need the staged knobs | + See `CLAUDE.md` for the cross-cutting context Claude loads in every session in this repo. ### Conventions From 3e012865f36f2c5a61e6396d427b7fafada85dc9 Mon Sep 17 00:00:00 2001 From: Joe Schoonover Date: Fri, 17 Apr 2026 14:30:11 -0400 Subject: [PATCH 5/5] Clean up low-priority skill wording Addresses the remaining review items: * xfer-manifest-build: drop the redundant --extra-lsjson-flags "--fast-list" (fast_list=True is already the CLI default); note how to disable it for POSIX sources instead. * xfer-manifest-rebase: require explicit user confirmation before rm -rf run/shards, and suggest `mv run/shards run/shards.pre-rebase` as a safer alternative. * xfer-rclone-config: replace the standalone "paths are per-system" invariant with a pointer to CLAUDE.md plus a trimmed per-stage path list. Reword the description so it no longer implies skills report to each other. Co-Authored-By: Claude Opus 4.7 (1M context) --- .claude/skills/xfer-manifest-build/SKILL.md | 5 ++--- .claude/skills/xfer-manifest-rebase/SKILL.md | 4 +++- .claude/skills/xfer-rclone-config/SKILL.md | 14 +++++++------- 3 files changed, 12 insertions(+), 11 deletions(-) diff --git a/.claude/skills/xfer-manifest-build/SKILL.md b/.claude/skills/xfer-manifest-build/SKILL.md index ad976ac..bd9b8fb 100644 --- a/.claude/skills/xfer-manifest-build/SKILL.md +++ b/.claude/skills/xfer-manifest-build/SKILL.md @@ -85,8 +85,7 @@ ssh @ ' --dest \ --out run/manifest.jsonl \ --rclone-image rclone/rclone:latest \ - --rclone-config \ - --extra-lsjson-flags "--fast-list" + --rclone-config ' ``` @@ -94,7 +93,7 @@ ssh @ ' Notes: - If the source is a POSIX path, the `--source` value is the filesystem path (e.g., `/mnt/data/dataset`), not an rclone remote. The destination remains an rclone remote. -- `--fast-list` is usually a win for S3 sources. Skip it for POSIX sources. +- `--fast-list` is already the default for S3 sources (see `cli.py:200`). Pass `--no-fast-list` when the source is a POSIX path. - Stream output back so the user sees progress. Don't background it. ## Step 4 — Retrieve the manifest and note the vantage point diff --git a/.claude/skills/xfer-manifest-rebase/SKILL.md b/.claude/skills/xfer-manifest-rebase/SKILL.md index 841ec3e..a5f231d 100644 --- a/.claude/skills/xfer-manifest-rebase/SKILL.md +++ b/.claude/skills/xfer-manifest-rebase/SKILL.md @@ -42,7 +42,9 @@ Always write to a new file (don't overwrite `manifest.jsonl`). Keeping the origi Sharding is derived from the manifest, so **re-shard after rebasing**. The existing `run/shards/` directory contains pre-rebase paths and must be replaced. -Confirm with the user before removing the old shards (`run/shards` is small but removing it is irreversible locally): +**Before running the command below, ask the user explicitly:** "OK to delete `run/shards/` and re-generate from the rebased manifest?" Offer to move it aside (`mv run/shards run/shards.pre-rebase`) as a safer alternative if they want to keep the old shards for debugging. + +Only after explicit confirmation: ```bash rm -rf run/shards diff --git a/.claude/skills/xfer-rclone-config/SKILL.md b/.claude/skills/xfer-rclone-config/SKILL.md index f0d7d88..9a93d4d 100644 --- a/.claude/skills/xfer-rclone-config/SKILL.md +++ b/.claude/skills/xfer-rclone-config/SKILL.md @@ -1,6 +1,6 @@ --- name: xfer-rclone-config -description: Create or extend an rclone.conf for xfer — collect S3 endpoints and credentials for each remote (source, destination, optional staging), write the config with 0600 permissions, optionally test it, and guide the user through deploying it to each Slurm cluster that will run xfer jobs. Use whenever the user needs to set up rclone remotes, is bootstrapping a new transfer, or an existing skill reports a missing/incomplete rclone.conf. +description: Create or extend an rclone.conf for xfer — collect S3 endpoints and credentials for each remote (source, destination, optional staging), write the config with 0600 permissions, optionally test it, and guide the user through deploying it to each Slurm cluster that will run xfer jobs. Use whenever the user needs to set up rclone remotes, is bootstrapping a new transfer, or a previous step discovered a missing rclone.conf on a cluster. --- # xfer-rclone-config @@ -13,15 +13,15 @@ Authors an rclone.conf suitable for xfer. The config is consumed by **containeri The authoritative template is `rclone.conf.example` at the repo root. -## Critical invariant — paths are per-system +## Per-stage rclone.conf paths -The `--rclone-config` flag always takes an **absolute path on whichever host the xfer command runs on**. That host differs between stages: +See CLAUDE.md's "Paths are per-system" cross-cutting invariant. Applied to `--rclone-config`, the host where each stage runs is: -- `xfer manifest build` — path on the **build cluster's login node** -- `xfer slurm render` — the path baked into `sbatch_array.sh` must be valid on the **transfer cluster's compute nodes** -- Local `uv run xfer analyze/shard/rebase` — path on the **workstation** (only if the user wants to sanity-check remotes locally via host rclone) +- `xfer manifest build` — build cluster's login node +- `xfer slurm render` — transfer cluster's compute nodes (the path is embedded into `sbatch_array.sh` and resolved at job time) +- Local `uv run xfer analyze/shard/rebase` — workstation (only if the user wants to sanity-check remotes via host rclone) -A single workstation config does not automatically work on the cluster. **Always ask for and record the path per system.** Do not assume `~/.config/rclone/rclone.conf` on the workstation resolves the same way on the cluster — home directories, shared scratch, and site-standard locations (e.g., `/etc/rclone/rclone.conf`) all differ. +Collect the absolute path per system — don't assume `~/.config/rclone/rclone.conf` means the same thing on every host. ## Step 1 — Inventory what's needed