diff --git a/.claude/skills/xfer-manifest-analyze/SKILL.md b/.claude/skills/xfer-manifest-analyze/SKILL.md new file mode 100644 index 0000000..9c7b7bd --- /dev/null +++ b/.claude/skills/xfer-manifest-analyze/SKILL.md @@ -0,0 +1,62 @@ +--- +name: xfer-manifest-analyze +description: Analyze an xfer manifest's file-size distribution and suggest rclone flags plus a shard count. Use after `xfer manifest build` and before sharding/rendering, whenever the user asks "how should I tune rclone?", "how many shards?", or wants to understand the dataset shape before transferring. +--- + +# xfer-manifest-analyze + +Drives `xfer manifest analyze` — reads `run/manifest.jsonl` (produced by `xfer-manifest-build`) and writes a histogram + suggested rclone flags to `run/analyze.json`. + +## Operating model + +Runs **locally on the workstation** — no cluster access needed. Pure file processing over the JSONL manifest. + +## Step 1 — Locate the manifest + +Default: `run/manifest.jsonl` at the repo root. If the user has a different path or multiple runs under `run_*/`, ask which to analyze. + +## Step 2 — Run analyze + +```bash +uv run xfer manifest analyze \ + --in run/manifest.jsonl \ + --out run/analyze.json +``` + +Optional flags that tune the **shard-count suggestion** (not the rclone flag suggestion): + +| Flag | Default | What it does | +| ------------------------------- | ------- | --------------------------------------------------------------------------- | +| `--assumed-cpus-per-task` | `4` | Cores each worker will request. Matches `xfer slurm render` default. | +| `--assumed-array-concurrency` | `64` | Expected Slurm array concurrency. Matches `xfer slurm render` default. | +| `--assumed-core-budget` | unset | Total cores the partition will make available (supply from `sinfo`). | +| `--max-shard-bytes-tb` | `10` | Per-shard byte cap. No single shard should carry more than this. | +| `--base-flags ""` | — | Prepend the user's preferred rclone flags to the suggested ones. | + +If the user already knows the transfer cluster's available core budget, pass it — the shard-count suggestion will be sharper. Otherwise the default (concurrency + bytes-only) is fine. + +## Step 3 — Report + +Read `run/analyze.json` and report to the user: + +1. **Dataset shape**: total object count, total bytes, median size, p10/p90 sizes, and the histogram bin counts (power-of-2 edges). +2. **Profile classification**: which profile the analyzer picked (`small_files`, `large_files`, or `mixed`) and the reasoning (e.g., ">70% of objects are under 1 MiB"). +3. **Suggested rclone flags** (`suggested_flags`): the concrete string to pass to `--rclone-flags` for render. Typical examples: + - small_files → `--transfers 64 --checkers 128 --fast-list` + - large_files → `--transfers 16 --checkers 32 --buffer-size 256M` + - mixed → `--transfers 32 --checkers 64 --fast-list` +4. **Suggested shard count** (`suggested_shard_count`, plus `shard_count_reasoning` and `shard_count_assumptions`). The heuristic: + - If `total_bytes` is below the per-shard cap (default 10 TiB), **1 shard** — don't shard small datasets. + - Otherwise `ceil(total_bytes / cap)` shards, upper-bounded by `4 × array_concurrency` and (if a core budget was supplied) `core_budget // cpus_per_task`. + + Quote `shard_count_reasoning` verbatim back to the user so they can see the trade-offs. + +## Step 4 — Persist for downstream skills + +`run/analyze.json` is the source of truth for flag/shard decisions. `xfer-manifest-shard` reads `suggested_shard_count` and `xfer-slurm-render` reads `suggested_flags` — point at this file, don't re-derive. + +If the user's plan changes (different transfer cluster, different concurrency cap), re-run `xfer manifest analyze` with updated `--assumed-*` flags before calling `xfer-manifest-shard`. + +## After this skill + +Recommend `xfer-manifest-shard` next. diff --git a/.claude/skills/xfer-manifest-build/SKILL.md b/.claude/skills/xfer-manifest-build/SKILL.md new file mode 100644 index 0000000..bd9b8fb --- /dev/null +++ b/.claude/skills/xfer-manifest-build/SKILL.md @@ -0,0 +1,117 @@ +--- +name: xfer-manifest-build +description: Build an xfer JSONL manifest for a large S3-to-S3 (or POSIX-to-S3) data transfer. Use when the user wants to list source objects for a transfer, kick off `xfer manifest build`, or start the xfer pipeline from scratch. Prefers running on a Slurm cluster that has a POSIX mount of the source bucket, since listing over POSIX is far faster than listing over S3. +--- + +# xfer-manifest-build + +Drives `xfer manifest build` — the first stage of the xfer pipeline. Produces `run/manifest.jsonl`. + +## Operating model + +Assume the user is working from a **local workstation** at the root of the `xfer` repo, in a `uv` environment (`uv venv && uv sync` already done). The workstation orchestrates. **`xfer manifest build` itself must run on a Slurm login node** because it invokes `srun` + pyxis internally. + +## Step 1 — Pick the build host (POSIX-first) + +Ask the user (or infer from prior conversation / CLAUDE.md / a site-config file if one exists): + +1. What is the source? (S3 remote like `s3src:bucket/prefix` or a POSIX path like `/mnt/data/dataset`) +2. What is the destination? +3. **Does any Slurm cluster have a POSIX mount equivalent to the source bucket?** (e.g., `/mnt/data/important-files` on cluster `weka` corresponds to `weka-s3:important-files`.) If yes, strongly prefer that cluster as the build host and pass the POSIX path as `--source` — listing over POSIX is latency-bound and much faster than listing over S3. +4. If no POSIX mount exists, pick any Slurm cluster with network access to the source endpoint. Still prefer the side with better network proximity to source. + +Record the chosen build host's hostname, username, and the xfer repo path on that host (default `~/xfer`). These are needed for SSH. + +## Step 2 — Pre-flight on the login node + +Run a single non-destructive SSH probe to discover what's already in place: + +```bash +ssh @ ' + command -v uv || echo "UV_MISSING" + test -d && echo "REPO_PRESENT" || echo "REPO_MISSING" + test -d /.venv && echo "VENV_PRESENT" || echo "VENV_MISSING" + test -f && echo "RCLONE_CONF_PRESENT" || echo "RCLONE_CONF_MISSING" + sinfo -h -o "%P %a %D %C %G" | head -20 +' +``` + +Parse results and react: + +| State | Action | +| ---------------------------- | ----------------------------------------------------------------------------------- | +| `UV_MISSING` | Install per-user: `curl -LsSf https://astral.sh/uv/install.sh \| sh` (confirm first)| +| `REPO_MISSING` | Rsync the local repo up (see step 2a below) | +| `REPO_PRESENT` but older | Offer to rsync updates from the workstation (step 2a); never force without asking | +| `VENV_MISSING` | Run `uv sync` on the login node after repo is in place (step 2b) | +| `RCLONE_CONF_MISSING` | Invoke `xfer-rclone-config` to create/deploy the config to this cluster — do not `scp` blindly | + +Also verify a CPU-only partition is visible in `sinfo` output (the `%G` column should be empty or `(null)` for CPU-only). + +### Step 2a — Sync the repo to the login node (if needed) + +```bash +rsync -av \ + --exclude='.venv/' --exclude='.git/' --exclude='run*/' \ + --exclude='__pycache__/' --exclude='*.egg-info/' \ + ./ @:/ +``` + +Exclude `.venv/` (platform-specific; must be rebuilt remotely), `.git/` (not needed to run), and any local `run*/` dirs (those stay on the workstation or move separately). + +### Step 2b — Bootstrap the uv environment on the login node + +```bash +ssh @ ' + cd && + uv venv && + uv sync +' +``` + +Stream the output so the user sees `uv sync` progress. If `uv sync` fails (e.g., no network from login node, locked-down Python), surface the error and stop — don't try to force-install. + +## Step 3 — Run the build + +Prefer a **CPU-only partition** with **4–8 cores**. The `srun` inside `xfer manifest build` already requests 8 cores (see `cli.py:251`), so ask the user to confirm the partition and any `--sbatch-extras`-style account/QoS flags their site requires before submission. + +Invoke on the login node via SSH: + +```bash +ssh @ ' + cd && + uv run xfer manifest build \ + --source \ + --dest \ + --out run/manifest.jsonl \ + --rclone-image rclone/rclone:latest \ + --rclone-config +' +``` + +`` is the absolute path on the **build cluster's login node**, not the workstation path. If unsure, ask the user — or run `xfer-rclone-config` to resolve/deploy it for this cluster. + +Notes: +- If the source is a POSIX path, the `--source` value is the filesystem path (e.g., `/mnt/data/dataset`), not an rclone remote. The destination remains an rclone remote. +- `--fast-list` is already the default for S3 sources (see `cli.py:200`). Pass `--no-fast-list` when the source is a POSIX path. +- Stream output back so the user sees progress. Don't background it. + +## Step 4 — Retrieve the manifest and note the vantage point + +Pull the manifest back to the workstation so downstream skills (analyze, shard, render) can run locally: + +```bash +rsync -av @:/run/manifest.jsonl ./run/manifest.jsonl +``` + +Tell the user **the vantage point of the manifest** — i.e., "source was listed as `` from host ``." If the transfer will run on a different cluster with a different view (e.g., built from POSIX `/mnt/data/x`, transferred via `weka-s3:x`), they will need to invoke the `xfer-manifest-rebase` skill before render/submit. + +## Safety + +- Do not delete or overwrite an existing `run/manifest.jsonl` without confirming. +- Do not pick a build partition without the user's confirmation if the cluster has multiple options. +- If `ssh` or `rsync` would touch a shared path on the login node (e.g., a group scratch dir), confirm the path first. + +## After this skill + +Recommend the user next invoke `xfer-manifest-analyze` to pick rclone flags from the file-size histogram. diff --git a/.claude/skills/xfer-manifest-combine/SKILL.md b/.claude/skills/xfer-manifest-combine/SKILL.md new file mode 100644 index 0000000..edb0a3e --- /dev/null +++ b/.claude/skills/xfer-manifest-combine/SKILL.md @@ -0,0 +1,78 @@ +--- +name: xfer-manifest-combine +description: Combine multiple `rclone lsjson` part files into a single xfer JSONL manifest. Use this instead of `xfer-manifest-build` when the source is too large for a single `rclone lsjson` call and the user has already produced parallel listings (one JSON-array file per top-level prefix, with a `.prefix` sidecar naming the prefix). Produces the same `xfer.manifest.v1` schema the rest of the pipeline consumes. +--- + +# xfer-manifest-combine + +Drives `xfer manifest combine` — an alternative entry point to the pipeline when `xfer manifest build`'s single-shot listing is too slow or too large to buffer. Combines per-prefix lsjson outputs into one `manifest.jsonl`. + +## When to use this instead of `xfer-manifest-build` + +Use `xfer-manifest-build` by default. Reach for combine when **all** of the following are true: + +- The source has so many objects that a single `rclone lsjson` call would OOM or run for days. +- The user (or a previous job) has already produced per-prefix lsjson outputs in a directory. +- Each part file is a JSON array (rclone's native lsjson format), and each has a sibling `.prefix` file naming the prefix that was listed. + +If only the first condition is true and there are no parts yet, `xfer-manifest-build` is simpler — running parallel lsjson jobs just to feed combine is out of scope for this skill; the user should do that with their own orchestration. + +## Operating model + +Runs **locally on the workstation** if the parts dir is accessible locally, otherwise on whichever host can see the part files. Pure file processing — no Slurm, no SSH. + +## Step 1 — Verify the parts directory layout + +Each part file must follow this pattern: + +``` +parts/ +├── lsjson-0001.json # JSON array: `rclone lsjson :/ --recursive` +├── lsjson-0001.prefix # text file containing the literal prefix, e.g. "prefix-1" +├── lsjson-0002.json +├── lsjson-0002.prefix +... +``` + +Quick probe: + +```bash +ls /lsjson-*.json | head +ls /lsjson-*.prefix | head +``` + +If a `.prefix` sidecar is missing for any part, combine will use an empty prefix for that part — the resulting `path` fields will be bucket-relative, which is usually **not** what you want. Flag this and confirm with the user before running. + +## Step 2 — Run combine + +```bash +uv run xfer manifest combine \ + --source \ + --dest \ + --parts-dir \ + --out run/manifest.jsonl +``` + +`--source` / `--dest` are the full roots (e.g., `s3src:bucket`) — combine prepends them to each part's prefix + object path to produce the `source`/`dest` URIs in the manifest. + +`--run-id ` is optional; if omitted, one is generated per run. + +## Step 3 — Sanity-check the manifest + +```bash +wc -l run/manifest.jsonl +head -1 run/manifest.jsonl | python -m json.tool +``` + +Confirm: +- Line count matches the sum of non-dir entries across parts (combine's final log line reports this). +- `source_root`, `dest_root`, and the first record's `source`/`path` look right (path should start with a known prefix from the parts). + +## Safety + +- If `run/manifest.jsonl` already exists, confirm with the user before overwriting — combine writes unconditionally. +- If the parts dir is on a shared filesystem, treat it as read-only. + +## After this skill + +Continue with `xfer-manifest-analyze` exactly as you would after `xfer-manifest-build`. Downstream skills don't care whether the manifest came from build or combine — the schema is identical. diff --git a/.claude/skills/xfer-manifest-rebase/SKILL.md b/.claude/skills/xfer-manifest-rebase/SKILL.md new file mode 100644 index 0000000..a5f231d --- /dev/null +++ b/.claude/skills/xfer-manifest-rebase/SKILL.md @@ -0,0 +1,80 @@ +--- +name: xfer-manifest-rebase +description: Remap an xfer manifest's source/dest roots when the transfer host has a different view than the manifest build host (e.g., manifest built over POSIX `/mnt/data/x`, transfer runs via S3 `weka-s3:x`). Use whenever the vantage point changes between manifest build and transfer — this MUST run before render/submit or the transfer will fail. +--- + +# xfer-manifest-rebase + +Drives `xfer manifest rebase` — rewrites the source/dest roots in `run/manifest.jsonl` so the manifest is valid from the transfer host's perspective. + +## When to run this + +**Trigger condition**: the host that will execute the transfer has a different view of either the source or the destination than the host that built the manifest. Common cases: + +- Manifest built on a cluster with POSIX mount (`/mnt/data/dataset`); transfer runs on a different cluster that only sees the bucket as an rclone remote (`weka-s3:dataset`). +- Source built via one rclone remote alias; transfer host's rclone.conf uses a different alias for the same bucket. +- Dest root changed (e.g., adding a subprefix) after build. + +If the transfer host sees source and dest identically to the build host, **do not rebase** — it's a no-op that wastes a pass over the manifest. + +## Step 1 — Determine the mismatch + +Ask the user (or look at prior conversation / `run/manifest.jsonl`'s header): + +1. What were the `source_root` and `dest_root` recorded in the manifest? (Peek at the first line of `run/manifest.jsonl`.) +2. What does the transfer host see as the source and dest? (Usually rclone remote names; confirm with `rclone listremotes` on that host.) + +Show the user the proposed before/after mapping and confirm before proceeding. + +## Step 2 — Run rebase + +```bash +uv run xfer manifest rebase \ + --in run/manifest.jsonl \ + --out run/manifest.rebased.jsonl \ + --source-root \ + --dest-root +``` + +Always write to a new file (don't overwrite `manifest.jsonl`). Keeping the original manifest as a record of the build vantage point is useful for debugging and audits. + +## Step 3 — Re-shard + +Sharding is derived from the manifest, so **re-shard after rebasing**. The existing `run/shards/` directory contains pre-rebase paths and must be replaced. + +**Before running the command below, ask the user explicitly:** "OK to delete `run/shards/` and re-generate from the rebased manifest?" Offer to move it aside (`mv run/shards run/shards.pre-rebase`) as a safer alternative if they want to keep the old shards for debugging. + +Only after explicit confirmation: + +```bash +rm -rf run/shards +uv run xfer manifest shard \ + --in run/manifest.rebased.jsonl \ + --outdir run/shards \ + --num-shards +``` + +(Or invoke `xfer-manifest-shard` with the rebased manifest as input.) Byte balance won't change meaningfully, but the shard files need to carry the rebased paths or workers will try to copy from the wrong URI. + +## Step 4 — Point `xfer slurm render` at the rebased manifest + +`xfer slurm render` reads `source_root` and `dest_root` from a manifest file. By default it reads `/manifest.jsonl`, which is intentionally left at the pre-rebase vantage point as an audit record. Pass `--manifest` to read the rebased file instead: + +```bash +uv run xfer slurm render \ + --run-dir run \ + --manifest run/manifest.rebased.jsonl \ + ... +``` + +Without `--manifest`, render would use the original roots and every array task would target the wrong URI. + +## Safety + +- Never delete the original manifest — always keep `run/manifest.jsonl` as an audit trail alongside `run/manifest.rebased.jsonl`. +- Rebase is a remap, not a content migration. It does not move data. It only relabels what each shard points to. +- Confirm before `rm -rf run/shards` — the user may want to move the old shards aside rather than delete them. + +## After this skill + +Recommend `xfer-slurm-render` (or re-shard first if you didn't in step 3). diff --git a/.claude/skills/xfer-manifest-shard/SKILL.md b/.claude/skills/xfer-manifest-shard/SKILL.md new file mode 100644 index 0000000..dbc0e75 --- /dev/null +++ b/.claude/skills/xfer-manifest-shard/SKILL.md @@ -0,0 +1,55 @@ +--- +name: xfer-manifest-shard +description: Split an xfer manifest into byte-balanced shards for parallel transfer. Use after analyze and before slurm render, whenever the user wants to shard a manifest or asks how many array tasks the transfer should use. +--- + +# xfer-manifest-shard + +Drives `xfer manifest shard` — splits `run/manifest.jsonl` into `run/shards/shard_*.jsonl` + `shards.meta.json`. + +## Operating model + +Runs **locally on the workstation**. Pure file processing. No Slurm/SSH needed. + +## Step 1 — Read the analyze output + +Read `run/analyze.json` (from `xfer-manifest-analyze`) and use `suggested_shard_count` directly as the shard count. The analyzer already factors in the 10 TiB/shard cap, the expected array concurrency, and (if supplied) the core budget. + +If `run/analyze.json` doesn't exist yet, invoke `xfer-manifest-analyze` first — don't guess shard counts from the raw manifest. + +## Step 2 — Decide whether to override + +Only override `suggested_shard_count` if one of the inputs that fed it has changed since analyze ran: + +- The transfer cluster is different from what analyze assumed (different core budget). +- The array concurrency cap is different from what analyze assumed (defaults: `--assumed-array-concurrency=64`). +- The user wants a different per-shard byte cap (default 10 TiB). + +In that case, **re-run `xfer-manifest-analyze`** with updated `--assumed-*` flags rather than hand-picking a new number here. Sharing the reasoning/assumptions via `run/analyze.json` is how downstream skills stay coherent. + +Show the user `suggested_shard_count` alongside `shard_count_reasoning` from analyze, then confirm before running. + +## Step 3 — Run shard + +```bash +uv run xfer manifest shard \ + --in run/manifest.jsonl \ + --outdir run/shards \ + --num-shards \ + --strategy bytes +``` + +`bytes` (the default) is almost always right — it minimizes long-tail tasks. Use `count` only if the user explicitly wants equal object counts per shard, or `hash` for deterministic assignment by key. + +## Step 4 — Report + +After the command completes, read `run/shards/shards.meta.json` and report: +- Number of shards created +- Min / max / median shard bytes (to show balance quality) +- Total object count + +If max/min bytes ratio is > 3x, warn the user — the manifest may have very large individual objects that a byte-balanced greedy bin-pack can't split. Suggest bumping shard count or splitting the outlier files out into a separate run. + +## After this skill + +Recommend `xfer-slurm-render`. If the transfer cluster has a different view of source/dest than the build host, recommend `xfer-manifest-rebase` first. diff --git a/.claude/skills/xfer-oneshot/SKILL.md b/.claude/skills/xfer-oneshot/SKILL.md new file mode 100644 index 0000000..2375d6c --- /dev/null +++ b/.claude/skills/xfer-oneshot/SKILL.md @@ -0,0 +1,100 @@ +--- +name: xfer-oneshot +description: Run the whole xfer pipeline (build → shard → render → optional submit) in a single `xfer run` invocation. Use this as an escape hatch for small, straightforward transfers where the user doesn't need the knobs that the staged skills (analyze, rebase, etc.) expose. Do NOT use for very large datasets, multi-cluster transfers, or any case where a separate vantage point (POSIX mount on one cluster, S3 on another) is involved — those need the staged pipeline. +--- + +# xfer-oneshot + +Drives `xfer run` — the one-shot pipeline in `cli.py:1030`. It invokes `manifest build`, `manifest shard`, and `slurm render` in order, optionally submitting. No `analyze`, no `rebase`, no load-aware cluster selection. + +## When to use this — and when not to + +**Good fit:** + +- Source and destination are both reachable as rclone remotes from a single Slurm cluster. +- Object count and total bytes are small enough that default shard/concurrency knobs are fine. +- The user has already run through the full pipeline at least once for this dataset and trusts the defaults. +- Quick demos, smoke tests, CI-style small transfers. + +**Bad fit — redirect to the staged pipeline:** + +- Total bytes ≥ 10 TiB (`xfer-manifest-analyze` will compute shard count properly). +- Source available only via POSIX mount on a different cluster than the transfer cluster (need `xfer-manifest-rebase`). +- The user wants to tune rclone flags to the dataset's size profile (`xfer-manifest-analyze` does this). +- User wants to inspect / edit the manifest, shards, or sbatch script before submitting. + +If any of the "bad fit" conditions apply, **do not invoke this skill** — recommend `xfer-manifest-build` and the staged flow instead. + +## Operating model + +Runs from a **Slurm login node** (not the workstation), same as `xfer-manifest-build`, because `xfer run` calls `xfer manifest build` internally, which requires `srun` + pyxis. Pre-flight (uv installed, repo staged, venv synced, rclone.conf deployed) is identical to `xfer-manifest-build` Step 2 — invoke `xfer-manifest-build`'s pre-flight probe or `xfer-rclone-config` as needed before running this skill. + +## Step 1 — Confirm fit + +Ask the user explicitly: + +- Rough total bytes? (If ≥ 10 TiB, redirect to staged flow.) +- Same cluster for listing and transfer? (If not, redirect.) +- Do you want tuned rclone flags? (If yes, redirect to analyze first.) + +Don't proceed silently — users sometimes reach for `xfer run` by habit when they shouldn't. + +## Step 2 — Gather inputs + +| Flag | Notes | +| ---------------------- | ---------------------------------------------------------------------------------------- | +| `--run-dir` | default `run`; pick a fresh directory | +| `--source` | rclone remote or POSIX path | +| `--dest` | rclone remote | +| `--num-shards` | default 256; override only if the user has a reason | +| `--array-concurrency` | default 64 | +| `--rclone-image` | e.g. `rclone/rclone:latest` | +| `--rclone-config` | absolute path on **this cluster** (the login node = transfer host here, since `xfer run` assumes same cluster) | +| `--rclone-flags` | sensible default is baked in; override if the user specifies | +| `--partition` | a CPU-only partition on this cluster | +| `--time-limit` | default `24:00:00` | +| `--cpus-per-task` | default 4 | +| `--mem` | default `8G` | +| `--max-attempts` | default 5 | +| `--sbatch-extras` | site-specific `--account=...`, `--qos=...` | +| `--pyxis-extra` | extra `srun --container-*` flags if needed | +| `--submit` | boolean; if set, sbatch runs immediately after render | + +## Step 3 — Run + +Via SSH to the login node, inside the xfer repo: + +```bash +ssh @ ' + cd && + uv run xfer run \ + --run-dir \ + --source \ + --dest \ + --rclone-image \ + --rclone-config \ + --partition \ + --sbatch-extras "" \ + [--submit] +' +``` + +Stream output so the user sees manifest progress, sharding stats, and the render summary. If `--submit` was set, capture the `Submitted batch job ` line. + +## Step 4 — Report + +Print a short summary: +- Total objects / total bytes (from the manifest build line). +- Number of shards (from the shard summary). +- Rendered artifacts in `/` on the login node. +- If submitted: job id and the monitoring commands from `xfer-slurm-submit` Step 4 (`squeue`, `sacct`, logs, state markers). + +## Safety + +- Never overwrite an existing `run-dir` without confirming — `xfer run` will re-run build and clobber `manifest.jsonl`. +- If `--submit` is set, this skill consumes real cluster resources on the spot. Confirm before running. +- If the user wants to inspect the sbatch script before it runs, run **without** `--submit`, show them `/sbatch_array.sh`, then invoke `xfer-slurm-submit` separately. + +## After this skill + +If `--submit` was used, direct the user to the monitoring commands from `xfer-slurm-submit` Step 4. If not, `xfer-slurm-submit` is next. diff --git a/.claude/skills/xfer-rclone-config/SKILL.md b/.claude/skills/xfer-rclone-config/SKILL.md new file mode 100644 index 0000000..9a93d4d --- /dev/null +++ b/.claude/skills/xfer-rclone-config/SKILL.md @@ -0,0 +1,114 @@ +--- +name: xfer-rclone-config +description: Create or extend an rclone.conf for xfer — collect S3 endpoints and credentials for each remote (source, destination, optional staging), write the config with 0600 permissions, optionally test it, and guide the user through deploying it to each Slurm cluster that will run xfer jobs. Use whenever the user needs to set up rclone remotes, is bootstrapping a new transfer, or a previous step discovered a missing rclone.conf on a cluster. +--- + +# xfer-rclone-config + +Authors an rclone.conf suitable for xfer. The config is consumed by **containerized rclone** inside Slurm jobs, so it must be: + +1. Readable on the workstation (convenient for local testing). +2. Present at a known absolute path on **every Slurm cluster** that will run any stage of xfer (`xfer manifest build` and the transfer array). +3. Restricted to `0600` permissions — it contains S3 secret keys. + +The authoritative template is `rclone.conf.example` at the repo root. + +## Per-stage rclone.conf paths + +See CLAUDE.md's "Paths are per-system" cross-cutting invariant. Applied to `--rclone-config`, the host where each stage runs is: + +- `xfer manifest build` — build cluster's login node +- `xfer slurm render` — transfer cluster's compute nodes (the path is embedded into `sbatch_array.sh` and resolved at job time) +- Local `uv run xfer analyze/shard/rebase` — workstation (only if the user wants to sanity-check remotes via host rclone) + +Collect the absolute path per system — don't assume `~/.config/rclone/rclone.conf` means the same thing on every host. + +## Step 1 — Inventory what's needed + +Ask the user: + +1. **Which remotes?** At minimum `source` and `destination`. Some transfers also need a staging remote. Each remote needs a short config-section name (e.g., `s3-src`, `s3-dest`, `weka-s3`). +2. **Per remote**: endpoint URL, access key ID, secret access key, provider (Other / AWS / Wasabi / Ceph / Minio / etc.), region (AWS only), and whether path-style addressing is required (true for most on-prem / VAST / Weka / Minio; false for AWS). +3. **Remote naming consistency across clusters.** If the same bucket will be referenced from more than one cluster, encourage the user to use **the same remote name everywhere** — this avoids needing `xfer-manifest-rebase` later. If a cluster has a POSIX mount of the same bucket, that's fine too; the build skill can point at the POSIX path directly on that cluster. +4. **Target file path on the workstation.** Default `~/.config/rclone/rclone.conf`. If the file already exists, default to **appending** new remotes rather than overwriting — confirm before touching existing sections. + +Collect secrets **one at a time** and do not echo them back in your user-facing text. When running the write step, never log the secret values. + +## Step 2 — Write the config + +For each remote, emit a section following the project's template style (`rclone.conf.example`): + +```ini +[] +type = s3 +provider = +access_key_id = +secret_access_key = +endpoint = +region = +force_path_style = +no_check_bucket = true +``` + +Notes: +- `force_path_style = true` for on-prem / VAST / Weka / Minio / Ceph. Leave unset for AWS. +- `no_check_bucket = true` is safe for xfer — workers assume the bucket exists; skipping the probe is faster at high concurrency. +- For AWS, set `region` and omit `endpoint`. + +Write the file, then immediately lock it down: + +```bash +chmod 600 +``` + +If appending to an existing file, confirm with the user that no section-name collisions exist before writing; if a section already exists, ask whether to overwrite or pick a different name. + +## Step 3 — Smoke-test (optional, workstation-local) + +If the user has `rclone` installed locally, offer to run: + +```bash +rclone --config listremotes +rclone --config lsd : --max-depth 1 +``` + +The `lsd` probe validates credentials and endpoint reachability. If `rclone` is not installed on the workstation, skip — the real validation happens on the cluster at job time anyway. + +## Step 4 — Deploy to each Slurm cluster + +For **every cluster** that will run any xfer stage: + +1. Ask the user the **absolute path** on that cluster where rclone.conf should live (e.g., `~/.config/rclone/rclone.conf`, `/home//xfer/rclone.conf`, or a site-provided shared location). +2. Check whether the cluster already has one at that path: + + ```bash + ssh @ 'test -f && echo PRESENT || echo MISSING' + ``` + +3. If missing, copy it over (**confirm with the user first — this transmits credentials**): + + ```bash + scp @: + ssh @ 'chmod 600 ' + ``` + +4. Record the `(cluster, path)` pairs in the current conversation so `xfer-manifest-build`, `xfer-slurm-render`, and `xfer-slurm-submit` can use the correct path per system. Example shape: + + ``` + workstation → ~/.config/rclone/rclone.conf + alpha-login.example.com → /home/joe/.config/rclone/rclone.conf + beta-login.example.com → /shared/rclone/joe.conf + ``` + +## Step 5 — Remind about credential hygiene + +Tell the user, explicitly: + +- The file must be `0600` on every host it lives on. +- If credentials rotate, all copies must be updated — xfer does not re-fetch. +- Do not commit rclone.conf to the repo (`.gitignore` already excludes it; confirm if unsure). +- If a cluster has a site-managed shared rclone.conf (e.g., admin-provisioned at `/etc/rclone/rclone.conf`), prefer that and skip the scp step — don't duplicate credentials. + +## After this skill + +If the user is starting fresh, continue to `xfer-manifest-build`. If they just added a remote for a new cluster, they're good to re-run whichever stage prompted this detour. diff --git a/.claude/skills/xfer-slurm-render/SKILL.md b/.claude/skills/xfer-slurm-render/SKILL.md new file mode 100644 index 0000000..1078dac --- /dev/null +++ b/.claude/skills/xfer-slurm-render/SKILL.md @@ -0,0 +1,104 @@ +--- +name: xfer-slurm-render +description: Render Slurm batch scripts (`worker.sh`, `sbatch_array.sh`, `submit.sh`, `config.resolved.json`) for an xfer transfer. Use after manifest shard — and after rebase, if the vantage point changed — whenever the user wants to generate the runnable job artifacts. Picks partition and resources informed by current cluster load. +--- + +# xfer-slurm-render + +Drives `xfer slurm render` — produces the runnable job artifacts under `run/`. + +## Operating model + +Runs **locally on the workstation**. No SSH needed to render, but you will SSH to candidate login nodes in step 1 to check cluster load before picking the transfer cluster. + +## Step 1 — Pick the transfer cluster (load-aware, CPU-only preferred) + +If the user already decided which cluster to transfer on, skip to step 2. Otherwise, help them pick. + +1. Ask for the list of candidate Slurm clusters with login-node SSH access. +2. For each candidate, probe load and CPU-only partitions: + +```bash +ssh @ ' + sinfo -h -o "%P %a %l %D %C %G" | head -20 + echo "---" + squeue -h -o "%T %C" | awk '"'"'{s[$1]+=$2} END{for(k in s) print k, s[k]}'"'"' +' +``` + +- `%G` is the GRES column — a partition reporting `(null)` or no GPUs is CPU-only. **Prefer CPU-only partitions** — transfers are I/O-bound and GPU nodes are wasted here. +- From `%C` (allocated/idle/other/total cores), compute idle-fraction per partition. Prefer the partition with the highest idle fraction. +- From `squeue` aggregated by state, check how many jobs are queued (`PD`) vs running (`R`). High pending counts mean the next job will wait. + +Present a short comparison to the user and recommend one. Let them override. + +## Step 2 — Confirm "vantage-unchanged" invariant + +Before rendering, verify the manifest and shards reflect the transfer host's view of source/dest: + +- If the build host differs from the chosen transfer host **and** their rclone.conf views / POSIX mounts differ, **stop and invoke `xfer-manifest-rebase` first**. Otherwise workers will hit wrong-URI errors. + +Don't assume — if unsure, check `run/config.resolved.json` (if a prior render exists) against the chosen cluster's rclone remotes. + +## Step 3 — Gather render inputs + +Collect from the user (with defaults from `run/analyze.json` and the chosen partition): + +| Flag | Source | +| ---------------------- | ------------------------------------------------------- | +| `--run-dir` | `run` (or whatever the prior skills used) | +| `--num-shards` | count of `run/shards/shard_*.jsonl` | +| `--array-concurrency` | ask user; typically 32–256, bounded by partition | +| `--job-name` | ask user; default `xfer-` | +| `--time-limit` | ask user; default 24:00:00 | +| `--partition` | from step 1 | +| `--cpus-per-task` | 4 (default) | +| `--mem` | 8G (default; bump for large-files profile) | +| `--rclone-image` | `rclone/rclone:latest` | +| `--rclone-config` | absolute path to rclone.conf **on the transfer cluster's compute nodes** (see note below) | +| `--rclone-flags` | `suggested_flags` from `run/analyze.json` | +| `--max-attempts` | 5 (default) | +| `--sbatch-extras` | site-specific `--account=...`, `--qos=...`, etc. | +| `--pyxis-extra` | extra `srun --container-*` flags if site requires them | +| `--manifest` | optional; path to a specific manifest (pass `run/manifest.rebased.jsonl` if the rebase skill ran) | + +The `--rclone-config` path is baked into `sbatch_array.sh` and resolved **on the transfer cluster at job time**, not at render time. Render itself no longer requires the file to exist on the workstation — it only prints a warning if the local path is missing, since the actual consumer is the compute node. Still: + +- The path must be an absolute path valid on the cluster's compute nodes. +- It must exist with `0600` permissions **on the cluster** before the job starts. + +If the user doesn't already have the config deployed to the cluster at a known path, invoke `xfer-rclone-config` to set it up and record the path. A wrong value here means every array task will fail identically at container start, so double-check the path. + +Use `--manifest` whenever the user ran `xfer-manifest-rebase`: render reads `source_root` / `dest_root` from a manifest file, and the default path (`/manifest.jsonl`) is intentionally left at the pre-rebase vantage point as an audit record. Passing `--manifest run/manifest.rebased.jsonl` is how render picks up the rebased roots. + +## Step 4 — Render + +```bash +uv run xfer slurm render \ + --run-dir \ + --num-shards \ + --array-concurrency \ + --job-name \ + --time-limit \ + --partition \ + --cpus-per-task \ + --mem \ + --rclone-image \ + --rclone-config \ + --rclone-flags "" \ + --max-attempts 5 \ + --sbatch-extras '' \ + --pyxis-extra '' \ + [--manifest run/manifest.rebased.jsonl] # only after rebase +``` + +## Step 5 — Verify the outputs + +After render, show the user: +- `run/sbatch_array.sh` — read and show the `#SBATCH` header so they can eyeball partition/time/mem +- `run/config.resolved.json` — the frozen run config +- `run/worker.sh` exists and is executable + +## After this skill + +Recommend `xfer-slurm-submit`. diff --git a/.claude/skills/xfer-slurm-submit/SKILL.md b/.claude/skills/xfer-slurm-submit/SKILL.md new file mode 100644 index 0000000..a20aa53 --- /dev/null +++ b/.claude/skills/xfer-slurm-submit/SKILL.md @@ -0,0 +1,109 @@ +--- +name: xfer-slurm-submit +description: Copy a rendered xfer run directory to a Slurm login node and submit the array job via sbatch. Use after `xfer slurm render` whenever the user is ready to kick off the actual transfer. Handles workstation→cluster staging and post-submit monitoring pointers. +--- + +# xfer-slurm-submit + +Submits the rendered Slurm array job. Assumes `xfer-slurm-render` already produced `run/worker.sh`, `run/sbatch_array.sh`, `run/shards/`, etc. on the workstation. + +## Operating model + +The user renders locally and submits to the **transfer cluster's login node**. This skill: +1. Copies (or syncs) the run directory to the login node. +2. SSHes in and runs `sbatch`. +3. Returns the job id and monitoring pointers. + +## Step 1 — Identify the transfer cluster and paths + +From `run/config.resolved.json` (or ask the user): +- Transfer cluster login node (hostname, user) +- Destination path for the run dir on the cluster (e.g., `~/xfer-runs/`) +- Path to `rclone.conf` on the cluster (referenced from `config.resolved.json`) + +Verify the rclone.conf path recorded in `run/config.resolved.json` exists on the transfer cluster — this is the path the compute nodes will see, which may differ from the workstation's path: + +```bash +ssh @ 'test -f && stat -c "%a" ' +``` + +Expect `600`. If the file is missing, **stop and invoke `xfer-rclone-config`** to deploy it — that skill handles credential hygiene, 0600 permissions, and recording the per-cluster path. Don't `scp` the file directly from this skill. + +If the file exists but isn't `0600`, ask the user whether to fix it (`chmod 600 ` via SSH) before proceeding. + +## Step 2 — Stage the run directory + +Use `rsync -av` so re-submits don't re-send unchanged shards: + +```bash +rsync -av --exclude='logs/' --exclude='state/' \ + .// \ + @:/ +``` + +Exclude `logs/` and `state/` — those are produced by the running job and re-pulling them would clobber live progress. + +If this is a re-submit after failure, do **not** `--delete` — preserve any `.done` markers so completed shards are skipped on replay. + +## Step 3 — Submit + +**Prefer `uv run xfer slurm submit`** when the xfer repo is available on the transfer cluster's login node (same pre-flight as `xfer-manifest-build` — `uv` installed, repo synced, `.venv` bootstrapped). The CLI wraps `sbatch --export=NONE` (see `cli.py:1021`) so env vars from the parent shell don't leak into the array job: + +```bash +ssh @ ' + cd && + uv run xfer slurm submit --run-dir +' +``` + +If the xfer repo isn't on the transfer cluster (and the user doesn't want to stage it), fall back to raw `sbatch`: + +```bash +ssh @ ' + cd && + sbatch --export=NONE sbatch_array.sh +' +``` + +`--export=NONE` is important either way — it prevents `SLURM_*` env from the workstation or a parent job from leaking into the array job (e.g., a stale `SLURM_MEM_PER_NODE` colliding with `SLURM_MEM_PER_CPU`). + +Capture the returned job id (e.g., `Submitted batch job 12345`). + +## Step 4 — Report monitoring commands + +Print back to the user — ready to paste: + +```bash +# Live queue state +ssh @ 'squeue -j ' + +# Completed tasks summary +ssh @ 'sacct -j --format=JobID,State,Elapsed,ExitCode' + +# Per-shard logs +ssh @ 'ls /logs/ | tail -20' + +# Back off array concurrency live (if S3 endpoint is rate-limiting) +ssh @ 'scontrol update ArrayTaskThrottle= JobId=' + +# State markers: .done = success, .fail = last attempt failed +ssh @ 'ls /state/ | sort | uniq -c | head' +``` + +## Step 5 — Tell the user about resume semantics + +Remind them of xfer's safety net: +- Failed shards auto-requeue up to `--max-attempts` (set at render time). +- Re-submitting the same `sbatch_array.sh` is safe — `.done` markers cause completed shards to skip. +- To fully restart, delete `run/state/` on the cluster before re-submitting. + +## Safety + +- **Always confirm before copying rclone.conf** — it contains credentials. +- **Always confirm the sbatch command** before running it — this is the moment real cluster resources get consumed. +- Never run `scontrol update` or `scancel` on a running job without explicit user direction. +- Do not use `rsync --delete` against a running job's run dir. + +## After this skill + +The pipeline is live. Direct the user to monitor via the commands printed in step 4. diff --git a/.gitignore b/.gitignore index c18dd8d..6658aed 100644 --- a/.gitignore +++ b/.gitignore @@ -1 +1,2 @@ __pycache__/ +.claude/settings.local.json diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..8e252c5 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,44 @@ +# CLAUDE.md — xfer + +`xfer` orchestrates large S3↔S3 (and POSIX↔S3) data transfers via rclone-in-a-container, Slurm job arrays, and pyxis/enroot. Manifest-driven, sharded, resumable. + +## Mental model + +The **user's local workstation** is the orchestrator. Long-running and compute-bound work happens on **Slurm clusters** reached via SSH/scp to login nodes. Commands like `uv run xfer manifest build` that invoke `srun` must run on a login node, not the workstation. `analyze`, `shard`, `rebase`, and `render` are pure file processing and run locally. + +## Cross-cutting invariants + +- **Paths are per-system.** `--rclone-config`, the xfer repo path, and the run directory all differ between workstation, build cluster, and transfer cluster. Never assume the workstation's path resolves on the cluster. Always ask, verify with `ssh ... test -f`, or consult `run/config.resolved.json`. +- **POSIX-first for manifest build.** If any Slurm cluster has a POSIX mount of the source bucket, build the manifest there against the POSIX path. Listing is latency-bound; POSIX beats S3 by a large margin. +- **CPU-only, load-aware for transfer.** Prefer CPU-only partitions for both build and transfer. Pick the transfer cluster by current load (`sinfo`/`squeue` on candidate login nodes), not habit. +- **Vantage change ⇒ rebase.** If the manifest was built on a host whose view of source/dest differs from the transfer host's view, run `xfer manifest rebase` and re-shard before render. Skipping this makes every array task fail identically. +- **Credential hygiene.** rclone.conf is `0600` everywhere it lives. Never commit it (`.gitignore` already excludes it). Never echo secrets back to the user. Confirm before transmitting credentials over scp. +- **Confirm before cluster-side actions.** SSH probes and `rsync` of local files are fine. `sbatch`, `scontrol update`, `scancel`, and any write to a shared path require explicit user confirmation. + +## Workflow skills + +The main pipeline is `build → analyze → shard → (rebase) → render → submit`. Invoke each skill when the user's intent matches its stage. + +| Skill | Stage | +| ------------------------- | ---------------------------------------------------------------------- | +| `xfer-manifest-build` | Run `xfer manifest build` on a login node (POSIX-preferred source) | +| `xfer-manifest-analyze` | File-size histogram → suggested rclone flags + shard count | +| `xfer-manifest-shard` | Byte-balanced split of the manifest into `run/shards/` | +| `xfer-manifest-rebase` | Remap source/dest roots when vantage changes; re-shard after | +| `xfer-slurm-render` | Render `worker.sh` / `sbatch_array.sh` / `config.resolved.json` | +| `xfer-slurm-submit` | Stage run dir to transfer cluster, `sbatch`, return monitoring cmds | + +**On-demand / alternative entry points:** + +| Skill | When to use | +| ------------------------- | ---------------------------------------------------------------------- | +| `xfer-rclone-config` | One-time (per cluster) setup — create/deploy rclone.conf. Invoke when another skill discovers a missing config. | +| `xfer-manifest-combine` | Alternative to `build` when the user already has parallel `rclone lsjson` parts and needs them unified into a single manifest. | +| `xfer-oneshot` | `xfer run` escape hatch for small/simple transfers that don't need the staged knobs. Redirect to the staged pipeline for large or cross-vantage jobs. | + +## Dev conventions + +- Python ≥ 3.10, managed with `uv` (`uv venv && uv sync`; `uv run xfer --help`). +- Formatting: `uv run black .`; pre-commit hook available via `uv run pre-commit install`. +- Branch names: `/` or `/` where type ∈ {feature, patch, docs}. +- Do **not** squash-merge PRs. diff --git a/README.md b/README.md index 654d766..2c4eae6 100644 --- a/README.md +++ b/README.md @@ -309,6 +309,46 @@ run/ --- +## Using Claude Code with xfer + +This repo ships a set of [Claude Code](https://claude.com/claude-code) skills under `.claude/skills/` that walk through each stage of a transfer. Each skill drives the corresponding `xfer` subcommand and encodes conventions we've found important on real clusters. + +### Available skills + +Invoke each by intent in a Claude Code session, or explicitly via `/`. + +Main pipeline (`build → analyze → shard → (rebase) → render → submit`): + +| Skill | Stage | +| ------------------------ | -------------------------------------------------------------------- | +| `xfer-manifest-build` | Run `xfer manifest build` on a login node (POSIX source preferred) | +| `xfer-manifest-analyze` | File-size histogram → suggested rclone flags and shard count | +| `xfer-manifest-shard` | Byte-balanced split of the manifest into shards | +| `xfer-manifest-rebase` | Remap source/dest roots when the transfer host's view differs | +| `xfer-slurm-render` | Render `worker.sh` / `sbatch_array.sh` / `config.resolved.json` | +| `xfer-slurm-submit` | Stage the run directory to the cluster and `sbatch` | + +On-demand / alternative entry points: + +| Skill | When to use | +| ------------------------ | -------------------------------------------------------------------- | +| `xfer-rclone-config` | One-time (per cluster) setup — create/deploy `rclone.conf` | +| `xfer-manifest-combine` | Combine parallel `rclone lsjson` parts into a single manifest | +| `xfer-oneshot` | `xfer run` escape hatch for small transfers that don't need the staged knobs | + +See `CLAUDE.md` for the cross-cutting context Claude loads in every session in this repo. + +### Conventions + +The skills (and `CLAUDE.md`) enforce a few invariants. These apply whether you drive xfer through Claude Code or by hand: + +- **Workstation orchestrates, clusters execute.** Run xfer from a local checkout in a `uv` environment. SSH to Slurm login nodes for `manifest build` and `sbatch`; `analyze`, `shard`, `rebase`, and `render` run locally. +- **Paths are per-system.** `--rclone-config`, the xfer repo path, and the run directory all differ between workstation, build cluster, and transfer cluster. Always resolve the correct absolute path on whichever host the command runs on — do not assume a workstation path resolves identically on a cluster. +- **POSIX-first manifest build.** If any Slurm cluster has a POSIX mount of the source bucket, build the manifest there against the POSIX path. Listing is latency-bound, and POSIX beats S3 by a wide margin. +- **CPU-only, load-aware transfer.** Prefer CPU-only partitions for both build and transfer. Pick the transfer cluster by current `sinfo`/`squeue` load rather than by habit. +- **Vantage change ⇒ rebase.** When the host that will run the transfer has a different view of source or destination than the host that built the manifest, run `xfer manifest rebase` and re-shard before render. Skipping this makes every array task fail identically. +- **Credential hygiene.** Keep `rclone.conf` at mode `0600` on every host it lives on. Never commit it to the repo, and confirm before transmitting it over `scp`. + ## Design notes * **Manifest is immutable** → enables reproducibility and auditing diff --git a/pyproject.toml b/pyproject.toml index 9c94356..fb85195 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -30,5 +30,9 @@ select = ["E", "F", "I", "B", "UP"] dev = [ "black>=26.1.0", "pre-commit>=4.5.1", + "pytest>=8.0.0", ] +[tool.pytest.ini_options] +testpaths = ["tests"] + diff --git a/src/xfer/cli.py b/src/xfer/cli.py index 2ae9ffc..25d364f 100644 --- a/src/xfer/cli.py +++ b/src/xfer/cli.py @@ -467,6 +467,25 @@ def manifest_analyze( "--retries 10 --low-level-retries 20 --stats 600s --progress", help="Base rclone flags to always include", ), + assumed_cpus_per_task: int = typer.Option( + 4, + min=1, + help="Cores per shard worker (matches slurm_render default). Used only for shard-count suggestion.", + ), + assumed_array_concurrency: int = typer.Option( + 64, + min=1, + help="Expected Slurm array concurrency (matches slurm_render default). Used only for shard-count suggestion.", + ), + assumed_core_budget: Optional[int] = typer.Option( + None, + help="Total cores the transfer cluster's partition will make available. Used only for shard-count suggestion. If omitted, the core constraint is skipped.", + ), + max_shard_bytes_tb: int = typer.Option( + 10, + min=1, + help="Per-shard byte cap in TiB (no single shard should exceed this).", + ), ) -> None: """ Analyze manifest file sizes and suggest optimal rclone flags. @@ -481,6 +500,7 @@ def manifest_analyze( format_histogram_data, human_bytes, suggest_rclone_flags_from_sizes, + suggest_shard_count, ) # Read manifest and extract sizes @@ -505,6 +525,13 @@ def manifest_analyze( stats = compute_file_size_stats(sizes) suggestion = suggest_rclone_flags_from_sizes(sizes) histogram = format_histogram_data(sizes) + shard_suggestion = suggest_shard_count( + stats.total_bytes, + cpus_per_task=assumed_cpus_per_task, + array_concurrency=assumed_array_concurrency, + core_budget=assumed_core_budget, + max_shard_bytes_tb=max_shard_bytes_tb, + ) # Combine base flags with profile-specific flags combined_flags = f"{suggestion.flags} {base_flags}" @@ -528,6 +555,9 @@ def manifest_analyze( "profile": suggestion.profile, "profile_explanation": suggestion.explanation, "suggested_flags": combined_flags, + "suggested_shard_count": shard_suggestion.num_shards, + "shard_count_reasoning": shard_suggestion.reasoning, + "shard_count_assumptions": shard_suggestion.assumptions, "histogram": histogram, } @@ -882,9 +912,7 @@ def slurm_render( rclone_image: str = typer.Option(..., help="Container image containing rclone"), rclone_config: Path = typer.Option( ..., - exists=True, - dir_okay=False, - help="Host path to rclone.conf", + help="Absolute path to rclone.conf on the transfer cluster's compute nodes. Not required to exist on this host; a warning is emitted if it is missing locally.", resolve_path=True, ), rclone_conf_in_container: str = typer.Option( @@ -911,6 +939,11 @@ def slurm_render( pyxis_extra: str = typer.Option( "", help="Extra pyxis flags (string placed after --container-mounts...)" ), + manifest: Optional[Path] = typer.Option( + None, + help="Manifest JSONL to read source/dest_root from. Defaults to /manifest.jsonl. Use this after `xfer manifest rebase` to point render at the rebased file.", + resolve_path=True, + ), ) -> None: """ Render worker.sh, sbatch_array.sh, and submit.sh under run_dir. @@ -919,12 +952,19 @@ def slurm_render( mkdirp(run_dir / "logs") mkdirp(run_dir / "state") - # If source/dest not provided, try to read first line of manifest.jsonl (if present) + if not rclone_config.exists(): + eprint( + f"WARNING: rclone_config {rclone_config} does not exist on this host. " + "That is fine if the path is valid on the transfer cluster's compute nodes. " + "Verify before submitting." + ) + + # If source/dest not provided, try to read first line of the manifest (if present) if source_root is None or dest_root is None: - manifest = run_dir / "manifest.jsonl" - if manifest.exists(): + manifest_path = manifest or (run_dir / "manifest.jsonl") + if manifest_path.exists(): first = None - for ln in manifest.read_text(encoding="utf-8").splitlines(): + for ln in manifest_path.read_text(encoding="utf-8").splitlines(): if ln.strip(): first = json.loads(ln) break @@ -934,7 +974,7 @@ def slurm_render( if not source_root or not dest_root: raise typer.BadParameter( - "source_root/dest_root not set and could not be read from run_dir/manifest.jsonl" + "source_root/dest_root not set and could not be read from the manifest (pass --manifest, --source-root, and/or --dest-root explicitly)" ) # Write scripts diff --git a/src/xfer/est.py b/src/xfer/est.py index 5d10e8c..fc31e09 100644 --- a/src/xfer/est.py +++ b/src/xfer/est.py @@ -569,6 +569,90 @@ def suggest_rclone_flags_from_sizes(sizes: List[int]) -> RcloneFlagsSuggestion: ) +@dataclass +class ShardCountSuggestion: + """Suggested shard count for `xfer manifest shard` based on bytes, cores, and concurrency.""" + + num_shards: int + reasoning: str + assumptions: Dict[str, Any] + + +def suggest_shard_count( + total_bytes: int, + *, + cpus_per_task: int = 4, + array_concurrency: int = 64, + core_budget: Optional[int] = None, + max_shard_bytes_tb: int = 10, +) -> ShardCountSuggestion: + """ + Suggest a shard count for a transfer based on three constraints: + + 1. Bytes cap: no single shard should carry more than ``max_shard_bytes_tb`` TiB + of data (worker wall-clock dominates the array's long tail otherwise). + 2. Concurrency cap: producing more than ``4 * array_concurrency`` shards is + wasteful (the scheduler only needs enough slack to keep the queue packed + as slow shards trail). + 3. Core cap (optional): if ``core_budget`` is supplied, the array can't + usefully exceed ``core_budget // cpus_per_task`` concurrent workers, so + producing more shards just lengthens the queue. + + Special case: if ``total_bytes`` is below the per-shard cap, return + ``num_shards=1`` — sharding isn't helpful. + """ + tib = 1024**4 + max_shard_bytes = max_shard_bytes_tb * tib + + assumptions: Dict[str, Any] = { + "total_bytes": total_bytes, + "cpus_per_task": cpus_per_task, + "array_concurrency": array_concurrency, + "core_budget": core_budget, + "max_shard_bytes_tb": max_shard_bytes_tb, + } + + if total_bytes < max_shard_bytes: + reasoning = ( + f"total_bytes ({human_bytes(total_bytes)}) is below the per-shard cap " + f"({max_shard_bytes_tb} TiB); a single shard is sufficient." + ) + return ShardCountSuggestion( + num_shards=1, reasoning=reasoning, assumptions=assumptions + ) + + shards_by_bytes = math.ceil(total_bytes / max_shard_bytes) + shards_by_concurrency = 4 * array_concurrency + shards_by_cores: Optional[int] = None + if core_budget is not None: + shards_by_cores = max(1, core_budget // cpus_per_task) + + upper = shards_by_concurrency + if shards_by_cores is not None: + upper = min(upper, shards_by_cores) + + num_shards = max(1, min(upper, max(shards_by_bytes, 1))) + + reasoning_parts = [ + f"shards_by_bytes={shards_by_bytes} " + f"(total {human_bytes(total_bytes)} / {max_shard_bytes_tb} TiB cap)", + f"shards_by_concurrency={shards_by_concurrency} (4 x {array_concurrency})", + ] + if shards_by_cores is not None: + reasoning_parts.append( + f"shards_by_cores={shards_by_cores} " + f"({core_budget} cores / {cpus_per_task} cpus_per_task)" + ) + reasoning_parts.append( + f"chose max(1, min(upper={upper}, shards_by_bytes)) = {num_shards}" + ) + reasoning = "; ".join(reasoning_parts) + + return ShardCountSuggestion( + num_shards=num_shards, reasoning=reasoning, assumptions=assumptions + ) + + def format_histogram_data( sizes: List[int], ) -> List[Dict[str, Any]]: diff --git a/tests/test_est_shard_count.py b/tests/test_est_shard_count.py new file mode 100644 index 0000000..fb58106 --- /dev/null +++ b/tests/test_est_shard_count.py @@ -0,0 +1,72 @@ +from xfer.est import suggest_shard_count + +TIB = 1024**4 + + +def test_below_cap_returns_one_shard(): + suggestion = suggest_shard_count(5 * TIB) + assert suggestion.num_shards == 1 + assert "below the per-shard cap" in suggestion.reasoning + + +def test_zero_bytes_returns_one_shard(): + suggestion = suggest_shard_count(0) + assert suggestion.num_shards == 1 + + +def test_bytes_bounded_by_concurrency_default(): + # 100 TiB, default cap 10 TiB -> 10 shards by bytes. + # Defaults: array_concurrency=64 -> concurrency bound 256. No core_budget. + # min(256, max(10, 1)) = 10. + suggestion = suggest_shard_count(100 * TIB) + assert suggestion.num_shards == 10 + + +def test_core_budget_tightens_upper_bound(): + # 100 TiB / 10 TiB = 10 shards by bytes. + # core_budget=40, cpus_per_task=4 -> core bound 10. + # concurrency bound = 4 * 64 = 256. + # min(min(256, 10), 10) = 10. Matches the bytes count exactly. + suggestion = suggest_shard_count(100 * TIB, core_budget=40, cpus_per_task=4) + assert suggestion.num_shards == 10 + assert "shards_by_cores=10" in suggestion.reasoning + + +def test_core_budget_caps_below_bytes_requirement(): + # 500 TiB / 10 TiB = 50 shards by bytes. + # core_budget=16, cpus_per_task=4 -> core bound 4. + # concurrency bound 256. upper = min(256, 4) = 4. + # num_shards = max(1, min(4, 50)) = 4. + suggestion = suggest_shard_count(500 * TIB, core_budget=16, cpus_per_task=4) + assert suggestion.num_shards == 4 + + +def test_concurrency_caps_below_bytes_requirement(): + # 500 TiB / 10 TiB = 50 shards by bytes. array_concurrency=8 -> bound 32. + # num_shards = max(1, min(32, 50)) = 32. + suggestion = suggest_shard_count(500 * TIB, array_concurrency=8) + assert suggestion.num_shards == 32 + + +def test_custom_max_shard_bytes_tb(): + # max_shard_bytes_tb=1 => cap is 1 TiB. + # 5 TiB at 1 TiB cap -> 5 shards by bytes. Default concurrency=64 -> upper 256. + suggestion = suggest_shard_count(5 * TIB, max_shard_bytes_tb=1) + assert suggestion.num_shards == 5 + + +def test_assumptions_are_echoed(): + suggestion = suggest_shard_count( + 100 * TIB, + cpus_per_task=8, + array_concurrency=32, + core_budget=200, + max_shard_bytes_tb=20, + ) + assert suggestion.assumptions == { + "total_bytes": 100 * TIB, + "cpus_per_task": 8, + "array_concurrency": 32, + "core_budget": 200, + "max_shard_bytes_tb": 20, + } diff --git a/uv.lock b/uv.lock index 9e0f8b4..5b2eb22 100644 --- a/uv.lock +++ b/uv.lock @@ -85,6 +85,18 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/33/6b/e0547afaf41bf2c42e52430072fa5658766e3d65bd4b03a563d1b6336f57/distlib-0.4.0-py2.py3-none-any.whl", hash = "sha256:9659f7d87e46584a30b5780e43ac7a2143098441670ff0a49d5f9034c54a6c16", size = 469047, upload-time = "2025-07-17T16:51:58.613Z" }, ] +[[package]] +name = "exceptiongroup" +version = "1.3.1" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "typing-extensions", marker = "python_full_version < '3.13'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/50/79/66800aadf48771f6b62f7eb014e352e5d06856655206165d775e675a02c9/exceptiongroup-1.3.1.tar.gz", hash = "sha256:8b412432c6055b0b7d14c310000ae93352ed6754f70fa8f7c34141f91c4e3219", size = 30371, upload-time = "2025-11-21T23:01:54.787Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/8a/0e/97c33bf5009bdbac74fd2beace167cab3f978feb69cc36f1ef79360d6c4e/exceptiongroup-1.3.1-py3-none-any.whl", hash = "sha256:a7a39a3bd276781e98394987d3a5701d0c4edffb633bb7a5144577f82c773598", size = 16740, upload-time = "2025-11-21T23:01:53.443Z" }, +] + [[package]] name = "filelock" version = "3.20.3" @@ -103,6 +115,15 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/b8/58/40fbbcefeda82364720eba5cf2270f98496bdfa19ea75b4cccae79c698e6/identify-2.6.16-py2.py3-none-any.whl", hash = "sha256:391ee4d77741d994189522896270b787aed8670389bfd60f326d677d64a6dfb0", size = 99202, upload-time = "2026-01-12T18:58:56.627Z" }, ] +[[package]] +name = "iniconfig" +version = "2.3.0" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/72/34/14ca021ce8e5dfedc35312d08ba8bf51fdd999c576889fc2c24cb97f4f10/iniconfig-2.3.0.tar.gz", hash = "sha256:c76315c77db068650d49c5b56314774a7804df16fee4402c1f19d6d15d8c4730", size = 20503, upload-time = "2025-10-18T21:55:43.219Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/cb/b1/3846dd7f199d53cb17f49cba7e651e9ce294d8497c8c150530ed11865bb8/iniconfig-2.3.0-py3-none-any.whl", hash = "sha256:f631c04d2c48c52b84d0d0549c99ff3859c98df65b3101406327ecc7d53fbf12", size = 7484, upload-time = "2025-10-18T21:55:41.639Z" }, +] + [[package]] name = "markdown-it-py" version = "4.0.0" @@ -169,6 +190,15 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/cb/28/3bfe2fa5a7b9c46fe7e13c97bda14c895fb10fa2ebf1d0abb90e0cea7ee1/platformdirs-4.5.1-py3-none-any.whl", hash = "sha256:d03afa3963c806a9bed9d5125c8f4cb2fdaf74a55ab60e5d59b3fde758104d31", size = 18731, upload-time = "2025-12-05T13:52:56.823Z" }, ] +[[package]] +name = "pluggy" +version = "1.6.0" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/f9/e2/3e91f31a7d2b083fe6ef3fa267035b518369d9511ffab804f839851d2779/pluggy-1.6.0.tar.gz", hash = "sha256:7dcc130b76258d33b90f61b658791dede3486c3e6bfb003ee5c9bfb396dd22f3", size = 69412, upload-time = "2025-05-15T12:30:07.975Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/54/20/4d324d65cc6d9205fabedc306948156824eb9f0ee1633355a8f7ec5c66bf/pluggy-1.6.0-py3-none-any.whl", hash = "sha256:e920276dd6813095e9377c0bc5566d94c932c33b27a3e3945d8389c374dd4746", size = 20538, upload-time = "2025-05-15T12:30:06.134Z" }, +] + [[package]] name = "pre-commit" version = "4.5.1" @@ -194,6 +224,24 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/c7/21/705964c7812476f378728bdf590ca4b771ec72385c533964653c68e86bdc/pygments-2.19.2-py3-none-any.whl", hash = "sha256:86540386c03d588bb81d44bc3928634ff26449851e99741617ecb9037ee5ec0b", size = 1225217, upload-time = "2025-06-21T13:39:07.939Z" }, ] +[[package]] +name = "pytest" +version = "9.0.3" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "colorama", marker = "sys_platform == 'win32'" }, + { name = "exceptiongroup", marker = "python_full_version < '3.11'" }, + { name = "iniconfig" }, + { name = "packaging" }, + { name = "pluggy" }, + { name = "pygments" }, + { name = "tomli", marker = "python_full_version < '3.11'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/7d/0d/549bd94f1a0a402dc8cf64563a117c0f3765662e2e668477624baeec44d5/pytest-9.0.3.tar.gz", hash = "sha256:b86ada508af81d19edeb213c681b1d48246c1a91d304c6c81a427674c17eb91c", size = 1572165, upload-time = "2026-04-07T17:16:18.027Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/d4/24/a372aaf5c9b7208e7112038812994107bc65a84cd00e0354a88c2c77a617/pytest-9.0.3-py3-none-any.whl", hash = "sha256:2c5efc453d45394fdd706ade797c0a81091eccd1d6e4bccfcd476e2b8e0ab5d9", size = 375249, upload-time = "2026-04-07T17:16:16.13Z" }, +] + [[package]] name = "pytokens" version = "0.4.0" @@ -425,6 +473,7 @@ dependencies = [ dev = [ { name = "black" }, { name = "pre-commit" }, + { name = "pytest" }, ] [package.metadata] @@ -437,4 +486,5 @@ requires-dist = [ dev = [ { name = "black", specifier = ">=26.1.0" }, { name = "pre-commit", specifier = ">=4.5.1" }, + { name = "pytest", specifier = ">=8.0.0" }, ]