Skip to content

Backup: chunked encrypted backup pipeline (Path B sub-project A)#16

Merged
kleopasevan merged 37 commits intomainfrom
feature/backup-pipeline
Apr 29, 2026
Merged

Backup: chunked encrypted backup pipeline (Path B sub-project A)#16
kleopasevan merged 37 commits intomainfrom
feature/backup-pipeline

Conversation

@kleopasevan
Copy link
Copy Markdown
Contributor

Summary

Implements the spec at docs/superpowers/specs/2026-04-29-backup-pipeline-design.md (commit 9ed3564).

  • New nexus-backup crate: pure transforms — FastCDC chunking, BLAKE3 hashing, XChaCha20-Poly1305 convergent encryption, manifest serialization (bincode + zstd). 14 unit tests covering determinism, convergence, round-trips, tamper detection.
  • Trait addition: HostBackend::read_snapshot (LocalFile + Iscsi impls).
  • Migration 0036_backup.sql: backup_target, backup, backup_gc_run tables; volume.backup_* columns.
  • Manager backup_targets feature: AES-GCM envelope wrap of S3 secrets + per-target chunk key, full CRUD API.
  • Manager backups feature: orchestration (snapshot → agent_backup → DB + retention pruning + restore-to-new-volume), per-volume cron scheduler, daily mark-and-sweep GC per target, reconciler for stuck rows, DR index-rebuild subcommand, REST endpoints.
  • Agent: S3 client wrapper, chunker pipeline, POST /v1/storage/{backup,restore} routes, per-chunk retry with exponential backoff.
  • UI: BackupTarget, Backup, BackupSchedule types; useBackupTargets, useBackups hooks; BackupTargetForm, BackupList, RestoreDialog, BackupScheduleEditor, VolumeBackupsTab components; /backup-targets page; /volumes/[id] detail page (created from scratch).

Architecture

Agent-side chunking (no manager bandwidth bottleneck). Per-target convergent encryption: same plaintext → same ciphertext → cross-volume dedup intact, but S3 compromise reveals nothing. Manifest in S3 (canonical, DR-safe) and DB (queryable). Implicit resume via content-addressing (HEAD-before-PUT); no checkpointing. Daily mark-and-sweep GC per target with 24h grace period.

Test plan

  • cargo fmt --check clean
  • cargo clippy --workspace --all-targets -- -D warnings clean
  • cargo test --workspace --exclude installer: agent 13, manager 32 + 15 ignored, nexus-backup 14, nexus-storage 8
  • UI next build clean
  • All 6 final-review issues addressed (C1 manifest deletion, C2 restore UUID, I1 GC timing, I2 chunk retry, I4 scheduler concurrency guard, M1 iSCSI session cleanup)
  • DB-gated tests pass against a real Postgres (DATABASE_URL=… cargo test -- --ignored)
  • E2E backup-restore cycle against a real SeaweedFS / MinIO target
  • DR drill: manager backup index-rebuild --target <id> against a target with manifests

Deferred (per spec out-of-scope)

  • Embedded SeaweedFS lifecycle
  • Restore-in-place (only restore-to-new-volume is supported)
  • Cluster-wide selector-based policies
  • Live (non-snapshot) backup with quiescing
  • Bandwidth throttling, parallel chunk upload (sequential v1)
  • TrueNAS-iSCSI native snapshots (still stubbed NotSupported from foundation)
  • [backup] TOML config section (concurrency hardcoded; can be added in a follow-up)
  • PATCH /v1/backup_targets/:id is a 501 Not Implemented stub for v1; operators delete + recreate

Notes for review

  • This is half of "Path B" from the storage HCI roadmap. The other half (SPDK + Raft distributed block backend) is a separate PR.
  • Branch was created off main at 8aacd69; 36 commits.

🤖 Generated with Claude Code

kleopasevan and others added 30 commits April 29, 2026 11:21
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extends the HostBackend trait with read_snapshot(&VolumeSnapshotHandle)
returning Box<dyn AsyncRead + Send + Unpin>, for use by the backup
pipeline to stream snapshot bytes. Adds tokio to nexus-storage
[dependencies] (was dev-only) so the trait can name tokio::io::AsyncRead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements HostBackend::read_snapshot for LocalFileHostBackend: treats
the snapshot locator as a file path and opens it with tokio::fs::File.
Adds a tokio test verifying round-trip file contents via AsyncReadExt.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements HostBackend::read_snapshot for IscsiHostBackend: parses the
snapshot locator JSON, logs in via iscsiadm (same as attach), then polls
for the by-path block device to appear (up to 3 s) before returning a
tokio::fs::File handle for streaming reads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…gent

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add BackupReq/BackupResp/RestoreReq/RestoreResp RPC types in a new
features/backups/types.rs, and extend storage/agent_rpc.rs with
agent_backup and agent_restore HTTP helpers (dead_code allowed — wired
in T16).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduce apps/agent/src/features/storage/s3.rs with make_client() using
aws-sdk-s3 1.x, force_path_style for MinIO/Ceph compatibility, and async
head_object/put_object/get_object helpers used by the backup pipeline.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add FastCDC→encrypt→HEAD-or-PUT backup pipeline (backup.rs), extend
routes.rs with POST /v1/storage/backup and /v1/storage/restore handlers,
and add blake3 dep to agent for direct hashing in the pipeline.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds BackupRepository with insert_running, mark_completed, mark_failed,
get, list_for_volume, list_completed_oldest_first, delete_row, and
list_stale_running (for future GC).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Orchestrates snapshot → agent_backup → DB update + S3 manifest pruning
(enforce_retention). Restore provisions a new volume via the target
backend, delegates to agent_restore, and records the resulting volume.
Removes #[allow(dead_code)] from agent_backup/agent_restore now that
they are called by the service.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wires GET/DELETE /v1/backups, GET /v1/backups/:id,
POST /v1/backups/:id/restore, and POST /v1/volumes/:id/backup into
the feature router.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add gc.rs with gc_loop (wakes once per hour, runs at each target's
configured gc_hour) and run_gc (records to backup_gc_run table).
Wire POST /v1/backup_targets/:id/gc for ad-hoc trigger. Spawn
gc_loop from main.rs after AppState construction.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add reconciler.rs with reconcile_loop that wakes every 5 minutes,
queries list_stale_running(1440), and marks them failed with an
explanatory message. Remove #[allow(dead_code)] from
BackupRepository::list_stale_running (now used). Spawn from main.rs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add scheduler.rs with schedule_loop that wakes once per minute,
queries volumes with backup_cron + backup_target_id set, computes
the next fire time after the last successful backup, and spawns
create_backup when due. Uses the cron 0.12 crate's Schedule::after API.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the patch_backup_schedule handler that validates cron syntax via
the cron crate and performs a partial UPDATE on backup_cron,
backup_retain_count, and backup_target_id using COALESCE semantics.
Routes registered under /:id/backup_schedule with axum::routing::patch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds backups::index_rebuild::run() that pages through S3 manifest
objects for a target, decrypts each with the target key, and
INSERT-or-skips into the backup table. Wired as
`manager backup index-rebuild --target <uuid>` via raw args parsed
in main() before AppState construction, so the binary doubles as a
DR CLI tool without a separate binary.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add BackupTarget, Backup, BackupSchedule types; facade API methods for
backup targets, backups, restore, and schedule; useBackupTargets and
useBackups TanStack Query hooks (B.T22).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add backup-target-form, backup-list, restore-dialog, backup-schedule-editor
components; VolumeBackupsTab for volume detail; backup-targets page route;
volumes/[id] detail page with Backups tab (B.T23).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
kleopasevan and others added 7 commits April 29, 2026 12:48
…B row (C1)

Look up the backup row and target, decrypt credentials, and call
delete_object on the manifest key before deleting the DB row.  An
orphaned manifest was keeping all its chunks alive forever because GC
uses the manifest set as the live reference set.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d id (C2)

VolumeRepository::create inserts with DEFAULT gen_random_uuid() and
returns the full row.  Capture the inserted row and return inserted.id
so callers receive a UUID that actually exists in the volume table.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…(I1)

The old formula slept to the next hour boundary; if the manager started
at gc_hour:00:01 the next wake landed at gc_hour+1:00:01 and the check
always missed.  Now we sleep only to the next minute boundary and check
on every matching minute.  A backup_gc_run lookup guards against
launching a second GC run within the same gc_hour window.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HEAD, PUT (chunk), PUT (manifest), and GET (manifest + chunk) calls now
retry up to 5 times with 200 ms * 2^attempt backoff.  A transient
network blip no longer aborts the entire backup or restore.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MAX(created_at) includes running rows, so a 5-minute cron with a
10-minute backup would fire a second concurrent run at T+5.  Add a
NOT EXISTS sub-query to exclude volumes that already have a running
backup before the tick even evaluates the cron schedule.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
read_snapshot was calling iscsiadm_login but never logout, leaving a
dangling session after each backup.  Wrap the opened File in
IscsiSnapshotReader, which spawns a detached logout task in its Drop
impl.  The error path (device never appeared) now also explicitly calls
iscsiadm_logout before returning the error.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- RUSTSEC-2026-0098/0099/0104: rustls-webpki 0.101.x via aws-smithy-http-client
  → rustls 0.21. Our other code uses 0.103.13; the AWS SDK pins old rustls.
- RUSTSEC-2025-0141: bincode is unmaintained but functionally stable.
  Migration to postcard/ciborium is a follow-up.
@kleopasevan kleopasevan merged commit 8ccddaa into main Apr 29, 2026
8 checks passed
@kleopasevan kleopasevan deleted the feature/backup-pipeline branch April 29, 2026 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant