Backup: chunked encrypted backup pipeline (Path B sub-project A)#16
Merged
kleopasevan merged 37 commits intomainfrom Apr 29, 2026
Merged
Backup: chunked encrypted backup pipeline (Path B sub-project A)#16kleopasevan merged 37 commits intomainfrom
kleopasevan merged 37 commits intomainfrom
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extends the HostBackend trait with read_snapshot(&VolumeSnapshotHandle) returning Box<dyn AsyncRead + Send + Unpin>, for use by the backup pipeline to stream snapshot bytes. Adds tokio to nexus-storage [dependencies] (was dev-only) so the trait can name tokio::io::AsyncRead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements HostBackend::read_snapshot for LocalFileHostBackend: treats the snapshot locator as a file path and opens it with tokio::fs::File. Adds a tokio test verifying round-trip file contents via AsyncReadExt. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements HostBackend::read_snapshot for IscsiHostBackend: parses the snapshot locator JSON, logs in via iscsiadm (same as attach), then polls for the by-path block device to appear (up to 3 s) before returning a tokio::fs::File handle for streaming reads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…gent Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add BackupReq/BackupResp/RestoreReq/RestoreResp RPC types in a new features/backups/types.rs, and extend storage/agent_rpc.rs with agent_backup and agent_restore HTTP helpers (dead_code allowed — wired in T16). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduce apps/agent/src/features/storage/s3.rs with make_client() using aws-sdk-s3 1.x, force_path_style for MinIO/Ceph compatibility, and async head_object/put_object/get_object helpers used by the backup pipeline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add FastCDC→encrypt→HEAD-or-PUT backup pipeline (backup.rs), extend routes.rs with POST /v1/storage/backup and /v1/storage/restore handlers, and add blake3 dep to agent for direct hashing in the pipeline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds BackupRepository with insert_running, mark_completed, mark_failed, get, list_for_volume, list_completed_oldest_first, delete_row, and list_stale_running (for future GC). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Orchestrates snapshot → agent_backup → DB update + S3 manifest pruning (enforce_retention). Restore provisions a new volume via the target backend, delegates to agent_restore, and records the resulting volume. Removes #[allow(dead_code)] from agent_backup/agent_restore now that they are called by the service. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wires GET/DELETE /v1/backups, GET /v1/backups/:id, POST /v1/backups/:id/restore, and POST /v1/volumes/:id/backup into the feature router. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add gc.rs with gc_loop (wakes once per hour, runs at each target's configured gc_hour) and run_gc (records to backup_gc_run table). Wire POST /v1/backup_targets/:id/gc for ad-hoc trigger. Spawn gc_loop from main.rs after AppState construction. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add reconciler.rs with reconcile_loop that wakes every 5 minutes, queries list_stale_running(1440), and marks them failed with an explanatory message. Remove #[allow(dead_code)] from BackupRepository::list_stale_running (now used). Spawn from main.rs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add scheduler.rs with schedule_loop that wakes once per minute, queries volumes with backup_cron + backup_target_id set, computes the next fire time after the last successful backup, and spawns create_backup when due. Uses the cron 0.12 crate's Schedule::after API. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the patch_backup_schedule handler that validates cron syntax via the cron crate and performs a partial UPDATE on backup_cron, backup_retain_count, and backup_target_id using COALESCE semantics. Routes registered under /:id/backup_schedule with axum::routing::patch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds backups::index_rebuild::run() that pages through S3 manifest objects for a target, decrypts each with the target key, and INSERT-or-skips into the backup table. Wired as `manager backup index-rebuild --target <uuid>` via raw args parsed in main() before AppState construction, so the binary doubles as a DR CLI tool without a separate binary. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add BackupTarget, Backup, BackupSchedule types; facade API methods for backup targets, backups, restore, and schedule; useBackupTargets and useBackups TanStack Query hooks (B.T22). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add backup-target-form, backup-list, restore-dialog, backup-schedule-editor components; VolumeBackupsTab for volume detail; backup-targets page route; volumes/[id] detail page with Backups tab (B.T23). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…B row (C1) Look up the backup row and target, decrypt credentials, and call delete_object on the manifest key before deleting the DB row. An orphaned manifest was keeping all its chunks alive forever because GC uses the manifest set as the live reference set. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d id (C2) VolumeRepository::create inserts with DEFAULT gen_random_uuid() and returns the full row. Capture the inserted row and return inserted.id so callers receive a UUID that actually exists in the volume table. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…(I1) The old formula slept to the next hour boundary; if the manager started at gc_hour:00:01 the next wake landed at gc_hour+1:00:01 and the check always missed. Now we sleep only to the next minute boundary and check on every matching minute. A backup_gc_run lookup guards against launching a second GC run within the same gc_hour window. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HEAD, PUT (chunk), PUT (manifest), and GET (manifest + chunk) calls now retry up to 5 times with 200 ms * 2^attempt backoff. A transient network blip no longer aborts the entire backup or restore. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MAX(created_at) includes running rows, so a 5-minute cron with a 10-minute backup would fire a second concurrent run at T+5. Add a NOT EXISTS sub-query to exclude volumes that already have a running backup before the tick even evaluates the cron schedule. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
read_snapshot was calling iscsiadm_login but never logout, leaving a dangling session after each backup. Wrap the opened File in IscsiSnapshotReader, which spawns a detached logout task in its Drop impl. The error path (device never appeared) now also explicitly calls iscsiadm_logout before returning the error. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- RUSTSEC-2026-0098/0099/0104: rustls-webpki 0.101.x via aws-smithy-http-client → rustls 0.21. Our other code uses 0.103.13; the AWS SDK pins old rustls. - RUSTSEC-2025-0141: bincode is unmaintained but functionally stable. Migration to postcard/ciborium is a follow-up.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the spec at
docs/superpowers/specs/2026-04-29-backup-pipeline-design.md(commit9ed3564).nexus-backupcrate: pure transforms — FastCDC chunking, BLAKE3 hashing, XChaCha20-Poly1305 convergent encryption, manifest serialization (bincode + zstd). 14 unit tests covering determinism, convergence, round-trips, tamper detection.HostBackend::read_snapshot(LocalFile + Iscsi impls).0036_backup.sql:backup_target,backup,backup_gc_runtables;volume.backup_*columns.backup_targetsfeature: AES-GCM envelope wrap of S3 secrets + per-target chunk key, full CRUD API.backupsfeature: orchestration (snapshot → agent_backup → DB + retention pruning + restore-to-new-volume), per-volume cron scheduler, daily mark-and-sweep GC per target, reconciler for stuck rows, DRindex-rebuildsubcommand, REST endpoints.POST /v1/storage/{backup,restore}routes, per-chunk retry with exponential backoff.BackupTarget,Backup,BackupScheduletypes;useBackupTargets,useBackupshooks;BackupTargetForm,BackupList,RestoreDialog,BackupScheduleEditor,VolumeBackupsTabcomponents;/backup-targetspage;/volumes/[id]detail page (created from scratch).Architecture
Agent-side chunking (no manager bandwidth bottleneck). Per-target convergent encryption: same plaintext → same ciphertext → cross-volume dedup intact, but S3 compromise reveals nothing. Manifest in S3 (canonical, DR-safe) and DB (queryable). Implicit resume via content-addressing (HEAD-before-PUT); no checkpointing. Daily mark-and-sweep GC per target with 24h grace period.
Test plan
cargo fmt --checkcleancargo clippy --workspace --all-targets -- -D warningscleancargo test --workspace --exclude installer: agent 13, manager 32 + 15 ignored, nexus-backup 14, nexus-storage 8next buildcleanDATABASE_URL=… cargo test -- --ignored)manager backup index-rebuild --target <id>against a target with manifestsDeferred (per spec out-of-scope)
NotSupportedfrom foundation)[backup]TOML config section (concurrency hardcoded; can be added in a follow-up)PATCH /v1/backup_targets/:idis a501 Not Implementedstub for v1; operators delete + recreateNotes for review
8aacd69; 36 commits.🤖 Generated with Claude Code