diff --git a/.workhorse/specs/jobs/backup.md b/.workhorse/specs/jobs/backup.md new file mode 100644 index 00000000..a99f4e99 --- /dev/null +++ b/.workhorse/specs/jobs/backup.md @@ -0,0 +1,67 @@ +--- +id: BKJ +--- + +# Backup control plane + +Canopy maintains, verifies, and watches over the fleet's backups itself — clients neither run maintenance nor hold the rights to. +This is the autonomous half of the backup system: the work Canopy does on a cadence with no device asking, and the health signals it raises from it. + +## Scope + +This spec covers Canopy's own background backup work: repo maintenance, inspection, storage metering, upstream preflight, and the detection and alerting that turn all of it into incidents. + +It does not cover the device contract (see [BAK](../public-server/backup.md)), the operator's configuration of a group (see [BKO](../private-server/backup.md)), or restore-health (the managed restore replicas spec, `RST`). + +Canopy acts only on groups whose configuration is ready, runs at most one operation per group at a time, and bounds how many groups it works on at once. + +## Maintenance + +Canopy runs each group's repo maintenance on a cadence — clients are never granted the rights to. +It enforces the group's retention as part of maintenance, and records every run's outcome so a stuck or failing maintenance is itself detectable. + +## Passphrase rotation + +Canopy rotates each group's repo passphrase on a cadence, so a leaked passphrase is useful only until the next rotation rather than indefinitely. +Rotation is crash-safe: an interrupted rotation is reconciled on the next attempt, and throughout it the repo stays openable with either the previous or the new passphrase — it is never left unopenable. +Like maintenance, rotation is Canopy's to do; operators never run it. + +## Recovery escrow + +Because Canopy holds the only copy of every passphrase, it continuously escrows the state needed to recover access without it: the per-group passphrases and repo coordinates, and the group, server, configuration, schedule, and capability records that frame them. +The escrow is encrypted to a set of offline recipient keys whose private halves Canopy never holds, and written to versioned, object-locked storage. +So Canopy can write the escrow but never read it back — a full Canopy compromise cannot disclose the escrowed secrets, and object-lock keeps past versions undeletable until they expire. +Recipients are mandatory: Canopy refuses to run without them, so there is never a silent recovery gap. +This is the escrow the operator recovery ceremony verifies (see [BKO](../private-server/backup.md)). + +## Inspection + +Canopy periodically inspects each group's repo against the storage directly, independent of what devices reported: + +- it verifies repo integrity, and a failed verification is repo corruption; +- it inventories the repo — the latest snapshot per source — as the ground truth a device's report is reconciled against; +- it records repo size, logical and physical, and the storage cost basis for display. + +## Upstream preflight + +Canopy watches its own access to each group's storage, so a broken control plane is caught at the source rather than when the fleet starts failing. +It checks that its identity resolves, that it can assume each group's role and perform a read-only no-op, and that the bucket's object-lock is present and at least the required retention. +Preflight only alerts; it never pulls Canopy out of service, because a failing check must not make a degraded situation worse. + +## Detection + +Canopy reconciles three sources — what a device reported, what credentials were issued, and what actually landed in the repo — and alerts on disagreement: + +- **staleness** — a server with a prior successful backup but none recent, or one that has never backed up though it has been expected long enough. +- **reconcile** — a device reported a successful backup but no matching snapshot landed (the report is false or the upload didn't persist), or a fresh snapshot exists but no recent report (the reporting path is broken). +- **maintenance** — a group whose maintenance is overdue, or whose most recent maintenance failed. + +## Alerting + +Backup alerts are raised at one of two scopes: + +- **Per-server** signals (staleness, never-backed-up, the report-gap) are subject to the server's monitoring gate: still recorded for visibility, but they contribute to an incident only when the server is monitored, because some servers are intentionally intermittent. +- **Group-level** signals (repo corruption, maintenance failure, missing-snapshot reconciliation, preflight failures, and restore-verification — see the managed restore replicas spec, `RST`) page regardless of any member's monitoring state, because they are control-plane or data-safety concerns that belong to no single server. + +Each signal has a stable key by which operators silence or snooze it and by which the interface and notifications refer to it; the keys are a contract and are not renamed without migrating stored silences. +A signal recovers when the condition that raised it clears. diff --git a/.workhorse/specs/private-server/backup.md b/.workhorse/specs/private-server/backup.md new file mode 100644 index 00000000..774bc033 --- /dev/null +++ b/.workhorse/specs/private-server/backup.md @@ -0,0 +1,70 @@ +--- +id: BKO +--- + +# Operator backup control + +An operator configures, through Canopy, how a server group backs up: where its repo lives, on what cadence, with what retention, and which servers and types participate. +Canopy owns the repo passphrase throughout — it is generated or accepted once, stored in Canopy's secret store, and never handed back except through the audited recovery ceremony. + +## Scope + +This spec covers the operator-facing control surface: per-group backup configuration and its lifecycle, scheduling and retention, per-server participation, on-demand backups, the status view, and passphrase recovery. + +It does not cover the device contract (see [BAK](../public-server/backup.md)) or Canopy's autonomous maintenance, inspection, detection, and alerting (see [BKJ](../jobs/backup.md)). + +Reads are available to any tailnet user; changes require an administrator. + +## Per-group configuration + +A group has at most one backup configuration: the bucket, prefix, region, the cross-account roles Canopy assumes, the reference to the group's passphrase, and its placement and lifecycle state. + +Placement is one of: + +- **external** — the operator brings their own bucket and supplies the role ARNs Canopy will assume. +- **shared** — Canopy provisions and names a bucket in its own shared account; the operator supplies nothing about location. + +A configuration is created once and its structural fields (bucket, roles, placement) are immutable; the region and the operational settings below are editable. +Decommissioning a group deletes its configuration row — which stops all credential issuance for the group — and deletes the Canopy-owned passphrase. +The bucket and its object-locked contents persist independently and are not Canopy's to delete; teardown is a separate, deliberate act gated by the lock window. + +## Lifecycle and provisioning + +A configuration moves from **provisioning** to **ready**; devices are refused until it is ready. +Creating a configuration sets it provisioning and asks Canopy to create or connect the repo; that work transitions the configuration to ready, or records the error it failed with so the operator sees why. +The operator interface depends only on these observable states, not on how provisioning is carried out. + +A configuration may also be created or reconciled idempotently by machine — for infrastructure-as-code — under administrator-equivalent authentication, with the same probe and provisioning behaviour as the interactive path. + +## Setup and the passphrase + +When a configuration is created, Canopy probes the target bucket and classifies it: empty, an existing kopia repo, holding unrelated content, or inaccessible. +The classification chooses the mode: + +- **from-birth** — an empty bucket; Canopy generates a fresh passphrase and creates a new repo. +- **passphrase** — an existing repo; the operator supplies its passphrase and Canopy connects to it. + +A bucket holding unrelated content is refused rather than written into; Canopy never deletes to make room. +Either way Canopy creates and owns the passphrase secret, and configuration and secret are created together — if the secret cannot be stored, the configuration is rolled back, so a configuration never exists without its passphrase. +The supplied or generated passphrase is only the starting point: Canopy rotates it on a cadence thereafter (see [BKJ](../jobs/backup.md)), and the recovery ceremony recovers whatever the current passphrase is. + +## Scheduling and retention + +Each `(group, type)` has an expected backup interval and a retention policy, taken from a per-`(group, type)` override when set, otherwise from the fleet-wide default for that type. +A manual-only type has no interval and is backed up only on an explicit request. +Retention is floored to an organisational minimum; a configuration may deliberately opt out of the floor, which is recorded as the dangerous choice it is. + +## Participation and on-demand + +A server participates in a type when that type is an enabled capability on it; an operator toggles participation per `(server, type)`. +An operator may queue a one-off backup — or restore — for a `(server, type)` to run on the next cycle, and may cancel a queued one before it runs. + +## Status + +The operator can see, per group: the repo's size and cost basis, recent runs with their outcomes and errors, recent maintenance, the latest snapshot per server, and any in-flight or pending one-off requests. + +## Passphrase recovery + +Because Canopy owns the only copy of each passphrase, the ability to recover it without Canopy is verified, not assumed. +Recovery is a ceremony: a passphrase is escrowed encrypted to a set of offline recipient keys, and an operator periodically proves the recipients can still decrypt it. +The ceremony is recorded so staleness — too long since the last successful proof — is visible. diff --git a/.workhorse/specs/public-server/backup.md b/.workhorse/specs/public-server/backup.md new file mode 100644 index 00000000..b132121b --- /dev/null +++ b/.workhorse/specs/public-server/backup.md @@ -0,0 +1,63 @@ +--- +id: BAK +--- + +# Device backups + +A server device backs up to object storage that Canopy mediates: it holds no long-lived storage credentials and knows nothing of the bucket layout until Canopy tells it, per run. +Canopy is the control plane — it owns the credentials, the repo location, the passphrase, and the record of what ran. + +## Scope + +This spec covers the device-facing contract: how a device learns what it may back up, obtains short-lived credentials and the repo coordinates for a run, and reports the outcome. + +It does not cover what an operator configures (see [BKO](../private-server/backup.md)), what Canopy does on its own — maintenance, inspection, detection, alerting (see [BKJ](../jobs/backup.md)) — or restoring backups (the managed restore replicas spec, `RST`). + +## Identity and resolution + +A device authenticates with the `server` role, over either transport Canopy accepts (a client certificate on the internet-facing path, or tailnet identity on the private mount). +Every device request resolves through the authenticated identity, never the request body: device → its single live server → that server's group → the group's backup configuration. +A device bound to no live server is refused; a server with no group, or whose group has no ready configuration, is refused. + +## Capabilities + +A device registers the backup types it can run on its server. +A newly seen type is enabled for scheduling or not according to that type's fleet default; a type already known keeps the operator's setting. +Registration requires the server to be grouped, but not the group's configuration to be ready. + +## Credentials + +A device requests credentials for a `(type, purpose)`. +Canopy issues short-lived credentials by assuming the group's dedicated cross-account storage role under a session policy that confines them to the group's bucket and prefix: + +- **backup** purpose grants the write set kopia needs, including a version-less delete — but never deletion of a locked version, nor any weakening of object-lock or retention. +- **restore** purpose grants read-only access. + +The credentials carry the storage role's identity for at most an hour; a device refreshes them as a run outruns that lifetime. +Every issuance is recorded before the credentials are returned. + +A `(type, purpose)` is issuable only when the type is an enabled capability of the server, or an operator has queued a one-off request of that purpose for it; otherwise it is refused. +The group's configuration must be ready: until then the endpoints refuse, so a half-provisioned group cannot be written to. + +## Target + +A device fetches the repo coordinates for its group each run: the storage kind, bucket, prefix, region, and the repo passphrase. +The passphrase is Canopy-owned and read from the group's secret store at request time; the device never stores it. + +## Reporting + +A device reports each run's outcome: the type and purpose, success or failure, an error when it failed, the resulting snapshot identifier, the bytes uploaded, and the object-storage traffic the run moved. +The run is keyed by an identifier the device mints at the start of the run; the device, server, and group are taken from the authenticated context, so a device cannot report a run as another group's. +A duplicate run identifier is refused. +Reporting a run clears any matching operator one-off request, so the standing "back up now" prompt stops. + +## Guarantees + +A compromised device cannot destroy backups. +Its credentials cannot delete a locked object version or weaken the bucket's object-lock; at worst it writes a delete-marker that object-lock and versioning leave recoverable. +Decommissioning a device is revoking its certificate: it can no longer obtain credentials, and any it already holds expire within the hour. + +## Failure contract + +The device endpoints distinguish: the caller is bound to no live server; the server is ungrouped, has no ready configuration, the type is not issuable, or a run identifier is duplicate; and Canopy's own dependency — the credential issuer or the secret store — is unavailable or unconfigured. +Each is a distinct, stable status so a device need not guess. diff --git a/docs/plans/backup-credentials-implementation-order.md b/docs/plans/backup-credentials-implementation-order.md deleted file mode 100644 index d0ce8300..00000000 --- a/docs/plans/backup-credentials-implementation-order.md +++ /dev/null @@ -1,351 +0,0 @@ -# Backup-credentials — cross-repo implementation order - -Direction / ordering doc for building the backup-credentials system across the -four repos (`canopy`, `ops/pulumi`, `bestool`, `pgro`). It does **not** restate -the design — read [`backup-credentials.md`](./backup-credentials.md) for that, -and each component spec for the how. This document answers one question: **in -what order, and on which tracks, do we build it so nothing waits on something -that isn't there yet.** - -## The eight component specs - -| # | Component | Repo | Spec | -|---|-----------|------|------| -| 1 | canopy-database (tables, models, migrations) | canopy | [specs/canopy-database.md](./specs/canopy-database.md) | -| 2 | canopy-public-server (device endpoints + AWS/kube on AppState) | canopy | [specs/canopy-public-server.md](./specs/canopy-public-server.md) | -| 3 | canopy-jobs-maintenance-inspection (maintenance/inspection/S3-metrics/init Jobs) | canopy | [specs/canopy-jobs-maintenance-inspection.md](./specs/canopy-jobs-maintenance-inspection.md) | -| 4 | canopy-jobs-detection-preflight (staleness, reconciliation, group-level alerting, preflight) | canopy | [specs/canopy-jobs-detection-preflight.md](./specs/canopy-jobs-detection-preflight.md) | -| 5 | canopy-operator-ui (private-server fns + private-web) | canopy | [specs/canopy-operator-ui.md](./specs/canopy-operator-ui.md) | -| 6 | ops (per-bucket roles, IRSA/ServiceAccounts, OIDC, scheduler Deployments) | ops/pulumi | [../../../ops/pulumi/docs/canopy-backup-credentials.md](../../../ops/pulumi/docs/canopy-backup-credentials.md) | -| 7 | bestool (device `backup-credentials` / `backup` subcommands) | bestool | [../../../bestool/docs/canopy-backup-credentials.md](../../../bestool/docs/canopy-backup-credentials.md) | -| 8 | pgro (restore consumer + signal-3 restore-verification) | pgro | [../../../pgro/docs/canopy-backup-integration.md](../../../pgro/docs/canopy-backup-integration.md) | - ---- - -## Dependency graph (derived from each spec's provides / depends_on) - -Arrows mean "depends on / must exist or be stubbed first". - -``` - ┌─────────────────────────────────────────────┐ - │ SPIKE: kopia vs GOVERNANCE-default-retention │ - │ bucket, no client-side PutObjectRetention │ - │ (gates ops A2 action-set + bestool kopia) │ - └───────────────┬─────────────────────────────-┘ - │ (verifies an assumption; doesn't block code start) - ▼ - (1) canopy-database ◄──────────────────────────── everything in canopy reads/writes these tables - │ tables, models, lib.rs re-exports, commons-types enums, - │ Option-B group-scoped-issues migration handled with (4) - │ - ├──────────────┬───────────────────────┬──────────────────────┐ - ▼ ▼ ▼ ▼ - (2) public-server (3) jobs-maint/insp (4) jobs-detect/preflight (5) operator-ui - AWS SDK + kube kube client + Job- group-level alerting private-server fns - on AppState; spawn lib; init Job; (Option-B issues); + private-web; reads - /backup-* endpoints maintenance/inspection staleness + reconcile; status/stats; reveal - │ /S3-metrics schedulers preflight (AWS) escrow (needs kube) - │ │ │ - │ contracts: HTTP endpoint shapes, IRSA role ARNs / ServiceAccount subs, billing labels - ▼ ▼ ▼ - (6) ops ◄────────────────┴──────────────────────┘ provides per-bucket role ARNs, IRSA roles, - │ OIDC providers, scheduler Deployments; consumes canopy SA names + OIDC issuer URL - │ - (7) bestool ◄── public-server endpoints (2) + the kopia spike - │ - (8) pgro ◄── restore endpoint + external-restore grant + first-party auth (canopy, later) - + a non-chained / longer-lived restore-cred decision -``` - -Two cross-cutting net-new capabilities sit underneath most of canopy and are -the real gate (see Critical path): - -- **AWS SDK + kube client** — first use anywhere in canopy. Lands on - `public-server` (component 2) and the `jobs` crate (components 3/4). -- **ServiceAccount + IRSA + OIDC** — first ServiceAccount canopy has ever had. - Owned by ops (component 6), consumed by 2/3/4. - -The canopy↔ops boundary is **mutually dependent** and resolved by contract, not -by serialising: canopy publishes the SA names + central-cluster OIDC issuer -URL; ops publishes the per-bucket role ARNs + IRSA role ARNs. Each side codes -against the agreed names and the two meet at deploy. - ---- - -## The early blocker: kopia-behaviour verification spike - -> **Concluded (from kopia docs/source + S3 semantics) — Branch A:** device -> creds = `AWS_S3_MULTIPART_ACTIONS` (no `PutObjectRetention`/no delete); -> repo created **non-lock-aware**; rely on the bucket's default GOVERNANCE -> retention + versioning + lifecycle. `--session-token`, -> `--override-hostname`, and `--point-in-time` are all supported. Two items -> still want a **live confirm** (the no-`PutObjectRetention` write path, and -> PIT on real AWS S3 per issue #4346). Full verdict + test script: -> [`backup-credentials-kopia-spike.md`](./backup-credentials-kopia-spike.md). - -**Do this first, in parallel with stage 0, before committing the ops action-set -and the bestool kopia wiring.** It's cheap, it's a known unknown, and it -changes two specs if it comes out the wrong way. - -The question (from ops spec A2 / bestool open-Q 2 / canopy-database H3): does -kopia **write and maintain** against an S3 bucket with **GOVERNANCE 30-day -default Object-Lock retention** when the client has **no `s3:PutObjectRetention`** -(the device action set is `AWS_S3_MULTIPART_ACTIONS`, delete- and -retention-free)? Also confirm: - -- kopia's S3 backend honours `AWS_SESSION_TOKEN` temporary creds (pgro open-Q 2; - bestool credential_process path). -- `--override-hostname` exists on the installed kopia for source-host = server-id - (bestool open-Q 3). -- which `BucketSizeBytes` `StorageType` dimension a versioned+locked bucket emits - (jobs-maint open-Q 6) — needed by the S3-metrics task, lower-stakes, can trail. - -Outcome drives: -- **ops A2**: device role is exactly `AWS_S3_MULTIPART_ACTIONS`, or that **plus** - `s3:PutObjectRetention` (safe under GOVERNANCE-without-bypass — can only - lengthen a lock). Don't finalise the managed policy until this is known. -- **bestool kopia helpers**: connect/snapshot wiring and how creds reach kopia. - -Run it against a throwaway dev bucket the ops `backups` stack can stand up. If it -comes back "kopia insists on PutObjectRetention", the fallback is already -specified — re-grant it — so this never blocks, it just picks a branch. Start -the spike at day 0; it must conclude before ops merges the action-set change and -before bestool finalises the kopia connect path. - ---- - -## Critical path - -The longest chain of hard dependencies, and what unblocks the most downstream -work, is: - -1. **Net-new enabling work** (the gate for all of canopy): - - **(1) canopy-database** — tables, models, `lib.rs` re-exports, and the - shared `commons-types` enums (`Purpose`/`Outcome`/`kind`). Nothing in - canopy compiles against these until they exist. This is the true - foundation; land it first. - - **AWS SDK + kube client deps + AppState wiring** (inside component 2) and - the **kube client + Job-spawn library** (inside component 3). First AWS/k8s - code in the repo; verify crate versions against the registry (no guessing), - pin `k8s-openapi` to the cluster's control-plane version. - - **(6) ops IRSA/ServiceAccount/OIDC + per-bucket role ARNs** — without the - ServiceAccount + IRSA trust, no canopy pod can `AssumeRole`, and without the - role ARNs there's no `target_role_arn` to put in config. This runs in - parallel with the canopy enabling work, joined by the ARN/SA-name contract. - -2. **The issuance hot path**: (1) → (2) public-server `/backup-credentials` + - `/backup-target` + `/backup-report`, against (6)'s role ARNs and Secret-read - RBAC. This is the contract bestool consumes. - -3. **bestool (7)** — needs (2)'s endpoint shapes live (or contract-frozen) and - the kopia spike concluded. - -4. **pgro (8)** — additive last stage; needs the restore endpoint, the - external-restore grant, first-party auth, and a longer-lived-cred decision, - none of which exist until canopy ships its restore surface. - -The single highest-leverage item is **canopy-database (1)**: components 2, 3, 4, -and 5 all import its models. Land it, with the `commons-types` enums, before the -four canopy tracks fan out. The second is the **ops IRSA/OIDC plumbing (6)**, -because it's the longest-lead infra item and gates every AWS-touching code path -at deploy time even though the code can be written against the contract earlier. - ---- - -## Build order (stages) - -Stages are sequencing guidance, not hard gates — within a stage, tracks run in -parallel. A later stage starts when its named dependencies from the earlier -stage are merged (or contract-frozen and stubbed). - -### Stage 0 — foundations (must land first) - -- **Spike**: kopia-behaviour verification (above). Parallel, concludes before - ops action-set + bestool kopia. -- **(1) canopy-database**: the `backup_credentials` migration (all 7 tables), - `backups.rs` models + `lib.rs` re-exports, the `commons-types` - `Purpose`/`Outcome`/`kind` enums. Resolve its open decisions up front because - they ripple: enum representation (shared enums vs validated String), the - `backup_runs` client-supplied-PK → `AppError::Conflict` mapping, cascade - policy for stats/requests vs the no-cascade audit rule, and the - `backup_repo_snapshots.server_id` on-delete behaviour. DB-only tests via - `TestDb::run`. -- **(6) ops — contract freeze + long-lead infra**: agree the names/ARNs both - sides code against (`canopyIssuerRoleArn`, `canopyJobsRoleArn`, - `canopy-issuer`/`canopy-jobs` ServiceAccount subs, central-cluster OIDC issuer - URL, per-bucket `deviceRoleArn`/`maintenanceRoleArn`, the - `billing.{product,stage,deployment}` label keys). Then start the actual - Pulumi: Component B (central ServiceAccounts + IRSA + RBAC, `spec.ts` - `serviceAccountName`), Component A1/A3/A4/A6 (per-bucket trust, Object-Lock - read action, ARN exports, lifecycle rules), Component C (OIDC provider per - deployment account). A2 (action-set reduction) waits on the spike. - -The canopy↔ops contract is the coordination spine for everything after. - -### Stage 1 — issuance hot path + enabling clients (the device-facing MVP) - -Depends on Stage 0's (1) and the (6) contract. - -- **(2) canopy-public-server**: add the AWS SDK + kube deps, the - `AppState.sts` / `AppState.kube` fields + `FromRef` impls + async init, - `AppError::Upstream` (502) + ERRORS.md, and the three handlers - (`/backup-credentials`, `/backup-target`, `/backup-report`) with the restore - session-policy builder. This component **owns** the AWS/kube-on-AppState - capability the rest of canopy reuses. Tests: the 412/409/502 resolution - matrix with `None` clients, the session-policy unit test, a stubbed-STS 200 - path. - -This is the first end-to-end slice: a device can mint creds and report a run. - -### Stage 2 — control-plane jobs (parallel canopy tracks) - -All depend on Stage 0 (1) and reuse the AWS/kube patterns from Stage 1 (2). -These three run in parallel with each other and with bestool. - -- **(3) canopy-jobs-maintenance-inspection**: the shared Job-spawn library - (recommended `commons-servers::backup_jobs` so private-server can call the - init-Job spawn without depending on the `jobs` crate), the three scheduler - bins (maintenance / inspection / S3-metrics), the kopia-Job arg contract, and - the migrations it owns (`backup_maintenance_runs`, `backup_repo_snapshots`, - `backup_repo_stats`) — coordinate with (1) on single-vs-split migration - ownership. -- **(4) canopy-jobs-detection-preflight**: the **Option-B group-scoped-issues - migration** + the thorough `issues.rs` sweep (this is the largest single - decision in the system and the central new shared plumbing — - `raise_group_event` is consumed by the inspection Job in (3) and PGRO ingest - in (8)), the `backup_staleness` and `backup_preflight` bins, and the shared - `jitter_slot` helper. Resolve Option A vs B before building; recommend B. -- **(6) ops — scheduler Deployments**: B4 wires the - `backup-maintenance`/`backup-inspection`/`backup-preflight` (and possibly - `backup-s3-metrics`/`backup-staleness`) single-replica Deployments on the - `canopy-jobs` SA, once the bin names are pinned by (3)/(4). - -Cross-track coordination inside Stage 2: -- (3) and (4) **share** `commons-servers` helpers (`jitter_slot`, - retention-floor) and the `(source, ref)` alert keys — agree these once. -- The group-level alerting path from (4) is a **prerequisite** for (3)'s - corruption alert and (4)'s own group-level refs — (4) should land the Option-B - plumbing early in the stage so (3) can call `raise_group_event`. - -### Stage 3 — operator UI + device client (parallel) - -- **(5) canopy-operator-ui**: private-server `/api/backups/*` fns + the React - screens. Depends on (1) models, reuses (2)'s kube client for `reveal_escrow` - (resolve open-Q: private-server gets its own `canopy-issuer` SA + Secret-read - RBAC — coordinate with ops open-Q 2), and depends on (3)'s init-Job contract - for the `provisioning → escrow_pending/ready` lifecycle. `just gen-openapi` + - Playwright e2e in the same change. -- **(7) bestool**: the two subcommands + `CanopyClient` methods, against (2)'s - frozen endpoint shapes and the concluded kopia spike. The "back up now" - command-channel transport is **deferred upstream** — build the - transport-independent subcommands now; wire the trigger when canopy defines - the status-response payload. - -(5) and (7) are independent and parallel. (7) can start as soon as (2)'s -endpoint contract is frozen, even before (3)/(4) land. - -### Stage 4 — PGRO (additive, last) - -- **(8) pgro**: restore-consumer CRD (`canopyBackup.group`), `fetch_restore_creds` - / `report_restore`, signal-3 `RestoreReport` into a future - `backup_restore_checks` table + the `restore-verification` group-level alert - (routed through (4)'s `raise_group_event`). - -PGRO is explicitly last because it needs canopy-side surfaces that don't exist -until the earlier stages ship: -- a **restore-credentials** path (purpose=restore creds + target + repo - password) — built on (2); -- the **external-restore grant** (operator-authorized, audited "consumer pgro - may read group X read-only") — net-new canopy authz surface; -- a **first-party non-device auth** path (Tailscale now, OIDC later) — joint - canopy+ops design; -- the **`backup_restore_checks` table + ingest endpoint + signal-3 detection**; -- a **decision on longer-lived / non-chained restore creds** so restores >1h - survive (mirror the maintenance-Job direct web-identity). **This decision is - owed by canopy and should be made during Stage 2** (when the maintenance-Job - direct-web-identity path is built) so PGRO isn't blocked on it in Stage 4. - ---- - -## Parallelizable tracks (one per repo) - -Once Stage 0's (1) + (6)-contract land, the repos proceed largely in parallel, -coordinated only by the contracts named below. - -- **canopy track**: (1) → then (2), (3), (4), (5) fan out. (2) blocks (7) (HTTP - contract). (4)'s Option-B plumbing blocks (3)'s corruption alert. (5) needs - (3)'s init-Job contract and (2)'s kube client. -- **ops track**: (6) runs alongside the canopy enabling work, joined by the - ARN/SA-name contract; its scheduler-Deployment piece (B4) trails (3)/(4)'s bin - names. -- **bestool track**: (7) starts when (2)'s endpoint shapes are frozen and the - spike is done; otherwise independent of (3)/(4)/(5). -- **pgro track**: (8) is last; nothing else depends on it. - -### The contracts that let the tracks run independently - -1. **HTTP endpoint shapes** (canopy public-server ⇆ bestool, and later ⇆ pgro): - `POST /backup-credentials`, `GET /backup-target`, `POST /backup-report` — - request/response bodies, the 412/409/502 semantics, and `backup_runs.id` = - client-minted UUID PK with `device_id`/`group_id` server-derived. Freeze - these from spec (2)/(7) before bestool starts; bestool's `canopy_contract.rs` - `#[ignore]`d suite is the drift detector. -2. **IRSA role ARNs + ServiceAccount subs + OIDC issuer** (canopy ⇆ ops): - `target_role_arn` (= ops `deviceRoleArn`), `maintenanceRoleArn`, - `canopyIssuerRoleArn`, `canopyJobsRoleArn`, the central-cluster OIDC issuer - URL, and the `canopy-issuer`/`canopy-jobs` SA names in namespace - `tamanu-meta-`. The hard isolation invariant: the maintenance/ - fullaccess role MUST NOT trust the issuer principal. -3. **Billing label keys** (canopy ⇆ ops): `billing.{product,stage,deployment}`, - with `ServerRank::Production → "prod"` (the load-bearing mapping gotcha). -4. **Shared `commons-types` enums** (canopy-internal, spec 1): `Purpose` / - `Outcome` / `kind` shared across public-server, jobs, and the generated - `api-types.ts`, so the three components don't drift. -5. **`raise_group_event` group-level alert entrypoint** (spec 4, consumed by 3 - and 8): the single place that opens a group-scoped incident bypassing - `is_monitored`. -6. **Init-Job lifecycle contract** (spec 3 ⇆ spec 5): UI sets - `status='provisioning'` + clears `last_init_error`; the init Job transitions - to `escrow_pending`/`ready` or sets `last_init_error`. UI depends only on the - observable fields, not the handoff mechanism. -7. **kopia Job image + entrypoint arg contract** (ops-built image ⇆ canopy jobs - ⇆ bestool source conventions): args (bucket/prefix/region/role/retention/ - run-id), `secretKeyRef` password mount, source-host `canopy@`, - snapshot tags `canopy-device`/`canopy-run`. - ---- - -## Cross-cutting decisions to settle before the dependent stage - -These appear in multiple specs' open questions; resolving them early prevents -rework. Each is tagged with the latest stage by which it must be decided. - -- **Enum representation** (`commons-types` shared vs validated String) — **Stage 0**, - blocks (1) and the generated `api-types.ts`. -- **Migration ownership** (one `backup_credentials` migration vs split across - (1)/(3)) — **Stage 0/2**, coordinate (1) and (3). -- **Group-level alerting Option A vs B** — **Stage 2**, recommend B; blocks (3)'s - corruption alert and all group-level refs in (4). -- **Where `reveal_escrow` reads the Secret** (private-server own kube client) — - **Stage 3**, ties to ops open-Q 2 (does private-server get `canopy-issuer`). -- **Longer-lived / non-chained restore creds for first-party consumers** — - **Stage 2** (decided when the maintenance-Job direct-web-identity lands), so - (8) isn't blocked. -- **"Back up now" command-channel transport** — deferred upstream; **does not - block** Stage 3's bestool subcommands, which are transport-independent. -- **kopia + default-retention without PutObjectRetention** — the **spike**; - blocks ops A2 and bestool kopia wiring. - ---- - -## Summary one-liner per stage - -- **Stage 0**: land the DB layer + shared enums (1); freeze the canopy↔ops - contract and start the long-lead IRSA/OIDC infra (6); run the kopia spike. -- **Stage 1**: build the issuance hot path + the AWS/kube-on-AppState - capability (2) — first end-to-end device slice. -- **Stage 2**: the control-plane Jobs (3) + detection/preflight + group-level - alerting (4) in parallel, plus ops scheduler Deployments; decide restore-cred - lifetime here. -- **Stage 3**: operator UI (5) and the bestool device client (7) in parallel. -- **Stage 4**: PGRO restore-verification (8), additive and last. diff --git a/docs/plans/backup-credentials-kopia-spike.md b/docs/plans/backup-credentials-kopia-spike.md deleted file mode 100644 index dce9917f..00000000 --- a/docs/plans/backup-credentials-kopia-spike.md +++ /dev/null @@ -1,166 +0,0 @@ -# kopia-behaviour verification spike (day-0 blocker) - -Resolves the spike named in -[`backup-credentials-implementation-order.md`](./backup-credentials-implementation-order.md). -Its job: pick the device IAM action-set branch and confirm the kopia -assumptions, so ops can finalize the managed policy and bestool can wire -the kopia connect/snapshot path. - -**Method:** authoritative from kopia docs + source and S3 semantics (no -live test was runnable in-session — no kopia binary, no valid AWS creds). -A gold-standard live-test script is at the end; the *decision* doesn't wait -on it, but two items (PIT on real AWS, the bucket-default-retention write -path) warrant a live confirm before relying on them. - -## Verdict - -**Branch A confirmed: device creds = `AWS_S3_MULTIPART_ACTIONS` (no -`s3:PutObjectRetention`, no `s3:DeleteObject`); the kopia repo is created -*non-lock-aware*; we rely on the bucket's default GOVERNANCE 30-day -retention + versioning + lifecycle.** Ops can finalize the device managed -policy as exactly `AWS_S3_MULTIPART_ACTIONS`. - -## Findings (per question) - -### 1. PutObjectRetention vs bucket-default retention (the branch decision) - -kopia needs `s3:PutObjectRetention` **only when it manages retention -itself** — i.e. the repo is created with `--retention-mode` and locks are -renewed via full-maintenance `--extend-object-locks`. That's the -kopia-documented "ransomware" path, and it's confirmed that even with -`--extend-object-locks` kopia still requires `PutObjectRetention` on the -primary bucket (so it can't be isolated away). - -We deliberately do **not** use that mode. Instead we create a *plain* -kopia repo (no `--retention-mode`) against a bucket whose **default object -lock retention** is GOVERNANCE 30d. S3 applies the default retention to -every `PutObject` **server-side**, which requires only `s3:PutObject` — not -`s3:PutObjectRetention`. So the device key needs neither delete nor -retention permission. ✔ matches the plan's H3. - -- **Consequence (already accepted):** without `--extend-object-locks`, a - live blob's lock is fixed at 30d from its last write and never renewed. - Irrelevant under the device-compromise threat (the device can't delete); - it only matters against an AWS-level attacker, which is out of scope. - Re-enabling renewal later = lock-aware mode + `PutObjectRetention` on the - **maintenance** role (never the device). -- **Caveat:** this is *not* the kopia-documented happy path; it rests on S3 - default-retention semantics (solid) rather than a kopia doc page. Live - test item (a) confirms kopia writes + maintains happily this way. - -### 2. Maintenance deletes on a versioned bucket (H2) - -Non-lock-aware kopia issues real `DeleteObject`; on a versioned bucket that -writes a **delete marker** (succeeds, reclaims nothing) — it does **not** -error. Reclamation is via the S3 lifecycle `noncurrentVersionExpiration` -rule, as the plan says. The maintenance role needs `s3:DeleteObject` (it -has it). (Note: kopia *also* has a "hidden marker" soft-delete it uses with -restricted/lock-aware keys — not our path; our maintenance role deletes for -real and lets lifecycle reclaim.) ✔ matches the plan's H2; the earlier -"throws errors on locked deletes" framing was wrong. - -### 3. Temporary credentials / `credential_process` (`AWS_SESSION_TOKEN`) - -`kopia repository create/connect s3` supports `--session-token` (and the -`AWS_SESSION_TOKEN` env). So the short-lived STS creds (which include a -session token) work, and the `credential_process`-style refresh is viable. -✔ unblocks the bestool credential path. - -### 4. Source host = server-id - -`--override-hostname` and `--override-username` exist, set at **`kopia -repository connect`** time (connection-level, *not* per-snapshot — the -per-snapshot `--hostname`/`--username` were removed in 0.6.0). bestool -reconnects per run (it re-derives the connection from Canopy every run), so -it passes `--override-hostname=` (`--override-username=canopy`) -on connect → source `canopy@:`. The **type** goes in the -path and a `canopy-type=` snapshot tag (`kopia snapshot create ---tags`). ✔ matches the plan's per-`(server, type)` source model. - -### 5. Point-in-time recovery (H1) - -`kopia repository connect … --point-in-time=` exists and is the -documented recovery path for a versioned+locked bucket (recover to before -a poisoning/deletion). ✔ the H1 recovery runbook is real. - -- **Caveat:** GitHub issue #4346 reports `--point-in-time` failing with - "repository not initialized" on some S3-*compatible* endpoints, and - #3492 covers a recovery edge case (missing files after deleted objects). - Real AWS S3 is kopia's primary supported target, but **PIT recovery must - be live-tested on real AWS S3** before we depend on it operationally — - it's our break-glass path. Live test item (b). - -### 6. CloudWatch `BucketSizeBytes` dimension (lower-stakes, may trail) - -`BucketSizeBytes` carries a `StorageType` dimension, and **all object -versions (current + noncurrent) count** toward it per storage class. So the -S3-metrics task sums `BucketSizeBytes` across the relevant `StorageType`s — -`StandardStorage` plus the intelligent-tiering classes (`.storageconfig` -puts pack blobs in `INTELLIGENT_TIERING`). Confirm the exact emitted -dimensions against a real bucket. Lower-stakes; `bucket_bytes` is -best-effort anyway. - -## What this unblocks - -- **ops** (action-set): the device role = `AWS_S3_MULTIPART_ACTIONS`, no - `PutObjectRetention`. The repo is created non-lock-aware; the bucket - keeps its default GOVERNANCE 30d retention + the lifecycle rules. -- **bestool** (kopia wiring): connect with `--session-token` + - `--override-hostname=`; `kopia snapshot create --tags - canopy-device=… canopy-run=… canopy-type=…`; do **not** pass - `--retention-mode`. - -## Remaining live confirmations (run when a throwaway bucket + creds exist) - -These don't change the branch; they de-risk the two assumptions that rest -on semantics/known-issues rather than a kopia doc. Script below. - -(a) Plain kopia repo create + snapshot + maintenance against a versioned, - default-GOVERNANCE-retention bucket, using a **device key without - PutObjectRetention/Delete** and a **maintenance key with delete** — - confirm no `AccessDenied` for retention and that maintenance succeeds. -(b) `--point-in-time` reconnect works on real AWS S3. -(c) The `BucketSizeBytes` `StorageType` dimensions emitted. - -```bash -#!/usr/bin/env bash -# Operator-run live confirmation. Needs: aws cli with creds, kopia. -# Creates a throwaway bucket — review + delete after. -set -euo pipefail -B="bes-kopia-spike-$(date +%s)"; R="ap-southeast-2" -KP="spike-pass-$(openssl rand -hex 8)" - -# 1. Versioned bucket + object lock + 30d GOVERNANCE default retention -aws s3api create-bucket --bucket "$B" --region "$R" \ - --create-bucket-configuration LocationConstraint="$R" \ - --object-lock-enabled-for-bucket -aws s3api put-object-lock-configuration --bucket "$B" \ - --object-lock-configuration 'ObjectLockEnabled=Enabled,Rule={DefaultRetention={Mode=GOVERNANCE,Days=30}}' - -# 2. DEVICE creds: NO delete, NO PutObjectRetention (AWS_S3_MULTIPART_ACTIONS). -# Use a scoped IAM user/role with: s3:GetObject,PutObject, -# AbortMultipartUpload,ListBucketMultipartUploads,ListMultipartUploadParts, -# ListBucket,GetBucketLocation on the bucket. Export its creds, then: -kopia repository create s3 --bucket "$B" --region "$R" --password "$KP" \ - --override-hostname server-test --override-username canopy -# ^ EXPECT: success. FAIL = AccessDenied mentioning PutObjectRetention -# → fall back to granting PutObjectRetention (safe; lengthen-only). -echo hello > /tmp/spike.txt -kopia snapshot create /tmp/spike.txt --tags canopy-type:tamanu-postgres -aws s3api list-object-versions --bucket "$B" --query 'Versions[0].ObjectLockMode' # EXPECT: GOVERNANCE (default applied on PUT) - -# 3. MAINTENANCE creds: full S3 incl. delete. Re-connect with those, then: -kopia maintenance run --full --safety none # EXPECT: success; deletes become markers -aws s3api list-object-versions --bucket "$B" --query 'DeleteMarkers' # EXPECT: markers present, no errors - -# 4. PIT (item b): note a timestamp, mutate, then: -kopia repository connect s3 --bucket "$B" --region "$R" --password "$KP" \ - --point-in-time="$(date -u +%Y-%m-%dT%H:%M:%SZ)" # EXPECT: connects (watch for issue #4346) - -# 5. CloudWatch dimensions (item c): after metrics populate (~a day), -aws cloudwatch list-metrics --namespace AWS/S3 --metric-name BucketSizeBytes \ - --dimensions Name=BucketName,Value="$B" - -# cleanup: object-locked objects can't be deleted for 30d; the throwaway -# bucket will linger until the lock lapses (expected). Tag it for teardown. -``` diff --git a/docs/plans/backup-setup-wizard-ops-handoff.md b/docs/plans/backup-setup-wizard-ops-handoff.md deleted file mode 100644 index 33202044..00000000 --- a/docs/plans/backup-setup-wizard-ops-handoff.md +++ /dev/null @@ -1,241 +0,0 @@ -# Ops/pulumi handoff — backup setup wizard + chained-AssumeRole cred model - -Companion to `backup-setup-wizard.md`. This lists **only the ops/pulumi-side -changes** canopy needs. Canopy owns the Rust/UI/DB work; this is what the ops -agent must adjust. (canopy ticket TAM-6877; ops ticket TAM-6878.) - -## Context - -Canopy is moving the backup credential flow to **explicit chained -`sts:AssumeRole`** (there is no deployment-account OIDC provider, so the old -`AWS_ROLE_ARN`-override / direct-web-identity trick is gone), and adding an -interactive setup wizard that **probes the target bucket from private-server** -before a config is saved. private-server has no AWS identity today — that's the -main new ask. - -## 1. New `canopy-private` ServiceAccount + IRSA role - -private-server currently has no AWS identity. Add a **dedicated** -`canopy-private` SA (not a reuse of `canopy-jobs`/`canopy-issuer` — we want room -to grow private-server AWS features later): - -- IRSA role annotated on the `canopy-private` SA. -- The role carries `sts:AssumeRole`. -- k8s RBAC for the SA: **`get` + `create` secrets** in the namespace (today - private-server only needs `get`; the wizard now creates the passphrase Secret). - -## 2. Trust-policy matrix (chained AssumeRole) — the main change - -Verified current state in `pulumi/tamanu/on-linux/src/backup/kopia.ts`: - -- device role: `assumeRolePolicyDocument: deviceAssumeRolePolicy(canopyIssuerRoleArn)` - → trusts **canopy-issuer**. -- maintenance role: `maintenanceAssumeRolePolicy(canopyJobsRoleArn)` → trusts - **canopy-jobs**. - -Target trust per group: - -| Per-group role | Trusted by (can `sts:AssumeRole` it) | Used for | -|---|---|---| -| **device role** (`deviceRoleArn` / `target_role_arn`) | `canopy-issuer` (existing) **+ `canopy-private` (NEW)** | mint device creds; wizard validation (`sts:get-caller-identity`) | -| **maintenance role** (`maintenanceRoleArn` / `maintenance_role_arn`) | `canopy-jobs` (existing) **+ `canopy-private` (NEW)** | maintenance/inspection/s3-metrics; wizard content-probe + connect-verify | - -**Concrete change:** `deviceAssumeRolePolicy` / `maintenanceAssumeRolePolicy` -currently take a single trusted role ARN. Extend them to also trust a new -`canopyPrivateRoleArn` (add a `canopy.get('canopyPrivateRoleArn')` alongside the -existing `canopyIssuerRoleArn` / `canopyJobsRoleArn` config reads), and have the -canopy deployment stack export that ARN (next to `canopyIssuerRoleArn` / -`canopyJobsRoleArn`). - -## 3. Maintenance role permissions — already correct, just confirm - -Verified: the maintenance role already grants `s3:*` on the bucket **and** -`CLOUDWATCH_METRIC_ACTIONS` (the s3-metrics CloudWatch reads). No change needed — -this section is just to confirm canopy will now assume **this** role for -maintenance/inspection/s3-metrics. - -> Bug being fixed canopy-side: maintenance/s3-metrics currently assume the -> *device* role (can't delete). They'll switch to `maintenance_role_arn`. The -> maintenance role is already complete, so no ops grant change — the device role -> stays minimal. - -## 4. Session duration — **NO ops change** (MaxSessionDuration ask withdrawn) - -Disregard the earlier "raise `canopy-jobs` `MaxSessionDuration` to 12h" note — -**withdrawn.** No `MaxSessionDuration` change is needed on any role. - -Why it's moot: we verified (kopia v0.23.1 + minio-go v7.2.0 source) that kopia -cannot use `credential_process`/a creds file, **and** the `--role` approach -(which would have needed a long base session) is superseded. The chosen -mechanism is kopia's **IAM container-credentials endpoint**: canopy runs a tiny -localhost endpoint that mints a fresh (1h chained) maintenance-role session per -poll; kopia's minio-go re-polls at ~80% of lifetime, and the Rust SDK keeps the -pod's IRSA base fresh on its own. So a 90-min (or longer) run just re-polls — no -session ceiling, no role-duration tuning. Per-bucket roles stay 1h-capped -(fine); `canopy-jobs`/`canopy-private` need no duration change. - -## 5. `.storageconfig` — informational, likely no change - -Canopy will **create `.storageconfig` as a fallback** during repo init -*only if absent*, and **never overwrites** an existing one, using the same -schema ops already writes (`blobOptions: p→INTELLIGENT_TIERING, else STANDARD`, -from `pulumi/tamanu/on-linux/src/backup/kopia.ts` and `pulumi/backups/index.ts`). -Since pulumi writes it at bucket creation, pulumi's object normally wins and -canopy's fallback is a no-op. No ops change required — just be aware canopy may -write it for buckets/prefixes pulumi didn't seed. - -## 6. Config-as-a-resource API (so pulumi stops hand-copying ARNs) - -Canopy will expose an API for pulumi to register a backup config as a managed -resource after it creates the bucket+roles — create/update/delete/get, -idempotent upsert. Ops side: - -- **Auth: `TailscaleAdmin`** for now (pulumi already has tailnet access — call - the private-server admin API over the tailnet). A proper non-interactive - machine-auth path over Tailscale (tagged/ACL-grant) is wanted **later, not in - this plan** — flag if ops wants to start designing it, but canopy isn't - building it yet. -- **Inputs pulumi must supply per group:** `server_group_id`, `bucket`, - `prefix`, `region`, `target_role_arn` (= `deviceRoleArn`), `maintenance_role_arn` - (= `maintenanceRoleArn`), `mode` (machine flows: `from_birth` — canopy - generates the passphrase; the human still escrows via the UI). -- **Delete** removes the config **and the canopy-owned passphrase Secret**. -- The create/update path runs the **same server-side access-check (the probe)** - before persisting, so a misconfigured role/bucket fails fast. -- Exact request/response schema is canopy's to finalize; canopy will share the - generated openapi. Ops only needs to confirm the **inputs above** are - available as pulumi outputs (they are: `deviceRoleArn` + `maintenanceRoleArn` - are already exported). - -## 6a. Device path note (informational — bestool repo, not pulumi) - -Heads-up that the device backup path is changing on the bestool side (TAM-6879), -not here: bestool will serve a localhost **container-credentials** endpoint that -kopia polls (fed by public-server creds), because we verified `credential_process` -doesn't work and ~90-min snapshot jobs make any <1h static-cred path non-viable. -This may prompt revisiting public-server's device-cred response shape. No pulumi -action — flagged only so the ops/bestool picture is consistent. - -## 7. Not changing - -- Device credential path *trust/roles* (public-server / `canopy-issuer` - chain-assuming the device role) — unchanged (only the bestool-side cred - *delivery* changes, §6a). -- The image still bundles kopia. - -## Summary of ops action items - -1. New `canopy-private` SA + IRSA role (`sts:AssumeRole`); SA RBAC `get`+`create` - secrets; export its role ARN as `canopyPrivateRoleArn`. -2. Add `canopyPrivateRoleArn` to **both** `deviceAssumeRolePolicy` and - `maintenanceAssumeRolePolicy` (they take a single ARN today). -3. Maintenance role perms — already `s3:*` + CloudWatch; nothing to change, just - confirmed. -4. **No `MaxSessionDuration` change (ask withdrawn, §4)** — canopy uses kopia's - container-credentials endpoint, which refreshes with no session ceiling. No - `.storageconfig` change either. -5. Plan to call canopy's config-as-a-resource API over the tailnet - (`TailscaleAdmin`) feeding `deviceRoleArn`+`maintenanceRoleArn`+bucket/prefix/ - region; delete cascades to the Secret. Schema TBD from canopy. - ---- - -## Changelog (append-only — do NOT edit the body above after handoff) - -**v1 — the version ops actioned** (everything above as of the first handoff). The -action items §1–§3, §5 are the source of truth; treat them as done. - -**2026-06-20 — delta since v1 (nothing here needs new ops IAM/pulumi work):** -- **§4 reworded, net zero for ops.** v1 already said "no `MaxSessionDuration` - change," and that's still true. The *reason* changed (canopy-internal): kopia - now gets creds via a localhost **container-credentials endpoint** (verified - against kopia 0.23.1 + minio-go 7.2.0), not `credential_process` and not - `--role`+long-session. No role-duration tuning on any role. - - ⚠️ **If anyone verbally relayed a "raise `canopy-jobs` `MaxSessionDuration` - to 12h" ask (it was never in this doc), it is WITHDRAWN — ignore/revert it.** -- **§6a added — informational only, no pulumi action.** The device backup path - moves to a bestool-served container-credentials endpoint (TAM-6879); may prompt - revisiting public-server's device-cred response shape. Flagged for picture - consistency only. -- **§1–§3, §5 unchanged** (byte-identical to v1). - -**2026-06-20 — NEW ops action (passphrase rotation):** -- ⚠️ **`canopy-jobs` SA now needs WRITE on secrets** (`create`/`update`/`patch`, - on top of the existing `get`). Why: the backups pod rotates each repo's - passphrase regularly (forward protection) — after `kopia change-password` it - writes the new passphrase back to the group's k8s Secret (dual-key - `password`/`password_next`, server-side apply, field-manager `canopy-backups`). - Read-only `get secrets` no longer covers the rotation path. -- No other ops change; rotation cadence is a canopy env - (`CANOPY_BACKUP_ROTATION_DAYS`, default 7). - -**2026-06-21 — NEW ops action (recovery vault):** -- ⚠️ **A new object-locked S3 bucket for the recovery vault, in a SEPARATE account** - from both the canopy cluster account and the per-tenant backup accounts. - Requirements: - - **Object Lock = COMPLIANCE** + **versioning** on (so a Canopy compromise - can't delete history; each daily write is a new immutable version of the - same key). Pick a retention period (with a lifecycle expiry so it doesn't - grow forever). SSE on. - - A **writer role** the `canopy-jobs` SA assumes (chained AssumeRole), granted - **`s3:PutObject` ONLY** on that bucket — **no delete, no get** (Canopy never - reads the vault back; the blob is asymmetrically encrypted so it couldn't - read it anyway). -- ⚠️ **age recipient keypairs (recovery-key custody).** Generate **multiple** age - keypairs (e.g. one per recovery officer; `bestool crypto keygen`). The - **public** keys go to Canopy via `CANOPY_RECOVERY_VAULT_KEYS` (space/comma- - separated `age1…`); the **private** keys are held **offline, out-of-band** - (any one can recover). Custody is an ops runbook — Canopy never sees a private - key. -- **Canopy env (backups pod):** `CANOPY_RECOVERY_VAULT_KEYS` (**mandatory** — the - pod refuses to start without it), `CANOPY_RECOVERY_VAULT_BUCKET` (**mandatory**), - `CANOPY_RECOVERY_VAULT_REGION`, `CANOPY_RECOVERY_VAULT_ROLE_ARN` (the writer role), - `CANOPY_RECOVERY_VAULT_SNAPSHOT_HOURS` (default 24). The object key/path within - the bucket is not configurable (fixed at `canopy-recovery/state.age`). These - must be provisioned **before** the backups pod is deployed with this build, or - it will crash-loop on the mandatory check. -- **Verification ceremony (runbook):** operators run a yearly (and on-key-change) - ceremony in the canopy admin UI (recovery vault page): Canopy issues an age-encrypted - challenge, the operator decrypts it offline with a held private key - (`bestool crypto decrypt`) and pastes it back. The vault blob itself is plain - `age` v1 (decryptable with `bestool crypto decrypt` / `age` / `rage`). -- No k8s RBAC change (the vault is S3, not a Secret). - -**2026-06-22 — clarification (private-server also needs the recipients):** -- ⚠️ **Set `CANOPY_RECOVERY_VAULT_KEYS` on private-server too**, not just the - backups pod. They're **public** keys (non-secret), so use the same value. The - private-server needs them to run the verification ceremony (issue the - age-encrypted challenge); without them the recovery-vault page reports the - ceremony as unavailable. private-server does **not** hard-require them (it - starts fine without; only the ceremony page is degraded) — unlike the backups - pod, which won't start without them. Nothing else on private-server needs it. - -**2026-06-22 — fix (repo-password Secret namespace):** -- Canopy now reads/creates the repo-password Secrets in the **pod's own - namespace** (the ServiceAccount's namespace from the in-cluster config), not a - hardcoded `canopy`. So in `tamanu-meta-prod` the Secrets live in - `tamanu-meta-prod`. **No `POD_NAMESPACE` env needed** (it's still honored as an - override if set). Earlier the default `canopy` caused `canopy-jobs` (in - `tamanu-meta-prod`) to hit `403 Forbidden` reading `backup-repo-*` in `canopy`. -- ⚠️ **Ensure both SAs have secret RBAC in the deployment namespace:** - `canopy-private` (`get`+`create` secrets) and `canopy-jobs` (`get`+`create`/ - `update`/`patch` secrets, for rotation) — in whatever namespace the pods run - (e.g. `tamanu-meta-prod`), which is the standard same-namespace grant. If those - Role/RoleBindings were created in `canopy`, move them to the pods' namespace. - -**2026-06-22 — fix (kopia is now bundled in the canopy image):** -- The backups pod drives kopia as in-process subprocesses, but kopia was **not** - in the image it ran: CI only ever built `ghcr.io/beyondessential/canopy` (a - busybox base with the canopy binaries, no kopia), and the standalone - `images/backups/` Dockerfile that bundled kopia was never wired into CI. So in - prod every repo-init/maintenance kopia spawn failed with - `failed to spawn kopia … No such file or directory (os error 2)`. -- Fix: the (static) kopia binary is now copied into the single shipped - `ghcr.io/beyondessential/canopy` image (`.github/Dockerfile.native`, from - `kopia/kopia:0.23.1`). **The backups pod uses the same image as every other - component** — there is no separate backups image. The orphaned `images/backups/` - dir was removed. -- ⚠️ **Ops action: ensure the backups Deployment references - `ghcr.io/beyondessential/canopy` (the standard image), not a separate - `canopy-backups`/kopia image** — none is published. Once deployed on this - build, kopia resolves on `PATH` and repo creation works. diff --git a/docs/plans/backup-setup-wizard.md b/docs/plans/backup-setup-wizard.md deleted file mode 100644 index 6f260d43..00000000 --- a/docs/plans/backup-setup-wizard.md +++ /dev/null @@ -1,381 +0,0 @@ -# Backup setup wizard + Canopy-owned passphrases + chained-AssumeRole cred model - -Status: **PLAN / for review** (2026-06-20). Supersedes parts of the cred model -in `backup-credentials.md` (the direct-web-identity scheme) — see §1. - -## Why - -Two things landed together: - -1. **Operator feedback (feature):** the backup setup form should become an - interactive wizard. The operator enters bucket/prefix/role(s)/region; Canopy - *immediately* uses those creds to probe the bucket — verifying the creds work - and reporting whether the prefix is empty, looks like an existing kopia repo, - holds other (forgotten) content, or is already configured in Canopy — and - offers next steps based on that. Only once the passphrase situation is - settled do we collect schedule/retention. Rework repo modes so **Canopy owns - every passphrase**: `from_birth` (generate + escrow) or `passphrase` (operator - types it); drop the old import-an-existing-Secret mode. - -2. **Ops-driven cred-model change:** there is **no deployment-account OIDC - provider**, so the previous "override `AWS_ROLE_ARN` + reuse the projected - web-identity token → direct `AssumeRoleWithWebIdentity`" scheme is gone. - Everything cross-account is now **explicit chained `sts:AssumeRole`** from the - pod's own IRSA creds, and there are now **two roles per group**. - -Both touch the same credential plumbing, so they're planned together. - ---- - -## 1. Credential model change (ops-driven) - -### 1.1 Two roles per group - -`server_group_backup_config` carries **both**: - -- `target_role_arn` — **device role** (`deviceRoleArn`). No delete. public-server - mints device creds from it. **Unchanged.** -- `maintenance_role_arn` — **maintenance role** (`maintenanceRoleArn`). `s3:*` + - delete + CloudWatch. The backups pod assumes this for - maintenance / inspection / s3-metrics. **New column.** - -The current code assuming `target_role_arn` for maintenance / s3-metrics is a -**bug** (the device role deliberately can't delete). Fixing it is in scope here. - -### 1.2 Chained AssumeRole everywhere cross-account - -The backups pod keeps its own `canopy-jobs` IRSA creds (default credential -chain). For each group op it needs to reach `maintenance_role_arn`. Two -consumers: - -- **CloudWatch SDK client (s3-metrics):** in-process Rust `aws-sdk-sts` - `AssumeRole(maintenance_role_arn)` → hand to the SDK client. The Rust SDK's - assume-role provider auto-refreshes; the 1h chained cap is a non-issue. - (preflight is *not* in this list — it stays on the device role, §preflight.) -- **kopia subprocess:** see §1.3 — kopia polls a localhost container-credentials - endpoint and self-refreshes. - -Drop every direct-web-identity-against-deployment-account assumption, and drop -the old `AWS_ROLE_ARN`-override path in `kopia.rs`. - -### 1.3 kopia creds via a container-credentials endpoint (the 1h-cap solution) - -**Verified against kopia v0.23.1 + minio-go v7.2.0 source (do not relitigate).** -~90-min snapshot/maintenance runs are routine, so **no <1h static-cred path is -viable**, and: - -- kopia's S3 chain is hardcoded **Static → EnvAWS → IAM**; it never instantiates - minio-go's file/`credential_process` provider → **`credential_process` and "a - creds file" do NOT work**. (Dead. No kopia fork.) -- Static `AWS_*` env creds **never refresh** → break past 1h. -- kopia's `--role` path *would* self-refresh but is bounded by the base session's - lifetime — **superseded** by the endpoint below (simpler ops, no ceiling). -- minio-go's **IAM provider** *does* support the ECS-style **container-credentials - endpoint** (`AWS_CONTAINER_CREDENTIALS_FULL_URI` + `AWS_CONTAINER_AUTHORIZATION_TOKEN`), - polls it, and **self-refreshes** (`SetExpiration` → re-GET at ~80% of lifetime). - -**Plan:** the `backups` bin runs one tiny **localhost** HTTP creds endpoint. Per -in-flight op it registers `token → (maintenance_role_arn, region)`, mints that -group's session via the Rust SDK (`AssumeRole`, base auto-refreshed by IRSA), and -returns container-creds JSON. Each kopia subprocess is launched pointing at it; -minio-go re-polls before expiry, so a 90-min+ run just gets fresh creds — **no -session ceiling, no `MaxSessionDuration` change**. - -Verified endpoint contract (minio-go v7.2.0 FULL_URI/ECS path) — implement exactly: -- Subprocess env: `AWS_CONTAINER_CREDENTIALS_FULL_URI=http://127.0.0.1:/` - (use the **`127.0.0.1`** literal — loopback is checked via `LookupHost`; `localhost` - can fail; http is fine, no https needed) + `AWS_CONTAINER_AUTHORIZATION_TOKEN=` - (sent **raw** as the `Authorization` header, no `Bearer`). -- **Env hygiene (critical):** on the kopia subprocess, *unset* - `AWS_WEB_IDENTITY_TOKEN_FILE`, `AWS_ROLE_ARN`, `AWS_ACCESS_KEY_ID`/`SECRET`, - `AWS_CONTAINER_CREDENTIALS_RELATIVE_URI`, `AWS_CONTAINER_AUTHORIZATION_TOKEN_FILE` - — all *precede* the FULL_URI path (IRSA injects the web-identity ones into the - pod). Leave kopia's S3 `accessKeyID/secretAccessKey/sessionToken/roleARN` empty - (Static/RoleARN bypass the chain). -- Server returns **HTTP 200** + JSON `{"AccessKeyId","SecretAccessKey","Token","Expiration"}` - — field is **`Token`** (not SessionToken); `Expiration` is **RFC3339**, a few - minutes out (never omit/zero, or minio-go re-GETs every call). Non-200 = hard - failure (no IMDS fallback on this path). - -Device side (bestool, TAM-6879) uses the *same* protocol, its endpoint fed by -public-server. Revisit public-server's device-cred output shape there. - -### 1.4 Device path: unchanged - -public-server / canopy-issuer already chain-assumes `target_role_arn` and hands -creds to the device over mTLS. No change. - -### 1.5 What ops provides (so canopy just uses the default chain → AssumeRole) - -- `canopy-issuer` + `canopy-jobs` SAs annotated with IRSA role ARNs, both - carrying `sts:AssumeRole` + `get secrets`. -- Per-bucket roles trust the matching SA role ARN. -- Image still bundles kopia. - -> **Ops coordination (TAM-6878):** the wizard's synchronous probe runs in -> **private-server**, which today has no AWS identity. **Decided:** private-server -> gets a **dedicated `canopy-private` SA + IRSA role** (room to grow more -> private-server AWS features later), carrying `sts:AssumeRole`; the per-bucket -> roles' trust policies must include this role ARN, and the SA needs `create -> secrets` (§3) on top of the existing `get secrets`. - ---- - -## 2. Interactive setup wizard (probe) - -### 2.1 Flow - -Step 1 — **Identity & target.** Operator enters: bucket, prefix, region (default -**`ap-southeast-2`** — most buckets live there), `target_role_arn` (device), -`maintenance_role_arn`. → **Probe.** - -Step 2 — **Probe result & passphrase.** Canopy assumes the role and inspects the -prefix. Based on the result (§2.3) it presents the right passphrase choice -(from_birth generate vs operator-typed passphrase). Probe also reports if this -bucket+prefix is already configured in Canopy (DB check, §2.4). For an existing -repo, once the operator types the passphrase Canopy runs a **second -(verify) probe** that attempts `kopia repository connect` to confirm the -passphrase before committing (§2.3). - -Step 3 — **Schedule & retention.** Only reached once the passphrase situation is -settled. Same fields as today (interval + per-type retention with the org -floors). Then create + provision. - -### 2.2 Probe endpoint - -New private-server endpoint, e.g. `POST /api/backups/probe`. Two phases share it: -an **inspect** phase (no passphrase) and a **verify** phase (with passphrase, for -existing repos): - -``` -ProbeArgs { - bucket, prefix, region, target_role_arn, maintenance_role_arn, - passphrase: Option, // present ⇒ also run the connect-verify -} -ProbeResult { - creds_ok: bool, - error: Option, // assume/list failure surfaced verbatim-ish - state: "empty" | "kopia_repo" | "other_content" | "inaccessible", - object_sample: Vec, // a few keys, for "other content" context - already_configured_in_canopy: Option, // group id if bucket+prefix taken - passphrase_ok: Option, // set only when a passphrase was supplied -} -``` - -Implementation: add `aws-sdk-sts` + `aws-sdk-s3` + `aws-config` to private-server -(mirrors `public-server/src/backup.rs` and `jobs/src/backup/preflight.rs`). -Assume the **`maintenance_role_arn`** (full read; it's the path that does the -heavy lifting, so validating it is the most useful signal), then: - -- `ListObjectsV2(bucket, prefix, max-keys=small)`. -- Probe for the kopia repo marker `HeadObject`/`GetObject` on - `kopia.repository` (confirmed: kopia 0.23.1 writes its format blob - there). Present ⇒ `kopia_repo`. -- **`.storageconfig`-only counts as empty:** if the only object(s) under the - prefix are `.storageconfig` (and no `kopia.repository`), treat as `empty`. -- Non-empty (beyond `.storageconfig`) with no marker ⇒ `other_content` (return a - sample of keys). -- Assume/list failure ⇒ `creds_ok=false`, `inaccessible`, surface the error. -- If `passphrase` supplied and state is `kopia_repo`: attempt `kopia repository - connect` with it (using the credential-process helper for S3 creds) → - `passphrase_ok`. (Connect leaves no writes.) - -Also cheaply validate `target_role_arn` with `sts:get-caller-identity` under the -assumed session so a bad device role is caught at setup, not at first device -mint. - -### 2.3 State → offered options - -| Probe state | What we show | -|---|---| -| `empty` | Proceed. Mode: **from_birth** (generate + escrow, recommended) or **passphrase** (type your own). | -| `kopia_repo` | An existing kopia repo. Only **passphrase** mode (operator provides the existing passphrase) — *not* from_birth (won't create over an existing repo). The verify-probe must return `passphrase_ok` before the operator can continue. | -| `other_content` | **Block** with a warning + sample of keys. Canopy does **not** delete anything — the operator clears the contents themselves (or chooses a different prefix/bucket) and hits **Retry** to re-probe. No "proceed anyway", no Canopy-side delete. | -| `inaccessible` | Block step 1; show the assume/list error so the operator can fix the role/bucket/region. | - -### 2.4 Already-configured-in-Canopy check - -Before/with the probe, query whether `(bucket, prefix)` (or the group) already -has a `server_group_backup_config`. If so, surface it (link to the existing -config) and block creating a duplicate. Pure DB check, no creds needed. - -### 2.5 `other_content` → Retry only (no Canopy-side delete) - -Canopy never deletes bucket contents. On `other_content` the wizard blocks and -shows a **Retry** button (re-runs the probe); the operator must clear the prefix -/ pick another prefix or bucket on their own (AWS console, etc.). No delete -endpoint, no `s3:DeleteObject*` grant needed. - ---- - -## 3. Repo modes + private-server-owned Secret creation - -### 3.1 Two modes only — drop import-by-Secret - -**Decision:** **Canopy owns all repo passphrases.** Remove the existing -`import` (operator-supplies-a-Secret-name) variant entirely. `BackupRepoMode` -becomes exactly two variants: - -- `from_birth` — Canopy generates the passphrase + escrow flow (reveal-once + - ack). Only valid on an `empty` prefix. -- `passphrase` — operator provides the passphrase; Canopy stores it. **Skips - escrow** → `provisioning → ready` on successful init. Covers *both* "set my - own on a fresh repo" (empty prefix → create) and "connect to an existing repo" - (`kopia_repo` → connect, passphrase pre-verified by the §2.2 verify probe). The - repo *state*, not the mode, decides create-vs-connect. - -DB migration changes the `CHECK` on `server_group_backup_config.mode` from -`IN ('from_birth','import')` to `IN ('from_birth','passphrase')`. No existing -rows (confirmed), so no data migration. Remove `import`-specific handling -(`repo_password_ref` is no longer a user input — Canopy always names/owns the -Secret) and the `BackupRepoMode::Import` match arms. - -### 3.2 Secret creation — currently missing - -**Gap found:** nothing in the codebase *creates* the passphrase Secret today. -`from_birth` init only ever *reads* it (`worker.read_repo_password`), and there -is no passphrase generation — so `from_birth` is not actually wired end-to-end. - -**Decision (confirmed):** private-server owns Secret creation for both modes, -at config-create time: - -- `from_birth`: generate a strong passphrase, create the k8s Secret - (`backup-repo-{group_id}`, key `password`), record the ref. Escrow flow - unchanged. -- `passphrase`: create the Secret from the operator-typed value, ref recorded, - no escrow. - -This gives the `canopy-private` SA `create secrets` RBAC (today it has `get` -only). The `backups` init loop keeps only *reading* the Secret — no change. - -> Scope note (rule_no_self_scoping): finishing `from_birth` generate+create is -> pulled in here because both modes share the same (missing) machinery; calling -> it out rather than silently bundling or dropping it. - -### 3.3 `.storageconfig` (Intelligent-Tiering) on init - -Normally pulumi writes `.storageconfig` at bucket creation -(`ops/pulumi/tamanu/on-linux/src/backup/kopia.ts` + `ops/pulumi/backups/index.ts`). -Canopy creates it as a **fallback**: on repo init, if `.storageconfig` -is **absent**, write it; **never overwrite** an existing one. A prefix -containing only `.storageconfig` is treated as `empty` by the probe (§2.2), so a -pre-seeded tiering config doesn't block from_birth. - -Use the exact schema ops uses (kopia `blobOptions` — data blobs under the `p` -prefix → Intelligent-Tiering, everything else → Standard so indexes stay in the -frequent tier), written as pretty JSON, `Content-Type: application/json`: - -```json -{ - "blobOptions": [ - { "prefix": "p", "storageClass": "INTELLIGENT_TIERING" }, - { "storageClass": "STANDARD" } - ] -} -``` - -The `p` prefix is relative to the kopia storage root, so the same content is -correct whether or not the repo lives under a canopy prefix; place the object at -`.storageconfig`. - ---- - -## 3a. Machine-facing config-as-a-resource API (ops/pulumi) - -Complements the wizard (does **not** replace it). Pulumi creates the bucket + -device/maintenance roles, then pushes the backup config to Canopy as a managed -resource — so operators don't hand-copy ARNs out of pulumi. - -- **Endpoints:** create / update / delete / get a `server_group_backup_config` - (the wizard's create/update reuse the same handlers). Create/update run the - **same server-side access-check/probe** (§2.2) before persisting, so a config - pushed by pulumi is validated identically — bad creds/role/bucket fail fast. -- **Resource semantics:** idempotent upsert keyed by group (or bucket+prefix), - suitable for a Pulumi dynamic provider / `Command`-style resource. **Delete - tears down the config *and* deletes the Canopy-owned passphrase Secret** (both - modes — Canopy owns it now). -- **Auth (decided):** gate on `TailscaleAdmin` for now — pulumi has tailnet - access. A proper non-interactive machine path (still over Tailscale, via - tagged/ACL-grant access) is wanted **later but explicitly out of scope here**. - ---- - -## 4. Work breakdown - -### DB (database crate, migration via `just migration`) -- Add `maintenance_role_arn TEXT NOT NULL` to `server_group_backup_config` (no - existing rows, so NOT NULL is clean) + model field + re-exports. -- Change `mode` CHECK from `IN ('from_birth','import')` to - `IN ('from_birth','passphrase')`. - -### commons-types -- `BackupRepoMode`: replace `Import` with `Passphrase` (`"passphrase"`). - -### private-server (`fns/backups.rs`, `state.rs`) -- AWS deps (`aws-sdk-sts`/`aws-sdk-s3`/`aws-config`); `probe` endpoint (inspect + - verify phases, §2.2); `already_configured` DB check. No delete-contents - endpoint — `other_content` is Retry-only (§2.5). -- Secret creation on `create` for from_birth + passphrase (§3.2); extend the - kube wrapper with a create op; `create secrets` RBAC. -- `CreateBackupConfigArgs`: add `maintenance_role_arn`; accept the typed - passphrase for `passphrase` mode; drop `repo_password_ref` as a user input and - the `Import` arms. -- **Config-as-a-resource API (§3a):** create/update/delete/get usable by - pulumi, sharing the access-check; resolve the machine-auth path. -- openapi regen (`just gen-openapi`). - -### jobs (`backup/{kopia,worker,maintenance,inspection,s3_metrics}.rs`, bin) -- Switch **s3-metrics** (Rust SDK) to `AssumeRole(maintenance_role_arn)` (the - CloudWatch grant lives there); auto-refreshes. **preflight stays on - `target_role_arn`** — it validates the *device* path (assumes the device role - both ways incl. the restore session policy); a separate maintenance-role - preflight check is a follow-up, not this PR. -- **kopia container-creds endpoint (§1.3):** the bin runs a localhost endpoint - (token → `maintenance_role_arn` registry, mints via Rust SDK); each kopia - subprocess gets `AWS_CONTAINER_CREDENTIALS_FULL_URI`+`AWS_CONTAINER_AUTHORIZATION_TOKEN` - and the env-hygiene scrub (unset web-identity/static/relative-uri vars). Remove - the `AWS_ROLE_ARN`-override path in `kopia.rs`. (No `credential_process`/`--role`.) -- **`.storageconfig` on init (§3.3):** create if absent, never overwrite. - -### frontend (`private-web/`) -- `BackupConfig.tsx` → multi-step wizard (step 1 identity+probe with both role - ARNs + region default `ap-southeast-2`; step 2 probe result + passphrase + - verify-probe for existing repos; step 3 schedule/retention). Render the - state→options matrix (§2.3); already-configured; `other_content` blocking with - a Retry button (no delete). -- Generated api-types (`just gen-openapi`). - -### tests -- Rust: probe endpoint (inspect + verify; mock/seed S3 where feasible), secret - creation, two-mode status machine, migration, the resource API. -- Playwright e2e: wizard steps, region default, probe states (the e2e kube/AWS - clients are `None` today → probe needs a test seam to stub responses), - passphrase-mode flow, from_birth escrow flow. - -### cross-repo / ops (TAM-6878 pulumi) -- New `canopy-private` SA + IRSA role; per-bucket trust includes it; `create - secrets` RBAC. -- `maintenanceRoleArn` per group (CloudWatch + delete grant on it); pulumi calls - the new resource API to register configs (§3a). - ---- - -## 5. Decisions (all resolved) - -Resolved: dedicated `canopy-private` SA; probe assumes -`maintenance_role_arn` (+ cheap `target_role_arn` validate); `maintenance_role_arn` -NOT NULL (no existing rows); passphrase mode straight to `ready` (no escrow); -`other_content` hard-blocks, Retry-only (Canopy never deletes contents); -import-by-Secret dropped; **kopia creds via a localhost container-credentials -endpoint** the bin serves (verified: `credential_process`/creds-file/`--role` -all rejected; minio-go IAM provider polls + self-refreshes, no session ceiling, -**no `MaxSessionDuration` ops change**); `kopia.repository` marker confirmed. - -Also resolved this round: pulumi→private-server auth = `TailscaleAdmin` gate -(pulumi has tailnet access; machine/tagged-grant auth is a later, out-of-scope -follow-up); deleting a config also deletes the Canopy-owned Secret; -`.storageconfig` schema taken verbatim from the ops repo (§3.3); e2e probe is -stubbed/mocked (whatever's simplest — the e2e AWS/kube clients are `None`, so the -probe path gets a test seam returning canned `ProbeResult`s; no real S3). - -**Nothing blocking remains** — ready to implement on plan approval. diff --git a/docs/plans/specs/canopy-database.md b/docs/plans/specs/canopy-database.md deleted file mode 100644 index c094cf4a..00000000 --- a/docs/plans/specs/canopy-database.md +++ /dev/null @@ -1,681 +0,0 @@ -# Implementation spec: `canopy-database` (backup-credentials) - -Component: the **database crate** (`crates/database`) changes for the -backup-credentials system. This is the foundational layer every other -component (public-server endpoints, the `jobs`-crate schedulers, the -private-server operator UI) builds on: it owns the migrations, the diesel -models, and the `lib.rs` re-exports for all backup tables. - -Authoritative design: [`../backup-credentials.md`](../backup-credentials.md) -(stage-2 stub: [`../backup-credentials-blind-relay.md`](../backup-credentials-blind-relay.md)). -This spec does not re-litigate decided shape — it makes the DB layer concrete. - ---- - -## Purpose - -Provide the persistent state for (**10 tables** as shipped — the "Backup -types addendum" at the foot of this spec added `backup_type_defaults`, -`server_backup_capabilities`, and `server_group_backup_schedule` to the -original seven; they're folded into the list here): - -- **`server_group_backup_config`** — per-group repo-level backup configuration + - lifecycle status (one row per configured group; `group_id` PK → `server_groups`). -- **`backup_type_defaults`** — Canopy-wide per-type defaults (`default_interval`, - `default_retention`, `auto_enable`). -- **`server_backup_capabilities`** — what each server advertises it can back up - (bestool-registered), with a per-server `enabled` toggle. -- **`server_group_backup_schedule`** — per-`(group, type)` schedule/retention - overrides over the type defaults. -- **`backup_credential_issuances`** — audit log of every STS credential issuance. -- **`backup_runs`** — what bestool reported per backup/restore run (client-minted UUID PK). -- **`backup_maintenance_runs`** — Canopy-owned maintenance-Job outcomes (per-group). -- **`backup_repo_snapshots`** — ground-truth inventory from the read-only inspection Job. -- **`backup_repo_stats`** — cached repo + bucket size/stats for operator display (per-group). -- **`backup_requests`** — pending operator one-off "backup now" flags (per `(server, type, purpose)`). - -Plus the diesel model structs, insert/query helpers, and `lib.rs` module + -re-exports. Where helpers fall on a component boundary (e.g. staleness scan -queries used by the `jobs` crate, issuance recording used by public-server) -this spec defines the **signatures and ownership**; the calling logic lives -in those components' own specs. - ---- - -## Conventions to follow (grounded in the repo) - -Read before implementing: `crates/database/src/{schema,servers,server_groups,devices,issues,statuses,pg_duration}.rs` -and `migrations/2026-05-22-120000-0000_server_groups/{up,down}.sql`. - -- **Migrations are scaffolded with `just migration NAME`** (never hand-create - the directory — inconsistent naming; this is a flagged repeat mistake). - That runs `diesel migration generate`, producing - `migrations/_/{up,down}.sql`. Then `just migrate` runs them and - `cargo fmt`s the regenerated `schema.rs`. The diesel CLI **regenerates - `crates/database/src/schema.rs`** from the live DB — do **not** hand-edit - `schema.rs`; let the migration drive it, then commit the diff. -- One migration per logical change is the norm, but a cohesive feature can be - several sequential migrations (see the `2026-06-01-012906-000{0,1,2}` triple). - RESOLVED (impl): the core landed as one migration - `2026-06-12-090526-0000_backup_credentials` creating all **10** tables (the - addendum tables included from the start), with a clean reverse-order - `DROP TABLE` `down.sql`. Two follow-up migrations layered on later: - `2026-06-15-064431-0000_backup_group_scoped_issues` and - `2026-06-16-001346-0000_backup_config_lifecycle_columns` (adds `mode`, - `last_init_error`, `escrow_acked_at`, `escrow_acked_by` to - `server_group_backup_config`). -- **Timestamps**: columns are `TIMESTAMPTZ NOT NULL DEFAULT NOW()`. Models map - them with `#[diesel(deserialize_as = jiff_diesel::Timestamp, serialize_as = - jiff_diesel::Timestamp)]` over a `jiff::Timestamp` field; nullable ones use - `jiff_diesel::NullableTimestamp` over `Option`. -- **`updated_at` auto-touch**: for tables with an `updated_at`, call - `SELECT diesel_manage_updated_at('');` in `up.sql` (as - `server_groups` does). Only `server_group_backup_config` needs this here. -- **INTERVAL columns** map to `crate::pg_duration::PgDuration` (wraps - `jiff::SignedDuration`; serde wire form is whole seconds as `i64`). For a - *nullable* interval (`expected_interval`), the field is - `Option`; annotate the schema with `#[schema(value_type = - Option, format = "int64")]` for utoipa (see `ServerGroup::slack_open_delay`). -- **JSONB `retention`** maps to `serde_json::Value` *at the storage layer* — - diesel handles `Jsonb -> serde_json::Value` natively (see `statuses.health`, - `statuses.extra`). The retention columns stay `JsonValue` in the model structs. - RESOLVED (impl): a validated shape **was** wanted after all — a typed - `RetentionPolicy` struct (`backups::RetentionPolicy`) sits *over* the raw - value with the kopia `keep_*` fields, `FLOOR_DAILY/WEEKLY/MONTHLY` constants, - `validate_floor()` (returns `AppError::BadRequest` listing the violated - fields), and `from_json`/`to_json`/`to_value` converters. The floor logic - lives in the DB crate and is called by the private-server write path; the - storage columns themselves remain `JsonValue` (so `RetentionPolicy` is a - helper, not a diesel column type). -- **Models**: `#[derive(Debug, Clone, Serialize, Deserialize, Queryable, - Selectable, Insertable, utoipa::ToSchema)]`, `#[diesel(table_name = - crate::schema::
)]`, `#[diesel(check_for_backend(diesel::pg::Pg))]`. - A separate `New
` `Insertable` struct is used where the insert shape - differs from the row (see `NewServerGroup`, `NewStatus`). Add - `#[diesel(belongs_to(...))]` + a `joinable!` entry where a join is wanted. -- **Helper methods** are `impl` blocks with - `pub async fn (db: &mut AsyncPgConnection, ...) -> Result<...>` returning - `commons_errors::Result`, ending each query with `.map_err(AppError::from)`. - Use `use crate::schema::
::dsl;` inside each fn (the established style). -- **`BIGSERIAL` PK** maps to `pub id: i64` in the model and is **omitted** from - the `New
` insertable. -- **Schema regen verification**: after the migration, `schema.rs` gains the 10 - `diesel::table!` blocks, plus `joinable!` and `allow_tables_to_appear_in_same_query!` - entries. Confirm `bigserial` surfaces as `Int8`, `JSONB` as `Jsonb`, - `INTERVAL` as `Interval`/`Nullable`. - ---- - -## Migration: `backup_credentials` - -`up.sql` creates the 10 tables below. DDL is normative (it is what the -diesel schema regen reads); the design doc's snippets are the source. - -RESOLVED (impl) — **FK semantics are uniform: plain `REFERENCES` with NO -`ON DELETE`/`ON UPDATE` clause everywhere.** `server_groups`, `servers`, and -`devices` are *archived* (`deleted_at` soft-delete), never hard-deleted, so the -cascade-vs-preserve distinction the original per-table notes agonised over is -moot in practice. The per-table "CASCADE here / no-CASCADE there" prose below -is **superseded** by this single rule; open questions 1 & 2 are resolved -accordingly (see below). The original notes are kept inline for design history, -struck through where they no longer hold. - -The addendum tables (`backup_type_defaults`, `server_backup_capabilities`, -`server_group_backup_schedule`) and the type-keying deltas are folded into the -DDL shown here; see the addendum at the foot for the design rationale. - -### `server_group_backup_config` - -RESOLVED (impl) — `expected_interval` and `retention` moved off this table -(repo-level only now; schedule/retention live per-`(group, type)` in -`server_group_backup_schedule` per the addendum). Lifecycle columns (`mode`, -`last_init_error`, `escrow_acked_at`, `escrow_acked_by`) were added by the -`2026-06-16-...backup_config_lifecycle_columns` migration. As-shipped DDL: - -```sql -CREATE TABLE server_group_backup_config ( - group_id UUID PRIMARY KEY REFERENCES server_groups(id), - bucket TEXT NOT NULL, - prefix TEXT NOT NULL DEFAULT '', - target_role_arn TEXT NOT NULL, - region TEXT, - repo_password_ref TEXT NOT NULL, - status TEXT NOT NULL CHECK (status IN ('provisioning', 'escrow_pending', 'ready')), - created_at TIMESTAMPTZ NOT NULL DEFAULT now(), - updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), - -- added by 2026-06-16-...backup_config_lifecycle_columns: - mode TEXT NOT NULL DEFAULT 'from_birth' CHECK (mode IN ('from_birth', 'import')), - last_init_error TEXT, - escrow_acked_at TIMESTAMPTZ, - escrow_acked_by TEXT -); -SELECT diesel_manage_updated_at('server_group_backup_config'); -``` - -- ~~**`ON DELETE CASCADE` is intentional** here~~ — superseded: plain - `REFERENCES`, no cascade (archival model; groups are soft-deleted, not - hard-deleted). -- `status` is a `TEXT` column with a `CHECK (status IN (...))` for the closed - three-value set `{provisioning, escrow_pending, ready}`, validated in code via - the `BackupConfigStatus` enum. The closed enums all carry a DB `CHECK` - in the shipped schema (status, mode, purpose, outcome, kind). -- **Lifecycle columns:** `mode` is the 5th closed enum `BackupRepoMode` - (`from_birth` / `import`, with a DB CHECK); `last_init_error` is set by the - init Job on `kopia repository create` failure and cleared by the operator-UI - on retry; `escrow_acked_at`/`escrow_acked_by` stamp the Bitwarden-escrow ack - that flips `escrow_pending → ready`. - -### `backup_type_defaults`, `server_backup_capabilities`, `server_group_backup_schedule` (addendum tables) - -```sql -CREATE TABLE backup_type_defaults ( - type TEXT PRIMARY KEY, - default_interval INTERVAL, - default_retention JSONB NOT NULL CHECK (jsonb_typeof(default_retention) = 'object'), - auto_enable BOOLEAN NOT NULL DEFAULT false -); - -CREATE TABLE server_backup_capabilities ( - server_id UUID NOT NULL REFERENCES servers(id), - type TEXT NOT NULL, - enabled BOOLEAN NOT NULL, - registered_at TIMESTAMPTZ NOT NULL DEFAULT now(), - PRIMARY KEY (server_id, type) -); - -CREATE TABLE server_group_backup_schedule ( - group_id UUID NOT NULL REFERENCES server_groups(id), - type TEXT NOT NULL, - expected_interval INTERVAL, - retention JSONB CHECK (retention IS NULL OR jsonb_typeof(retention) = 'object'), - created_at TIMESTAMPTZ NOT NULL DEFAULT now(), - updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), - PRIMARY KEY (group_id, type) -); -SELECT diesel_manage_updated_at('server_group_backup_schedule'); -``` - -- `retention` now lives here (nullable, `jsonb_typeof='object'` CHECK when - present) and on `backup_type_defaults.default_retention` (NOT NULL, same - CHECK) — **superseding** the original `retention` column on - `server_group_backup_config`. Effective value for a `(group, type)` is the - schedule override `?? backup_type_defaults`, with the org retention floor - applied in code (`RetentionPolicy::validate_floor`). -- `server_backup_capabilities.enabled` is **seeded** from - `backup_type_defaults.auto_enable` at first registration, then operator- - toggleable per server. - -### `backup_credential_issuances` - -```sql -CREATE TABLE backup_credential_issuances ( - id BIGSERIAL PRIMARY KEY, - device_id UUID NOT NULL REFERENCES devices(id), - group_id UUID NOT NULL REFERENCES server_groups(id), - type TEXT NOT NULL, - issued_at TIMESTAMPTZ NOT NULL DEFAULT now(), - expires_at TIMESTAMPTZ NOT NULL, - purpose TEXT NOT NULL CHECK (purpose IN ('backup', 'restore')), - sts_assumed_role TEXT NOT NULL, - sts_request_id TEXT, - access_key_id TEXT, - bucket TEXT NOT NULL, - prefix TEXT NOT NULL -); -CREATE INDEX ON backup_credential_issuances (device_id, issued_at DESC); -CREATE INDEX ON backup_credential_issuances (group_id, issued_at DESC); -``` - -- ~~**No CASCADE on `group_id`/`device_id` — deliberate audit preservation**~~ — - superseded by the uniform no-cascade/archival rule (the FK is plain - `REFERENCES` regardless; the audit data is preserved because rows are - soft-deleted, not hard-deleted). -- `bucket`/`prefix` are **snapshots at issuance time**, not FKs back to config. -- `type TEXT` (addendum) — backups are keyed `(server, type)`. -- `purpose` is `TEXT` with a DB `CHECK (purpose IN ('backup','restore'))`, - also validated in code via `BackupPurpose`. - -### `backup_runs` - -```sql -CREATE TABLE backup_runs ( - id UUID PRIMARY KEY, - device_id UUID NOT NULL REFERENCES devices(id), - group_id UUID NOT NULL REFERENCES server_groups(id), - server_id UUID REFERENCES servers(id), - type TEXT NOT NULL, - purpose TEXT NOT NULL CHECK (purpose IN ('backup', 'restore')), - outcome TEXT NOT NULL CHECK (outcome IN ('success', 'failure')), - error TEXT, - bytes_uploaded BIGINT, - snapshot_id TEXT, - reported_at TIMESTAMPTZ NOT NULL DEFAULT now() -); -CREATE INDEX ON backup_runs (group_id, reported_at DESC); -CREATE INDEX ON backup_runs (device_id, reported_at DESC); -CREATE INDEX ON backup_runs (server_id, type, reported_at DESC); -``` - -- `server_id` (nullable) + `type TEXT` were added by the addendum so staleness - is per-`(server, type)`. The third index `(server_id, type, reported_at DESC)` - serves that per-`(server, type)` "latest run" staleness scan. - -- **`id` is a client-supplied UUID** (the run-uuid bestool mints at run start), - **not** `gen_random_uuid()` and **not** `BIGSERIAL`. No `DEFAULT`. The - `New`-side insert provides it. A duplicate `id` fails its own insert (PK - violation) — that's the intended safety (design §`backup_runs`); the model - helper should surface that as a clean error, not panic. -- `device_id`/`group_id` come from the authenticated `ServerDevice` context in - the caller, **never** from the client body — the model helper takes them as - parameters (see contract below), it does not read them from a deserialized - client struct. -- Plain `REFERENCES` on the FKs (uniform no-cascade/archival rule). -- For the staleness scan, the hot query is "latest successful `purpose='backup'` - run per `(server, type)`"; the `(server_id, type, reported_at DESC)` index - serves it (the `(group_id, reported_at DESC)` index serves repo-level cuts). - -### `backup_maintenance_runs` - -```sql -CREATE TABLE backup_maintenance_runs ( - id BIGSERIAL PRIMARY KEY, - group_id UUID NOT NULL REFERENCES server_groups(id), - kind TEXT NOT NULL, -- "quick" | "full" - started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), - finished_at TIMESTAMPTZ, - outcome TEXT, -- NULL while running - error TEXT, - bytes_reclaimed BIGINT -); -CREATE INDEX ON backup_maintenance_runs (group_id, started_at DESC); -``` - -- Plain `REFERENCES` on `group_id` (uniform no-cascade/archival rule). -- `outcome` NULL = still running; the model helper has a `start()` (insert, - returns the new `i64` id) and a `finish(id, outcome, error, bytes_reclaimed)` - update — the Job-side caller (jobs crate) owns the start/finish bracket. - -### `backup_repo_snapshots` - -```sql -CREATE TABLE backup_repo_snapshots ( - group_id UUID NOT NULL REFERENCES server_groups(id), - source TEXT NOT NULL, - server_id UUID REFERENCES servers(id), - type TEXT, - latest_snapshot_at TIMESTAMPTZ, - observed_at TIMESTAMPTZ NOT NULL DEFAULT now(), - PRIMARY KEY (group_id, source) -); -``` - -- Composite PK `(group_id, source)`. The inspection Job **upserts** per source - (`ON CONFLICT (group_id, source) DO UPDATE`) — provide an `upsert` helper. -- `server_id` (and `type`, addendum) are parsed from `source` by the caller and - are **nullable** (a source whose server-id no longer resolves still records). - Plain `REFERENCES` on both FKs (uniform no-cascade/archival rule) — RESOLVED: - no `SET NULL`; servers are archived not deleted, so the "block on delete" - worry is moot. - -### `backup_repo_stats` - -```sql -CREATE TABLE backup_repo_stats ( - group_id UUID PRIMARY KEY REFERENCES server_groups(id), - snapshot_count INTEGER, - source_count INTEGER, - logical_bytes BIGINT, - physical_bytes BIGINT, - bucket_bytes BIGINT, - observed_at TIMESTAMPTZ NOT NULL DEFAULT NOW() -); -``` - -- One row per group (PK = `group_id`). Filled by **two distinct writers**: the - inspection Job sets the repo-derived fields + `source_count`/`snapshot_count`; - the S3-metrics task sets `bucket_bytes` (best-effort/nullable, may lag). - Provide **two separate update helpers** so each writer touches only its - fields (don't clobber `bucket_bytes` from the inspection writer, or vice - versa) — both upsert on `group_id`. -- This is a *cache*, not audit. RESOLVED (impl): plain `REFERENCES`, no - cascade — the uniform archival rule applies here too (groups are - soft-deleted, so a "rebuildable cache should cascade" exception isn't needed). - -### `backup_requests` - -```sql -CREATE TABLE backup_requests ( - server_id UUID NOT NULL REFERENCES servers(id), - type TEXT NOT NULL, - purpose TEXT NOT NULL CHECK (purpose IN ('backup', 'restore')), - requested_at TIMESTAMPTZ NOT NULL DEFAULT now(), - requested_by TEXT, - PRIMARY KEY (server_id, type, purpose) -); -``` - -- Keyed on `server_id` (one-off requests are server-scoped, cleared when the - run is reported). Composite PK is `(server_id, type, purpose)` (addendum - added `type`) — one pending request per `(server, type, purpose)`; a second - request is an upsert (refresh `requested_at`/`requested_by`). -- This is transient operator intent, not audit. RESOLVED (impl): plain - `REFERENCES` on `server_id`, no cascade (uniform archival rule). - -`down.sql`: `DROP TABLE` all 10 in reverse-dependency order (no inter-table -FKs among them; all FKs point at pre-existing tables). - ---- - -## Diesel models + `lib.rs` - -New module `crates/database/src/backups.rs` (single module for all 10 -tables — they're one cohesive feature, mirroring how `issues.rs` holds -issues/events/incidents together). RESOLVED (impl) — the as-shipped `lib.rs` -re-export superset: - -```rust -pub mod backups; -pub use backups::{ - BackupCredentialIssuance, BackupMaintenanceRun, BackupRepoSnapshot, BackupRepoStats, - BackupRequest, BackupRun, BackupTypeDefault, NewBackupCredentialIssuance, NewBackupRun, - NewBackupTypeDefault, NewServerGroupBackupConfig, NewServerGroupBackupSchedule, - ServerBackupCapability, ServerGroupBackupConfig, ServerGroupBackupSchedule, -}; -// the enums are defined in commons-types and re-exported through database: -pub use commons_types::backup::{ - BackupConfigStatus, BackupPurpose, BackupRepoMode, BackupType, MaintenanceKind, RunOutcome, -}; -// RetentionPolicy is reached as `database::backups::RetentionPolicy` -// (it lives in the backups module; not in the flat re-export set). -``` - -(Existing `lib.rs` re-exports `devices::*` and `bestool_snippets::*`; backups -is the same pattern. The five closed enums — `BackupPurpose`, `RunOutcome`, -`MaintenanceKind`, `BackupRepoMode`, `BackupConfigStatus` — live in -`commons-types` plus the open `BackupType{Custom}`; see below.) - -### String-typed enums - -`purpose`, `status`, `outcome`, `kind` are `TEXT` in the DB. Two options, -consistent with existing code: - -- **Plain `String` fields**, validated at the API layer (matches - `issues.resolved_reason`, `servers.kind`). Simplest; lowest ceremony. -- A small enum in `commons-types` with `Display`/`FromStr` and stored via - `deserialize_as = String, serialize_as = String` (matches how `Severity` and - `ServerKind` are handled). Preferred if the values are reused across - public-server, jobs, and private-web wire types — which they are - (`purpose` flows through three components). - -RESOLVED (impl): the enum option won across the board — **all** the closed -enum-ish columns are typed `commons-types` enums (via a `text_enum!` macro that -implements `Display`/`FromStr` + diesel `ToSql`/`FromSql` over `Text`), each -backed by a DB `CHECK`. The five closed enums are `BackupPurpose -{Backup, Restore}`, `RunOutcome {Success, Failure}`, `MaintenanceKind -{Quick, Full}`, `BackupRepoMode {FromBirth, Import}`, and `BackupConfigStatus -{Provisioning, EscrowPending, Ready}` (the original spec listed only the first -four — `BackupRepoMode` is the 5th, added with the lifecycle columns). -`status` did **not** stay a bare `String` — it's `BackupConfigStatus`. -Separately, `backup type` is the **open** enum `BackupType` with a `Custom(String)` -arm (no DB CHECK, any advertised name preserved verbatim). The model field -types and the `CHECK` constraints agree. - -### Model sketches (abbreviated; full set in `backups.rs`) - -```rust -#[derive(Debug, Clone, Serialize, Deserialize, Queryable, Selectable, Insertable, utoipa::ToSchema)] -#[diesel(table_name = crate::schema::server_group_backup_config)] -#[diesel(check_for_backend(diesel::pg::Pg))] -pub struct ServerGroupBackupConfig { - pub group_id: Uuid, - pub bucket: String, - pub prefix: String, - pub target_role_arn: String, - pub region: Option, - pub repo_password_ref: String, - pub status: BackupConfigStatus, - #[diesel(deserialize_as = jiff_diesel::Timestamp, serialize_as = jiff_diesel::Timestamp)] - pub created_at: Timestamp, - #[diesel(deserialize_as = jiff_diesel::Timestamp, serialize_as = jiff_diesel::Timestamp)] - pub updated_at: Timestamp, - // lifecycle columns (2026-06-16 migration): - #[schema(value_type = String)] - pub mode: BackupRepoMode, - pub last_init_error: Option, - #[diesel(deserialize_as = jiff_diesel::NullableTimestamp, serialize_as = jiff_diesel::NullableTimestamp)] - pub escrow_acked_at: Option, - pub escrow_acked_by: Option, -} -``` - -(Note: `expected_interval`/`retention` are **not** on this struct — they moved -to `server_group_backup_schedule` / `backup_type_defaults` per the addendum.) - -`backup_runs` row maps `id: Uuid` (client-supplied, no default). Its -`NewBackupRun` insertable **includes** `id` (unlike the BIGSERIAL tables which -omit it). `bytes_uploaded`/`snapshot_id`/`error` are `Option<_>`. - -### Model helper methods (DB-layer surface this component owns) - -Defined here; their callers live in other components' specs. Signatures -(`db: &mut AsyncPgConnection`, returning `commons_errors::Result`): - -- `ServerGroupBackupConfig::get(db, group_id) -> Result>` — the - endpoint resolution path (absent → caller maps to 409). -- `ServerGroupBackupConfig::upsert(db, NewServerGroupBackupConfig) -> Result` - and `set_status(db, group_id, status) -> Result` — operator UI / repo-init flow. -- `ServerGroupBackupConfig::list_scheduled(db) -> Result>` — rows - with `expected_interval IS NOT NULL` (the staleness-scan set). -- `BackupCredentialIssuance::record(db, NewBackupCredentialIssuance) -> Result` - — called by public-server step 7. Snapshots bucket/prefix; takes resolved - `device_id`/`group_id`/`access_key_id`/`sts_request_id`. -- `BackupRun::record(db, NewBackupRun) -> Result` — called by - `POST /backup-report`. PK violation on duplicate `id` returns a clean - `Result::Err` (caller decides idempotency response). -- `BackupRun::latest_success_for_server(db, server_id) -> Result>` - and a bulk `latest_success_by_server(db, &[Uuid]) -> Result>` - filtered to `purpose='backup'`, `outcome='success'` — the staleness join. - (Server-centric: `backup_runs` rows carry `device_id`; the scan joins via the - server's associated devices, or filters by `group_id` then maps device→server - in the caller. Provide the query keyed the way the `jobs` scan needs — settle - with the jobs spec; the DB helper exposes both a per-group and per-device cut.) -- `BackupMaintenanceRun::start(db, group_id, kind) -> Result` and - `finish(db, id, outcome, error, bytes_reclaimed) -> Result<()>`. -- `BackupRepoSnapshot::upsert(db, group_id, source, server_id, latest_snapshot_at) -> Result<()>` - and `list_for_group(db, group_id) -> Result>`. -- `BackupRepoStats::upsert_repo_fields(db, group_id, snapshot_count, source_count, logical, physical) -> Result<()>` - and `upsert_bucket_bytes(db, group_id, bucket_bytes) -> Result<()>` — the two - separate writers; both `ON CONFLICT (group_id) DO UPDATE` touching only their - own columns. `get(db, group_id) -> Result>` for the stats panel. -- `BackupRequest::enqueue(db, server_id, purpose, requested_by) -> Result<()>` - (upsert), `clear(db, server_id, purpose) -> Result<()>`, - `pending_for_server(db, server_id) -> Result>`. - -The "present since" anchor for never-backed-up detection -(`max(MIN(device_server_associations.first_seen) over the server, -server_group_backup_config.created_at)`) is a **jobs-crate query** that joins -these tables; the DB crate exposes the building blocks (`list_scheduled`, -`latest_success_*`, and existing `device_server_associations` access). Note -`first_seen` is per `(device_id, server_id)` pair — use `MIN` over the server. - ---- - -## Interfaces / contracts - -### Provided (to other components) - -- **Tables + schema** — the seven `diesel::table!` blocks in `schema.rs` and - the typed models in `backups.rs`, re-exported from `lib.rs`. Public-server, - the jobs crate, and private-server all `use database::{...}`. -- **`ServerGroupBackupConfig` row** — the single source for repo-level - `bucket`, `prefix`, `region`, `target_role_arn`, `repo_password_ref`, - `status`, `mode`, and the lifecycle fields. (Schedule/retention are now - per-`(group, type)` on `server_group_backup_schedule` / `backup_type_defaults`.) - Consumed by: `GET /backup-target` & - `POST /backup-credentials` (public-server); maintenance/inspection schedulers - & preflight (jobs); onboarding/stats UI (private-server). -- **Audit-record helpers** — `BackupCredentialIssuance::record`, - `BackupRun::record`, `BackupMaintenanceRun::{start,finish}` — the write - surface for the issuance/report/maintenance flows. -- **Scan helpers** — `list_scheduled`, `latest_success_*`, - `BackupRepoSnapshot::list_for_group` — inputs to staleness/reconciliation - detection (jobs crate, signals 1 & 2). -- **`backup_requests` queue** — `enqueue`/`clear`/`pending_for_server` — the - operator one-off "backup now" home, read by the cadence-trigger path. -- **Wire/utoipa shapes** — every model derives `utoipa::ToSchema`, so the - private-server handlers can use them in `#[utoipa::path]` and the regenerated - `openapi.json` → `api-types.ts` exposes them to private-web (run - `just gen-openapi` in the component that adds the handlers, not here). - -### Consumed (from existing code) - -- `server_groups(id)`, `servers(id)`, `devices(id)` — FK targets. -- `crate::pg_duration::PgDuration` (INTERVAL ↔ `SignedDuration`). -- `jiff_diesel::{Timestamp, NullableTimestamp}` (timestamp mapping). -- `commons_errors::{AppError, Result}` (return type + `AppError::from`). -- `device_server_associations` (`first_seen`, per device-server pair) — read by - the staleness "present since" query, joined alongside these tables. -- `database::issues::NewEvent::save(conn, server_id, device_id)` — *not* called - by this crate, but the staleness/poisoning alerting (jobs) writes through it; - this crate must not duplicate alerting logic. (`source="canopy"`, - `ref="backup-staleness"` etc. live in the jobs component.) - -### Explicitly NOT in this component - -- No AWS SDK / STS / S3 calls (public-server + jobs). -- No kube client / Secret reads (public-server + jobs). -- No HTTP handlers, no scheduler loops, no alerting/`NewEvent` construction. -- No utoipa `#[path]` annotations or `openapi.json` regen (that's whichever - crate adds the handlers). - ---- - -## Data shapes - -- **`retention`** (JSONB, on `server_group_backup_schedule` / - `backup_type_defaults`): a kopia keep-policy object, e.g. - `{"keep_latest":1,"keep_daily":7,"keep_weekly":4,"keep_monthly":6,"keep_annual":0}`. - Stored as `JsonValue`; the typed `RetentionPolicy` helper sits over it. - **Floor enforcement** (`keep_daily≥7`, `keep_weekly≥4`, `keep_monthly≥6`) is - `RetentionPolicy::validate_floor()` — a DB-crate function (returns - `AppError::BadRequest`) called by the private-server write path, **not** a DB - constraint (the only DB CHECK on these columns is `jsonb_typeof='object'`). -- **`status`**: `provisioning` → `escrow_pending` → `ready`. Backups dormant - (412/409 from the endpoints) until `ready` — enforced by the *endpoint*, but - the column is the source of truth. -- **`purpose`**: `backup` | `restore`. -- **`outcome`**: `success` | `failure` (`backup_runs`); same plus NULL-while- - running for `backup_maintenance_runs`. -- **`kind`**: `quick` | `full`. -- **`source`** (`backup_repo_snapshots`): kopia source string - `canopy@:`; `server_id` parsed out by the caller. - ---- - -## Testing approach (per AGENTS.md) - -DB-only tests via `commons_tests::db::TestDb::run(|mut conn, _url| async move { ... })`, -`#[tokio::test(flavor = "multi_thread")]`, exercising **model functions -directly** (not HTTP). Put them in `crates/database/tests/` with no `_test` -suffix (e.g. `tests/backups.rs`), `use database::*;` for the models. Run with -`just test-package database` or `just test-name `. - -Cover: - -1. **Migration applies cleanly** — implicitly via every test (each spins a - fresh migrated DB) plus an explicit smoke test inserting one row per table. -2. **`server_group_backup_config`** — insert/upsert round-trip incl. NULL - `region`/`expected_interval`, JSONB `retention` round-trips, `status` - transitions, `updated_at` auto-touch fires on update, `jsonb_typeof` CHECK - rejects a non-object retention. -3. **FK behaviour (archival model)** — RESOLVED (impl): there is **no** cascade - anywhere; groups/servers are archived (`deleted_at`), never hard-deleted, so - a config-delete does not cascade and a hard `DELETE` on a `server_groups` row - with any backup rows simply fails the FK (and is never done in practice). The - original "cascade the config / block on audit rows" split no longer applies — - the rule is uniform plain `REFERENCES`. (No dedicated cascade test is needed; - the archival path is what's exercised.) -4. **`backup_runs` client-supplied PK** — insert with a chosen UUID succeeds; - re-inserting the same UUID returns an error (PK violation surfaced as - `Result::Err`, not a panic); `device_id`/`group_id` are taken from - parameters (a test that the helper signature doesn't read them from a body). -5. **Issuance audit** — `record` snapshots bucket/prefix; later changing the - config row does not mutate the issuance row. Indexes exercised by an - ordered `(device_id, issued_at DESC)` query returning newest-first. -6. **Scan helpers** — `list_scheduled` returns only non-NULL-interval rows; - `latest_success_*` filters to `purpose='backup'` + `outcome='success'` and - ignores a newer `restore` success (the staleness-reset bug guard). -7. **`backup_repo_stats` split writers** — `upsert_repo_fields` then - `upsert_bucket_bytes` accumulate without clobbering each other; either order. -8. **`backup_repo_snapshots` upsert** — second observation of the same - `(group_id, source)` updates `latest_snapshot_at`/`observed_at` in place. -9. **`backup_requests`** — enqueue is upsert on `(server_id, purpose)`; `clear` - removes; `pending_for_server` lists. - -No HTTP/e2e here (those belong to the public-server and private-server specs). -Per repo memory: per-package tests while coding (`just test-package database`), -no final full-suite run. - ---- - -## Open questions / decisions to make - -1. **`backup_repo_stats` / `backup_requests` cascade.** RESOLVED (impl): - **no cascade** — there is *no* cache/transient-vs-audit split. Every backup - FK is plain `REFERENCES`, because groups/servers are archived (`deleted_at`), - not hard-deleted, so cascade-vs-preserve never fires. -2. **`backup_repo_snapshots.server_id` FK on a server delete.** RESOLVED (impl): - plain `REFERENCES servers(id)`, no `SET NULL`/cascade — same archival rule. - The column is nullable only because a `source` server-id may not resolve at - observation time, not for delete semantics. -3. **Enum representation.** RESOLVED (impl): `commons-types` enums for **all** - the closed sets (`purpose`, `outcome`, `kind`, `mode`/`BackupRepoMode`, and - `status`/`BackupConfigStatus` — `status` did not stay a plain `String`), each - with a matching DB `CHECK`; the open `BackupType{Custom}` for the type name. -4. **Where `purpose`/`status` CHECK constraints live** — RESOLVED (impl): the - closed enums carry **both** a DB `CHECK (... IN ...)` *and* the typed - `commons-types` enum at the model layer. -5. **`backup_runs.id` collision response contract.** DB returns a PK-violation - error; the *endpoint* decides whether a duplicate report is a 409 or an - idempotent 204. That's a public-server decision, but the DB helper's - error-mapping (does `record` map unique-violation to a typed `AppError` - variant, or pass the raw diesel error?) should be settled here so the caller - can match on it. Lean: map to `AppError::Conflict` so the caller can branch. -6. **`retention` validation surface.** RESOLVED (impl): code-level via the typed - `RetentionPolicy::validate_floor()` (DB crate), called by the private-server - write path; not a DB floor-CHECK (the only DB CHECK is `jsonb_typeof`). -7. **Indexing for the staleness scan.** The provided indexes cover the - per-group/per-device "latest run" cuts. If the jobs scan ends up doing a - `DISTINCT ON (server)` over `backup_runs` joined through - `device_server_associations`, a covering index may be wanted — defer until - the jobs query is concrete, then add in a follow-up migration. - ---- - -## Backup types addendum (supersedes the relevant schema above) - -Added after this spec: backups are keyed `(server, type)`, not `(server)`. -See the plan's "Backup types" section. Concrete deltas: - -- **`server_group_backup_config`** drops `expected_interval` and - `retention` — it's now repo-level only (`bucket`, `prefix`, - `target_role_arn`, `region`, `repo_password_ref`, `status`). -- **New tables/models:** - - `server_backup_capabilities(server_id, type, enabled, registered_at)` - PK `(server_id, type)` — bestool-registered; `enabled` **seeded from** - `backup_type_defaults.auto_enable` at first registration, then - operator-toggleable per server. - - `backup_type_defaults(type PK, default_interval, default_retention - JSONB, auto_enable BOOL)` — canopy-wide per-type defaults. - - `server_group_backup_schedule(group_id, type, expected_interval, - retention)` PK `(group_id, type)` — schedule/retention overrides over - the type defaults; absent row → defaults. -- **`type TEXT` column added to** `backup_credential_issuances`, - `backup_runs` (+ a `server_id` column for per-server-type staleness), - `backup_repo_snapshots`, and `backup_requests` (PK now - `(server_id, type, purpose)`). `backup_maintenance_runs` and - `backup_repo_stats` stay per-group (repo-level). -- **New model surface:** capability upsert + per-server toggle; effective - schedule/retention resolution (`override ?? type-default`, with the org - retention floor enforced); "list active `(server, type)`" for the - scheduler/staleness. diff --git a/docs/plans/specs/canopy-jobs-detection-preflight.md b/docs/plans/specs/canopy-jobs-detection-preflight.md deleted file mode 100644 index bbc85b82..00000000 --- a/docs/plans/specs/canopy-jobs-detection-preflight.md +++ /dev/null @@ -1,602 +0,0 @@ -# Spec: canopy-jobs-detection — staleness, reconciliation, alerting & upstream preflight - -Component of the **backup-credentials** system. Authoritative design: -[`../backup-credentials.md`](../backup-credentials.md) (stage-2 stub: -[`../backup-credentials-blind-relay.md`](../backup-credentials-blind-relay.md)). - -This spec covers the **detection / alerting / preflight** half of the -Canopy control plane: the periodic jobs that decide whether each group's -backups are healthy and raise issues/events when they are not. It does -**not** cover credential issuance (`public-server` endpoints), the -maintenance/inspection/init Jobs themselves, the operator UI, or the IaC — -those are sibling components. It *consumes* the tables those components -write (`backup_runs`, `backup_maintenance_runs`, `backup_repo_snapshots`, -`server_group_backup_config`) and the AWS-client plumbing they introduce. - -## Purpose - -Three classes of periodic check, all running as loops in the `jobs` crate -and all alerting through the existing issues/events/incidents model -(`NewEvent::save`, `source="canopy"`): - -1. **Signal 1 — staleness scan** (DB-only, frequent): scan servers that are - *expected* to be backed up and alert when no recent successful backup is - on record. Server-centric. Also catches stuck maintenance. -2. **Signal reconciliation (1 / 2 / 3)**: cross-check what devices - *reported* (signal 1, `backup_runs`) against what *actually landed* - (signal 2, `backup_repo_snapshots`) and — later — what PGRO proved - *restorable* (signal 3, `backup_restore_checks`). Disagreement is itself - an alert; repo corruption (poisoning) is a group-level critical. -3. **Upstream preflight** (AWS-touching, hash-jittered): Canopy checking its - *own* access — `GetCallerIdentity` (shared, ~minute) plus per-group deep - checks (both purposes issue working creds + Object-Lock still in place), - hourly. - -The shared thread running through all three: a **group-level** failure -(can't mint creds, lock removed, repo corrupt, restore broken) must page -**regardless of any server's `is_monitored`**, whereas **per-server** -staleness obeys the existing `is_monitored` gate. The incident model is -server-keyed today, so the group-level path needs new plumbing (see -[Group-level alerting](#group-level-alerting-server-independent)). - -## Where it lives - -RESOLVED (impl) — the bin layout changed from what's sketched below: - -- **No separate `backup_staleness` bin.** Open-question 3 resolved → - **folded into the renamed `monitor` bin** (formerly `reachability`). The - `monitor` bin (`crates/jobs/src/bin/monitor.rs`) now runs the reachability - sweep, the backup staleness + reconcile sweep (`database::backup::sweep`), - **and** the tailnet key-expiry sweep, all on its one minute loop. The - staleness/reconcile *logic* lives in the `database` crate (`database::backup`, - see below). -- **Preflight is not its own bin either** — it's a module - (`crates/jobs/src/backup/preflight.rs`) run by the consolidated `backups` - bin (see canopy-jobs-maintenance-inspection.md §2/§4 for that single bin). - -The original sketch (separate `backup_staleness` + `backup_preflight` bins -following the `reachability`/`pingtask` template) is kept below for design -history: - -- ~~`crates/jobs/src/bin/backup_staleness.rs`~~ — signal-1 + reconciliation - scan (DB-only; no AWS). ~1–5 min cadence. **Folded into `monitor`.** -- ~~`crates/jobs/src/bin/backup_preflight.rs`~~ — upstream preflight (AWS SDK; - STS + S3). **Now a module under the `backups` bin.** - -The bulk of the logic lives in the **`database` crate** as model functions -(like `Status::sweep_reachability`), so it's testable with -`commons_tests::db::TestDb::run` without standing up a binary. As shipped, the -`database::backup` module (`crates/database/src/backup/`) holds: - -- `staleness.rs` — the scan + classify + file-events logic. -- `reconcile.rs` — signal 1↔2(↔3) reconciliation. -- `alerts.rs` — `raise_group_event`, the single group-scoped incident - entrypoint (bypasses per-server `is_monitored`). -- `refs.rs` — the `(source, ref)` constants. - -`database::backup::sweep` is the top-level entry the `monitor` bin calls each -tick (runs signal-1 classify + reconciliation). -- The preflight's AWS calls live in the **binary** (the `database` crate - must not gain an AWS dependency); the preflight's *alerting* reuses the - same `NewEvent` helpers. The binary reads config rows via a `database` - model function and calls the AWS SDK directly. - -Per the workspace memory: `database` is the only crate allowed diesel; the -preflight binary depends on the new AWS-SDK plumbing the issuance component -adds (`aws-config` + `aws-sdk-sts` + `aws-sdk-s3`). The `jobs` crate gains -those deps for `backup_preflight` only. - -## Refs and sources (issues/events keys) - -All events use `source = "canopy"` (the existing `CANOPY_SOURCE` constant -in `statuses.rs`; promote it somewhere shared if both crates need it, or -re-declare a `const BACKUP_*` set in the backup module). Refs (new -constants — keep them all in one place, e.g. `database::backup::refs`): - -| ref | level | severity (active) | severity (recovery) | gate | -|-----|-------|-------------------|---------------------|------| -| `backup-staleness` | server | `Error` | `Info` (`active:false`) | `is_monitored` | -| `backup-never` | server | `Error` | n/a (clears when first success lands) | `is_monitored` | -| `backup-maintenance-stale` | group | `Error` | `Info` | none (group-level) | -| `backup-reconcile-missing` | group | `Error` | `Info` | none (group-level) | -| `backup-reconcile-report-gap` | server | `Warning` | `Info` | `is_monitored` | -| `backup-corruption` | group | `Critical` | `Info` | none (group-level) | -| `preflight-identity` | fleet/group | `Critical` | `Info` | none (group-level) | -| `preflight-assume` | group | `Error` | `Info` | none (group-level) | -| `preflight-object-lock` | group | `Critical` | `Info` | none (group-level) | -| `restore-verification` (signal 3, later) | group | `Error` | `Info` | none (group-level) | - -Notes: -- Staleness/never/report-gap are **per-server** → ordinary - `NewEvent::save(conn, server_id, Some(device_id))`. They inherit the - `is_monitored` incident gate by design (see plan: some prods are - intentionally intermittently-alive; per-server backup noise on them is - unwanted). They are still *recorded* (visible on the server page) even - when unmonitored — `NewEvent::save` records the issue/event unconditionally - and only skips the incident contribution. -- Everything marked **group-level** must page even on unmonitored servers, - so it must **not** go through a per-server `NewEvent::save` (which would - re-inherit the gate). See [Group-level alerting](#group-level-alerting-server-independent). -- `Error`+ is required for `opens_incident()` (`OPENS_INCIDENT = [Critical, - Error]`, `commons-types/src/issue.rs`). `Warning`/`Info` only join an - already-open incident for context; they never open one. So the - report-gap notice (`Warning`) is deliberately non-paging on its own. - -## Signal 1 — staleness scan - -Server-centric. The subject is the **server** being protected; the device -is the actor recorded in `backup_runs`/snapshot tags. - -### Scanned set - -Servers in a group whose `server_group_backup_config` has: -- `status = 'ready'` (dormant configs — `provisioning`/`escrow_pending` — - are not yet expected to back up), **and** -- a non-NULL `expected_interval` (manual-only groups have no schedule, so - no staleness alerting — they're simply not in the set). - -A manual-only or unconfigured group is therefore never scanned, so -unauthorized/un-set-up devices never alert. Implement as a single query -joining `servers` → `server_group_backup_config` (on `servers.group_id`) -filtered as above, returning `(server_id, group_id, expected_interval, -config.created_at)`. - -### Per-server classification - -For each scanned server, find its most recent `backup_runs` row with -`purpose = 'backup' AND outcome = 'success'` (the `(device_id, …)` / -`(group_id, reported_at DESC)` indexes support this; a server-centric query -joins runs to the server via `group_id` **and** the server identity — see -the source-mapping note below). Let `grace = expected_interval * 2`. - -- **Stale** — a prior successful backup exists but none newer than - `now - grace` → file `backup-staleness` at `Error`, `active:true`. -- **Never backed up** — *no* successful `purpose='backup'` row ever, **and** - the server has been expected long enough: `now - anchor > grace`, where - - ``` - anchor = max( MIN(first_seen over this server's device_server_associations rows), - server_group_backup_config.created_at ) - ``` - - → file `backup-never` at `Error`, `active:true`. Below the grace from the - anchor: no alert yet (freshly-present server or freshly-authorized group - must not false-alarm). -- **Recovered** — a previously-stale server reporting success again: file - `backup-staleness` `active:false` at `Info` (the issue leaves the - incident and auto-closes). Mirror the reachability sweep's - `(false, Some(issue)) if !issue.active => continue` short-circuit so we - don't re-file an already-closed recovery every tick. - -**Anchor details (do not get these wrong — they're explicit decisions):** -- `first_seen` in `device_server_associations` is per `(device_id, - server_id)` **pair**, *not* a per-server scalar. Use - `MIN(first_seen)` over **all** of that server's association rows - (earliest any device saw it). Schema: - `device_server_associations (device_id, server_id, first_seen, last_seen)`. -- `created_at` is `server_group_backup_config.created_at` (group-authorized - time). A server present long ago but whose group was authorized 5 minutes - ago must use the *later* of the two, so a just-authorized group doesn't - instantly fire `backup-never` on every member. -- Filter runs on `purpose='backup'` **specifically** — a recent successful - *restore* must **not** reset backup staleness. - -### Mapping a `backup_run` to a server - -`backup_runs` carries `device_id` + `group_id` but **not** `server_id` -directly. The protected server is identified via the kopia source -(`canopy@:`, recorded in `backup_repo_snapshots.server_id`) -and via the device→server association at report time. For signal 1, resolve -the server from the run's `device_id` via `Server::live_by_device_id` (the -`servers_device_id_unique` partial unique index guarantees at most one live -server per device). Scan-side, it's cleaner to drive **from the server**: -for each scanned server, find runs whose `device_id` is one of that -server's associated devices and whose `group_id` matches. Encode this as a -single classify query rather than per-server round-trips. - -### Maintenance staleness - -`backup_maintenance_runs` (group-level) feeds the same scan: a group whose -last `outcome='success'` maintenance run (any `kind`) is older than a -maintenance-cadence threshold (full-weekly default → e.g. `8 days`; make it -a constant, not `expected_interval`-derived, since maintenance cadence is -independent of backup cadence) → file `backup-maintenance-stale` at `Error` -via the **group-level** path. Recovery: a fresh successful maintenance run -clears it. - -## Signal reconciliation (1 / 2 / 3) - -Runs in the same `backup_staleness` loop (after signal-1 classify), reading -`backup_repo_snapshots` (signal-2 ground truth) against `backup_runs` -(signal-1 reports). Per scanned server (resolved to a kopia `source` = -`canopy@:`, so the join key is `server_id`): - -- **report says success but no recent snapshot** (a `backup_runs` success - newer than `grace`, but `backup_repo_snapshots.latest_snapshot_at` for - that source is older than `grace`, or no snapshot row at all) → the report - is wrong or the upload didn't persist. **`backup-reconcile-missing`**, - `Error`, **group-level** (a device lying about success / data not landing - endangers the group's actual recoverability, so it pages regardless of - monitored). This is the case signal 1 alone cannot catch. -- **recent snapshot but no report** (`latest_snapshot_at` fresh, but no - recent `backup_runs` success) → backups are fine, the *reporting path* is - broken. **`backup-reconcile-report-gap`**, `Warning`, **per-server** - (low-severity, non-paging — it's a telemetry gap, not a backup failure). -- **neither** → genuinely stale; already covered by signal 1, emit nothing - extra here (avoid double-filing on the same `(server)`). - -Signal 2 is only as fresh as the inspection Job's last run; if -`backup_repo_snapshots.observed_at` for a group is itself stale (older than -the inspection floor), reconciliation can't conclude "missing" reliably — -**skip the `reconcile-missing` verdict when signal-2 data is stale** and -instead rely on the inspection Job's own failure to surface (it writes -`backup_repo_snapshots`/stats; a Job that stops running is caught by the -preflight/maintenance-staleness machinery, not here). Record this as a -guard so a lagging inspector doesn't produce false "report lied" alerts. - -**Poisoning / corruption** is reported by the inspection Job (signal 2), -not computed here: when inspection detects content-blob hash mismatch / -unreadable index, it raises **`backup-corruption`** at `Critical`, -group-level. This spec owns the *alerting shape* (the constant, severity, -group-level routing, recovery-runbook pointer in the message body); the -*detection* (running `kopia` verify) is the inspection-Job component. To -avoid two components both knowing how to raise a group-level event, expose -a single helper (below) that the inspection Job calls. - -**Signal 3 (restore-verification, later/additive):** PGRO reports -per-replica restore outcomes into a future `backup_restore_checks` table; a -failed/stale restorability check is **`restore-verification`** at `Error`, -group-level. Same group-level helper. Stubbed here so the routing is -designed-for, not bolted on; the table + ingest endpoint are out of scope -for this component's first cut. - -## Group-level alerting (server-independent) - -**The core mechanism wrinkle.** The incident model -(`crates/database/src/issues.rs`) is **server-keyed**: `Issue.server_id` is -`NOT NULL`, `NewEvent::save(conn, server_id, device_id)` requires a server, -and `re_evaluate_incident_membership` gates incident contribution on that -server's `is_monitored`. Incidents themselves are **group-keyed** -(`incidents.server_group_id`). There is no "group-level issue with no -server" path today. Group-level backup checks must page regardless of -`is_monitored`, so routing them through a per-server `NewEvent::save` is -wrong (it would inherit the monitored gate and could be silenced by an -unmonitored member). - -**Decision required — pick one (flagged in the plan as -implementation-time):** - -- **Option A — representative monitored server.** Pick a deterministic - server in the group (e.g. the highest-rank live member, reusing - `ServerGroup::highest_member_ranks` ordering) and file against it, but - **bypass the monitored gate** for these refs. This needs a new code path - because `re_evaluate_incident_membership` hard-gates on `monitored`; - passing `monitored=true` unconditionally for group-level refs is the - smallest change but is a lie in the data. Fragile if the group has no - live members. -- **Option B — group sentinel issue (recommended).** Add first-class - support for a group-scoped issue with no member server. Concretely: make - `issues.server_id` nullable **or** add an `issues.server_group_id` - nullable column, and teach `re_evaluate_incident_membership` / - `find_or_open_incident` to accept a group directly (the incident is - already group-keyed, so `find_or_open_incident(conn, group_id, …)` works - as-is — the only gap is producing an `Issue` that points at a group, not - a server, and skipping the `is_monitored` lookup for it). This is the - clean model and matches "group/control-plane concern, not any one - server's." It's a migration + a branch in the membership evaluator. - -This spec **recommends Option B** and treats it as the deliverable's -central new piece of shared plumbing. Provide one helper that both this -component and the inspection Job call: - -```rust -// database::backup::alerts (new) -pub async fn raise_group_event( - conn: &mut AsyncPgConnection, - group_id: Uuid, - r#ref: &str, - severity: Severity, - description: Option<&str>, - message: &str, - active: bool, -) -> Result<()>; -``` - -Internally it find-or-creates a **group-scoped** issue keyed by -`(server_group_id, source="canopy", ref)`, appends/coalesces an event (reuse -`hash_event`), and runs the group-aware membership evaluation that ignores -`is_monitored`. Recovery is the same `(source, ref)` with `active:false` at -a lower severity, which lets the issue leave the incident and auto-close — -identical lifecycle to the per-server path. **Do not** add an -`Incident::open_for`; there is no such function — reuse -`find_or_open_incident` → `enqueue_slack_open` → `SlackOutbox::enqueue`, -which the existing evaluator already drives. - -The migration for Option B must be a separate `just migration` step -(`just migration backup_group_scoped_issues`); never hand-create migration -dirs. If `issues.server_id` becomes nullable, audit every existing query in -`issues.rs` that assumes it non-null (the model is large — `list_for_server`, -`list`, `reconcile_open_incidents`, `re_evaluate_incident_membership`'s -`Server::get_by_id`). `reconcile_open_incidents` (run on reachability -startup) must handle group-scoped issues (no server → resolve group -directly, skip the `is_monitored` short-circuit). This is the -refactor-thoroughly cost of Option B and must not be half-done. - -## Upstream preflight - -Watches **Canopy's own upstream access**, not the devices. Lives in -`backup_preflight.rs`. Alert, **never gate readiness** (a failing check must -not pull the pod out of rotation — that makes it worse). - -### Shared check (every ~minute, on the loop tick) - -- **`sts:GetCallerIdentity`** — confirms the pod's IRSA web-identity is - mounted and valid. Cheap; rides the minute loop. On failure → raise - **`preflight-identity`** at `Critical`, group-level (route it against a - fleet sentinel — Option B's group-scoped issue keyed to a "control-plane" - pseudo-group, **or** fan out one `preflight-identity` per configured group - since "every group's per-group check fails" is the same signal). The plan - says: a check failing for *every* group points at the shared IRSA - identity rather than any one bucket — so emitting per-group and letting - the operator see the fan-out is acceptable, but a single fleet-level alert - is cleaner. **Decision required** — see open questions. - -### Per-group deep checks (hourly, hash-jittered) - -For each `status='ready'` group with a config row, on its jittered slot -(`hash(group_id) mod window`, stable per group — same scheme as maintenance -and inspection; factor the jitter helper so all three share it): - -1. **Both purposes issue working creds.** Cross-account `sts:AssumeRole` on - the group's `target_role_arn`: - - **backup path**: plain assume (no session policy), then a **read-only - no-op** S3 call against the bucket (e.g. `HeadBucket` or - `GetBucketLocation` — a harmless op the backup role policy allows). - - **restore path**: assume **with the read-only restore session policy** - (the normative JSON from the plan — `GetObject` + unconditioned - `GetBucketLocation` + conditioned `ListBucket`), then the same - read-only no-op. This proves the restore session policy actually works - and catches the `GetBucketLocation`-folded-under-`s3:prefix` class of - bug **proactively**, while plain backup issuance still looks fine. - - Any failure (assume or no-op) → **`preflight-assume`**, `Error`, - group-level. Message should distinguish which purpose/leg failed. -2. **Object Lock still in place.** `s3:GetBucketObjectLockConfiguration` on - the group's bucket; assert it returns an enabled lock with `mode` - present and `days >= 30` (GOVERNANCE, the `backups` stack's `mode: - 'GOVERNANCE', days: 30`). Missing/weakened lock → **`preflight-object-lock`**, - `Critical`, group-level (the whole "can't destroy backups" guarantee - rests on it, and there's no other symptom). This action is **not** in - `AWS_S3_MULTIPART_ACTIONS`; the issuance/IaC component must add it to the - per-bucket role Canopy assumes, or the check itself 403s on day one — - note that dependency, don't silently absorb it. - -Prefer **behavioural** checks (assume + harmless S3 op) over IAM/policy -*introspection*: behavioural checks test the real path and need no extra -`iam:Get*`. The Object-Lock read is the one allowed exception. - -The **maintenance path** needs no separate preflight: the read-only -inspection Job already connects each group's repo on its cadence (proving -reachability + password), and maintenance-specific failures surface via -`backup_maintenance_runs` → `backup-maintenance-stale` (signal 1 above). - -### Reactive rate-tracking (light) - -The live paths are signals too: `/backup-credentials` 502s on STS failure -and maintenance failures land in `backup_maintenance_runs`. Out of scope to -build a metrics pipeline here, but note that a spike between hourly -preflights should ideally surface — for the first cut, the hourly preflight -+ maintenance staleness cover it; richer rate-tracking is deferred. - -## Loop / scheduling shape - -Mirror `reachability`/`pingtask`: - -```rust -pub fn spawn() -> JoinHandle<()> { - let pool = database::init(); - task::spawn(async move { - loop { - sleep(Duration::from_secs(60)).await; - let Ok(mut db) = pool.get().await else { error!(…); continue; }; - // signal-1 + reconcile scan (DB only) - } - }) -} -``` - -- **staleness + reconcile**: DB-only, 60 s tick. RESOLVED (impl): **folded - into the `monitor` bin** (formerly `reachability`), which already does the - minute-cadence DB sweep + the startup `reconcile_open_incidents`. No separate - Deployment; `database::backup::sweep` runs each tick after the reachability - sweep. (The earlier recommendation of a separate binary was reversed — see - open question 3.) -- **preflight**: AWS-touching. 60 s tick for `GetCallerIdentity`; per-group - deep checks fire only when the tick lands in the group's jittered hourly slot. - RESOLVED (impl): runs as a **module under the consolidated `backups` bin** - (alongside maintenance/inspection/s3-metrics via `tokio::try_join!`), not its - own Deployment — see canopy-jobs-maintenance-inspection.md. - -Hash-jitter helper (shared with maintenance/inspection): RESOLVED (impl) it -lives in `commons_servers::backup_jobs` (`jitter_slot(group_id, window)` + -`slot_is_due(...)`), so all schedulers agree. - -## Interfaces / contracts - -### Consumes (written by sibling components) - -- **`server_group_backup_config`** — `group_id`, `expected_interval` - (NULL / set states), `created_at`, `status` (`provisioning` / - `escrow_pending` / `ready`), `bucket`, `target_role_arn`, `region`. Read - via a new `database` model fn, e.g. - `BackupConfig::scannable(conn) -> Vec` and - `BackupConfig::ready_groups(conn) -> Vec<…>`. -- **`backup_runs`** — `device_id`, `group_id`, `purpose`, `outcome`, - `reported_at`. (Written by `POST /backup-report`, issuance component.) -- **`backup_maintenance_runs`** — `group_id`, `kind`, `outcome`, - `started_at`/`finished_at`. (Written by maintenance Jobs.) -- **`backup_repo_snapshots`** — `group_id`, `source`, `server_id`, - `latest_snapshot_at`, `observed_at`. (Written by inspection Job.) -- **`device_server_associations`** — `(device_id, server_id, first_seen, - last_seen)`, for the `MIN(first_seen)` anchor. -- **`servers`** / `Server::live_by_device_id`, `is_monitored`, `group_id`. -- **AWS SDK plumbing** (`aws-config`, `aws-sdk-sts`, `aws-sdk-s3`), the - ServiceAccount + IRSA role, and `s3:GetBucketObjectLockConfiguration` on - the per-bucket roles — all introduced by the issuance/IaC components. - -### Provides (to other components / operators) - -- **`database::backup::alerts::raise_group_event(conn, group_id, ref, - severity, …)`** — the single group-level alerting entrypoint. The - **inspection Job** calls it for `backup-corruption`; **PGRO ingest** - (later) calls it for `restore-verification`. Owning this here means there - is exactly one place that knows how to open a group-level incident - without the `is_monitored` gate. -- **Stable `(source, ref)` keys** (the table above) — operators silence / - snooze by these via the existing `silenced_refs` mechanism; the UI / Slack - reference them. Documenting them is part of the contract. -- **Group-scoped issue support** (Option B migration) — a reusable - capability beyond backups (any future control-plane-level check can raise - a group issue). - -## Data shapes - -No new tables are owned by *this* component except the Option-B schema -change to `issues` (nullable `server_id` or new nullable `server_group_id`) -and — for signal 3, later — `backup_restore_checks` (out of scope for the -first cut, noted for design-for). Everything else is reads. - -A small internal struct for the scan, e.g.: - -```rust -struct ScanRow { - server_id: Uuid, - group_id: Uuid, - device_id: Option, // latest-associated device, for NewEvent - expected_interval: SignedDuration, - config_created_at: Timestamp, - min_first_seen: Option, - last_success_at: Option, // purpose='backup', outcome='success' - latest_snapshot_at: Option,// from backup_repo_snapshots (reconcile) - snapshot_observed_at: Option, // signal-2 freshness guard -} -``` - -## Testing approach (per AGENTS.md) - -UPDATE (shipped): DB-level detection tests exist in -`crates/database/tests/backup_detection.rs` — covering classify boundaries -(`classify_boundaries`, `classify_restore_only_history_is_never`), -staleness/never/reconcile sweeps with the `is_monitored` gate -(`sweep_files_staleness_*`, `sweep_files_never_*`, -`unmonitored_staleness_records_issue_but_no_incident_link`, -`reconcile_files_report_gap_*`, `reconcile_files_missing_*`, -`reconcile_clears_report_gap_*`), and the headline -group-level-alert-pages-even-when-all-members-unmonitored case -(`group_event_pages_even_when_all_members_unmonitored`). - -- **Database-level tests** (`commons_tests::db::TestDb::run`) are the - primary coverage, since the scan/classify/reconcile logic lives in the - `database` crate as model fns. Use direct model functions, not HTTP. - Always `use database::ModelName;`. Seed `server_group_backup_config`, - `servers`, `device_server_associations`, `backup_runs`, - `backup_maintenance_runs`, `backup_repo_snapshots` directly, then assert - on the issues/events rows produced. -- Cases to cover (success **and** the boundary/negative cases): - - stale (success older than `×2`) fires `backup-staleness` `Error`; - - just-under-`×2` does **not** fire; - - never-backed-up past anchor fires `backup-never`; just-authorized group - (recent `config.created_at`) does **not**, even with an old - `first_seen`; freshly-present server (recent `MIN(first_seen)`) does - **not**, even with an old `config.created_at` — assert the `max(...)` - anchor explicitly with both orderings; - - a recent successful **restore** does **not** clear backup staleness - (purpose filter); - - recovery: stale → success files `active:false`, and re-running the scan - does not re-file (idempotence); - - manual-only (`expected_interval` NULL) and non-`ready` configs are - **not** scanned; - - maintenance staleness fires/clears on `backup_maintenance_runs`; - - reconcile: report-success-but-no-snapshot → `backup-reconcile-missing` - (group-level, **pages even when the server is unmonitored** — assert the - incident opens); snapshot-but-no-report → `report-gap` `Warning` - (does not open an incident on its own); - - reconcile **skips** the missing verdict when `snapshot_observed_at` is - stale; - - **group-level vs per-server gating**: a `backup-staleness` on an - unmonitored server records the issue but opens **no** incident; a - `backup-corruption` / `preflight-object-lock` on a group whose servers - are all unmonitored **does** open an incident (this is the headline - behaviour and must be tested directly against `incidents` rows). -- **Reconciliation/incident interplay**: reuse the patterns in the existing - issues/events tests — assert `incidents` / `incident_issues` rows and the - `slack_outbox` enqueue (`KIND_INCIDENT_OPEN`) for the paging cases, and - that recovery enqueues the resolve. -- **Preflight** AWS calls can't hit real STS/S3 in tests; structure the - binary so the AWS-touching functions take a trait/client object that can - be faked, and unit-test the **decision logic** (lock-config → - pass/fail, assume-result → which ref/severity) separately from the SDK - wiring. The alerting side (given a verdict, the right group event is - raised) is DB-testable via `raise_group_event`. -- Use `#[tokio::test(flavor = "multi_thread")]`. Tests run on the ramdisk - Postgres via `just test` / `just test-package`. There's no rendered UI in - this component, so no Playwright here (the operator stats/onboarding UI is - a sibling component and owns its own e2e). - -## Open questions / decisions to make - -1. **Group-level routing (Option A vs B).** Recommend **B** (group-scoped - issue: nullable `issues.server_id` or new `server_group_id`). It's the - clean model and is reused by inspection (corruption) and PGRO (signal 3), - but it's a migration + a thorough sweep of `issues.rs`. Confirm before - building — this is the largest single decision and the rest of the - group-level alerting depends on it. -2. **`preflight-identity` fan-out.** One fleet-level alert (needs a - control-plane sentinel target) vs one-per-group (reuses per-group - routing, operator sees the fan-out and infers "shared identity"). Lean - fleet-level if Option B gives us a non-group sentinel cheaply; otherwise - per-group. -3. **Separate `backup_staleness` binary vs folding into `reachability`.** - RESOLVED (impl): **folded** into the bin formerly called `reachability`, - now renamed **`monitor`** (it runs reachability + backup - staleness/reconcile + tailnet key-expiry on one minute loop). No separate - `backup_staleness` Deployment. -4. **Maintenance-staleness threshold.** Independent of `expected_interval` - (maintenance cadence is full-weekly). Proposed constant ~`8 days`; - confirm and make it a named constant, not magic. -5. **Reconcile severities.** `reconcile-missing` = `Error` group-level - (pages); `report-gap` = `Warning` per-server (non-paging). Confirm the - report-gap shouldn't be group-level — argument for per-server: a broken - *reporting* path is a single device's telemetry problem, not a - recoverability risk. -6. **Signal-2 freshness floor for the reconcile guard.** What - `observed_at` age makes signal-2 "too stale to conclude missing"? Tie to - the inspection cadence floor (weekly for manual-only). Needs the - inspection component's cadence to be pinned first. -7. **Anchor when a server has zero `device_server_associations` rows.** - `MIN(first_seen)` is NULL → fall back to `config.created_at` alone (the - `max` degenerates). Confirm that's the intended behaviour (a config'd - group with a server that no device has ever reported for: it's `never` - once `config.created_at` + grace elapses). -8. **`CANOPY_SOURCE` sharing.** It currently lives in `statuses.rs`. Promote - to a shared location, or re-declare in the backup module? Minor, but pick - one to avoid drift. -9. **Signal 3 (`backup_restore_checks` + PGRO ingest)** is explicitly - later/additive — confirm it stays out of this component's first cut - (only the group-level routing is built now, ready for it). - ---- - -## Backup types addendum - -Per the plan's "Backup types": staleness is per-`(server, type)`. - -- The staleness scan iterates **enabled `(server, type)` capabilities whose - effective schedule is non-NULL**, comparing each to its most recent - `backup_runs` row **for that type** (`type = ?`, `purpose='backup'`, - `outcome='success'`). The `×2` grace and `max(MIN(first_seen), - schedule-created)` anchor are unchanged, just per-type. -- Disabled / manual-only / unconfigured `(server, type)` are out of the - scanned set. -- Group-level alerting (corruption, preflight) is unchanged — it's - per-group, not per-type. diff --git a/docs/plans/specs/canopy-jobs-maintenance-inspection.md b/docs/plans/specs/canopy-jobs-maintenance-inspection.md deleted file mode 100644 index 7bc140fa..00000000 --- a/docs/plans/specs/canopy-jobs-maintenance-inspection.md +++ /dev/null @@ -1,734 +0,0 @@ -# Spec: canopy-jobs-maintenance-inspection - -**Component:** `canopy-jobs-maintenance` (repo: `canopy`) -**Authoritative design:** [`../backup-credentials.md`](../backup-credentials.md) (and the blind-relay stub -[`../backup-credentials-blind-relay.md`](../backup-credentials-blind-relay.md)). -This spec implements the Canopy-owned **maintenance**, **read-only inspection**, **S3-metrics**, and -**repo-creation init** paths — the scheduler loops in the `jobs` crate that drive kopia **in-process**. - -UPDATE (shipped) — the architecture changed fundamentally from the original -"spawn one-shot k8s Jobs that report back" design to **a single long-lived -`backups` Deployment that runs kopia as an in-process subprocess** for each due -group. There are **no Kubernetes Jobs** anywhere: the loops parse kopia's -`--json` output and write results **inline** to the DB. Throughout this spec, -where older text describes Job manifests, a `JobSpawner`/`jobspec`, Job reaping, -a `/job-report` endpoint, a pod termination-message, or a separate `kopia-job` -image/binary, it is **superseded** — see the inline "UPDATE (shipped)" notes and -§5/§8. The reasons: every Job already shared the one `canopy-jobs` IRSA -identity, so collapsing into one process loses no isolation; and a long-lived -process can hold a **refreshing** per-group credential, fixing the 1-hour cap -that one-shot static creds hit. - -This is (jobs-side) the first IRSA usage anywhere in canopy, and the Kubernetes -API client is now used only for **Secret reads** (the repo passphrase), not Job -create/watch. "Like reachability" describes only the `spawn()` + -`loop { sleep(60); pool.get; … }` shape. - ---- - -## 1. Purpose - -Canopy owns kopia repository lifecycle for every backup-configured server-group: repo creation, retention -enforcement, snapshot expiry, blob GC/compaction, ground-truth inventory, poisoning detection, and the bucket -billing-size readout. Devices never run these (they have no `DeleteObject`); the control plane does, off the -client servers. UPDATE (shipped): it runs them **in-process** in the long-lived -`backups` Deployment (kopia is a bundled subprocess), **not** as one-shot -Kubernetes Jobs. - -RESOLVED (impl) — **the four scheduler loops ship as ONE bin**, -`crates/jobs/src/bin/backups.rs`, which runs four modules -(`crates/jobs/src/backup/{maintenance,inspection,preflight,s3_metrics}.rs`) -concurrently via `tokio::try_join!`. (Preflight, originally specced in the -sibling detection-preflight doc, is one of those modules.) The text below -still describes the work as four loops for clarity of each loop's job; read -"four scheduler loops" as "four modules in the one `backups` bin" throughout, -and §2/§4 are updated accordingly. - -This component delivers (as four modules of the single `backups` bin). Each due -group's kopia work runs as an **in-process subprocess** of the same pod (UPDATE -(shipped) — no per-group k8s Jobs): - -1. **Maintenance scheduler** — per-group cycle `assert-retention → kopia snapshot expire → kopia maintenance - run`; quick-daily / full-weekly; hash-jittered per group; writes `backup_maintenance_runs` inline. -2. **Inspection scheduler** — read-only `kopia snapshot list` + repo stats + repo verify (poisoning detection); - writes `backup_repo_snapshots` and the repo-derived fields of `backup_repo_stats` inline. -3. **S3-metrics task** — CloudWatch `BucketSizeBytes` → `backup_repo_stats.bucket_bytes` (best-effort, separate - permissions, separate cadence). -4. **Repo-creation init** — driven by the maintenance loop (not a Job): for each `provisioning` group it runs - `kopia repository create` + asserts initial retention in-process, using the group's per-bucket role. - -Out of scope here (other specs/components own them): the public-server device endpoints -(`/backup-credentials`, `/backup-target`, `/backup-report`), the AWS-SDK client on `public-server`'s -`AppState`, staleness detection over `backup_runs` (signal 1), the per-group upstream **preflight**, the -operator UI, and all Pulumi `backups`-stack bucket/role changes. Where this component *depends on* those, it is -called out in §6/§7. UPDATE (shipped): kopia is a **bundled subprocess** of the -`backups` bin. The kopia binary is copied into the single shipped -`ghcr.io/beyondessential/canopy` image (`.github/Dockerfile.native`), so the -backups pod runs that same image — there is no separate job-pod image and no -inter-process contract — §5 now documents how the bin invokes the kopia CLI -directly. - ---- - -## 2. Where it lives & the loop template - -RESOLVED (impl) — **one** bin, not four, following `reachability.rs` / -`pingtask.rs` for the outer `spawn()`/`main()` shape but driving four loop -modules concurrently: - -``` -crates/jobs/src/bin/backups.rs # the single long-lived bin -crates/jobs/src/backup.rs # backup module root -crates/jobs/src/backup/maintenance.rs # maintenance scheduler loop (+ drives init) -crates/jobs/src/backup/inspection.rs # read-only inspection scheduler loop -crates/jobs/src/backup/preflight.rs # upstream preflight (see detection-preflight spec) -crates/jobs/src/backup/s3_metrics.rs # CloudWatch BucketSizeBytes task -crates/jobs/src/backup/kopia.rs # in-process kopia execution layer (subprocess wrappers + parsing) -crates/jobs/src/backup/worker.rs # shared Worker: pool, kube client (Secret reads), concurrency, in-flight set -crates/jobs/src/backup/complete.rs # inline DB writes from a kopia op's typed outcome -``` - -UPDATE (shipped) — the bin's `main()` builds a shared -[`Worker`](#) (DB pool, `kube::Client` for Secret reads, concurrency semaphore, -in-flight group set) **once**, then launches the four loops under one -`tokio::try_join!`. Maintenance and inspection share the `Worker`; preflight and -s3-metrics build their own pool/AWS clients. The kopia ops for each due group run -**in-process** as subprocesses (the kopia CLI is bundled in the image), parse -kopia's `--json`, and write results inline via `complete.rs`. (Original sketch was -three separate `backup_*` bins each as its own Deployment, each *spawning k8s -Jobs* — both superseded: one bin, in-process kopia.) - -Repo-creation is **not** its own loop and **not** handler-driven — UPDATE -(shipped): the **maintenance loop** runs `init` in-process for `provisioning` -groups (gated by the same in-flight set), then advances the status inline (see -§3.8). The onboarding handler only sets `status = 'provisioning'`; private-server -holds no kube/jobs dependency. - -Each bin keeps the established structure verbatim: - -```rust -pub fn spawn() -> JoinHandle<()> { - let pool = database::init(); - task::spawn(async move { - // build the kube client + scheduler config ONCE at startup - // (like reachability builds the TailnetDirectory once) - loop { - sleep(Duration::from_secs(TICK)).await; - let Ok(mut db) = pool.get().await else { error!(...); continue; }; - // … per-tick work … - } - }) -} - -#[derive(Debug, Parser)] -struct Args { #[command(flatten)] logging: LoggingArgs } - -#[tokio::main] -async fn main() -> miette::Result<()> { /* identical to reachability.rs main() */ } -``` - -Deviation from the DB-only sweeps: at startup the bin builds a **kube client** (used only for repo-password -**Secret reads**, not Jobs) and the s3-metrics/preflight tasks build AWS clients. UPDATE (shipped): the kube -client is built once in `main()` (a hard failure there exits the pod); a transient API blip when *reading a -Secret* mid-loop is per-group and just skips that group's op for the tick, so it doesn't kill the pod. - -RESOLVED (impl): the four loops share **one** single-replica `Recreate` -Deployment (`backups`) in `ops/pulumi/tamanu/meta/src/jobs.ts` (see §4). UPDATE -(shipped): the per-group work runs **in-process** (kopia subprocess) inside the -loop pod — it is **not** a k8s Job. - -### Tick vs. cadence - -The loop ticks frequently (default 60s, matching reachability) but **per-group work is gated by hash-jittered -cadence**, so a tick mostly finds nothing due. The loop's job each tick is: enumerate configured+`ready` -groups, compute each group's due-ness for *this* loop's cadence, and — for those due and not already in-flight — -**claim a per-group + concurrency slot and run the kopia op in-process**. This keeps "is anything due" cheap (a -DB read + arithmetic) and the heavy work in bounded in-process subprocesses. - ---- - -## 3. Concrete changes (canopy) - -### 3.1 New crate dependencies (`crates/jobs/Cargo.toml`) - -Net-new; **do not pin versions without checking the registry** (per global rule). UPDATE (shipped) — the -shipped `crates/jobs/Cargo.toml` carries: - -- `kube` and `k8s-openapi` — used **only for Secret reads** (the per-group repo passphrase, via - `worker::read_repo_password`). UPDATE (shipped): **no** Job create/list/watch/delete — there are no Jobs. - `k8s-openapi` is pinned to a feature matching the cluster (verify against the deployed control-plane version, - do not guess). -- `aws-config` + `aws-sdk-cloudwatch` + `aws-sdk-s3` + `aws-sdk-sts` for **preflight** (upstream reachability) - and the **S3-metrics** task. The maintenance/inspection loops do **not** use the AWS SDK directly — kopia's - own bundled AWS SDK talks to S3 (the bin only overrides `AWS_ROLE_ARN` per subprocess). -- **No `axum`** — UPDATE (shipped): there is no `/job-report` HTTP server; results are typed Rust values - written inline. -- Hash-jitter uses a stable hash of the group UUID (stable across restarts — not a randomly-seeded hasher); - the helpers live in `commons_servers::backup_jobs`. - -(The AWS SDK also lands on `public-server` per the endpoints spec; the kube client also lands on `public-server` -for Secret-read per `/backup-target` — both are *separate* additions owned by the endpoints component.) - -### 3.2 In-process execution model: `kopia.rs` + `worker.rs` + `complete.rs` - -UPDATE (shipped) — there is **no** k8s-Job manifest builder (`jobspec.rs`), no `JobSpawner`, and no -`spawn_*_job` helpers. The kopia work runs in-process. The code splits three ways: - -``` -crates/jobs/src/backup/kopia.rs # subprocess wrappers + parsing + per-kind orchestration (run_init/run_maintenance/run_inspect) -crates/jobs/src/backup/worker.rs # shared Worker: pool, kube client (Secret reads), Slots (semaphore + in-flight set) -crates/jobs/src/backup/complete.rs # inline DB writes from a kopia op's typed outcome -``` - -Pure scheduler logic (hash-jitter, due-ness, billing labels, retention floor, `JobKind`) still lives in -**`commons_servers::backup_jobs`**, reused by the loops and by private-server (so private-server need not depend -on the `jobs` crate). UPDATE (shipped): private-server does not call any spawn helper — init is scheduler-driven -(§3.8). - -**`kopia.rs`** (in-process execution layer): - -- Builds a per-op `KopiaEnv { target_role_arn, region, password }` and applies it to each `tokio::process::Command`: - it sets `AWS_ROLE_ARN` = the group's `target_role_arn` (overriding the pod's shared `canopy-jobs` IRSA role), - `AWS_REGION`/`AWS_DEFAULT_REGION` = the group's region, and `KOPIA_PASSWORD` = the repo passphrase. The - projected `AWS_WEB_IDENTITY_TOKEN_FILE` is **inherited** from the pod env, so kopia's own AWS SDK does - `AssumeRoleWithWebIdentity` against the per-bucket role **directly** (not chained → up to the role's - `MaxSessionDuration`, auto-refreshed — no 1h cap). This replaces the per-kind IRSA-SA distinction: every op - uses the same pod SA and overrides the role per subprocess. -- `connect(...)` always connects with `--override-username canopy --override-hostname canopy-maintenance` so the - running identity is the maintenance owner (kopia 0.23.1 requires running identity == owner for - `maintenance run`; see §5). `run_init` sets that identity as the owner. -- `run_init` / `run_maintenance` / `run_inspect` orchestrate the kopia subcommands and return **typed Rust - outcomes** (e.g. `MaintOutcome { bytes_reclaimed }`, `InspectOutcome { verify_ok, counts, per-source - inventory }`) — no JSON-over-HTTP, no termination-log. -- The repo password is **never** logged; it is read from the group's k8s Secret and passed only via the - subprocess env. - -**`worker.rs`** (concurrency + Secret reads): - -- `Worker { pool, kube, cfg, slots }` is built once in `main()` and shared (cheaply cloned) by maintenance + - inspection. `Cfg::from_env()` reads `CANOPY_NAMESPACE`, the Secret password key, and the web-identity token - file path — so one binary works across stacks. -- `read_repo_password(secret_name)` reads the named key from the group's k8s Secret (the only kube API use). -- `Slots` holds a tokio `Semaphore` (max concurrency from `CANOPY_BACKUP_MAX_CONCURRENCY`, default 4) and an - in-flight `HashSet`. `try_claim(group_id)` takes a permit then marks the group in-flight, returning an - `InFlightGuard` that releases both on drop — enforcing **one op per group at a time** across maintenance + - inspection + init, plus a global concurrency cap. - -**`complete.rs`** (inline completion): - -- Called inline with the typed outcome: `complete_maint` closes the `backup_maintenance_runs` row (success → - `bytes_reclaimed`; failure → error), `complete_init` advances `provisioning → escrow_pending`/`ready` or - records `last_init_error`, and the inspection path upserts inventory/stats and raises/recovers the corruption - alert off `verify_ok`. There is no report endpoint and no crash-detection: the op runs in the same process, so - its outcome is known directly. - -**Billing labels** (`commons_servers::backup_jobs`) are still computed from the group (for the Deployment's cost -labels, not per-Job pods): - -- `billing.product` = group's `billing.product` tag if present else `"tamanu"`. -- `billing.deployment` = group's `billing.deployment` tag if present else the group **name**. -- `billing.stage` = group's `billing.stage` tag if present, else derived from - `ServerGroup::highest_member_ranks` → `rank_priority`, mapped **explicitly** to the CUR stage strings ops - already emits — **not** the `ServerRank` `Display` strings, which don't match: - - | `ServerRank` | `Display` | billing stage | - |---|---|---| - | `Production` | `production` | `prod` | - | `Clone` | `clone` | `clone` | - | `Demo` | `demo` | `demo` | - | `Test` | `test` | `test` | - | `Dev` | `dev` | `dev` | - - RESOLVED (impl): `billing.stage` maps explicitly, with `ServerRank::Production - → "prod"` (the load-bearing mismatch); the others coincide but are mapped - explicitly so a future `Display` rename can't silently break CUR tags. (See §8 - for the all-unranked fallback.) - -`commons_servers::backup_jobs` provides hash-jittered due-ness (UPDATE (shipped) — these live in -`backup_jobs`, not a `schedule.rs`): - -```rust -/// Cadence elapsed since the last run-of-this-kind (window arithmetic only). -pub fn is_due(window: Duration, last: Option, now: Timestamp) -> bool; - -/// Stable per-group jitter slot: true only on the tick matching this group's -/// hashed offset within the window. Stable across restarts (hashes the group -/// UUID, NOT a randomly-seeded hasher). -pub fn slot_is_due(group_id: Uuid, window: Duration, tick: Duration, secs_into_window: u64) -> bool; -``` - -### 3.3 Database changes - -The **tables** are defined in the design doc and shared with sibling components; this component **reads** config -and **writes** run/inventory/stats rows. Migrations are created with **`just migration NAME`** (never -hand-authored — per project rule). To avoid two specs both trying to own the same migration, ownership is: - -- `server_group_backup_config`, `backup_credential_issuances`, `backup_runs`, `backup_requests` — owned by the - **endpoints/onboarding** components (this component only **reads** `server_group_backup_config`). -- **This component owns the migrations for** `backup_maintenance_runs`, `backup_repo_snapshots`, - `backup_repo_stats` (DDL verbatim from the design doc §"Database changes"). If a single migration is - preferred for the whole feature, coordinate so this component contributes these three tables. - -New database-crate model modules (mirroring `chrome_releases.rs` shape: a `Queryable` struct + a `New*` insert -struct + impl methods, re-exported from `lib.rs`): - -- `crates/database/src/backup_maintenance_runs.rs` — `MaintenanceRun` / `NewMaintenanceRun`. - - `NewMaintenanceRun::start(conn, group_id, kind) -> id` (insert with `outcome = NULL`, returns `BIGSERIAL`). - - `MaintenanceRun::finish(conn, id, outcome, error, bytes_reclaimed)`. - - `MaintenanceRun::latest_for_group(conn, group_id, kind) -> Option` (for due-ness + - staleness). -- `crates/database/src/backup_repo_snapshots.rs` — `RepoSnapshot` / `NewRepoSnapshot`. - - `NewRepoSnapshot::upsert_many(conn, group_id, rows)` (PK `(group_id, source)`, `ON CONFLICT … DO UPDATE` - `latest_snapshot_at`/`observed_at`). - - parse `server_id` from the kopia `source` (`canopy@:`) at write time. -- `crates/database/src/backup_repo_stats.rs` — `RepoStats`. - - `RepoStats::upsert_repo_fields(conn, group_id, snapshot_count, source_count, logical, physical)` — written - by inspection. - - `RepoStats::upsert_bucket_bytes(conn, group_id, bytes)` — written by the S3-metrics task; must **not** - clobber the repo fields (partial upsert), since the two tasks run on different cadences. `bucket_bytes` is - nullable/best-effort. - -Use PostgreSQL-native upserts (`ON CONFLICT`) per project DB conventions; keep the per-task partial-update -separation so the two writers don't race over each other's columns. - -`backup_repo_config` reader: add `server_group_backup_config` model (likely owned by the config/onboarding -component) — this component needs a read like `BackupConfig::all_ready(conn) -> Vec` (status = -`'ready'`, used to enumerate groups to schedule) and `BackupConfig::by_group(conn, group_id)`. If that model -doesn't exist yet, this component adds the read-only accessors it needs. - -### 3.4 Maintenance scheduler loop (`backup/maintenance.rs`) - -Per tick (UPDATE (shipped) — no Jobs; the op is an in-process subprocess task): - -1. `ServerGroupBackupConfig::all(...)` → candidate groups; `provisioning` ones go through init (§3.8), - `ready` ones through maintenance. -2. For each `ready` group, decide quick vs full and due-ness (`due_kind`): - - **full**: due weekly (`is_due(WEEK, last_full, now)`) **and** this tick matches the group's hashed slot - (`slot_is_due(group, WEEK, TICK, …)`). - - **quick**: due daily, slot-jittered over the day; `full` subsumes quick. - - If both are due, run **full**. -3. For each due group: `Worker::try_claim(group_id)` — skip if the group is already in-flight (across - maintenance/inspection/init) or the concurrency cap is hit. No kube API query. -4. `NewMaintenanceRun::start(...)` → `run_id`; read the repo password from the group's Secret, build the - `KopiaEnv`, **spawn a tokio task** that runs `kopia::run_maintenance(...)` in-process and then calls - `complete::complete_maint(run_id, …)` inline with the typed outcome. The `InFlightGuard` releases the slot on - task completion. -5. Cadence defaults: quick-daily, full-weekly, deployment-wide (`TICK`/`DAY`/`WEEK` constants). Per-group override - is later (design non-goal). - -**Finish recording — UPDATE (shipped): inline, in-process.** The op runs as a -subprocess of the same pod, so the loop knows its outcome directly: it calls -`complete::complete_maint(run_id, outcome, error)` → `MaintenanceRun::finish`. -There is **no** `/job-report` endpoint, no axum server, and no kube poll for -reaping/crash-detection. (Superseded design: first a report-endpoint, then a pod -termination-message / pod-log read, then a bearer-authed `/job-report` POST with -a kube reap/crash-detect poll — see §8 #2. The termination-message/pod-log read -proved unreliable in practice — k8s truncates/drops the message and the pod may -be gone before it's read, PGRO's pattern too — and the report round-trip became -unnecessary once kopia runs in-process, since the IRSA was shared anyway.) A -crash now can't leave a row stuck at `outcome IS NULL` via a missing report: if -the in-process op panics or errors, the loop's task records the failure inline. - -The maintenance cycle's **three steps run in-process** (`kopia.rs::run_maintenance`), not in a separate Job: -`assert retention → kopia snapshot expire → kopia maintenance run [--full]`, all under the group's per-bucket -role. RESOLVED (impl) — **per-`(group, type)` retention is resolved and applied per source.** The loop calls -`commons_servers::backup_jobs::effective_retention_for_group` (per enabled type: `server_group_backup_schedule` -override → `backup_type_defaults` → floor baseline, each `.enforce_floor()`-clamped — never below `keep_daily 7, -keep_weekly 4, keep_monthly 6`) and builds a `{type → policy}` **map** (`kopia::RetentionMap`). The kopia layer -applies it **per source**: for each `canopy@:` source it sets that type's kopia policy -(`policy set `), then expires — so different types sharing a group's repo get their own -retention. (`init` sets a strictest-of-the-map global baseline since the repo has no sources yet.) The -private-server write path validates operator input against the same floor via the DB-crate -`RetentionPolicy::validate_floor()` (rejects below-floor rather than clamping). - -### 3.5 Inspection scheduler loop (`backup/inspection.rs`) - -Per tick, same enumerate-and-gate shape, on its **own cadence** (default ≈ `expected_interval`, tunable; floor -weekly for manual-only `NULL`-interval groups that still hold backups). RESOLVED (impl): the per-group cadence -is `commons_servers::backup_jobs::effective_interval_for_group` (the **min** effective `expected_interval` -across the group's enabled types), floored to weekly. UPDATE (shipped): for each due group it claims a slot and -**runs `kopia::run_inspect(...)` in-process** (no read-only Job), which: - -- `kopia snapshot list --all --json` → per-source inventory (latest snapshot per source). -- repo stats (`kopia content stats` — note: **no `--json`**, parsed from text) → snapshot/source counts + - logical/physical bytes. -- **repo verify** → a `verify_ok` flag (poisoning/corruption signal). - -UPDATE (shipped) — **inspection completion runs inline**, in-process. There is no -`/job-report` POST and no kube reap poll. The loop calls `complete.rs` directly -with the typed `InspectOutcome`, which: - -- writes `backup_repo_snapshots` (latest snapshot per source; `server_id`/`type` - parsed from each source) and the repo-derived fields of `backup_repo_stats`. -- on `verify_ok: false`, raises the **`backup-corruption`** `Severity::Critical` - **group-level** alert (via `raise_group_event`, §3.7), with a matching - `active: false` recovery when a later inspection verifies clean. - -This is a **group-level** alert that fires **regardless of any server's -`is_monitored`** — routed through `raise_group_event` (§3.7), not a per-server -`NewEvent`. - -Inspection results vs signal-1 reconciliation (report-said-success-but-no-snapshot, etc.) is **owned by the -signal-1 staleness component**, which reads `backup_repo_snapshots`/`backup_runs`. This component's job is to -*write the ground truth* and to raise the *corruption* alert; the cross-signal reconciliation alerts are the -staleness component's. - -### 3.6 S3-metrics task (`backup/s3_metrics.rs`) - -A loop module in the one `backups` bin (UPDATE (shipped) — not a separate bin); it builds its own AWS clients. -Per tick (own cadence, ≈ `expected_interval`, weekly floor): - -- For each `ready` group, read CloudWatch `AWS/S3 BucketSizeBytes`. It is reported **per `StorageType`** - (storage class) with no "all storage types" total, and the class depends on bucket config (Standard, - Intelligent-Tiering tiers, …), so the task **`ListMetrics`-discovers** whichever `StorageType`s the bucket - actually emits and **sums** the latest `GetMetricStatistics` datapoint across them — no hardcoded class. - RESOLVED (impl): the metric lives in the **deployment** account, so the task **assumes the group's - `target_role_arn`** (the same role preflight assumes) and reads CloudWatch with those cross-account - credentials — no dedicated canopy-side cross-account CloudWatch IRSA. The per-bucket role must grant - `cloudwatch:GetMetricStatistics` **and `cloudwatch:ListMetrics`** (ops `backups`-stack). -- `RepoStats::upsert_bucket_bytes(conn, group_id, bytes)` — best-effort; on error log + continue, never alert - (best-effort/nullable per design). - -This task reads CloudWatch directly from the `backups` pod (lightweight); it never ran as a Job. - -### 3.7 Group-level alerting path (shared concern, must be settled here) - -Maintenance failure (stuck/failed maintenance) and inspection corruption are **group/control-plane** concerns -that must **not** pass the per-server `is_monitored` gate (design "Group-level checks alert regardless of -`is_monitored`"). But the incident model (`issues.rs`) is **server-keyed**: `NewEvent::save(conn, server_id, -device_id)` and `re_evaluate_incident_membership` gate on the server's `is_monitored`. There is no -"group-level issue with no server" path today. - -RESOLVED (impl): the gap was closed with a **group-scoped issue** path. The -shipped entrypoint is **`database::backup::alerts::raise_group_event(conn, -group_id, ref, severity, …)`**, which find-or-creates a group-scoped issue -(nullable `issues.server_id`, group resolved directly — migration -`2026-06-15-...backup_group_scoped_issues`), runs the group-aware membership -evaluation that **bypasses `is_monitored`**, and drains to `slacker_outbox` -unchanged. Both this component (corruption) and the detection component call it; -its tests cover the all-members-unmonitored paging case. Do **not** route -group-level alerts through a per-server `NewEvent`. (See the detection-preflight -spec for the full `raise_group_event` contract and the migration sweep.) - -`(source, ref)` conventions for this component (mirroring reachability's `source="canopy"`): -- maintenance stuck/failed → `ref = "backup-maintenance"`, `Severity::Error` (opens incident). -- repo corruption/poisoning → `ref = "backup-corruption"`, `Severity::Critical`. -Recovery is the **same `(source, ref)`** event with `active: false` / lower severity, so the issue leaves the -incident and auto-closes (same pattern reachability uses). `slacker_outbox` drains to Slack unchanged. - -(The `backup_maintenance_runs` staleness scan — "a group whose maintenance silently stopped" — can live in the -maintenance bin's loop or the signal-1 staleness component; **recommend** it lives with signal-1 so all -staleness logic is in one place, with this component only emitting the corruption alert and writing the runs -table. **Decision §8.**) - -### 3.8 Repo-creation init - -RESOLVED (impl) — **init is scheduler-driven, not handler-driven** (cleaner: no -kube/Jobs dependency in private-server). UPDATE (shipped): it runs **in-process**, -not as a Job. The onboarding handler only sets `status = 'provisioning'`; it does -**not** spawn anything. The **maintenance loop** then: - -- enumerates groups in `provisioning`, guarded by `last_init_error IS NULL` - (cleared by the operator-UI retry) **and** not-already-in-flight, -- runs `kopia::run_init(...)` in-process under the group's per-bucket role - (creating the repo format blob needs the full-access role, not the device's - no-delete set): `kopia repository create` (CONFIRMED kopia 0.23.1: exits - non-zero if the repo already exists → falls back to `connect` and treats that - as success), connects with the fixed `canopy@canopy-maintenance` identity and - sets it as the maintenance **owner**, and asserts the floor-enforced initial - retention, -- on completion, `complete::complete_init(...)` advances the status inline: - `provisioning → escrow_pending` for **FromBirth** mode (Canopy-minted - passphrase → escrow flow) / `provisioning → ready` for **Import** mode - (operator already holds the passphrase), **or** records `last_init_error` on - failure (operator-UI clears it to retry). - -(So both the original "the onboarding handler calls `spawn_init_job`" *and* the -intermediate "scheduler spawns an init Job that POSTs `/job-report`" are -superseded: private-server holds no kube/jobs dependency, and the maintenance -loop runs init in-process and advances the status directly.) - ---- - -## 4. IaC changes (ops — `ops/pulumi/tamanu/meta`) - -Owned jointly with the ops/IaC spec; the canopy-jobs-relevant pieces: - -- RESOLVED (impl): **one** new single-replica `Recreate` Deployment in `jobs.ts` - mirroring `reachability`/`pingtask` — `backups` (`['backups']`), running all - four loop modules — `dependsOn: [migrator]`, with `costLabels`. (Originally - specced as three separate `backup-maintenance`/`backup-inspection`/`backup-s3-metrics` - Deployments — superseded.) -- **ServiceAccount + IRSA, net-new to canopy.** UPDATE (shipped) — there is **one** SA, the `canopy-jobs` SA on - the single `backups` Deployment; **no per-Job SAs**, no per-kind maintenance/inspection SA split (every kopia - subprocess overrides `AWS_ROLE_ARN` to the group's per-bucket role and reuses the pod's projected web-identity - token). `spec.ts` injects no `serviceAccountName` today, so add an optional `serviceAccountName` to the - `spec()` container args (or a sibling helper) and create the SA via the existing - **`common/eksServiceAccount.ts`** helper. The SA needs: - - **k8s RBAC: `get` Secrets** in canopy's namespace — to read the per-group repo passwords — **and that's it**. - UPDATE (shipped): **NO** create/list/watch/delete Jobs, **no** pods, **no** tokenreviews. - - **AWS/IRSA:** assume the per-bucket roles via **web-identity** (direct, refreshing — up to each role's - `MaxSessionDuration`; set it high enough to cover a long maintenance run). The per-bucket role must **trust - the `canopy-jobs` SA's OIDC subject** and (for s3-metrics) grant `cloudwatch:GetMetricStatistics`. The - per-bucket role trust + action set + `s3:GetBucketObjectLockConfiguration` are **`backups`-stack** changes - owned by the ops spec. - - **OIDC-provider-per-account** wiring so the pod's web-identity can assume cross-account (ops/IaC). -- **The shipped `ghcr.io/beyondessential/canopy` image** bundles kopia (the kopia binary is copied into - `.github/Dockerfile.native` from `kopia/kopia:0.23.1`), so the backups pod runs the same image as the other - components — there is no separate kopia-job image, and there is **no `CANOPY_BACKUP_IMAGE` env** (no Job image - to reference). See §5. -- UPDATE (shipped): **no** report Service/Secret (`CANOPY_BACKUP_REPORT_*` / bearer token) — there is no - `/job-report` endpoint. The k8s RBAC is a least-privilege namespace `Role`/`RoleBinding` granting only - `get secrets`. - -This component's canopy code reads the namespace (and Secret password key / web-identity token file) from -**env/config** (like DATABASE_URL), not hardcoded, so the same binary works across stacks. - ---- - -## 5. Interfaces / contracts - -### Consumes - -- **DB config:** `server_group_backup_config` (read): `group_id`, `bucket`, `prefix`, `target_role_arn`, - `region`, `repo_password_ref`, `status`, `mode`, `last_init_error`. Schedule/retention are read from - `server_group_backup_schedule` / `backup_type_defaults` (the addendum moved `expected_interval`/`retention` - off the config table). `status = 'ready'` groups are scheduled for maintenance/inspection; `provisioning` - groups drive the init flow (§3.8). -- **`server_groups`:** `ServerGroup::highest_member_ranks`, `rank_priority`, `tags` (`TagMap`) for billing - labels. -- **kopia repo password Secret** named by `repo_password_ref`, in canopy's namespace — UPDATE (shipped): **read - via the kube API** (`worker::read_repo_password`) and passed to the kopia subprocess as `KOPIA_PASSWORD`, not - mounted via `secretKeyRef`. Owned by the repo-password/onboarding component; consumed here read-only. -- **Per-bucket IAM roles** (`target_role_arn`) trusting the **`canopy-jobs` SA** OIDC subject cross-account; the - kopia subprocess assumes them directly via web-identity. Owned by the ops `backups`-stack spec. UPDATE - (shipped): one role per group (no separate full-access vs read-only role per kind). -- **`database::backup::alerts::raise_group_event`** (group-level alert path, §3.7). `Severity` from - `commons_types::issue` (`OPENS_INCIDENT = [Critical, Error]`). -- **kube API** (Secret reads only) + **CloudWatch** (s3-metrics) + **S3/STS** (preflight, and CloudWatch - cross-account assume). - -### Provides - -- **DB writes** other components read: - - `backup_maintenance_runs` (start/finish; consumed by signal-1 staleness + the stats UI panel). - - `backup_repo_snapshots` (ground-truth inventory; consumed by signal-1/2 reconciliation + UI). - - `backup_repo_stats` repo fields + `bucket_bytes` (consumed by the operator stats panel). -- **Shared library** (`commons_servers::backup_jobs`): pure scheduler helpers — `JobKind`, billing labels, - `RetentionPolicy`/floor (`effective_retention_for_group`), `effective_interval_for_group`, - `is_due`/`slot_is_due` — shared by the loops and by private-server's validation. UPDATE (shipped): **no** - `spawn_*_job` helpers (init is scheduler-driven, in-process). -- **Group-level alerts** `(source="canopy", ref ∈ {backup-maintenance, backup-corruption})` feeding the - existing incident → Slack pipeline. - -### kopia invocation (in-process subprocess — no inter-process contract) - -UPDATE (shipped) — there is **no inter-process contract** anymore. The earlier -designs (config-via-ENV/args + results-via-POST-to-`/job-report`, and the -separate `images/kopia-job/` image with its `CONTRACT.md`) are **superseded**: -the `backups` bin invokes the **bundled** kopia CLI directly (`tokio::process`) -in `kopia.rs`, parses its `--json`/text output into typed Rust values, and writes -the results inline. No ENV-config handoff, no result JSON over HTTP, no -`terminationMessagePolicy`, no `kopia-job` image. - -Per op the bin builds a `KopiaEnv` (`AWS_ROLE_ARN` = group's `target_role_arn`, -`AWS_REGION`, `KOPIA_PASSWORD` from the Secret; projected web-identity token -inherited) and runs: - -- **maint-quick / maint-full:** `connect` → per source set that type's policy - (`policy set --keep-*`, from the `{type → policy}` map) → - `snapshot expire` → `maintenance run [--full]`. (`init` sets a - strictest-of-map global baseline since there are no sources yet.) -- **inspect:** `snapshot list --all --json` → repo stats (`content stats`) → verify. -- **init:** `repository create` + assert initial retention; connect as the canopy - identity and set it as the maintenance **owner**, **disabling** client-side - maintenance/expiry (so devices never attempt delete-needing ops). - -**Verified kopia 0.23.1 facts** (confirmed against the bundled version, encoded in `kopia.rs`): - -- **Connect identity / maintenance owner:** `maintenance run` refuses unless the connected client identity - equals the maintenance owner. So every op connects with `--override-username canopy --override-hostname - canopy-maintenance` (constants `MAINTENANCE_USER`/`MAINTENANCE_HOST`), and `init` sets that identity as the - owner. Devices connect with their own identity, so they never become owner. -- **Per-source policy:** retention is applied per source via `kopia policy set --keep-*` - (per-type, keyed by `canopy@:`). -- **`kopia content stats` has no `--json`** — physical-bytes are parsed from its text output (a "Total Bytes:" - line), best-effort (`None` if unparseable). -- **`kopia repository create` exits non-zero if the repo already exists** — `run_init` treats that as success by - falling back to `connect`. -- **`kopia snapshot list --all --json`** elements carry `source = { userName, host, path }`, parsed into the - per-source inventory (`server_id`/`type` from `host`/`path`). - -Typed outcomes (written inline by `complete.rs`, no wire schema): - -- **init:** ok / error → status advance or `last_init_error`. -- **maint:** `MaintOutcome { bytes_reclaimed }`. -- **inspect:** `InspectOutcome { verify_ok, snapshot/source counts, logical/physical bytes, per-source - inventory }`. - ---- - -## 6. Data shapes (Rust) - -```rust -// commons_servers::backup_jobs (kept as a *kind* enum even though there are no Jobs) -pub enum JobKind { MaintQuick, MaintFull, Inspect, Init } - -pub struct BillingLabels { - pub product: String, // default "tamanu" - pub deployment: String, // default = group name - pub stage: Option, // None => omit label (all-unranked group) -} - -// RESOLVED (impl): RetentionPolicy lives in the DATABASE crate -// (database::backups::RetentionPolicy), over the schedule/type-default JSONB. -pub struct RetentionPolicy { - pub keep_latest: i32, // default 1, not floored - pub keep_daily: i32, // floor 7 (FLOOR_DAILY) - pub keep_weekly: i32, // floor 4 (FLOOR_WEEKLY) - pub keep_monthly: i32, // floor 6 (FLOOR_MONTHLY) - pub keep_annual: i32, // default 0 -} -impl RetentionPolicy { - // validates (does NOT silently clamp): below-floor → AppError::BadRequest - pub fn validate_floor(&self) -> Result<()>; - pub fn from_json(&JsonValue) -> Option; - pub fn to_json(&self) -> JsonValue; -} -``` - -```rust -// database models -pub struct MaintenanceRun { pub id: i64, pub group_id: Uuid, pub kind: String, - pub started_at: Timestamp, pub finished_at: Option, - pub outcome: Option, pub error: Option, pub bytes_reclaimed: Option } - -pub struct RepoSnapshot { pub group_id: Uuid, pub source: String, - pub server_id: Option, pub latest_snapshot_at: Option, pub observed_at: Timestamp } - -pub struct RepoStats { pub group_id: Uuid, pub snapshot_count: Option, pub source_count: Option, - pub logical_bytes: Option, pub physical_bytes: Option, - pub bucket_bytes: Option, pub observed_at: Timestamp } -``` - -Kopia source parse: `canopy@:` → `server_id = Uuid::parse(...)` (best-effort; `None` if the -host segment isn't a UUID, e.g. legacy/imported repos — store the row with `server_id = NULL` rather than -dropping it). - ---- - -## 7. Testing approach (per AGENTS.md) - -- **DB model tests** with `commons_tests::db::TestDb::run(|mut conn, _url| async move { … })`, calling model - functions directly (not HTTP), per the project rule. Cover: - - `NewMaintenanceRun::start` then `MaintenanceRun::finish` (success + failure rows), `latest_for_group`. - - `RepoSnapshot::upsert_many` idempotency on `(group_id, source)`; `server_id` parse from a real - `canopy@:/path` source and from a non-UUID host (→ `NULL`). - - `RepoStats` **partial upserts**: `upsert_repo_fields` then `upsert_bucket_bytes` must not clobber each - other (the two-writer split is the load-bearing invariant — test it explicitly). - - retention **floor enforcement**: a policy below floor is raised; an above-floor override is preserved; - `keep_latest` is **not** floored. - - 404/absent cases (`latest_for_group` for an unknown group → `None`). - - Always `use database::ModelName;` imports. -- **Pure-logic unit tests** (plain `#[test]` / `#[tokio::test(flavor = "multi_thread")]`): - - `slot_is_due`/`slot_offset` is **stable** for a fixed UUID across calls (regression guard against a - randomly-seeded hasher) and spreads across the window for distinct UUIDs. - - `is_due` boundaries (just-before / just-after the window, full subsumes quick). - - **billing label mapping** — especially `Production → "prod"` (the gotcha) and all-unranked → `None`. - - UPDATE (shipped) — instead of a Job-manifest test: **`kopia.rs` parsing/policy helpers** (retention `--keep-*` - flag building, `snapshot list --json` source parsing, `content stats` text parsing) and **`worker::Slots`** - concurrency (one-op-per-group exclusion + semaphore cap) — all unit-testable without a cluster. -- **Kube/kopia interaction:** do **not** stand up a real cluster or invoke real kopia in tests. UPDATE (shipped): - there is no `JobSpawner` trait — concurrency/due-ness is tested via `Slots` + the `commons_servers::backup_jobs` - helpers; the kopia subprocess itself is not exercised in CI. -- **Alerting:** assert the corruption path raises the group-level event (`Severity::Critical`, `ref = - "backup-corruption"`) via `raise_group_event` and that recovery emits the matching `active: false` event (the - group-scoped-issue path is shipped, §3.7). -- **No e2e/Playwright here** — this component has no rendered UI (the onboarding/stats UI is a separate - component and carries its own Playwright per AGENTS.md). The init flow is exercised via the - `complete_init` status-advance path. -- Run per-package while iterating: `just test-package jobs` and `just test-package database`; let CI run the - full suite (no final local full-suite run, per memory). `just check` for compile/warnings. - ---- - -## 8. Open questions / decisions to make - -1. **Shared-library location** — RESOLVED: **`commons_servers::backup_jobs`**. - (And, as shipped, init is scheduler-driven so private-server doesn't call a - spawn helper at all — §3.8.) -2. **Maintenance/inspection completion signal** — RESOLVED (impl): **in-process.** - kopia runs as a subprocess of the `backups` pod and its typed outcome is - written inline (`complete.rs`); there is no completion *signal* to receive. - This supersedes both earlier ideas: (a) the kube-watch / pod-termination-message - / pod-log read (reverted — k8s truncates/drops the message and the pod may be - gone before it's read, unreliable in practice), and (b) the bearer-authed - `/job-report` POST + kube reap/crash-detect poll (dropped — the report - round-trip became unnecessary once kopia runs in-process, since the IRSA is - shared anyway). No `/job-report` endpoint, no axum, no kube poll. -3. **Group-level alert mechanism** — RESOLVED: **group-scoped issue** via - `database::backup::alerts::raise_group_event` (bypasses `is_monitored`), - backed by the `backup_group_scoped_issues` migration (nullable - `issues.server_id`). No per-server `NewEvent` workaround. -4. **Where the maintenance-staleness scan lives** — RESOLVED: the - staleness/reconcile sweep lives with the detection slice - (`database::backup::sweep`, run by the `monitor` bin); this component writes - `backup_maintenance_runs` and raises only the corruption alert. -5. **kopia image** — RESOLVED (impl): there is no separate kopia-job image and no entrypoint contract. kopia is - **bundled** into the single shipped `ghcr.io/beyondessential/canopy` image (the kopia binary is copied into - `.github/Dockerfile.native` from `kopia/kopia:0.23.1`) and invoked in-process; no `CANOPY_BACKUP_IMAGE` - env. (§5.) -6. **S3-metrics cross-account read** — RESOLVED (impl): assume the group's `target_role_arn` and read CloudWatch - with those creds (no dedicated canopy CloudWatch IRSA); per-bucket role grants `cloudwatch:GetMetricStatistics` - + `cloudwatch:ListMetrics`. `BucketSizeBytes` is summed across the bucket's actual `StorageType`s - (ListMetrics-discovered — handles Standard / Intelligent-Tiering), not a hardcoded class. -7. **All-unranked group billing stage** — fall back to `prod` vs omit `billing.stage`. *Recommend omit.* -8. **k8s-openapi API version** — pin to the cluster's actual control-plane version (verify; don't guess), per - the no-guessing-versions rule. Same for `kube`/AWS-SDK crate versions (check the registry before pinning). -9. **Migration ownership** — confirm with the config/endpoints components whether the feature ships one - migration or several; this component contributes `backup_maintenance_runs` / `backup_repo_snapshots` / - `backup_repo_stats` either way. -10. **Cadence/tick defaults** — quick-daily / full-weekly / inspection ≈ `expected_interval` (weekly floor) / - metrics ≈ `expected_interval` (weekly floor); confirmed-tunable per design but the constants live in code. - ---- - -## 9. Net-new infrastructure summary (none exists in canopy today) - -UPDATE (shipped) — the shape changed from "k8s Jobs + per-kind IRSA SAs + a separate kopia image" to "one -long-lived Deployment that runs bundled kopia in-process": - -- **Kubernetes API client** (`kube` + `k8s-openapi`) on the `backups` bin — used **only to `get` Secrets** (repo - passwords). No Job create/list/watch/delete. -- **ServiceAccount + IRSA** plumbed through `spec.ts` for the single `backups` pod: **one** `canopy-jobs` SA with - k8s RBAC = `get secrets` and an IRSA role; each kopia subprocess overrides `AWS_ROLE_ARN` to the group's - per-bucket role and assumes it **directly via web-identity** (refreshing, up to the role's `MaxSessionDuration`). - No per-Job/per-kind SAs, no chained read-only path. -- **AWS SDK** (`aws-config` + `aws-sdk-cloudwatch` + `aws-sdk-s3` + `aws-sdk-sts`) on the `backups` bin for - **preflight** + **s3-metrics**. kopia's own bundled AWS SDK handles the S3 repo I/O. -- **The shipped `ghcr.io/beyondessential/canopy` image** bundles kopia (its binary is copied into - `.github/Dockerfile.native` from `kopia/kopia:0.23.1`). No separate kopia-job image, no ENV-config/POST - contract. -- **OIDC-provider-per-account** wiring for cross-account web-identity (ops/IaC). - -This component owns the **jobs-side** kube (Secret-read) client + the `canopy-jobs` IRSA; `public-server`'s -kube/AWS additions are owned by the endpoints component. - ---- - -## Backup types addendum - -Per the plan's "Backup types": - -- **Retention is per-`(group, type)`.** The maintenance cycle's - assert-retention step asserts *each type's* effective keep-policy - (`server_group_backup_schedule.retention ?? backup_type_defaults`, org - floor applied) as a kopia **per-source/path policy**, so `kopia snapshot - expire` honours the right policy per type. The maintenance *run* itself - stays per-group (one repo per group, shared by all types). -- **Scheduling is per-`(group, type)`** — the maintenance/inspection - schedulers iterate active `(group, type)` (or per-group for the - repo-wide maintenance run; per-type for retention assertion). -- **Inspection** parses the snapshot's `canopy-type` tag → writes - `backup_repo_snapshots.type`; `(server, type)` is one source. -- `backup_repo_stats` stays per-group (repo is shared; size is repo-level). diff --git a/docs/plans/specs/canopy-operator-ui.md b/docs/plans/specs/canopy-operator-ui.md deleted file mode 100644 index 4362e199..00000000 --- a/docs/plans/specs/canopy-operator-ui.md +++ /dev/null @@ -1,571 +0,0 @@ -# Spec: canopy-operator-ui - -Implementation spec for the **operator UI** of the backup-credentials system: -the `TailscaleAdmin`-gated private-server admin endpoints and the private-web -React/MUI screens that drive group backup onboarding/config, repo-creation -trigger + status, the reveal-once escrow + acknowledgment, the one-off "backup -now" trigger, and the read-only stats panel. - -Authoritative design: [`../backup-credentials.md`](../backup-credentials.md) -(see esp. "Operator workflows & repo provisioning (private-server UI)", -"Repository password ownership" → "DR escrow", "Backup cadence and triggering" -→ "Operator one-off", "Operational story"). This component owns **only** the -operator-facing surface; it consumes the data model, AWS/k8s machinery, -schedulers, and detection owned by the other backup-credentials components. - -This spec is scoped to repo `canopy`: `crates/private-server` (axum admin fns) -and `private-web` (React SPA + Playwright e2e), following the patterns in -`AGENTS.md` ("Private server architecture", "React frontend"). - ---- - -## 1. Purpose - -Make group backup onboarding a real, self-serve operator workflow in the -existing Tailscale-gated admin SPA — *not* a SQL bootstrap. Concretely, give an -operator the ability to: - -1. **Onboard / configure** a group's backup: set `bucket`, `target_role_arn`, - `region`, `expected_interval`, `retention`, choose from-birth vs. import - mode, and kick off repo creation. -2. **See repo-creation status** and the lifecycle state machine - (`provisioning → escrow_pending → ready`), including init-Job failures. -3. **Reveal the generated passphrase once** (from-birth repos), with a "saved - to Bitwarden" acknowledgment that flips `escrow_pending → ready`. -4. **Trigger a one-off "backup now"** for any group (scheduled or - manual-only), writing a `backup_requests` row the device picks up on its - next ~1-minute tick. -5. **Read a stats panel** — cached `backup_repo_stats` plus recent - `backup_runs` / `backup_maintenance_runs` per group. - -The UI is the human end of the control plane; it never talks to AWS, kopia, or -k8s directly — it only reads/writes Canopy's database via private-server fns -and *triggers* the init Job through a fn that the jobs-side machinery acts on. - ---- - -## 2. Where it lives in the repo - -### Backend (private-server) - -- New module `crates/private-server/src/fns/backups.rs` exposing - `pub fn routes() -> OpenApiRouter`, mounted under `/api/backups` - in `crates/private-server/src/fns/mod.rs` (add - `.nest("/backups", backups::routes())` and `pub mod backups;`). -- Follows the exact handler shape used by `server_groups.rs`: bare axum - handlers `(State, [TailscaleAdmin], Json) -> Result>`, - each annotated with `#[utoipa::path(post, path = "/", operation_id = - "backups_", tag = "backups", security(("tailscale-admin" = [])), …)]`. - Read-only endpoints use `security(("tailscale-user" = []))` (matching how - `server_groups::list`/`get` are user-gated while mutations are admin-gated). -- DB access via `state.db.get().await?`; all model logic lives in the - **database crate** (`crates/database/src/`), never inline in private-server - (per AGENTS.md: no diesel in private-server). - -### Backend (database crate) - -The UI fns are thin wrappers over model functions. The migrations and models -listed in §4 are **shared with the data-model component** of -backup-credentials; if that component lands them first, this component only -adds the *operator-facing* query/mutation methods. To avoid a silent gap, this -spec lists the full set the UI needs; whoever lands the table owns the -migration, and this component owns the query methods it calls. Coordinate via -the `depends_on` contract in the orchestration metadata. - -### Frontend (private-web) - -- New route components under `private-web/src/routes/`: - - `BackupConfig.tsx` — onboarding / edit config form (create + edit modes, - mirroring `GroupEdit.tsx`'s split). - - `BackupEscrow.tsx` — reveal-once passphrase + ack (often rendered as a - section inside the group backup page, gated on `status === 'escrow_pending'`). - - `BackupPanel.tsx` — the per-group backup overview: status, stats, recent - runs, "backup now" button, links to config/escrow. -- Surfaced from the existing **group detail page** (`GroupDetail.tsx`): add a - "Backups" section/card that renders `BackupPanel` (or a "Set up backups" - CTA when no config row exists). New routes registered in `App.tsx`: - - `/groups/:id/backups` → `BackupPanel` - - `/groups/:id/backups/config` → `BackupConfig` (create or edit) -- Wire types come from `private-web/src/api-types.ts` (generated) re-exported - through `private-web/src/types.ts`. UI-only label/order constants - (status labels, retention field labels) go in `types.ts` below the - re-exports, same as `SEVERITY_INTENT` / `SERVER_RANK_ORDER`. -- After any handler request/response change, run **`just gen-openapi`** and - commit both `private-web/openapi.json` and `private-web/src/api-types.ts` - alongside the Rust change (per AGENTS.md). - -### e2e - -- `private-web/e2e/backups.spec.ts` (new), using `./test-fixtures` + - `./seed.ts`. Extend `seed.ts` with `seedServerGroupBackupConfig`, - `seedBackupRun`, `seedBackupRepoStats`, `seedBackupRequest` helpers and add - the new tables to `resetSeededTables`'s `TRUNCATE` list. - ---- - -## 3. Lifecycle state machine (the UI's spine) - -`server_group_backup_config.status ∈ { 'provisioning', 'escrow_pending', -'ready' }`. The UI renders one of three top-level states per group, plus the -"no config yet" zero-state and an explicit "init failed" sub-state: - -``` -(no row) ──[Set up backups: create config]──► provisioning -provisioning ──[init Job creates repo, from-birth]──► escrow_pending -provisioning ──[init Job creates repo, import mode]──► ready -provisioning ──[init Job fails]──► provisioning + last_init_error shown (retry available) -escrow_pending ──[operator acks "saved to Bitwarden"]──► ready -ready ──[edit non-structural config]──► ready -``` - -- The UI **does not** itself run the init Job; it calls `backups.create_repo`, - which records intent / sets `status='provisioning'` and lets the jobs-side - init-Job machinery (owned by the maintenance/jobs component) pick it up. The - UI polls config status (`useReloadInterval`, like the incidents badge) to - reflect `provisioning → escrow_pending/ready`. -- **Backups are dormant until `ready`** — this is enforced on the device path - (412/409), not in this UI; the UI surfaces *why* (status chip + helper text) - so an operator isn't confused that "configured" ≠ "live". -- **Import mode** skips escrow: `create_repo` with `mode='import'` moves - `provisioning → ready` once the repo connects (operator already holds the - passphrase / points `repo_password_ref` at an existing Secret). - -How the init Job's outcome reaches `status` and `last_init_error` is the -jobs-side component's contract (see §6 consumed contracts). The UI only reads -those fields; it must not assume an in-process transition. - ---- - -## 4. Data shapes (DB) - -These tables come from the backup-credentials data model; the UI reads/writes -the subset below. Migrations are created with `just migration NAME` (never -hand-authored — per AGENTS.md). Two of these (`status`, `last_init_error`, -`mode`, `repo_password_ref`, escrow tracking) are the columns the UI most -depends on, so if the base table is authored elsewhere, confirm these exist. - -### `server_group_backup_config` (read + write by UI) - -Per the main plan's schema, plus the columns the UI lifecycle needs. If the -base table predates this work, the UI requires at minimum: - -```sql --- (from backup-credentials.md "New table: server_group_backup_config") -group_id UUID PRIMARY KEY REFERENCES server_groups(id) ON DELETE CASCADE, -bucket TEXT NOT NULL, -prefix TEXT NOT NULL DEFAULT '', -target_role_arn TEXT NOT NULL, -region TEXT, -expected_interval INTERVAL, -- NULL = manual-only -retention JSONB NOT NULL, -- kopia keep-* policy -repo_password_ref TEXT NOT NULL, -status TEXT NOT NULL, -- 'provisioning'|'escrow_pending'|'ready' -created_at TIMESTAMPTZ NOT NULL DEFAULT now(), -updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), --- UI/lifecycle additions this component needs (confirm/author): -mode TEXT NOT NULL DEFAULT 'from_birth', -- 'from_birth' | 'import' -last_init_error TEXT, -- set when the init Job fails; cleared on success/retry -escrow_acked_at TIMESTAMPTZ, -- set when the operator acks the reveal-once (from-birth) -escrow_acked_by TEXT -- operator identity (from TailscaleAdmin) -``` - -`expected_interval` maps to/from the UI via the same minutes/seconds pattern -`GroupEdit.tsx` uses for `slack_open_delay`, except the API column is -`INTERVAL`. Reuse `database::pg_duration::PgDuration` (already used on -`server_groups.slack_open_delay`) and `#[schema(value_type = Option)]` -so the wire type is seconds. **NULL `expected_interval` = manual-only** must be -representable distinctly from `0`; the form needs a "Manual only (no schedule)" -toggle, not just an empty number field. - -`retention` is a small JSON object; on the wire model it as a typed struct -(not raw `serde_json::Value`) so `openapi-typescript` emits a real shape: - -```rust -#[derive(Serialize, Deserialize, ToSchema)] -pub struct RetentionPolicy { - pub keep_latest: i32, // default 1 (not floor-enforced) - pub keep_daily: i32, // floor 7 - pub keep_weekly: i32, // floor 4 - pub keep_monthly: i32, // floor 6 - pub keep_annual: i32, // default 0 -} -``` - -The org-minimum **floor** (`keep_daily ≥ 7, keep_weekly ≥ 4, keep_monthly ≥ 6`) -is enforced in the model/handler on create+update; the UI also validates -client-side (helper text + disabled submit) but the server is authoritative -(returns `400 AppError::BadRequest`-style problem-details on violation). - -### `backup_requests` (write + read by UI — "backup now") - -```sql -CREATE TABLE backup_requests ( - server_id UUID NOT NULL REFERENCES servers(id), - purpose TEXT NOT NULL, -- "backup" | "restore" - requested_at TIMESTAMPTZ NOT NULL DEFAULT now(), - requested_by TEXT, -- operator identity (TailscaleAdmin login) - PRIMARY KEY (server_id, purpose) -); -``` - -The "backup now" button targets a **server** (the request is keyed by -`server_id`), so the UI offers per-server one-off triggers within the group's -backup panel (the group's member servers come from `server_groups.get`). A -group-wide "backup all members" convenience can fan out to one row per member -server (open question §8). Cleared by the device path on report; the UI shows -"requested , pending" while a row exists. - -### Read-only display tables (read by UI) - -- `backup_repo_stats` (PK `group_id`): `snapshot_count`, `source_count`, - `logical_bytes`, `physical_bytes`, `bucket_bytes` (nullable), `observed_at`. -- `backup_runs` (recent N per group): `device_id`, `purpose`, `outcome`, - `error`, `bytes_uploaded`, `snapshot_id`, `reported_at`. -- `backup_maintenance_runs` (recent N per group): `kind`, `started_at`, - `finished_at`, `outcome`, `error`, `bytes_reclaimed`. -- `backup_repo_snapshots` (optional, for a per-source "latest snapshot" list): - `source`, `server_id`, `latest_snapshot_at`. - -### Escrow secret read - -The reveal-once passphrase is read from the **k8s Secret** named by -`repo_password_ref`. This requires `public-server`/the relevant pod to have a -kube client + Secret-read RBAC — that machinery is **net-new and owned by the -AWS/k8s-infra component**, not this UI. The escrow reveal endpoint -(`backups.reveal_escrow`) consumes that kube-client capability on `AppState`. -If private-server does not have the kube client at the time this lands, see -§8 open question on where the escrow read executes. - ---- - -## 5. Interfaces this component EXPOSES - -All under `/api/backups/`, POST, `TailscaleAdmin`-gated unless noted. -Argument/response structs live in `backups.rs` with `#[derive(…, ToSchema)]`; -operation ids prefixed `backups_`. Names are the contract for the React layer -and any other consumer. - -| fn | gate | args | returns | purpose | -|----|------|------|---------|---------| -| `backups_get` | user | `{ server_group_id }` | `BackupConfigView \| null` | full config + lifecycle for a group (null = no config) | -| `backups_list` | user | `{}` | `Vec` | all configured groups (fleet overview) | -| `backups_create` | admin | `CreateBackupConfigArgs` | `BackupConfigView` | insert config row (`status='provisioning'`), validate floor; does **not** create repo | -| `backups_update` | admin | `UpdateBackupConfigArgs` | `BackupConfigView` | edit config on a `ready` group. RESOLVED (impl): `UpdateBackupConfigArgs` carries **only `region`** — structural fields are inexpressible (no 409 path). (Interval/retention are per-`(group, type)` on the schedule table, edited separately.) | -| `backups_create_repo` | admin | `{ server_group_id }` | `BackupConfigView` | record intent for the init Job (sets/keeps `provisioning`, clears `last_init_error`); idempotent retry | -| `backups_reveal_escrow` | admin | `{ server_group_id }` | `RevealEscrowResponse` | reveal-once passphrase (from-birth, `escrow_pending` only); reads the k8s Secret | -| `backups_ack_escrow` | admin | `{ server_group_id }` | `BackupConfigView` | flip `escrow_pending → ready`, stamp `escrow_acked_at/by` | -| `backups_request_now` | admin | `{ server_id, purpose }` | `()` | upsert a `backup_requests` row (one-off "backup now") | -| `backups_cancel_request` | admin | `{ server_id, purpose }` | `()` | delete a pending `backup_requests` row | -| `backups_stats` | user | `{ server_group_id }` | `BackupStatsView` | `backup_repo_stats` + recent `backup_runs` + recent `backup_maintenance_runs` + pending requests | -| `backups_delete` | admin | `{ server_group_id }` | `()` | delete the config row (decommission; see audit-table FK note in main plan) | - -### Response/argument shapes (wire) - -```rust -pub struct BackupConfigView { - pub server_group_id: Uuid, - pub bucket: String, - pub prefix: String, - pub target_role_arn: String, - pub region: Option, - #[schema(value_type = Option)] // seconds; None = manual-only - pub expected_interval: Option, - pub retention: RetentionPolicy, - pub mode: BackupRepoMode, // FromBirth | Import (serde lowercase) - pub status: BackupConfigStatus, // Provisioning | EscrowPending | Ready - pub last_init_error: Option, - pub escrow_acked_at: Option, - pub created_at: Timestamp, - pub updated_at: Timestamp, - // NOTE: never includes repo_password_ref's *value* — only reveal_escrow does. -} - -pub struct CreateBackupConfigArgs { - pub server_group_id: Uuid, - pub bucket: String, - #[serde(default)] pub prefix: String, - pub target_role_arn: String, - pub region: Option, - #[schema(value_type = Option)] - pub expected_interval: Option, - pub retention: RetentionPolicy, - pub mode: BackupRepoMode, - /// Import mode only: name of a pre-existing k8s Secret holding the - /// passphrase. From-birth leaves this None (Canopy generates + names it). - pub repo_password_ref: Option, -} - -pub struct RevealEscrowResponse { - pub passphrase: String, // shown once; UI must not persist - pub repo_password_ref: String, // the Secret name, for the "saved where" note -} - -pub struct BackupStatsView { - pub stats: Option, // None until first inspection - pub recent_runs: Vec, // most-recent first, capped (e.g. 20) - pub recent_maintenance: Vec, - pub pending_requests: Vec, // server_id, purpose, requested_at, requested_by -} -``` - -Use `commons_types::Uuid` and `jiff::Timestamp` to match the rest of the -codebase (`server_groups.rs` uses these). Status/mode are string enums with -`#[serde(rename_all = "snake_case")]` so the generated TS unions read -`"provisioning" | "escrow_pending" | "ready"` and `"from_birth" | "import"`. - -### Error contract (problem-details) - -Reuse existing `AppError` variants; map to the documented statuses in -`#[utoipa::path(responses(...))]`: - -- `404` (`AppError`'s not-found path) — group / config not found in `get` when - the caller expects one; but `backups_get` returns `null` for "no config" - rather than 404, matching the "zero-state" UI. Use 404 only for a bad - `server_group_id` (group itself missing). -- `400` — retention floor violation, or `create` for a group that already has - config. Prefer `AppError::Conflict(String)` (→ 409) for "already configured" - and a bad-request variant for floor violations; pick per existing - `commons-errors` variants and **update ERRORS.md** if a new variant is added - (per AGENTS.md, heading must match the problem type). -- `409` (`AppError::Conflict`) — `reveal_escrow`/`ack_escrow` called when - `status != 'escrow_pending'`, or `create_repo` on an already-ready group. - RESOLVED (impl) — the "409 on **structural-field** update" path **does not - exist by design**: `UpdateBackupConfigArgs` carries **only `region`**, so a - bucket/role/mode edit is simply *inexpressible* over the wire rather than - rejected at runtime. Field-omission supersedes the 409. -- `502` — `reveal_escrow` if the k8s Secret read fails (control-plane error). - ---- - -## 6. Interfaces this component CONSUMES - -From **other backup-credentials components** (must exist first or be stubbed): - -- **Data model component (canopy DB):** the migrations/tables in §4 and the - base `database::server_group_backup_config` / `backup_requests` / - `backup_runs` / `backup_maintenance_runs` / `backup_repo_stats` models. - Contract: model structs with the columns in §4; the UI adds query methods - (`get_for_group`, `list_configured`, `create`, `update`, `set_status`, - `ack_escrow`, recent-runs queries) — author these in the database crate. -- **AWS/k8s-infra component:** a **kube client on `AppState`** with Secret-read - capability, so `backups_reveal_escrow` can read the passphrase Secret named - by `repo_password_ref`. This is net-new (canopy has no kube client today). - Contract consumed: `state.kube` (or equivalent) + a helper like - `read_secret(name, key) -> Result`. The UI does **not** create - Secrets — from-birth passphrase generation + Secret creation is the init - Job's job; the UI only reveals. -- **Jobs/maintenance component:** the **init Job** that performs - `kopia repository create` and drives `status`/`last_init_error`. Contract - consumed: the UI's `backups_create_repo` records intent (sets - `status='provisioning'`, clears `last_init_error`); the Job is expected to - transition the row to `escrow_pending` (from-birth) or `ready` (import), or - set `last_init_error` on failure. The exact handoff (a flag column, a queue, - or the Job polling `provisioning` rows) is the jobs component's decision — - this UI only depends on the *observable* `status`/`last_init_error` fields. -- **Device path / detection components:** none consumed directly; the UI - surfaces their *output* (runs, staleness via the existing issues/events - model already shown on the server/group pages — no new wiring here). - UPDATE (shipped): the **group-scoped** issues raised by detection/inspection - (nullable `server_id`, see the detection-preflight spec's `raise_group_event`) - are rendered — `IssueRow`/the `issues` fn handle a null `server_id` and key - off `server_group_id`/`server_group_name` for group-scoped issues (member - servers resolved via `Server::group_refs_by_server_ids`), so a corruption / - preflight alert with no member server shows correctly. - -From **existing canopy code** (already present): - -- `commons_servers::tailscale_auth::TailscaleAdmin` extractor (gate + - operator identity for `requested_by` / `escrow_acked_by`). Confirm how to - extract the login string from it (mirror whatever `admins.rs` / audited - endpoints do). -- `database::server_groups::ServerGroup` (member list for per-server "backup - now"; group existence checks). -- React: `useApi` / `useApiAction` (`private-web/src/api.ts`), - `useIsAdmin` (`hooks/useIsAdmin.tsx`), `useReloadInterval` - (status polling), `TimeAgo`, `usePageTitle`, `TagsEditor` pattern. - ---- - -## 7. Frontend behaviour detail - -### `BackupPanel` (`/groups/:id/backups`, and a card on `GroupDetail`) - -- `useApi("backups", "get", { server_group_id: id }, [id])`. - - `null` → zero-state: "Backups not set up" + admin-only "Set up backups" - button → `/groups/:id/backups/config`. - - non-null → status chip (`provisioning`/`escrow_pending`/`ready` with the - same intent-helper-text pattern as `SEVERITY_INTENT`), config summary - (bucket, region, interval or "Manual only", retention), and: - - `provisioning` → spinner + "Creating repository…"; if `last_init_error`, - an error Alert + admin "Retry repo creation" (`create_repo`). - - `escrow_pending` → prominent warning card → render `BackupEscrow`. - - `ready` → stats (`backups.stats`), recent runs table, per-server - "Backup now" buttons. -- Poll status with `useReloadInterval` (e.g. 5s while `provisioning`, slower - when `ready`) so the operator sees the init Job land without a manual reload. - -### `BackupConfig` (`/groups/:id/backups/config`) - -- Create vs edit split like `GroupEdit.tsx` (`isCreate = no config row`). -- Fields: bucket, target_role_arn, region (optional), **schedule mode toggle** - (Manual only ↔ Scheduled every N minutes — `expected_interval`), retention - (5 number fields with floor validation + helper text), repo mode - (From-birth ↔ Import; Import reveals a `repo_password_ref` field). -- Structural fields (bucket, target_role_arn, mode) are **create-only**; - disabled in edit mode with helper text ("changing the bucket is a repo - migration — out of scope here"). -- On create success → if from-birth, `create_repo` is offered as the next step - (or auto-called) so the operator flows into provisioning → escrow. - -### `BackupEscrow` - -- Renders only when `status === 'escrow_pending'` and `mode === 'from_birth'`. -- "Reveal passphrase" button → `useApiAction("backups", "reveal_escrow")`; - shows the passphrase in a monospace, copy-to-clipboard block with a loud - "Save this to Bitwarden NOW — it cannot be shown again" warning. -- A required checkbox "I have saved this passphrase to Bitwarden" enables the - "Acknowledge & activate backups" button → `ack_escrow` → flips to `ready`. -- The reveal is deliberately re-callable while `escrow_pending` (operator may - reload before acking); once `ready`, `reveal_escrow` returns 409. - -### Admin gating - -- Read views (`get`/`list`/`stats`) render for any Tailscale user (user-gate), - matching `server_groups::list`/`get`. -- All mutating buttons gate on `useIsAdmin() === true`, mirroring - `GroupDetail.tsx`'s `admin && (