diff --git a/.workhorse/specs/jobs/backup.md b/.workhorse/specs/jobs/backup.md
new file mode 100644
index 00000000..a99f4e99
--- /dev/null
+++ b/.workhorse/specs/jobs/backup.md
@@ -0,0 +1,67 @@
+---
+id: BKJ
+---
+
+# Backup control plane
+
+Canopy maintains, verifies, and watches over the fleet's backups itself — clients neither run maintenance nor hold the rights to.
+This is the autonomous half of the backup system: the work Canopy does on a cadence with no device asking, and the health signals it raises from it.
+
+## Scope
+
+This spec covers Canopy's own background backup work: repo maintenance, inspection, storage metering, upstream preflight, and the detection and alerting that turn all of it into incidents.
+
+It does not cover the device contract (see [BAK](../public-server/backup.md)), the operator's configuration of a group (see [BKO](../private-server/backup.md)), or restore-health (the managed restore replicas spec, `RST`).
+
+Canopy acts only on groups whose configuration is ready, runs at most one operation per group at a time, and bounds how many groups it works on at once.
+
+## Maintenance
+
+Canopy runs each group's repo maintenance on a cadence — clients are never granted the rights to.
+It enforces the group's retention as part of maintenance, and records every run's outcome so a stuck or failing maintenance is itself detectable.
+
+## Passphrase rotation
+
+Canopy rotates each group's repo passphrase on a cadence, so a leaked passphrase is useful only until the next rotation rather than indefinitely.
+Rotation is crash-safe: an interrupted rotation is reconciled on the next attempt, and throughout it the repo stays openable with either the previous or the new passphrase — it is never left unopenable.
+Like maintenance, rotation is Canopy's to do; operators never run it.
+
+## Recovery escrow
+
+Because Canopy holds the only copy of every passphrase, it continuously escrows the state needed to recover access without it: the per-group passphrases and repo coordinates, and the group, server, configuration, schedule, and capability records that frame them.
+The escrow is encrypted to a set of offline recipient keys whose private halves Canopy never holds, and written to versioned, object-locked storage.
+So Canopy can write the escrow but never read it back — a full Canopy compromise cannot disclose the escrowed secrets, and object-lock keeps past versions undeletable until they expire.
+Recipients are mandatory: Canopy refuses to run without them, so there is never a silent recovery gap.
+This is the escrow the operator recovery ceremony verifies (see [BKO](../private-server/backup.md)).
+
+## Inspection
+
+Canopy periodically inspects each group's repo against the storage directly, independent of what devices reported:
+
+- it verifies repo integrity, and a failed verification is repo corruption;
+- it inventories the repo — the latest snapshot per source — as the ground truth a device's report is reconciled against;
+- it records repo size, logical and physical, and the storage cost basis for display.
+
+## Upstream preflight
+
+Canopy watches its own access to each group's storage, so a broken control plane is caught at the source rather than when the fleet starts failing.
+It checks that its identity resolves, that it can assume each group's role and perform a read-only no-op, and that the bucket's object-lock is present and at least the required retention.
+Preflight only alerts; it never pulls Canopy out of service, because a failing check must not make a degraded situation worse.
+
+## Detection
+
+Canopy reconciles three sources — what a device reported, what credentials were issued, and what actually landed in the repo — and alerts on disagreement:
+
+- **staleness** — a server with a prior successful backup but none recent, or one that has never backed up though it has been expected long enough.
+- **reconcile** — a device reported a successful backup but no matching snapshot landed (the report is false or the upload didn't persist), or a fresh snapshot exists but no recent report (the reporting path is broken).
+- **maintenance** — a group whose maintenance is overdue, or whose most recent maintenance failed.
+
+## Alerting
+
+Backup alerts are raised at one of two scopes:
+
+- **Per-server** signals (staleness, never-backed-up, the report-gap) are subject to the server's monitoring gate: still recorded for visibility, but they contribute to an incident only when the server is monitored, because some servers are intentionally intermittent.
+- **Group-level** signals (repo corruption, maintenance failure, missing-snapshot reconciliation, preflight failures, and restore-verification — see the managed restore replicas spec, `RST`) page regardless of any member's monitoring state, because they are control-plane or data-safety concerns that belong to no single server.
+
+Each signal has a stable key by which operators silence or snooze it and by which the interface and notifications refer to it; the keys are a contract and are not renamed without migrating stored silences.
+A signal recovers when the condition that raised it clears.
diff --git a/.workhorse/specs/private-server/backup.md b/.workhorse/specs/private-server/backup.md
new file mode 100644
index 00000000..774bc033
--- /dev/null
+++ b/.workhorse/specs/private-server/backup.md
@@ -0,0 +1,70 @@
+---
+id: BKO
+---
+
+# Operator backup control
+
+An operator configures, through Canopy, how a server group backs up: where its repo lives, on what cadence, with what retention, and which servers and types participate.
+Canopy owns the repo passphrase throughout — it is generated or accepted once, stored in Canopy's secret store, and never handed back except through the audited recovery ceremony.
+
+## Scope
+
+This spec covers the operator-facing control surface: per-group backup configuration and its lifecycle, scheduling and retention, per-server participation, on-demand backups, the status view, and passphrase recovery.
+
+It does not cover the device contract (see [BAK](../public-server/backup.md)) or Canopy's autonomous maintenance, inspection, detection, and alerting (see [BKJ](../jobs/backup.md)).
+
+Reads are available to any tailnet user; changes require an administrator.
+
+## Per-group configuration
+
+A group has at most one backup configuration: the bucket, prefix, region, the cross-account roles Canopy assumes, the reference to the group's passphrase, and its placement and lifecycle state.
+
+Placement is one of:
+
+- **external** — the operator brings their own bucket and supplies the role ARNs Canopy will assume.
+- **shared** — Canopy provisions and names a bucket in its own shared account; the operator supplies nothing about location.
+
+A configuration is created once and its structural fields (bucket, roles, placement) are immutable; the region and the operational settings below are editable.
+Decommissioning a group deletes its configuration row — which stops all credential issuance for the group — and deletes the Canopy-owned passphrase.
+The bucket and its object-locked contents persist independently and are not Canopy's to delete; teardown is a separate, deliberate act gated by the lock window.
+
+## Lifecycle and provisioning
+
+A configuration moves from **provisioning** to **ready**; devices are refused until it is ready.
+Creating a configuration sets it provisioning and asks Canopy to create or connect the repo; that work transitions the configuration to ready, or records the error it failed with so the operator sees why.
+The operator interface depends only on these observable states, not on how provisioning is carried out.
+
+A configuration may also be created or reconciled idempotently by machine — for infrastructure-as-code — under administrator-equivalent authentication, with the same probe and provisioning behaviour as the interactive path.
+
+## Setup and the passphrase
+
+When a configuration is created, Canopy probes the target bucket and classifies it: empty, an existing kopia repo, holding unrelated content, or inaccessible.
+The classification chooses the mode:
+
+- **from-birth** — an empty bucket; Canopy generates a fresh passphrase and creates a new repo.
+- **passphrase** — an existing repo; the operator supplies its passphrase and Canopy connects to it.
+
+A bucket holding unrelated content is refused rather than written into; Canopy never deletes to make room.
+Either way Canopy creates and owns the passphrase secret, and configuration and secret are created together — if the secret cannot be stored, the configuration is rolled back, so a configuration never exists without its passphrase.
+The supplied or generated passphrase is only the starting point: Canopy rotates it on a cadence thereafter (see [BKJ](../jobs/backup.md)), and the recovery ceremony recovers whatever the current passphrase is.
+
+## Scheduling and retention
+
+Each `(group, type)` has an expected backup interval and a retention policy, taken from a per-`(group, type)` override when set, otherwise from the fleet-wide default for that type.
+A manual-only type has no interval and is backed up only on an explicit request.
+Retention is floored to an organisational minimum; a configuration may deliberately opt out of the floor, which is recorded as the dangerous choice it is.
+
+## Participation and on-demand
+
+A server participates in a type when that type is an enabled capability on it; an operator toggles participation per `(server, type)`.
+An operator may queue a one-off backup — or restore — for a `(server, type)` to run on the next cycle, and may cancel a queued one before it runs.
+
+## Status
+
+The operator can see, per group: the repo's size and cost basis, recent runs with their outcomes and errors, recent maintenance, the latest snapshot per server, and any in-flight or pending one-off requests.
+
+## Passphrase recovery
+
+Because Canopy owns the only copy of each passphrase, the ability to recover it without Canopy is verified, not assumed.
+Recovery is a ceremony: a passphrase is escrowed encrypted to a set of offline recipient keys, and an operator periodically proves the recipients can still decrypt it.
+The ceremony is recorded so staleness — too long since the last successful proof — is visible.
diff --git a/.workhorse/specs/public-server/backup.md b/.workhorse/specs/public-server/backup.md
new file mode 100644
index 00000000..b132121b
--- /dev/null
+++ b/.workhorse/specs/public-server/backup.md
@@ -0,0 +1,63 @@
+---
+id: BAK
+---
+
+# Device backups
+
+A server device backs up to object storage that Canopy mediates: it holds no long-lived storage credentials and knows nothing of the bucket layout until Canopy tells it, per run.
+Canopy is the control plane — it owns the credentials, the repo location, the passphrase, and the record of what ran.
+
+## Scope
+
+This spec covers the device-facing contract: how a device learns what it may back up, obtains short-lived credentials and the repo coordinates for a run, and reports the outcome.
+
+It does not cover what an operator configures (see [BKO](../private-server/backup.md)), what Canopy does on its own — maintenance, inspection, detection, alerting (see [BKJ](../jobs/backup.md)) — or restoring backups (the managed restore replicas spec, `RST`).
+
+## Identity and resolution
+
+A device authenticates with the `server` role, over either transport Canopy accepts (a client certificate on the internet-facing path, or tailnet identity on the private mount).
+Every device request resolves through the authenticated identity, never the request body: device → its single live server → that server's group → the group's backup configuration.
+A device bound to no live server is refused; a server with no group, or whose group has no ready configuration, is refused.
+
+## Capabilities
+
+A device registers the backup types it can run on its server.
+A newly seen type is enabled for scheduling or not according to that type's fleet default; a type already known keeps the operator's setting.
+Registration requires the server to be grouped, but not the group's configuration to be ready.
+
+## Credentials
+
+A device requests credentials for a `(type, purpose)`.
+Canopy issues short-lived credentials by assuming the group's dedicated cross-account storage role under a session policy that confines them to the group's bucket and prefix:
+
+- **backup** purpose grants the write set kopia needs, including a version-less delete — but never deletion of a locked version, nor any weakening of object-lock or retention.
+- **restore** purpose grants read-only access.
+
+The credentials carry the storage role's identity for at most an hour; a device refreshes them as a run outruns that lifetime.
+Every issuance is recorded before the credentials are returned.
+
+A `(type, purpose)` is issuable only when the type is an enabled capability of the server, or an operator has queued a one-off request of that purpose for it; otherwise it is refused.
+The group's configuration must be ready: until then the endpoints refuse, so a half-provisioned group cannot be written to.
+
+## Target
+
+A device fetches the repo coordinates for its group each run: the storage kind, bucket, prefix, region, and the repo passphrase.
+The passphrase is Canopy-owned and read from the group's secret store at request time; the device never stores it.
+
+## Reporting
+
+A device reports each run's outcome: the type and purpose, success or failure, an error when it failed, the resulting snapshot identifier, the bytes uploaded, and the object-storage traffic the run moved.
+The run is keyed by an identifier the device mints at the start of the run; the device, server, and group are taken from the authenticated context, so a device cannot report a run as another group's.
+A duplicate run identifier is refused.
+Reporting a run clears any matching operator one-off request, so the standing "back up now" prompt stops.
+
+## Guarantees
+
+A compromised device cannot destroy backups.
+Its credentials cannot delete a locked object version or weaken the bucket's object-lock; at worst it writes a delete-marker that object-lock and versioning leave recoverable.
+Decommissioning a device is revoking its certificate: it can no longer obtain credentials, and any it already holds expire within the hour.
+
+## Failure contract
+
+The device endpoints distinguish: the caller is bound to no live server; the server is ungrouped, has no ready configuration, the type is not issuable, or a run identifier is duplicate; and Canopy's own dependency — the credential issuer or the secret store — is unavailable or unconfigured.
+Each is a distinct, stable status so a device need not guess.
diff --git a/docs/plans/backup-credentials-implementation-order.md b/docs/plans/backup-credentials-implementation-order.md
deleted file mode 100644
index d0ce8300..00000000
--- a/docs/plans/backup-credentials-implementation-order.md
+++ /dev/null
@@ -1,351 +0,0 @@
-# Backup-credentials — cross-repo implementation order
-
-Direction / ordering doc for building the backup-credentials system across the
-four repos (`canopy`, `ops/pulumi`, `bestool`, `pgro`). It does **not** restate
-the design — read [`backup-credentials.md`](./backup-credentials.md) for that,
-and each component spec for the how. This document answers one question: **in
-what order, and on which tracks, do we build it so nothing waits on something
-that isn't there yet.**
-
-## The eight component specs
-
-| # | Component | Repo | Spec |
-|---|-----------|------|------|
-| 1 | canopy-database (tables, models, migrations) | canopy | [specs/canopy-database.md](./specs/canopy-database.md) |
-| 2 | canopy-public-server (device endpoints + AWS/kube on AppState) | canopy | [specs/canopy-public-server.md](./specs/canopy-public-server.md) |
-| 3 | canopy-jobs-maintenance-inspection (maintenance/inspection/S3-metrics/init Jobs) | canopy | [specs/canopy-jobs-maintenance-inspection.md](./specs/canopy-jobs-maintenance-inspection.md) |
-| 4 | canopy-jobs-detection-preflight (staleness, reconciliation, group-level alerting, preflight) | canopy | [specs/canopy-jobs-detection-preflight.md](./specs/canopy-jobs-detection-preflight.md) |
-| 5 | canopy-operator-ui (private-server fns + private-web) | canopy | [specs/canopy-operator-ui.md](./specs/canopy-operator-ui.md) |
-| 6 | ops (per-bucket roles, IRSA/ServiceAccounts, OIDC, scheduler Deployments) | ops/pulumi | [../../../ops/pulumi/docs/canopy-backup-credentials.md](../../../ops/pulumi/docs/canopy-backup-credentials.md) |
-| 7 | bestool (device `backup-credentials` / `backup` subcommands) | bestool | [../../../bestool/docs/canopy-backup-credentials.md](../../../bestool/docs/canopy-backup-credentials.md) |
-| 8 | pgro (restore consumer + signal-3 restore-verification) | pgro | [../../../pgro/docs/canopy-backup-integration.md](../../../pgro/docs/canopy-backup-integration.md) |
-
----
-
-## Dependency graph (derived from each spec's provides / depends_on)
-
-Arrows mean "depends on / must exist or be stubbed first".
-
-```
-                         ┌─────────────────────────────────────────────┐
-                         │  SPIKE: kopia vs GOVERNANCE-default-retention │
-                         │  bucket, no client-side PutObjectRetention    │
-                         │  (gates ops A2 action-set + bestool kopia)    │
-                         └───────────────┬─────────────────────────────-┘
-                                         │ (verifies an assumption; doesn't block code start)
-                                         ▼
-   (1) canopy-database  ◄──────────────────────────── everything in canopy reads/writes these tables
-        │  tables, models, lib.rs re-exports, commons-types enums,
-        │  Option-B group-scoped-issues migration handled with (4)
-        │
-        ├──────────────┬───────────────────────┬──────────────────────┐
-        ▼              ▼                       ▼                      ▼
-   (2) public-server  (3) jobs-maint/insp   (4) jobs-detect/preflight  (5) operator-ui
-   AWS SDK + kube      kube client + Job-     group-level alerting       private-server fns
-   on AppState;        spawn lib; init Job;   (Option-B issues);         + private-web; reads
-   /backup-* endpoints maintenance/inspection staleness + reconcile;     status/stats; reveal
-        │              /S3-metrics schedulers preflight (AWS)            escrow (needs kube)
-        │                    │                      │
-        │   contracts: HTTP endpoint shapes, IRSA role ARNs / ServiceAccount subs, billing labels
-        ▼                    ▼                      ▼
-   (6) ops  ◄────────────────┴──────────────────────┘   provides per-bucket role ARNs, IRSA roles,
-        │   OIDC providers, scheduler Deployments; consumes canopy SA names + OIDC issuer URL
-        │
-   (7) bestool ◄── public-server endpoints (2) + the kopia spike
-        │
-   (8) pgro    ◄── restore endpoint + external-restore grant + first-party auth (canopy, later)
-                   + a non-chained / longer-lived restore-cred decision
-```
-
-Two cross-cutting net-new capabilities sit underneath most of canopy and are
-the real gate (see Critical path):
-
-- **AWS SDK + kube client** — first use anywhere in canopy. Lands on
-  `public-server` (component 2) and the `jobs` crate (components 3/4).
-- **ServiceAccount + IRSA + OIDC** — first ServiceAccount canopy has ever had.
-  Owned by ops (component 6), consumed by 2/3/4.
-
-The canopy↔ops boundary is **mutually dependent** and resolved by contract, not
-by serialising: canopy publishes the SA names + central-cluster OIDC issuer
-URL; ops publishes the per-bucket role ARNs + IRSA role ARNs. Each side codes
-against the agreed names and the two meet at deploy.
-
----
-
-## The early blocker: kopia-behaviour verification spike
-
-> **Concluded (from kopia docs/source + S3 semantics) — Branch A:** device
-> creds = `AWS_S3_MULTIPART_ACTIONS` (no `PutObjectRetention`/no delete);
-> repo created **non-lock-aware**; rely on the bucket's default GOVERNANCE
-> retention + versioning + lifecycle. `--session-token`,
-> `--override-hostname`, and `--point-in-time` are all supported. Two items
-> still want a **live confirm** (the no-`PutObjectRetention` write path, and
-> PIT on real AWS S3 per issue #4346). Full verdict + test script:
-> [`backup-credentials-kopia-spike.md`](./backup-credentials-kopia-spike.md).
-
-**Do this first, in parallel with stage 0, before committing the ops action-set
-and the bestool kopia wiring.** It's cheap, it's a known unknown, and it
-changes two specs if it comes out the wrong way.
-
-The question (from ops spec A2 / bestool open-Q 2 / canopy-database H3): does
-kopia **write and maintain** against an S3 bucket with **GOVERNANCE 30-day
-default Object-Lock retention** when the client has **no `s3:PutObjectRetention`**
-(the device action set is `AWS_S3_MULTIPART_ACTIONS`, delete- and
-retention-free)? Also confirm:
-
-- kopia's S3 backend honours `AWS_SESSION_TOKEN` temporary creds (pgro open-Q 2;
-  bestool credential_process path).
-- `--override-hostname` exists on the installed kopia for source-host = server-id
-  (bestool open-Q 3).
-- which `BucketSizeBytes` `StorageType` dimension a versioned+locked bucket emits
-  (jobs-maint open-Q 6) — needed by the S3-metrics task, lower-stakes, can trail.
-
-Outcome drives:
-- **ops A2**: device role is exactly `AWS_S3_MULTIPART_ACTIONS`, or that **plus**
-  `s3:PutObjectRetention` (safe under GOVERNANCE-without-bypass — can only
-  lengthen a lock). Don't finalise the managed policy until this is known.
-- **bestool kopia helpers**: connect/snapshot wiring and how creds reach kopia.
-
-Run it against a throwaway dev bucket the ops `backups` stack can stand up. If it
-comes back "kopia insists on PutObjectRetention", the fallback is already
-specified — re-grant it — so this never blocks, it just picks a branch. Start
-the spike at day 0; it must conclude before ops merges the action-set change and
-before bestool finalises the kopia connect path.
-
----
-
-## Critical path
-
-The longest chain of hard dependencies, and what unblocks the most downstream
-work, is:
-
-1. **Net-new enabling work** (the gate for all of canopy):
-   - **(1) canopy-database** — tables, models, `lib.rs` re-exports, and the
-     shared `commons-types` enums (`Purpose`/`Outcome`/`kind`). Nothing in
-     canopy compiles against these until they exist. This is the true
-     foundation; land it first.
-   - **AWS SDK + kube client deps + AppState wiring** (inside component 2) and
-     the **kube client + Job-spawn library** (inside component 3). First AWS/k8s
-     code in the repo; verify crate versions against the registry (no guessing),
-     pin `k8s-openapi` to the cluster's control-plane version.
-   - **(6) ops IRSA/ServiceAccount/OIDC + per-bucket role ARNs** — without the
-     ServiceAccount + IRSA trust, no canopy pod can `AssumeRole`, and without the
-     role ARNs there's no `target_role_arn` to put in config. This runs in
-     parallel with the canopy enabling work, joined by the ARN/SA-name contract.
-
-2. **The issuance hot path**: (1) → (2) public-server `/backup-credentials` +
-   `/backup-target` + `/backup-report`, against (6)'s role ARNs and Secret-read
-   RBAC. This is the contract bestool consumes.
-
-3. **bestool (7)** — needs (2)'s endpoint shapes live (or contract-frozen) and
-   the kopia spike concluded.
-
-4. **pgro (8)** — additive last stage; needs the restore endpoint, the
-   external-restore grant, first-party auth, and a longer-lived-cred decision,
-   none of which exist until canopy ships its restore surface.
-
-The single highest-leverage item is **canopy-database (1)**: components 2, 3, 4,
-and 5 all import its models. Land it, with the `commons-types` enums, before the
-four canopy tracks fan out. The second is the **ops IRSA/OIDC plumbing (6)**,
-because it's the longest-lead infra item and gates every AWS-touching code path
-at deploy time even though the code can be written against the contract earlier.
-
----
-
-## Build order (stages)
-
-Stages are sequencing guidance, not hard gates — within a stage, tracks run in
-parallel. A later stage starts when its named dependencies from the earlier
-stage are merged (or contract-frozen and stubbed).
-
-### Stage 0 — foundations (must land first)
-
-- **Spike**: kopia-behaviour verification (above). Parallel, concludes before
-  ops action-set + bestool kopia.
-- **(1) canopy-database**: the `backup_credentials` migration (all 7 tables),
-  `backups.rs` models + `lib.rs` re-exports, the `commons-types`
-  `Purpose`/`Outcome`/`kind` enums. Resolve its open decisions up front because
-  they ripple: enum representation (shared enums vs validated String), the
-  `backup_runs` client-supplied-PK → `AppError::Conflict` mapping, cascade
-  policy for stats/requests vs the no-cascade audit rule, and the
-  `backup_repo_snapshots.server_id` on-delete behaviour. DB-only tests via
-  `TestDb::run`.
-- **(6) ops — contract freeze + long-lead infra**: agree the names/ARNs both
-  sides code against (`canopyIssuerRoleArn`, `canopyJobsRoleArn`,
-  `canopy-issuer`/`canopy-jobs` ServiceAccount subs, central-cluster OIDC issuer
-  URL, per-bucket `deviceRoleArn`/`maintenanceRoleArn`, the
-  `billing.{product,stage,deployment}` label keys). Then start the actual
-  Pulumi: Component B (central ServiceAccounts + IRSA + RBAC, `spec.ts`
-  `serviceAccountName`), Component A1/A3/A4/A6 (per-bucket trust, Object-Lock
-  read action, ARN exports, lifecycle rules), Component C (OIDC provider per
-  deployment account). A2 (action-set reduction) waits on the spike.
-
-The canopy↔ops contract is the coordination spine for everything after.
-
-### Stage 1 — issuance hot path + enabling clients (the device-facing MVP)
-
-Depends on Stage 0's (1) and the (6) contract.
-
-- **(2) canopy-public-server**: add the AWS SDK + kube deps, the
-  `AppState.sts` / `AppState.kube` fields + `FromRef` impls + async init,
-  `AppError::Upstream` (502) + ERRORS.md, and the three handlers
-  (`/backup-credentials`, `/backup-target`, `/backup-report`) with the restore
-  session-policy builder. This component **owns** the AWS/kube-on-AppState
-  capability the rest of canopy reuses. Tests: the 412/409/502 resolution
-  matrix with `None` clients, the session-policy unit test, a stubbed-STS 200
-  path.
-
-This is the first end-to-end slice: a device can mint creds and report a run.
-
-### Stage 2 — control-plane jobs (parallel canopy tracks)
-
-All depend on Stage 0 (1) and reuse the AWS/kube patterns from Stage 1 (2).
-These three run in parallel with each other and with bestool.
-
-- **(3) canopy-jobs-maintenance-inspection**: the shared Job-spawn library
-  (recommended `commons-servers::backup_jobs` so private-server can call the
-  init-Job spawn without depending on the `jobs` crate), the three scheduler
-  bins (maintenance / inspection / S3-metrics), the kopia-Job arg contract, and
-  the migrations it owns (`backup_maintenance_runs`, `backup_repo_snapshots`,
-  `backup_repo_stats`) — coordinate with (1) on single-vs-split migration
-  ownership.
-- **(4) canopy-jobs-detection-preflight**: the **Option-B group-scoped-issues
-  migration** + the thorough `issues.rs` sweep (this is the largest single
-  decision in the system and the central new shared plumbing —
-  `raise_group_event` is consumed by the inspection Job in (3) and PGRO ingest
-  in (8)), the `backup_staleness` and `backup_preflight` bins, and the shared
-  `jitter_slot` helper. Resolve Option A vs B before building; recommend B.
-- **(6) ops — scheduler Deployments**: B4 wires the
-  `backup-maintenance`/`backup-inspection`/`backup-preflight` (and possibly
-  `backup-s3-metrics`/`backup-staleness`) single-replica Deployments on the
-  `canopy-jobs` SA, once the bin names are pinned by (3)/(4).
-
-Cross-track coordination inside Stage 2:
-- (3) and (4) **share** `commons-servers` helpers (`jitter_slot`,
-  retention-floor) and the `(source, ref)` alert keys — agree these once.
-- The group-level alerting path from (4) is a **prerequisite** for (3)'s
-  corruption alert and (4)'s own group-level refs — (4) should land the Option-B
-  plumbing early in the stage so (3) can call `raise_group_event`.
-
-### Stage 3 — operator UI + device client (parallel)
-
-- **(5) canopy-operator-ui**: private-server `/api/backups/*` fns + the React
-  screens. Depends on (1) models, reuses (2)'s kube client for `reveal_escrow`
-  (resolve open-Q: private-server gets its own `canopy-issuer` SA + Secret-read
-  RBAC — coordinate with ops open-Q 2), and depends on (3)'s init-Job contract
-  for the `provisioning → escrow_pending/ready` lifecycle. `just gen-openapi` +
-  Playwright e2e in the same change.
-- **(7) bestool**: the two subcommands + `CanopyClient` methods, against (2)'s
-  frozen endpoint shapes and the concluded kopia spike. The "back up now"
-  command-channel transport is **deferred upstream** — build the
-  transport-independent subcommands now; wire the trigger when canopy defines
-  the status-response payload.
-
-(5) and (7) are independent and parallel. (7) can start as soon as (2)'s
-endpoint contract is frozen, even before (3)/(4) land.
-
-### Stage 4 — PGRO (additive, last)
-
-- **(8) pgro**: restore-consumer CRD (`canopyBackup.group`), `fetch_restore_creds`
-  / `report_restore`, signal-3 `RestoreReport` into a future
-  `backup_restore_checks` table + the `restore-verification` group-level alert
-  (routed through (4)'s `raise_group_event`).
-
-PGRO is explicitly last because it needs canopy-side surfaces that don't exist
-until the earlier stages ship:
-- a **restore-credentials** path (purpose=restore creds + target + repo
-  password) — built on (2);
-- the **external-restore grant** (operator-authorized, audited "consumer pgro
-  may read group X read-only") — net-new canopy authz surface;
-- a **first-party non-device auth** path (Tailscale now, OIDC later) — joint
-  canopy+ops design;
-- the **`backup_restore_checks` table + ingest endpoint + signal-3 detection**;
-- a **decision on longer-lived / non-chained restore creds** so restores >1h
-  survive (mirror the maintenance-Job direct web-identity). **This decision is
-  owed by canopy and should be made during Stage 2** (when the maintenance-Job
-  direct-web-identity path is built) so PGRO isn't blocked on it in Stage 4.
-
----
-
-## Parallelizable tracks (one per repo)
-
-Once Stage 0's (1) + (6)-contract land, the repos proceed largely in parallel,
-coordinated only by the contracts named below.
-
-- **canopy track**: (1) → then (2), (3), (4), (5) fan out. (2) blocks (7) (HTTP
-  contract). (4)'s Option-B plumbing blocks (3)'s corruption alert. (5) needs
-  (3)'s init-Job contract and (2)'s kube client.
-- **ops track**: (6) runs alongside the canopy enabling work, joined by the
-  ARN/SA-name contract; its scheduler-Deployment piece (B4) trails (3)/(4)'s bin
-  names.
-- **bestool track**: (7) starts when (2)'s endpoint shapes are frozen and the
-  spike is done; otherwise independent of (3)/(4)/(5).
-- **pgro track**: (8) is last; nothing else depends on it.
-
-### The contracts that let the tracks run independently
-
-1. **HTTP endpoint shapes** (canopy public-server ⇆ bestool, and later ⇆ pgro):
-   `POST /backup-credentials`, `GET /backup-target`, `POST /backup-report` —
-   request/response bodies, the 412/409/502 semantics, and `backup_runs.id` =
-   client-minted UUID PK with `device_id`/`group_id` server-derived. Freeze
-   these from spec (2)/(7) before bestool starts; bestool's `canopy_contract.rs`
-   `#[ignore]`d suite is the drift detector.
-2. **IRSA role ARNs + ServiceAccount subs + OIDC issuer** (canopy ⇆ ops):
-   `target_role_arn` (= ops `deviceRoleArn`), `maintenanceRoleArn`,
-   `canopyIssuerRoleArn`, `canopyJobsRoleArn`, the central-cluster OIDC issuer
-   URL, and the `canopy-issuer`/`canopy-jobs` SA names in namespace
-   `tamanu-meta-<stack>`. The hard isolation invariant: the maintenance/
-   fullaccess role MUST NOT trust the issuer principal.
-3. **Billing label keys** (canopy ⇆ ops): `billing.{product,stage,deployment}`,
-   with `ServerRank::Production → "prod"` (the load-bearing mapping gotcha).
-4. **Shared `commons-types` enums** (canopy-internal, spec 1): `Purpose` /
-   `Outcome` / `kind` shared across public-server, jobs, and the generated
-   `api-types.ts`, so the three components don't drift.
-5. **`raise_group_event` group-level alert entrypoint** (spec 4, consumed by 3
-   and 8): the single place that opens a group-scoped incident bypassing
-   `is_monitored`.
-6. **Init-Job lifecycle contract** (spec 3 ⇆ spec 5): UI sets
-   `status='provisioning'` + clears `last_init_error`; the init Job transitions
-   to `escrow_pending`/`ready` or sets `last_init_error`. UI depends only on the
-   observable fields, not the handoff mechanism.
-7. **kopia Job image + entrypoint arg contract** (ops-built image ⇆ canopy jobs
-   ⇆ bestool source conventions): args (bucket/prefix/region/role/retention/
-   run-id), `secretKeyRef` password mount, source-host `canopy@<server-id>`,
-   snapshot tags `canopy-device`/`canopy-run`.
-
----
-
-## Cross-cutting decisions to settle before the dependent stage
-
-These appear in multiple specs' open questions; resolving them early prevents
-rework. Each is tagged with the latest stage by which it must be decided.
-
-- **Enum representation** (`commons-types` shared vs validated String) — **Stage 0**,
-  blocks (1) and the generated `api-types.ts`.
-- **Migration ownership** (one `backup_credentials` migration vs split across
-  (1)/(3)) — **Stage 0/2**, coordinate (1) and (3).
-- **Group-level alerting Option A vs B** — **Stage 2**, recommend B; blocks (3)'s
-  corruption alert and all group-level refs in (4).
-- **Where `reveal_escrow` reads the Secret** (private-server own kube client) —
-  **Stage 3**, ties to ops open-Q 2 (does private-server get `canopy-issuer`).
-- **Longer-lived / non-chained restore creds for first-party consumers** —
-  **Stage 2** (decided when the maintenance-Job direct-web-identity lands), so
-  (8) isn't blocked.
-- **"Back up now" command-channel transport** — deferred upstream; **does not
-  block** Stage 3's bestool subcommands, which are transport-independent.
-- **kopia + default-retention without PutObjectRetention** — the **spike**;
-  blocks ops A2 and bestool kopia wiring.
-
----
-
-## Summary one-liner per stage
-
-- **Stage 0**: land the DB layer + shared enums (1); freeze the canopy↔ops
-  contract and start the long-lead IRSA/OIDC infra (6); run the kopia spike.
-- **Stage 1**: build the issuance hot path + the AWS/kube-on-AppState
-  capability (2) — first end-to-end device slice.
-- **Stage 2**: the control-plane Jobs (3) + detection/preflight + group-level
-  alerting (4) in parallel, plus ops scheduler Deployments; decide restore-cred
-  lifetime here.
-- **Stage 3**: operator UI (5) and the bestool device client (7) in parallel.
-- **Stage 4**: PGRO restore-verification (8), additive and last.
diff --git a/docs/plans/backup-credentials-kopia-spike.md b/docs/plans/backup-credentials-kopia-spike.md
deleted file mode 100644
index dce9917f..00000000
--- a/docs/plans/backup-credentials-kopia-spike.md
+++ /dev/null
@@ -1,166 +0,0 @@
-# kopia-behaviour verification spike (day-0 blocker)
-
-Resolves the spike named in
-[`backup-credentials-implementation-order.md`](./backup-credentials-implementation-order.md).
-Its job: pick the device IAM action-set branch and confirm the kopia
-assumptions, so ops can finalize the managed policy and bestool can wire
-the kopia connect/snapshot path.
-
-**Method:** authoritative from kopia docs + source and S3 semantics (no
-live test was runnable in-session — no kopia binary, no valid AWS creds).
-A gold-standard live-test script is at the end; the *decision* doesn't wait
-on it, but two items (PIT on real AWS, the bucket-default-retention write
-path) warrant a live confirm before relying on them.
-
-## Verdict
-
-**Branch A confirmed: device creds = `AWS_S3_MULTIPART_ACTIONS` (no
-`s3:PutObjectRetention`, no `s3:DeleteObject`); the kopia repo is created
-*non-lock-aware*; we rely on the bucket's default GOVERNANCE 30-day
-retention + versioning + lifecycle.** Ops can finalize the device managed
-policy as exactly `AWS_S3_MULTIPART_ACTIONS`.
-
-## Findings (per question)
-
-### 1. PutObjectRetention vs bucket-default retention (the branch decision)
-
-kopia needs `s3:PutObjectRetention` **only when it manages retention
-itself** — i.e. the repo is created with `--retention-mode` and locks are
-renewed via full-maintenance `--extend-object-locks`. That's the
-kopia-documented "ransomware" path, and it's confirmed that even with
-`--extend-object-locks` kopia still requires `PutObjectRetention` on the
-primary bucket (so it can't be isolated away).
-
-We deliberately do **not** use that mode. Instead we create a *plain*
-kopia repo (no `--retention-mode`) against a bucket whose **default object
-lock retention** is GOVERNANCE 30d. S3 applies the default retention to
-every `PutObject` **server-side**, which requires only `s3:PutObject` — not
-`s3:PutObjectRetention`. So the device key needs neither delete nor
-retention permission. ✔ matches the plan's H3.
-
-- **Consequence (already accepted):** without `--extend-object-locks`, a
-  live blob's lock is fixed at 30d from its last write and never renewed.
-  Irrelevant under the device-compromise threat (the device can't delete);
-  it only matters against an AWS-level attacker, which is out of scope.
-  Re-enabling renewal later = lock-aware mode + `PutObjectRetention` on the
-  **maintenance** role (never the device).
-- **Caveat:** this is *not* the kopia-documented happy path; it rests on S3
-  default-retention semantics (solid) rather than a kopia doc page. Live
-  test item (a) confirms kopia writes + maintains happily this way.
-
-### 2. Maintenance deletes on a versioned bucket (H2)
-
-Non-lock-aware kopia issues real `DeleteObject`; on a versioned bucket that
-writes a **delete marker** (succeeds, reclaims nothing) — it does **not**
-error. Reclamation is via the S3 lifecycle `noncurrentVersionExpiration`
-rule, as the plan says. The maintenance role needs `s3:DeleteObject` (it
-has it). (Note: kopia *also* has a "hidden marker" soft-delete it uses with
-restricted/lock-aware keys — not our path; our maintenance role deletes for
-real and lets lifecycle reclaim.) ✔ matches the plan's H2; the earlier
-"throws errors on locked deletes" framing was wrong.
-
-### 3. Temporary credentials / `credential_process` (`AWS_SESSION_TOKEN`)
-
-`kopia repository create/connect s3` supports `--session-token` (and the
-`AWS_SESSION_TOKEN` env). So the short-lived STS creds (which include a
-session token) work, and the `credential_process`-style refresh is viable.
-✔ unblocks the bestool credential path.
-
-### 4. Source host = server-id
-
-`--override-hostname` and `--override-username` exist, set at **`kopia
-repository connect`** time (connection-level, *not* per-snapshot — the
-per-snapshot `--hostname`/`--username` were removed in 0.6.0). bestool
-reconnects per run (it re-derives the connection from Canopy every run), so
-it passes `--override-hostname=<server-id>` (`--override-username=canopy`)
-on connect → source `canopy@<server-id>:<path>`. The **type** goes in the
-path and a `canopy-type=<type>` snapshot tag (`kopia snapshot create
---tags`). ✔ matches the plan's per-`(server, type)` source model.
-
-### 5. Point-in-time recovery (H1)
-
-`kopia repository connect … --point-in-time=<ts>` exists and is the
-documented recovery path for a versioned+locked bucket (recover to before
-a poisoning/deletion). ✔ the H1 recovery runbook is real.
-
-- **Caveat:** GitHub issue #4346 reports `--point-in-time` failing with
-  "repository not initialized" on some S3-*compatible* endpoints, and
-  #3492 covers a recovery edge case (missing files after deleted objects).
-  Real AWS S3 is kopia's primary supported target, but **PIT recovery must
-  be live-tested on real AWS S3** before we depend on it operationally —
-  it's our break-glass path. Live test item (b).
-
-### 6. CloudWatch `BucketSizeBytes` dimension (lower-stakes, may trail)
-
-`BucketSizeBytes` carries a `StorageType` dimension, and **all object
-versions (current + noncurrent) count** toward it per storage class. So the
-S3-metrics task sums `BucketSizeBytes` across the relevant `StorageType`s —
-`StandardStorage` plus the intelligent-tiering classes (`.storageconfig`
-puts pack blobs in `INTELLIGENT_TIERING`). Confirm the exact emitted
-dimensions against a real bucket. Lower-stakes; `bucket_bytes` is
-best-effort anyway.
-
-## What this unblocks
-
-- **ops** (action-set): the device role = `AWS_S3_MULTIPART_ACTIONS`, no
-  `PutObjectRetention`. The repo is created non-lock-aware; the bucket
-  keeps its default GOVERNANCE 30d retention + the lifecycle rules.
-- **bestool** (kopia wiring): connect with `--session-token` +
-  `--override-hostname=<server-id>`; `kopia snapshot create --tags
-  canopy-device=… canopy-run=… canopy-type=…`; do **not** pass
-  `--retention-mode`.
-
-## Remaining live confirmations (run when a throwaway bucket + creds exist)
-
-These don't change the branch; they de-risk the two assumptions that rest
-on semantics/known-issues rather than a kopia doc. Script below.
-
-(a) Plain kopia repo create + snapshot + maintenance against a versioned,
-    default-GOVERNANCE-retention bucket, using a **device key without
-    PutObjectRetention/Delete** and a **maintenance key with delete** —
-    confirm no `AccessDenied` for retention and that maintenance succeeds.
-(b) `--point-in-time` reconnect works on real AWS S3.
-(c) The `BucketSizeBytes` `StorageType` dimensions emitted.
-
-```bash
-#!/usr/bin/env bash
-# Operator-run live confirmation. Needs: aws cli with creds, kopia.
-# Creates a throwaway bucket — review + delete after.
-set -euo pipefail
-B="bes-kopia-spike-$(date +%s)"; R="ap-southeast-2"
-KP="spike-pass-$(openssl rand -hex 8)"
-
-# 1. Versioned bucket + object lock + 30d GOVERNANCE default retention
-aws s3api create-bucket --bucket "$B" --region "$R" \
-  --create-bucket-configuration LocationConstraint="$R" \
-  --object-lock-enabled-for-bucket
-aws s3api put-object-lock-configuration --bucket "$B" \
-  --object-lock-configuration 'ObjectLockEnabled=Enabled,Rule={DefaultRetention={Mode=GOVERNANCE,Days=30}}'
-
-# 2. DEVICE creds: NO delete, NO PutObjectRetention (AWS_S3_MULTIPART_ACTIONS).
-#    Use a scoped IAM user/role with: s3:GetObject,PutObject,
-#    AbortMultipartUpload,ListBucketMultipartUploads,ListMultipartUploadParts,
-#    ListBucket,GetBucketLocation on the bucket. Export its creds, then:
-kopia repository create s3 --bucket "$B" --region "$R" --password "$KP" \
-  --override-hostname server-test --override-username canopy
-#    ^ EXPECT: success. FAIL = AccessDenied mentioning PutObjectRetention
-#      → fall back to granting PutObjectRetention (safe; lengthen-only).
-echo hello > /tmp/spike.txt
-kopia snapshot create /tmp/spike.txt --tags canopy-type:tamanu-postgres
-aws s3api list-object-versions --bucket "$B" --query 'Versions[0].ObjectLockMode' # EXPECT: GOVERNANCE (default applied on PUT)
-
-# 3. MAINTENANCE creds: full S3 incl. delete. Re-connect with those, then:
-kopia maintenance run --full --safety none   # EXPECT: success; deletes become markers
-aws s3api list-object-versions --bucket "$B" --query 'DeleteMarkers' # EXPECT: markers present, no errors
-
-# 4. PIT (item b): note a timestamp, mutate, then:
-kopia repository connect s3 --bucket "$B" --region "$R" --password "$KP" \
-  --point-in-time="$(date -u +%Y-%m-%dT%H:%M:%SZ)"   # EXPECT: connects (watch for issue #4346)
-
-# 5. CloudWatch dimensions (item c): after metrics populate (~a day),
-aws cloudwatch list-metrics --namespace AWS/S3 --metric-name BucketSizeBytes \
-  --dimensions Name=BucketName,Value="$B"
-
-# cleanup: object-locked objects can't be deleted for 30d; the throwaway
-# bucket will linger until the lock lapses (expected). Tag it for teardown.
-```
diff --git a/docs/plans/backup-setup-wizard-ops-handoff.md b/docs/plans/backup-setup-wizard-ops-handoff.md
deleted file mode 100644
index 33202044..00000000
--- a/docs/plans/backup-setup-wizard-ops-handoff.md
+++ /dev/null
@@ -1,241 +0,0 @@
-# Ops/pulumi handoff — backup setup wizard + chained-AssumeRole cred model
-
-Companion to `backup-setup-wizard.md`. This lists **only the ops/pulumi-side
-changes** canopy needs. Canopy owns the Rust/UI/DB work; this is what the ops
-agent must adjust. (canopy ticket TAM-6877; ops ticket TAM-6878.)
-
-## Context
-
-Canopy is moving the backup credential flow to **explicit chained
-`sts:AssumeRole`** (there is no deployment-account OIDC provider, so the old
-`AWS_ROLE_ARN`-override / direct-web-identity trick is gone), and adding an
-interactive setup wizard that **probes the target bucket from private-server**
-before a config is saved. private-server has no AWS identity today — that's the
-main new ask.
-
-## 1. New `canopy-private` ServiceAccount + IRSA role
-
-private-server currently has no AWS identity. Add a **dedicated**
-`canopy-private` SA (not a reuse of `canopy-jobs`/`canopy-issuer` — we want room
-to grow private-server AWS features later):
-
-- IRSA role annotated on the `canopy-private` SA.
-- The role carries `sts:AssumeRole`.
-- k8s RBAC for the SA: **`get` + `create` secrets** in the namespace (today
-  private-server only needs `get`; the wizard now creates the passphrase Secret).
-
-## 2. Trust-policy matrix (chained AssumeRole) — the main change
-
-Verified current state in `pulumi/tamanu/on-linux/src/backup/kopia.ts`:
-
-- device role: `assumeRolePolicyDocument: deviceAssumeRolePolicy(canopyIssuerRoleArn)`
-  → trusts **canopy-issuer**.
-- maintenance role: `maintenanceAssumeRolePolicy(canopyJobsRoleArn)` → trusts
-  **canopy-jobs**.
-
-Target trust per group:
-
-| Per-group role | Trusted by (can `sts:AssumeRole` it) | Used for |
-|---|---|---|
-| **device role** (`deviceRoleArn` / `target_role_arn`) | `canopy-issuer` (existing) **+ `canopy-private` (NEW)** | mint device creds; wizard validation (`sts:get-caller-identity`) |
-| **maintenance role** (`maintenanceRoleArn` / `maintenance_role_arn`) | `canopy-jobs` (existing) **+ `canopy-private` (NEW)** | maintenance/inspection/s3-metrics; wizard content-probe + connect-verify |
-
-**Concrete change:** `deviceAssumeRolePolicy` / `maintenanceAssumeRolePolicy`
-currently take a single trusted role ARN. Extend them to also trust a new
-`canopyPrivateRoleArn` (add a `canopy.get('canopyPrivateRoleArn')` alongside the
-existing `canopyIssuerRoleArn` / `canopyJobsRoleArn` config reads), and have the
-canopy deployment stack export that ARN (next to `canopyIssuerRoleArn` /
-`canopyJobsRoleArn`).
-
-## 3. Maintenance role permissions — already correct, just confirm
-
-Verified: the maintenance role already grants `s3:*` on the bucket **and**
-`CLOUDWATCH_METRIC_ACTIONS` (the s3-metrics CloudWatch reads). No change needed —
-this section is just to confirm canopy will now assume **this** role for
-maintenance/inspection/s3-metrics.
-
-> Bug being fixed canopy-side: maintenance/s3-metrics currently assume the
-> *device* role (can't delete). They'll switch to `maintenance_role_arn`. The
-> maintenance role is already complete, so no ops grant change — the device role
-> stays minimal.
-
-## 4. Session duration — **NO ops change** (MaxSessionDuration ask withdrawn)
-
-Disregard the earlier "raise `canopy-jobs` `MaxSessionDuration` to 12h" note —
-**withdrawn.** No `MaxSessionDuration` change is needed on any role.
-
-Why it's moot: we verified (kopia v0.23.1 + minio-go v7.2.0 source) that kopia
-cannot use `credential_process`/a creds file, **and** the `--role` approach
-(which would have needed a long base session) is superseded. The chosen
-mechanism is kopia's **IAM container-credentials endpoint**: canopy runs a tiny
-localhost endpoint that mints a fresh (1h chained) maintenance-role session per
-poll; kopia's minio-go re-polls at ~80% of lifetime, and the Rust SDK keeps the
-pod's IRSA base fresh on its own. So a 90-min (or longer) run just re-polls — no
-session ceiling, no role-duration tuning. Per-bucket roles stay 1h-capped
-(fine); `canopy-jobs`/`canopy-private` need no duration change.
-
-## 5. `.storageconfig` — informational, likely no change
-
-Canopy will **create `<prefix>.storageconfig` as a fallback** during repo init
-*only if absent*, and **never overwrites** an existing one, using the same
-schema ops already writes (`blobOptions: p→INTELLIGENT_TIERING, else STANDARD`,
-from `pulumi/tamanu/on-linux/src/backup/kopia.ts` and `pulumi/backups/index.ts`).
-Since pulumi writes it at bucket creation, pulumi's object normally wins and
-canopy's fallback is a no-op. No ops change required — just be aware canopy may
-write it for buckets/prefixes pulumi didn't seed.
-
-## 6. Config-as-a-resource API (so pulumi stops hand-copying ARNs)
-
-Canopy will expose an API for pulumi to register a backup config as a managed
-resource after it creates the bucket+roles — create/update/delete/get,
-idempotent upsert. Ops side:
-
-- **Auth: `TailscaleAdmin`** for now (pulumi already has tailnet access — call
-  the private-server admin API over the tailnet). A proper non-interactive
-  machine-auth path over Tailscale (tagged/ACL-grant) is wanted **later, not in
-  this plan** — flag if ops wants to start designing it, but canopy isn't
-  building it yet.
-- **Inputs pulumi must supply per group:** `server_group_id`, `bucket`,
-  `prefix`, `region`, `target_role_arn` (= `deviceRoleArn`), `maintenance_role_arn`
-  (= `maintenanceRoleArn`), `mode` (machine flows: `from_birth` — canopy
-  generates the passphrase; the human still escrows via the UI).
-- **Delete** removes the config **and the canopy-owned passphrase Secret**.
-- The create/update path runs the **same server-side access-check (the probe)**
-  before persisting, so a misconfigured role/bucket fails fast.
-- Exact request/response schema is canopy's to finalize; canopy will share the
-  generated openapi. Ops only needs to confirm the **inputs above** are
-  available as pulumi outputs (they are: `deviceRoleArn` + `maintenanceRoleArn`
-  are already exported).
-
-## 6a. Device path note (informational — bestool repo, not pulumi)
-
-Heads-up that the device backup path is changing on the bestool side (TAM-6879),
-not here: bestool will serve a localhost **container-credentials** endpoint that
-kopia polls (fed by public-server creds), because we verified `credential_process`
-doesn't work and ~90-min snapshot jobs make any <1h static-cred path non-viable.
-This may prompt revisiting public-server's device-cred response shape. No pulumi
-action — flagged only so the ops/bestool picture is consistent.
-
-## 7. Not changing
-
-- Device credential path *trust/roles* (public-server / `canopy-issuer`
-  chain-assuming the device role) — unchanged (only the bestool-side cred
-  *delivery* changes, §6a).
-- The image still bundles kopia.
-
-## Summary of ops action items
-
-1. New `canopy-private` SA + IRSA role (`sts:AssumeRole`); SA RBAC `get`+`create`
-   secrets; export its role ARN as `canopyPrivateRoleArn`.
-2. Add `canopyPrivateRoleArn` to **both** `deviceAssumeRolePolicy` and
-   `maintenanceAssumeRolePolicy` (they take a single ARN today).
-3. Maintenance role perms — already `s3:*` + CloudWatch; nothing to change, just
-   confirmed.
-4. **No `MaxSessionDuration` change (ask withdrawn, §4)** — canopy uses kopia's
-   container-credentials endpoint, which refreshes with no session ceiling. No
-   `.storageconfig` change either.
-5. Plan to call canopy's config-as-a-resource API over the tailnet
-   (`TailscaleAdmin`) feeding `deviceRoleArn`+`maintenanceRoleArn`+bucket/prefix/
-   region; delete cascades to the Secret. Schema TBD from canopy.
-
----
-
-## Changelog (append-only — do NOT edit the body above after handoff)
-
-**v1 — the version ops actioned** (everything above as of the first handoff). The
-action items §1–§3, §5 are the source of truth; treat them as done.
-
-**2026-06-20 — delta since v1 (nothing here needs new ops IAM/pulumi work):**
-- **§4 reworded, net zero for ops.** v1 already said "no `MaxSessionDuration`
-  change," and that's still true. The *reason* changed (canopy-internal): kopia
-  now gets creds via a localhost **container-credentials endpoint** (verified
-  against kopia 0.23.1 + minio-go 7.2.0), not `credential_process` and not
-  `--role`+long-session. No role-duration tuning on any role.
-  - ⚠️ **If anyone verbally relayed a "raise `canopy-jobs` `MaxSessionDuration`
-    to 12h" ask (it was never in this doc), it is WITHDRAWN — ignore/revert it.**
-- **§6a added — informational only, no pulumi action.** The device backup path
-  moves to a bestool-served container-credentials endpoint (TAM-6879); may prompt
-  revisiting public-server's device-cred response shape. Flagged for picture
-  consistency only.
-- **§1–§3, §5 unchanged** (byte-identical to v1).
-
-**2026-06-20 — NEW ops action (passphrase rotation):**
-- ⚠️ **`canopy-jobs` SA now needs WRITE on secrets** (`create`/`update`/`patch`,
-  on top of the existing `get`). Why: the backups pod rotates each repo's
-  passphrase regularly (forward protection) — after `kopia change-password` it
-  writes the new passphrase back to the group's k8s Secret (dual-key
-  `password`/`password_next`, server-side apply, field-manager `canopy-backups`).
-  Read-only `get secrets` no longer covers the rotation path.
-- No other ops change; rotation cadence is a canopy env
-  (`CANOPY_BACKUP_ROTATION_DAYS`, default 7).
-
-**2026-06-21 — NEW ops action (recovery vault):**
-- ⚠️ **A new object-locked S3 bucket for the recovery vault, in a SEPARATE account**
-  from both the canopy cluster account and the per-tenant backup accounts.
-  Requirements:
-  - **Object Lock = COMPLIANCE** + **versioning** on (so a Canopy compromise
-    can't delete history; each daily write is a new immutable version of the
-    same key). Pick a retention period (with a lifecycle expiry so it doesn't
-    grow forever). SSE on.
-  - A **writer role** the `canopy-jobs` SA assumes (chained AssumeRole), granted
-    **`s3:PutObject` ONLY** on that bucket — **no delete, no get** (Canopy never
-    reads the vault back; the blob is asymmetrically encrypted so it couldn't
-    read it anyway).
-- ⚠️ **age recipient keypairs (recovery-key custody).** Generate **multiple** age
-  keypairs (e.g. one per recovery officer; `bestool crypto keygen`). The
-  **public** keys go to Canopy via `CANOPY_RECOVERY_VAULT_KEYS` (space/comma-
-  separated `age1…`); the **private** keys are held **offline, out-of-band**
-  (any one can recover). Custody is an ops runbook — Canopy never sees a private
-  key.
-- **Canopy env (backups pod):** `CANOPY_RECOVERY_VAULT_KEYS` (**mandatory** — the
-  pod refuses to start without it), `CANOPY_RECOVERY_VAULT_BUCKET` (**mandatory**),
-  `CANOPY_RECOVERY_VAULT_REGION`, `CANOPY_RECOVERY_VAULT_ROLE_ARN` (the writer role),
-  `CANOPY_RECOVERY_VAULT_SNAPSHOT_HOURS` (default 24). The object key/path within
-  the bucket is not configurable (fixed at `canopy-recovery/state.age`). These
-  must be provisioned **before** the backups pod is deployed with this build, or
-  it will crash-loop on the mandatory check.
-- **Verification ceremony (runbook):** operators run a yearly (and on-key-change)
-  ceremony in the canopy admin UI (recovery vault page): Canopy issues an age-encrypted
-  challenge, the operator decrypts it offline with a held private key
-  (`bestool crypto decrypt`) and pastes it back. The vault blob itself is plain
-  `age` v1 (decryptable with `bestool crypto decrypt` / `age` / `rage`).
-- No k8s RBAC change (the vault is S3, not a Secret).
-
-**2026-06-22 — clarification (private-server also needs the recipients):**
-- ⚠️ **Set `CANOPY_RECOVERY_VAULT_KEYS` on private-server too**, not just the
-  backups pod. They're **public** keys (non-secret), so use the same value. The
-  private-server needs them to run the verification ceremony (issue the
-  age-encrypted challenge); without them the recovery-vault page reports the
-  ceremony as unavailable. private-server does **not** hard-require them (it
-  starts fine without; only the ceremony page is degraded) — unlike the backups
-  pod, which won't start without them. Nothing else on private-server needs it.
-
-**2026-06-22 — fix (repo-password Secret namespace):**
-- Canopy now reads/creates the repo-password Secrets in the **pod's own
-  namespace** (the ServiceAccount's namespace from the in-cluster config), not a
-  hardcoded `canopy`. So in `tamanu-meta-prod` the Secrets live in
-  `tamanu-meta-prod`. **No `POD_NAMESPACE` env needed** (it's still honored as an
-  override if set). Earlier the default `canopy` caused `canopy-jobs` (in
-  `tamanu-meta-prod`) to hit `403 Forbidden` reading `backup-repo-*` in `canopy`.
-- ⚠️ **Ensure both SAs have secret RBAC in the deployment namespace:**
-  `canopy-private` (`get`+`create` secrets) and `canopy-jobs` (`get`+`create`/
-  `update`/`patch` secrets, for rotation) — in whatever namespace the pods run
-  (e.g. `tamanu-meta-prod`), which is the standard same-namespace grant. If those
-  Role/RoleBindings were created in `canopy`, move them to the pods' namespace.
-
-**2026-06-22 — fix (kopia is now bundled in the canopy image):**
-- The backups pod drives kopia as in-process subprocesses, but kopia was **not**
-  in the image it ran: CI only ever built `ghcr.io/beyondessential/canopy` (a
-  busybox base with the canopy binaries, no kopia), and the standalone
-  `images/backups/` Dockerfile that bundled kopia was never wired into CI. So in
-  prod every repo-init/maintenance kopia spawn failed with
-  `failed to spawn kopia … No such file or directory (os error 2)`.
-- Fix: the (static) kopia binary is now copied into the single shipped
-  `ghcr.io/beyondessential/canopy` image (`.github/Dockerfile.native`, from
-  `kopia/kopia:0.23.1`). **The backups pod uses the same image as every other
-  component** — there is no separate backups image. The orphaned `images/backups/`
-  dir was removed.
-- ⚠️ **Ops action: ensure the backups Deployment references
-  `ghcr.io/beyondessential/canopy` (the standard image), not a separate
-  `canopy-backups`/kopia image** — none is published. Once deployed on this
-  build, kopia resolves on `PATH` and repo creation works.
diff --git a/docs/plans/backup-setup-wizard.md b/docs/plans/backup-setup-wizard.md
deleted file mode 100644
index 6f260d43..00000000
--- a/docs/plans/backup-setup-wizard.md
+++ /dev/null
@@ -1,381 +0,0 @@
-# Backup setup wizard + Canopy-owned passphrases + chained-AssumeRole cred model
-
-Status: **PLAN / for review** (2026-06-20). Supersedes parts of the cred model
-in `backup-credentials.md` (the direct-web-identity scheme) — see §1.
-
-## Why
-
-Two things landed together:
-
-1. **Operator feedback (feature):** the backup setup form should become an
-   interactive wizard. The operator enters bucket/prefix/role(s)/region; Canopy
-   *immediately* uses those creds to probe the bucket — verifying the creds work
-   and reporting whether the prefix is empty, looks like an existing kopia repo,
-   holds other (forgotten) content, or is already configured in Canopy — and
-   offers next steps based on that. Only once the passphrase situation is
-   settled do we collect schedule/retention. Rework repo modes so **Canopy owns
-   every passphrase**: `from_birth` (generate + escrow) or `passphrase` (operator
-   types it); drop the old import-an-existing-Secret mode.
-
-2. **Ops-driven cred-model change:** there is **no deployment-account OIDC
-   provider**, so the previous "override `AWS_ROLE_ARN` + reuse the projected
-   web-identity token → direct `AssumeRoleWithWebIdentity`" scheme is gone.
-   Everything cross-account is now **explicit chained `sts:AssumeRole`** from the
-   pod's own IRSA creds, and there are now **two roles per group**.
-
-Both touch the same credential plumbing, so they're planned together.
-
----
-
-## 1. Credential model change (ops-driven)
-
-### 1.1 Two roles per group
-
-`server_group_backup_config` carries **both**:
-
-- `target_role_arn` — **device role** (`deviceRoleArn`). No delete. public-server
-  mints device creds from it. **Unchanged.**
-- `maintenance_role_arn` — **maintenance role** (`maintenanceRoleArn`). `s3:*` +
-  delete + CloudWatch. The backups pod assumes this for
-  maintenance / inspection / s3-metrics. **New column.**
-
-The current code assuming `target_role_arn` for maintenance / s3-metrics is a
-**bug** (the device role deliberately can't delete). Fixing it is in scope here.
-
-### 1.2 Chained AssumeRole everywhere cross-account
-
-The backups pod keeps its own `canopy-jobs` IRSA creds (default credential
-chain). For each group op it needs to reach `maintenance_role_arn`. Two
-consumers:
-
-- **CloudWatch SDK client (s3-metrics):** in-process Rust `aws-sdk-sts`
-  `AssumeRole(maintenance_role_arn)` → hand to the SDK client. The Rust SDK's
-  assume-role provider auto-refreshes; the 1h chained cap is a non-issue.
-  (preflight is *not* in this list — it stays on the device role, §preflight.)
-- **kopia subprocess:** see §1.3 — kopia polls a localhost container-credentials
-  endpoint and self-refreshes.
-
-Drop every direct-web-identity-against-deployment-account assumption, and drop
-the old `AWS_ROLE_ARN`-override path in `kopia.rs`.
-
-### 1.3 kopia creds via a container-credentials endpoint (the 1h-cap solution)
-
-**Verified against kopia v0.23.1 + minio-go v7.2.0 source (do not relitigate).**
-~90-min snapshot/maintenance runs are routine, so **no <1h static-cred path is
-viable**, and:
-
-- kopia's S3 chain is hardcoded **Static → EnvAWS → IAM**; it never instantiates
-  minio-go's file/`credential_process` provider → **`credential_process` and "a
-  creds file" do NOT work**. (Dead. No kopia fork.)
-- Static `AWS_*` env creds **never refresh** → break past 1h.
-- kopia's `--role` path *would* self-refresh but is bounded by the base session's
-  lifetime — **superseded** by the endpoint below (simpler ops, no ceiling).
-- minio-go's **IAM provider** *does* support the ECS-style **container-credentials
-  endpoint** (`AWS_CONTAINER_CREDENTIALS_FULL_URI` + `AWS_CONTAINER_AUTHORIZATION_TOKEN`),
-  polls it, and **self-refreshes** (`SetExpiration` → re-GET at ~80% of lifetime).
-
-**Plan:** the `backups` bin runs one tiny **localhost** HTTP creds endpoint. Per
-in-flight op it registers `token → (maintenance_role_arn, region)`, mints that
-group's session via the Rust SDK (`AssumeRole`, base auto-refreshed by IRSA), and
-returns container-creds JSON. Each kopia subprocess is launched pointing at it;
-minio-go re-polls before expiry, so a 90-min+ run just gets fresh creds — **no
-session ceiling, no `MaxSessionDuration` change**.
-
-Verified endpoint contract (minio-go v7.2.0 FULL_URI/ECS path) — implement exactly:
-- Subprocess env: `AWS_CONTAINER_CREDENTIALS_FULL_URI=http://127.0.0.1:<port>/<path>`
-  (use the **`127.0.0.1`** literal — loopback is checked via `LookupHost`; `localhost`
-  can fail; http is fine, no https needed) + `AWS_CONTAINER_AUTHORIZATION_TOKEN=<secret>`
-  (sent **raw** as the `Authorization` header, no `Bearer`).
-- **Env hygiene (critical):** on the kopia subprocess, *unset*
-  `AWS_WEB_IDENTITY_TOKEN_FILE`, `AWS_ROLE_ARN`, `AWS_ACCESS_KEY_ID`/`SECRET`,
-  `AWS_CONTAINER_CREDENTIALS_RELATIVE_URI`, `AWS_CONTAINER_AUTHORIZATION_TOKEN_FILE`
-  — all *precede* the FULL_URI path (IRSA injects the web-identity ones into the
-  pod). Leave kopia's S3 `accessKeyID/secretAccessKey/sessionToken/roleARN` empty
-  (Static/RoleARN bypass the chain).
-- Server returns **HTTP 200** + JSON `{"AccessKeyId","SecretAccessKey","Token","Expiration"}`
-  — field is **`Token`** (not SessionToken); `Expiration` is **RFC3339**, a few
-  minutes out (never omit/zero, or minio-go re-GETs every call). Non-200 = hard
-  failure (no IMDS fallback on this path).
-
-Device side (bestool, TAM-6879) uses the *same* protocol, its endpoint fed by
-public-server. Revisit public-server's device-cred output shape there.
-
-### 1.4 Device path: unchanged
-
-public-server / canopy-issuer already chain-assumes `target_role_arn` and hands
-creds to the device over mTLS. No change.
-
-### 1.5 What ops provides (so canopy just uses the default chain → AssumeRole)
-
-- `canopy-issuer` + `canopy-jobs` SAs annotated with IRSA role ARNs, both
-  carrying `sts:AssumeRole` + `get secrets`.
-- Per-bucket roles trust the matching SA role ARN.
-- Image still bundles kopia.
-
-> **Ops coordination (TAM-6878):** the wizard's synchronous probe runs in
-> **private-server**, which today has no AWS identity. **Decided:** private-server
-> gets a **dedicated `canopy-private` SA + IRSA role** (room to grow more
-> private-server AWS features later), carrying `sts:AssumeRole`; the per-bucket
-> roles' trust policies must include this role ARN, and the SA needs `create
-> secrets` (§3) on top of the existing `get secrets`.
-
----
-
-## 2. Interactive setup wizard (probe)
-
-### 2.1 Flow
-
-Step 1 — **Identity & target.** Operator enters: bucket, prefix, region (default
-**`ap-southeast-2`** — most buckets live there), `target_role_arn` (device),
-`maintenance_role_arn`. → **Probe.**
-
-Step 2 — **Probe result & passphrase.** Canopy assumes the role and inspects the
-prefix. Based on the result (§2.3) it presents the right passphrase choice
-(from_birth generate vs operator-typed passphrase). Probe also reports if this
-bucket+prefix is already configured in Canopy (DB check, §2.4). For an existing
-repo, once the operator types the passphrase Canopy runs a **second
-(verify) probe** that attempts `kopia repository connect` to confirm the
-passphrase before committing (§2.3).
-
-Step 3 — **Schedule & retention.** Only reached once the passphrase situation is
-settled. Same fields as today (interval + per-type retention with the org
-floors). Then create + provision.
-
-### 2.2 Probe endpoint
-
-New private-server endpoint, e.g. `POST /api/backups/probe`. Two phases share it:
-an **inspect** phase (no passphrase) and a **verify** phase (with passphrase, for
-existing repos):
-
-```
-ProbeArgs {
-  bucket, prefix, region, target_role_arn, maintenance_role_arn,
-  passphrase: Option<String>,     // present ⇒ also run the connect-verify
-}
-ProbeResult {
-  creds_ok: bool,
-  error: Option<String>,          // assume/list failure surfaced verbatim-ish
-  state: "empty" | "kopia_repo" | "other_content" | "inaccessible",
-  object_sample: Vec<String>,     // a few keys, for "other content" context
-  already_configured_in_canopy: Option<Uuid>,  // group id if bucket+prefix taken
-  passphrase_ok: Option<bool>,    // set only when a passphrase was supplied
-}
-```
-
-Implementation: add `aws-sdk-sts` + `aws-sdk-s3` + `aws-config` to private-server
-(mirrors `public-server/src/backup.rs` and `jobs/src/backup/preflight.rs`).
-Assume the **`maintenance_role_arn`** (full read; it's the path that does the
-heavy lifting, so validating it is the most useful signal), then:
-
-- `ListObjectsV2(bucket, prefix, max-keys=small)`.
-- Probe for the kopia repo marker `HeadObject`/`GetObject` on
-  `<prefix>kopia.repository` (confirmed: kopia 0.23.1 writes its format blob
-  there). Present ⇒ `kopia_repo`.
-- **`.storageconfig`-only counts as empty:** if the only object(s) under the
-  prefix are `.storageconfig` (and no `kopia.repository`), treat as `empty`.
-- Non-empty (beyond `.storageconfig`) with no marker ⇒ `other_content` (return a
-  sample of keys).
-- Assume/list failure ⇒ `creds_ok=false`, `inaccessible`, surface the error.
-- If `passphrase` supplied and state is `kopia_repo`: attempt `kopia repository
-  connect` with it (using the credential-process helper for S3 creds) →
-  `passphrase_ok`. (Connect leaves no writes.)
-
-Also cheaply validate `target_role_arn` with `sts:get-caller-identity` under the
-assumed session so a bad device role is caught at setup, not at first device
-mint.
-
-### 2.3 State → offered options
-
-| Probe state | What we show |
-|---|---|
-| `empty` | Proceed. Mode: **from_birth** (generate + escrow, recommended) or **passphrase** (type your own). |
-| `kopia_repo` | An existing kopia repo. Only **passphrase** mode (operator provides the existing passphrase) — *not* from_birth (won't create over an existing repo). The verify-probe must return `passphrase_ok` before the operator can continue. |
-| `other_content` | **Block** with a warning + sample of keys. Canopy does **not** delete anything — the operator clears the contents themselves (or chooses a different prefix/bucket) and hits **Retry** to re-probe. No "proceed anyway", no Canopy-side delete. |
-| `inaccessible` | Block step 1; show the assume/list error so the operator can fix the role/bucket/region. |
-
-### 2.4 Already-configured-in-Canopy check
-
-Before/with the probe, query whether `(bucket, prefix)` (or the group) already
-has a `server_group_backup_config`. If so, surface it (link to the existing
-config) and block creating a duplicate. Pure DB check, no creds needed.
-
-### 2.5 `other_content` → Retry only (no Canopy-side delete)
-
-Canopy never deletes bucket contents. On `other_content` the wizard blocks and
-shows a **Retry** button (re-runs the probe); the operator must clear the prefix
-/ pick another prefix or bucket on their own (AWS console, etc.). No delete
-endpoint, no `s3:DeleteObject*` grant needed.
-
----
-
-## 3. Repo modes + private-server-owned Secret creation
-
-### 3.1 Two modes only — drop import-by-Secret
-
-**Decision:** **Canopy owns all repo passphrases.** Remove the existing
-`import` (operator-supplies-a-Secret-name) variant entirely. `BackupRepoMode`
-becomes exactly two variants:
-
-- `from_birth` — Canopy generates the passphrase + escrow flow (reveal-once +
-  ack). Only valid on an `empty` prefix.
-- `passphrase` — operator provides the passphrase; Canopy stores it. **Skips
-  escrow** → `provisioning → ready` on successful init. Covers *both* "set my
-  own on a fresh repo" (empty prefix → create) and "connect to an existing repo"
-  (`kopia_repo` → connect, passphrase pre-verified by the §2.2 verify probe). The
-  repo *state*, not the mode, decides create-vs-connect.
-
-DB migration changes the `CHECK` on `server_group_backup_config.mode` from
-`IN ('from_birth','import')` to `IN ('from_birth','passphrase')`. No existing
-rows (confirmed), so no data migration. Remove `import`-specific handling
-(`repo_password_ref` is no longer a user input — Canopy always names/owns the
-Secret) and the `BackupRepoMode::Import` match arms.
-
-### 3.2 Secret creation — currently missing
-
-**Gap found:** nothing in the codebase *creates* the passphrase Secret today.
-`from_birth` init only ever *reads* it (`worker.read_repo_password`), and there
-is no passphrase generation — so `from_birth` is not actually wired end-to-end.
-
-**Decision (confirmed):** private-server owns Secret creation for both modes,
-at config-create time:
-
-- `from_birth`: generate a strong passphrase, create the k8s Secret
-  (`backup-repo-{group_id}`, key `password`), record the ref. Escrow flow
-  unchanged.
-- `passphrase`: create the Secret from the operator-typed value, ref recorded,
-  no escrow.
-
-This gives the `canopy-private` SA `create secrets` RBAC (today it has `get`
-only). The `backups` init loop keeps only *reading* the Secret — no change.
-
-> Scope note (rule_no_self_scoping): finishing `from_birth` generate+create is
-> pulled in here because both modes share the same (missing) machinery; calling
-> it out rather than silently bundling or dropping it.
-
-### 3.3 `.storageconfig` (Intelligent-Tiering) on init
-
-Normally pulumi writes `.storageconfig` at bucket creation
-(`ops/pulumi/tamanu/on-linux/src/backup/kopia.ts` + `ops/pulumi/backups/index.ts`).
-Canopy creates it as a **fallback**: on repo init, if `<prefix>.storageconfig`
-is **absent**, write it; **never overwrite** an existing one. A prefix
-containing only `.storageconfig` is treated as `empty` by the probe (§2.2), so a
-pre-seeded tiering config doesn't block from_birth.
-
-Use the exact schema ops uses (kopia `blobOptions` — data blobs under the `p`
-prefix → Intelligent-Tiering, everything else → Standard so indexes stay in the
-frequent tier), written as pretty JSON, `Content-Type: application/json`:
-
-```json
-{
-  "blobOptions": [
-    { "prefix": "p", "storageClass": "INTELLIGENT_TIERING" },
-    { "storageClass": "STANDARD" }
-  ]
-}
-```
-
-The `p` prefix is relative to the kopia storage root, so the same content is
-correct whether or not the repo lives under a canopy prefix; place the object at
-`<prefix>.storageconfig`.
-
----
-
-## 3a. Machine-facing config-as-a-resource API (ops/pulumi)
-
-Complements the wizard (does **not** replace it). Pulumi creates the bucket +
-device/maintenance roles, then pushes the backup config to Canopy as a managed
-resource — so operators don't hand-copy ARNs out of pulumi.
-
-- **Endpoints:** create / update / delete / get a `server_group_backup_config`
-  (the wizard's create/update reuse the same handlers). Create/update run the
-  **same server-side access-check/probe** (§2.2) before persisting, so a config
-  pushed by pulumi is validated identically — bad creds/role/bucket fail fast.
-- **Resource semantics:** idempotent upsert keyed by group (or bucket+prefix),
-  suitable for a Pulumi dynamic provider / `Command`-style resource. **Delete
-  tears down the config *and* deletes the Canopy-owned passphrase Secret** (both
-  modes — Canopy owns it now).
-- **Auth (decided):** gate on `TailscaleAdmin` for now — pulumi has tailnet
-  access. A proper non-interactive machine path (still over Tailscale, via
-  tagged/ACL-grant access) is wanted **later but explicitly out of scope here**.
-
----
-
-## 4. Work breakdown
-
-### DB (database crate, migration via `just migration`)
-- Add `maintenance_role_arn TEXT NOT NULL` to `server_group_backup_config` (no
-  existing rows, so NOT NULL is clean) + model field + re-exports.
-- Change `mode` CHECK from `IN ('from_birth','import')` to
-  `IN ('from_birth','passphrase')`.
-
-### commons-types
-- `BackupRepoMode`: replace `Import` with `Passphrase` (`"passphrase"`).
-
-### private-server (`fns/backups.rs`, `state.rs`)
-- AWS deps (`aws-sdk-sts`/`aws-sdk-s3`/`aws-config`); `probe` endpoint (inspect +
-  verify phases, §2.2); `already_configured` DB check. No delete-contents
-  endpoint — `other_content` is Retry-only (§2.5).
-- Secret creation on `create` for from_birth + passphrase (§3.2); extend the
-  kube wrapper with a create op; `create secrets` RBAC.
-- `CreateBackupConfigArgs`: add `maintenance_role_arn`; accept the typed
-  passphrase for `passphrase` mode; drop `repo_password_ref` as a user input and
-  the `Import` arms.
-- **Config-as-a-resource API (§3a):** create/update/delete/get usable by
-  pulumi, sharing the access-check; resolve the machine-auth path.
-- openapi regen (`just gen-openapi`).
-
-### jobs (`backup/{kopia,worker,maintenance,inspection,s3_metrics}.rs`, bin)
-- Switch **s3-metrics** (Rust SDK) to `AssumeRole(maintenance_role_arn)` (the
-  CloudWatch grant lives there); auto-refreshes. **preflight stays on
-  `target_role_arn`** — it validates the *device* path (assumes the device role
-  both ways incl. the restore session policy); a separate maintenance-role
-  preflight check is a follow-up, not this PR.
-- **kopia container-creds endpoint (§1.3):** the bin runs a localhost endpoint
-  (token → `maintenance_role_arn` registry, mints via Rust SDK); each kopia
-  subprocess gets `AWS_CONTAINER_CREDENTIALS_FULL_URI`+`AWS_CONTAINER_AUTHORIZATION_TOKEN`
-  and the env-hygiene scrub (unset web-identity/static/relative-uri vars). Remove
-  the `AWS_ROLE_ARN`-override path in `kopia.rs`. (No `credential_process`/`--role`.)
-- **`.storageconfig` on init (§3.3):** create if absent, never overwrite.
-
-### frontend (`private-web/`)
-- `BackupConfig.tsx` → multi-step wizard (step 1 identity+probe with both role
-  ARNs + region default `ap-southeast-2`; step 2 probe result + passphrase +
-  verify-probe for existing repos; step 3 schedule/retention). Render the
-  state→options matrix (§2.3); already-configured; `other_content` blocking with
-  a Retry button (no delete).
-- Generated api-types (`just gen-openapi`).
-
-### tests
-- Rust: probe endpoint (inspect + verify; mock/seed S3 where feasible), secret
-  creation, two-mode status machine, migration, the resource API.
-- Playwright e2e: wizard steps, region default, probe states (the e2e kube/AWS
-  clients are `None` today → probe needs a test seam to stub responses),
-  passphrase-mode flow, from_birth escrow flow.
-
-### cross-repo / ops (TAM-6878 pulumi)
-- New `canopy-private` SA + IRSA role; per-bucket trust includes it; `create
-  secrets` RBAC.
-- `maintenanceRoleArn` per group (CloudWatch + delete grant on it); pulumi calls
-  the new resource API to register configs (§3a).
-
----
-
-## 5. Decisions (all resolved)
-
-Resolved: dedicated `canopy-private` SA; probe assumes
-`maintenance_role_arn` (+ cheap `target_role_arn` validate); `maintenance_role_arn`
-NOT NULL (no existing rows); passphrase mode straight to `ready` (no escrow);
-`other_content` hard-blocks, Retry-only (Canopy never deletes contents);
-import-by-Secret dropped; **kopia creds via a localhost container-credentials
-endpoint** the bin serves (verified: `credential_process`/creds-file/`--role`
-all rejected; minio-go IAM provider polls + self-refreshes, no session ceiling,
-**no `MaxSessionDuration` ops change**); `kopia.repository` marker confirmed.
-
-Also resolved this round: pulumi→private-server auth = `TailscaleAdmin` gate
-(pulumi has tailnet access; machine/tagged-grant auth is a later, out-of-scope
-follow-up); deleting a config also deletes the Canopy-owned Secret;
-`.storageconfig` schema taken verbatim from the ops repo (§3.3); e2e probe is
-stubbed/mocked (whatever's simplest — the e2e AWS/kube clients are `None`, so the
-probe path gets a test seam returning canned `ProbeResult`s; no real S3).
-
-**Nothing blocking remains** — ready to implement on plan approval.
diff --git a/docs/plans/specs/canopy-database.md b/docs/plans/specs/canopy-database.md
deleted file mode 100644
index c094cf4a..00000000
--- a/docs/plans/specs/canopy-database.md
+++ /dev/null
@@ -1,681 +0,0 @@
-# Implementation spec: `canopy-database` (backup-credentials)
-
-Component: the **database crate** (`crates/database`) changes for the
-backup-credentials system. This is the foundational layer every other
-component (public-server endpoints, the `jobs`-crate schedulers, the
-private-server operator UI) builds on: it owns the migrations, the diesel
-models, and the `lib.rs` re-exports for all backup tables.
-
-Authoritative design: [`../backup-credentials.md`](../backup-credentials.md)
-(stage-2 stub: [`../backup-credentials-blind-relay.md`](../backup-credentials-blind-relay.md)).
-This spec does not re-litigate decided shape — it makes the DB layer concrete.
-
----
-
-## Purpose
-
-Provide the persistent state for (**10 tables** as shipped — the "Backup
-types addendum" at the foot of this spec added `backup_type_defaults`,
-`server_backup_capabilities`, and `server_group_backup_schedule` to the
-original seven; they're folded into the list here):
-
-- **`server_group_backup_config`** — per-group repo-level backup configuration +
-  lifecycle status (one row per configured group; `group_id` PK → `server_groups`).
-- **`backup_type_defaults`** — Canopy-wide per-type defaults (`default_interval`,
-  `default_retention`, `auto_enable`).
-- **`server_backup_capabilities`** — what each server advertises it can back up
-  (bestool-registered), with a per-server `enabled` toggle.
-- **`server_group_backup_schedule`** — per-`(group, type)` schedule/retention
-  overrides over the type defaults.
-- **`backup_credential_issuances`** — audit log of every STS credential issuance.
-- **`backup_runs`** — what bestool reported per backup/restore run (client-minted UUID PK).
-- **`backup_maintenance_runs`** — Canopy-owned maintenance-Job outcomes (per-group).
-- **`backup_repo_snapshots`** — ground-truth inventory from the read-only inspection Job.
-- **`backup_repo_stats`** — cached repo + bucket size/stats for operator display (per-group).
-- **`backup_requests`** — pending operator one-off "backup now" flags (per `(server, type, purpose)`).
-
-Plus the diesel model structs, insert/query helpers, and `lib.rs` module +
-re-exports. Where helpers fall on a component boundary (e.g. staleness scan
-queries used by the `jobs` crate, issuance recording used by public-server)
-this spec defines the **signatures and ownership**; the calling logic lives
-in those components' own specs.
-
----
-
-## Conventions to follow (grounded in the repo)
-
-Read before implementing: `crates/database/src/{schema,servers,server_groups,devices,issues,statuses,pg_duration}.rs`
-and `migrations/2026-05-22-120000-0000_server_groups/{up,down}.sql`.
-
-- **Migrations are scaffolded with `just migration NAME`** (never hand-create
-  the directory — inconsistent naming; this is a flagged repeat mistake).
-  That runs `diesel migration generate`, producing
-  `migrations/<ts>_<name>/{up,down}.sql`. Then `just migrate` runs them and
-  `cargo fmt`s the regenerated `schema.rs`. The diesel CLI **regenerates
-  `crates/database/src/schema.rs`** from the live DB — do **not** hand-edit
-  `schema.rs`; let the migration drive it, then commit the diff.
-- One migration per logical change is the norm, but a cohesive feature can be
-  several sequential migrations (see the `2026-06-01-012906-000{0,1,2}` triple).
-  RESOLVED (impl): the core landed as one migration
-  `2026-06-12-090526-0000_backup_credentials` creating all **10** tables (the
-  addendum tables included from the start), with a clean reverse-order
-  `DROP TABLE` `down.sql`. Two follow-up migrations layered on later:
-  `2026-06-15-064431-0000_backup_group_scoped_issues` and
-  `2026-06-16-001346-0000_backup_config_lifecycle_columns` (adds `mode`,
-  `last_init_error`, `escrow_acked_at`, `escrow_acked_by` to
-  `server_group_backup_config`).
-- **Timestamps**: columns are `TIMESTAMPTZ NOT NULL DEFAULT NOW()`. Models map
-  them with `#[diesel(deserialize_as = jiff_diesel::Timestamp, serialize_as =
-  jiff_diesel::Timestamp)]` over a `jiff::Timestamp` field; nullable ones use
-  `jiff_diesel::NullableTimestamp` over `Option<Timestamp>`.
-- **`updated_at` auto-touch**: for tables with an `updated_at`, call
-  `SELECT diesel_manage_updated_at('<table>');` in `up.sql` (as
-  `server_groups` does). Only `server_group_backup_config` needs this here.
-- **INTERVAL columns** map to `crate::pg_duration::PgDuration` (wraps
-  `jiff::SignedDuration`; serde wire form is whole seconds as `i64`). For a
-  *nullable* interval (`expected_interval`), the field is
-  `Option<PgDuration>`; annotate the schema with `#[schema(value_type =
-  Option<i64>, format = "int64")]` for utoipa (see `ServerGroup::slack_open_delay`).
-- **JSONB `retention`** maps to `serde_json::Value` *at the storage layer* —
-  diesel handles `Jsonb -> serde_json::Value` natively (see `statuses.health`,
-  `statuses.extra`). The retention columns stay `JsonValue` in the model structs.
-  RESOLVED (impl): a validated shape **was** wanted after all — a typed
-  `RetentionPolicy` struct (`backups::RetentionPolicy`) sits *over* the raw
-  value with the kopia `keep_*` fields, `FLOOR_DAILY/WEEKLY/MONTHLY` constants,
-  `validate_floor()` (returns `AppError::BadRequest` listing the violated
-  fields), and `from_json`/`to_json`/`to_value` converters. The floor logic
-  lives in the DB crate and is called by the private-server write path; the
-  storage columns themselves remain `JsonValue` (so `RetentionPolicy` is a
-  helper, not a diesel column type).
-- **Models**: `#[derive(Debug, Clone, Serialize, Deserialize, Queryable,
-  Selectable, Insertable, utoipa::ToSchema)]`, `#[diesel(table_name =
-  crate::schema::<table>)]`, `#[diesel(check_for_backend(diesel::pg::Pg))]`.
-  A separate `New<Table>` `Insertable` struct is used where the insert shape
-  differs from the row (see `NewServerGroup`, `NewStatus`). Add
-  `#[diesel(belongs_to(...))]` + a `joinable!` entry where a join is wanted.
-- **Helper methods** are `impl` blocks with
-  `pub async fn (db: &mut AsyncPgConnection, ...) -> Result<...>` returning
-  `commons_errors::Result`, ending each query with `.map_err(AppError::from)`.
-  Use `use crate::schema::<table>::dsl;` inside each fn (the established style).
-- **`BIGSERIAL` PK** maps to `pub id: i64` in the model and is **omitted** from
-  the `New<Table>` insertable.
-- **Schema regen verification**: after the migration, `schema.rs` gains the 10
-  `diesel::table!` blocks, plus `joinable!` and `allow_tables_to_appear_in_same_query!`
-  entries. Confirm `bigserial` surfaces as `Int8`, `JSONB` as `Jsonb`,
-  `INTERVAL` as `Interval`/`Nullable<Interval>`.
-
----
-
-## Migration: `backup_credentials`
-
-`up.sql` creates the 10 tables below. DDL is normative (it is what the
-diesel schema regen reads); the design doc's snippets are the source.
-
-RESOLVED (impl) — **FK semantics are uniform: plain `REFERENCES` with NO
-`ON DELETE`/`ON UPDATE` clause everywhere.** `server_groups`, `servers`, and
-`devices` are *archived* (`deleted_at` soft-delete), never hard-deleted, so the
-cascade-vs-preserve distinction the original per-table notes agonised over is
-moot in practice. The per-table "CASCADE here / no-CASCADE there" prose below
-is **superseded** by this single rule; open questions 1 & 2 are resolved
-accordingly (see below). The original notes are kept inline for design history,
-struck through where they no longer hold.
-
-The addendum tables (`backup_type_defaults`, `server_backup_capabilities`,
-`server_group_backup_schedule`) and the type-keying deltas are folded into the
-DDL shown here; see the addendum at the foot for the design rationale.
-
-### `server_group_backup_config`
-
-RESOLVED (impl) — `expected_interval` and `retention` moved off this table
-(repo-level only now; schedule/retention live per-`(group, type)` in
-`server_group_backup_schedule` per the addendum). Lifecycle columns (`mode`,
-`last_init_error`, `escrow_acked_at`, `escrow_acked_by`) were added by the
-`2026-06-16-...backup_config_lifecycle_columns` migration. As-shipped DDL:
-
-```sql
-CREATE TABLE server_group_backup_config (
-    group_id          UUID PRIMARY KEY REFERENCES server_groups(id),
-    bucket            TEXT NOT NULL,
-    prefix            TEXT NOT NULL DEFAULT '',
-    target_role_arn   TEXT NOT NULL,
-    region            TEXT,
-    repo_password_ref TEXT NOT NULL,
-    status            TEXT NOT NULL CHECK (status IN ('provisioning', 'escrow_pending', 'ready')),
-    created_at        TIMESTAMPTZ NOT NULL DEFAULT now(),
-    updated_at        TIMESTAMPTZ NOT NULL DEFAULT now(),
-    -- added by 2026-06-16-...backup_config_lifecycle_columns:
-    mode              TEXT NOT NULL DEFAULT 'from_birth' CHECK (mode IN ('from_birth', 'import')),
-    last_init_error   TEXT,
-    escrow_acked_at   TIMESTAMPTZ,
-    escrow_acked_by   TEXT
-);
-SELECT diesel_manage_updated_at('server_group_backup_config');
-```
-
-- ~~**`ON DELETE CASCADE` is intentional** here~~ — superseded: plain
-  `REFERENCES`, no cascade (archival model; groups are soft-deleted, not
-  hard-deleted).
-- `status` is a `TEXT` column with a `CHECK (status IN (...))` for the closed
-  three-value set `{provisioning, escrow_pending, ready}`, validated in code via
-  the `BackupConfigStatus` enum. The closed enums all carry a DB `CHECK`
-  in the shipped schema (status, mode, purpose, outcome, kind).
-- **Lifecycle columns:** `mode` is the 5th closed enum `BackupRepoMode`
-  (`from_birth` / `import`, with a DB CHECK); `last_init_error` is set by the
-  init Job on `kopia repository create` failure and cleared by the operator-UI
-  on retry; `escrow_acked_at`/`escrow_acked_by` stamp the Bitwarden-escrow ack
-  that flips `escrow_pending → ready`.
-
-### `backup_type_defaults`, `server_backup_capabilities`, `server_group_backup_schedule` (addendum tables)
-
-```sql
-CREATE TABLE backup_type_defaults (
-    type              TEXT PRIMARY KEY,
-    default_interval  INTERVAL,
-    default_retention JSONB NOT NULL CHECK (jsonb_typeof(default_retention) = 'object'),
-    auto_enable       BOOLEAN NOT NULL DEFAULT false
-);
-
-CREATE TABLE server_backup_capabilities (
-    server_id     UUID NOT NULL REFERENCES servers(id),
-    type          TEXT NOT NULL,
-    enabled       BOOLEAN NOT NULL,
-    registered_at TIMESTAMPTZ NOT NULL DEFAULT now(),
-    PRIMARY KEY (server_id, type)
-);
-
-CREATE TABLE server_group_backup_schedule (
-    group_id          UUID NOT NULL REFERENCES server_groups(id),
-    type              TEXT NOT NULL,
-    expected_interval INTERVAL,
-    retention         JSONB CHECK (retention IS NULL OR jsonb_typeof(retention) = 'object'),
-    created_at        TIMESTAMPTZ NOT NULL DEFAULT now(),
-    updated_at        TIMESTAMPTZ NOT NULL DEFAULT now(),
-    PRIMARY KEY (group_id, type)
-);
-SELECT diesel_manage_updated_at('server_group_backup_schedule');
-```
-
-- `retention` now lives here (nullable, `jsonb_typeof='object'` CHECK when
-  present) and on `backup_type_defaults.default_retention` (NOT NULL, same
-  CHECK) — **superseding** the original `retention` column on
-  `server_group_backup_config`. Effective value for a `(group, type)` is the
-  schedule override `?? backup_type_defaults`, with the org retention floor
-  applied in code (`RetentionPolicy::validate_floor`).
-- `server_backup_capabilities.enabled` is **seeded** from
-  `backup_type_defaults.auto_enable` at first registration, then operator-
-  toggleable per server.
-
-### `backup_credential_issuances`
-
-```sql
-CREATE TABLE backup_credential_issuances (
-    id                  BIGSERIAL PRIMARY KEY,
-    device_id           UUID NOT NULL REFERENCES devices(id),
-    group_id            UUID NOT NULL REFERENCES server_groups(id),
-    type                TEXT NOT NULL,
-    issued_at           TIMESTAMPTZ NOT NULL DEFAULT now(),
-    expires_at          TIMESTAMPTZ NOT NULL,
-    purpose             TEXT NOT NULL CHECK (purpose IN ('backup', 'restore')),
-    sts_assumed_role    TEXT NOT NULL,
-    sts_request_id      TEXT,
-    access_key_id       TEXT,
-    bucket              TEXT NOT NULL,
-    prefix              TEXT NOT NULL
-);
-CREATE INDEX ON backup_credential_issuances (device_id, issued_at DESC);
-CREATE INDEX ON backup_credential_issuances (group_id, issued_at DESC);
-```
-
-- ~~**No CASCADE on `group_id`/`device_id` — deliberate audit preservation**~~ —
-  superseded by the uniform no-cascade/archival rule (the FK is plain
-  `REFERENCES` regardless; the audit data is preserved because rows are
-  soft-deleted, not hard-deleted).
-- `bucket`/`prefix` are **snapshots at issuance time**, not FKs back to config.
-- `type TEXT` (addendum) — backups are keyed `(server, type)`.
-- `purpose` is `TEXT` with a DB `CHECK (purpose IN ('backup','restore'))`,
-  also validated in code via `BackupPurpose`.
-
-### `backup_runs`
-
-```sql
-CREATE TABLE backup_runs (
-    id              UUID PRIMARY KEY,
-    device_id       UUID NOT NULL REFERENCES devices(id),
-    group_id        UUID NOT NULL REFERENCES server_groups(id),
-    server_id       UUID REFERENCES servers(id),
-    type            TEXT NOT NULL,
-    purpose         TEXT NOT NULL CHECK (purpose IN ('backup', 'restore')),
-    outcome         TEXT NOT NULL CHECK (outcome IN ('success', 'failure')),
-    error           TEXT,
-    bytes_uploaded  BIGINT,
-    snapshot_id     TEXT,
-    reported_at     TIMESTAMPTZ NOT NULL DEFAULT now()
-);
-CREATE INDEX ON backup_runs (group_id, reported_at DESC);
-CREATE INDEX ON backup_runs (device_id, reported_at DESC);
-CREATE INDEX ON backup_runs (server_id, type, reported_at DESC);
-```
-
-- `server_id` (nullable) + `type TEXT` were added by the addendum so staleness
-  is per-`(server, type)`. The third index `(server_id, type, reported_at DESC)`
-  serves that per-`(server, type)` "latest run" staleness scan.
-
-- **`id` is a client-supplied UUID** (the run-uuid bestool mints at run start),
-  **not** `gen_random_uuid()` and **not** `BIGSERIAL`. No `DEFAULT`. The
-  `New`-side insert provides it. A duplicate `id` fails its own insert (PK
-  violation) — that's the intended safety (design §`backup_runs`); the model
-  helper should surface that as a clean error, not panic.
-- `device_id`/`group_id` come from the authenticated `ServerDevice` context in
-  the caller, **never** from the client body — the model helper takes them as
-  parameters (see contract below), it does not read them from a deserialized
-  client struct.
-- Plain `REFERENCES` on the FKs (uniform no-cascade/archival rule).
-- For the staleness scan, the hot query is "latest successful `purpose='backup'`
-  run per `(server, type)`"; the `(server_id, type, reported_at DESC)` index
-  serves it (the `(group_id, reported_at DESC)` index serves repo-level cuts).
-
-### `backup_maintenance_runs`
-
-```sql
-CREATE TABLE backup_maintenance_runs (
-    id              BIGSERIAL PRIMARY KEY,
-    group_id        UUID NOT NULL REFERENCES server_groups(id),
-    kind            TEXT NOT NULL,            -- "quick" | "full"
-    started_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
-    finished_at     TIMESTAMPTZ,
-    outcome         TEXT,                     -- NULL while running
-    error           TEXT,
-    bytes_reclaimed BIGINT
-);
-CREATE INDEX ON backup_maintenance_runs (group_id, started_at DESC);
-```
-
-- Plain `REFERENCES` on `group_id` (uniform no-cascade/archival rule).
-- `outcome` NULL = still running; the model helper has a `start()` (insert,
-  returns the new `i64` id) and a `finish(id, outcome, error, bytes_reclaimed)`
-  update — the Job-side caller (jobs crate) owns the start/finish bracket.
-
-### `backup_repo_snapshots`
-
-```sql
-CREATE TABLE backup_repo_snapshots (
-    group_id           UUID NOT NULL REFERENCES server_groups(id),
-    source             TEXT NOT NULL,
-    server_id          UUID REFERENCES servers(id),
-    type               TEXT,
-    latest_snapshot_at TIMESTAMPTZ,
-    observed_at        TIMESTAMPTZ NOT NULL DEFAULT now(),
-    PRIMARY KEY (group_id, source)
-);
-```
-
-- Composite PK `(group_id, source)`. The inspection Job **upserts** per source
-  (`ON CONFLICT (group_id, source) DO UPDATE`) — provide an `upsert` helper.
-- `server_id` (and `type`, addendum) are parsed from `source` by the caller and
-  are **nullable** (a source whose server-id no longer resolves still records).
-  Plain `REFERENCES` on both FKs (uniform no-cascade/archival rule) — RESOLVED:
-  no `SET NULL`; servers are archived not deleted, so the "block on delete"
-  worry is moot.
-
-### `backup_repo_stats`
-
-```sql
-CREATE TABLE backup_repo_stats (
-    group_id         UUID PRIMARY KEY REFERENCES server_groups(id),
-    snapshot_count   INTEGER,
-    source_count     INTEGER,
-    logical_bytes    BIGINT,
-    physical_bytes   BIGINT,
-    bucket_bytes     BIGINT,
-    observed_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
-);
-```
-
-- One row per group (PK = `group_id`). Filled by **two distinct writers**: the
-  inspection Job sets the repo-derived fields + `source_count`/`snapshot_count`;
-  the S3-metrics task sets `bucket_bytes` (best-effort/nullable, may lag).
-  Provide **two separate update helpers** so each writer touches only its
-  fields (don't clobber `bucket_bytes` from the inspection writer, or vice
-  versa) — both upsert on `group_id`.
-- This is a *cache*, not audit. RESOLVED (impl): plain `REFERENCES`, no
-  cascade — the uniform archival rule applies here too (groups are
-  soft-deleted, so a "rebuildable cache should cascade" exception isn't needed).
-
-### `backup_requests`
-
-```sql
-CREATE TABLE backup_requests (
-    server_id    UUID NOT NULL REFERENCES servers(id),
-    type         TEXT NOT NULL,
-    purpose      TEXT NOT NULL CHECK (purpose IN ('backup', 'restore')),
-    requested_at TIMESTAMPTZ NOT NULL DEFAULT now(),
-    requested_by TEXT,
-    PRIMARY KEY (server_id, type, purpose)
-);
-```
-
-- Keyed on `server_id` (one-off requests are server-scoped, cleared when the
-  run is reported). Composite PK is `(server_id, type, purpose)` (addendum
-  added `type`) — one pending request per `(server, type, purpose)`; a second
-  request is an upsert (refresh `requested_at`/`requested_by`).
-- This is transient operator intent, not audit. RESOLVED (impl): plain
-  `REFERENCES` on `server_id`, no cascade (uniform archival rule).
-
-`down.sql`: `DROP TABLE` all 10 in reverse-dependency order (no inter-table
-FKs among them; all FKs point at pre-existing tables).
-
----
-
-## Diesel models + `lib.rs`
-
-New module `crates/database/src/backups.rs` (single module for all 10
-tables — they're one cohesive feature, mirroring how `issues.rs` holds
-issues/events/incidents together). RESOLVED (impl) — the as-shipped `lib.rs`
-re-export superset:
-
-```rust
-pub mod backups;
-pub use backups::{
-    BackupCredentialIssuance, BackupMaintenanceRun, BackupRepoSnapshot, BackupRepoStats,
-    BackupRequest, BackupRun, BackupTypeDefault, NewBackupCredentialIssuance, NewBackupRun,
-    NewBackupTypeDefault, NewServerGroupBackupConfig, NewServerGroupBackupSchedule,
-    ServerBackupCapability, ServerGroupBackupConfig, ServerGroupBackupSchedule,
-};
-// the enums are defined in commons-types and re-exported through database:
-pub use commons_types::backup::{
-    BackupConfigStatus, BackupPurpose, BackupRepoMode, BackupType, MaintenanceKind, RunOutcome,
-};
-// RetentionPolicy is reached as `database::backups::RetentionPolicy`
-// (it lives in the backups module; not in the flat re-export set).
-```
-
-(Existing `lib.rs` re-exports `devices::*` and `bestool_snippets::*`; backups
-is the same pattern. The five closed enums — `BackupPurpose`, `RunOutcome`,
-`MaintenanceKind`, `BackupRepoMode`, `BackupConfigStatus` — live in
-`commons-types` plus the open `BackupType{Custom}`; see below.)
-
-### String-typed enums
-
-`purpose`, `status`, `outcome`, `kind` are `TEXT` in the DB. Two options,
-consistent with existing code:
-
-- **Plain `String` fields**, validated at the API layer (matches
-  `issues.resolved_reason`, `servers.kind`). Simplest; lowest ceremony.
-- A small enum in `commons-types` with `Display`/`FromStr` and stored via
-  `deserialize_as = String, serialize_as = String` (matches how `Severity` and
-  `ServerKind` are handled). Preferred if the values are reused across
-  public-server, jobs, and private-web wire types — which they are
-  (`purpose` flows through three components).
-
-RESOLVED (impl): the enum option won across the board — **all** the closed
-enum-ish columns are typed `commons-types` enums (via a `text_enum!` macro that
-implements `Display`/`FromStr` + diesel `ToSql`/`FromSql` over `Text`), each
-backed by a DB `CHECK`. The five closed enums are `BackupPurpose
-{Backup, Restore}`, `RunOutcome {Success, Failure}`, `MaintenanceKind
-{Quick, Full}`, `BackupRepoMode {FromBirth, Import}`, and `BackupConfigStatus
-{Provisioning, EscrowPending, Ready}` (the original spec listed only the first
-four — `BackupRepoMode` is the 5th, added with the lifecycle columns).
-`status` did **not** stay a bare `String` — it's `BackupConfigStatus`.
-Separately, `backup type` is the **open** enum `BackupType` with a `Custom(String)`
-arm (no DB CHECK, any advertised name preserved verbatim). The model field
-types and the `CHECK` constraints agree.
-
-### Model sketches (abbreviated; full set in `backups.rs`)
-
-```rust
-#[derive(Debug, Clone, Serialize, Deserialize, Queryable, Selectable, Insertable, utoipa::ToSchema)]
-#[diesel(table_name = crate::schema::server_group_backup_config)]
-#[diesel(check_for_backend(diesel::pg::Pg))]
-pub struct ServerGroupBackupConfig {
-    pub group_id: Uuid,
-    pub bucket: String,
-    pub prefix: String,
-    pub target_role_arn: String,
-    pub region: Option<String>,
-    pub repo_password_ref: String,
-    pub status: BackupConfigStatus,
-    #[diesel(deserialize_as = jiff_diesel::Timestamp, serialize_as = jiff_diesel::Timestamp)]
-    pub created_at: Timestamp,
-    #[diesel(deserialize_as = jiff_diesel::Timestamp, serialize_as = jiff_diesel::Timestamp)]
-    pub updated_at: Timestamp,
-    // lifecycle columns (2026-06-16 migration):
-    #[schema(value_type = String)]
-    pub mode: BackupRepoMode,
-    pub last_init_error: Option<String>,
-    #[diesel(deserialize_as = jiff_diesel::NullableTimestamp, serialize_as = jiff_diesel::NullableTimestamp)]
-    pub escrow_acked_at: Option<Timestamp>,
-    pub escrow_acked_by: Option<String>,
-}
-```
-
-(Note: `expected_interval`/`retention` are **not** on this struct — they moved
-to `server_group_backup_schedule` / `backup_type_defaults` per the addendum.)
-
-`backup_runs` row maps `id: Uuid` (client-supplied, no default). Its
-`NewBackupRun` insertable **includes** `id` (unlike the BIGSERIAL tables which
-omit it). `bytes_uploaded`/`snapshot_id`/`error` are `Option<_>`.
-
-### Model helper methods (DB-layer surface this component owns)
-
-Defined here; their callers live in other components' specs. Signatures
-(`db: &mut AsyncPgConnection`, returning `commons_errors::Result`):
-
-- `ServerGroupBackupConfig::get(db, group_id) -> Result<Option<Self>>` — the
-  endpoint resolution path (absent → caller maps to 409).
-- `ServerGroupBackupConfig::upsert(db, NewServerGroupBackupConfig) -> Result<Self>`
-  and `set_status(db, group_id, status) -> Result<Self>` — operator UI / repo-init flow.
-- `ServerGroupBackupConfig::list_scheduled(db) -> Result<Vec<Self>>` — rows
-  with `expected_interval IS NOT NULL` (the staleness-scan set).
-- `BackupCredentialIssuance::record(db, NewBackupCredentialIssuance) -> Result<Self>`
-  — called by public-server step 7. Snapshots bucket/prefix; takes resolved
-  `device_id`/`group_id`/`access_key_id`/`sts_request_id`.
-- `BackupRun::record(db, NewBackupRun) -> Result<Self>` — called by
-  `POST /backup-report`. PK violation on duplicate `id` returns a clean
-  `Result::Err` (caller decides idempotency response).
-- `BackupRun::latest_success_for_server(db, server_id) -> Result<Option<Self>>`
-  and a bulk `latest_success_by_server(db, &[Uuid]) -> Result<HashMap<Uuid, Self>>`
-  filtered to `purpose='backup'`, `outcome='success'` — the staleness join.
-  (Server-centric: `backup_runs` rows carry `device_id`; the scan joins via the
-  server's associated devices, or filters by `group_id` then maps device→server
-  in the caller. Provide the query keyed the way the `jobs` scan needs — settle
-  with the jobs spec; the DB helper exposes both a per-group and per-device cut.)
-- `BackupMaintenanceRun::start(db, group_id, kind) -> Result<i64>` and
-  `finish(db, id, outcome, error, bytes_reclaimed) -> Result<()>`.
-- `BackupRepoSnapshot::upsert(db, group_id, source, server_id, latest_snapshot_at) -> Result<()>`
-  and `list_for_group(db, group_id) -> Result<Vec<Self>>`.
-- `BackupRepoStats::upsert_repo_fields(db, group_id, snapshot_count, source_count, logical, physical) -> Result<()>`
-  and `upsert_bucket_bytes(db, group_id, bucket_bytes) -> Result<()>` — the two
-  separate writers; both `ON CONFLICT (group_id) DO UPDATE` touching only their
-  own columns. `get(db, group_id) -> Result<Option<Self>>` for the stats panel.
-- `BackupRequest::enqueue(db, server_id, purpose, requested_by) -> Result<()>`
-  (upsert), `clear(db, server_id, purpose) -> Result<()>`,
-  `pending_for_server(db, server_id) -> Result<Vec<Self>>`.
-
-The "present since" anchor for never-backed-up detection
-(`max(MIN(device_server_associations.first_seen) over the server,
-server_group_backup_config.created_at)`) is a **jobs-crate query** that joins
-these tables; the DB crate exposes the building blocks (`list_scheduled`,
-`latest_success_*`, and existing `device_server_associations` access). Note
-`first_seen` is per `(device_id, server_id)` pair — use `MIN` over the server.
-
----
-
-## Interfaces / contracts
-
-### Provided (to other components)
-
-- **Tables + schema** — the seven `diesel::table!` blocks in `schema.rs` and
-  the typed models in `backups.rs`, re-exported from `lib.rs`. Public-server,
-  the jobs crate, and private-server all `use database::{...}`.
-- **`ServerGroupBackupConfig` row** — the single source for repo-level
-  `bucket`, `prefix`, `region`, `target_role_arn`, `repo_password_ref`,
-  `status`, `mode`, and the lifecycle fields. (Schedule/retention are now
-  per-`(group, type)` on `server_group_backup_schedule` / `backup_type_defaults`.)
-  Consumed by: `GET /backup-target` &
-  `POST /backup-credentials` (public-server); maintenance/inspection schedulers
-  & preflight (jobs); onboarding/stats UI (private-server).
-- **Audit-record helpers** — `BackupCredentialIssuance::record`,
-  `BackupRun::record`, `BackupMaintenanceRun::{start,finish}` — the write
-  surface for the issuance/report/maintenance flows.
-- **Scan helpers** — `list_scheduled`, `latest_success_*`,
-  `BackupRepoSnapshot::list_for_group` — inputs to staleness/reconciliation
-  detection (jobs crate, signals 1 & 2).
-- **`backup_requests` queue** — `enqueue`/`clear`/`pending_for_server` — the
-  operator one-off "backup now" home, read by the cadence-trigger path.
-- **Wire/utoipa shapes** — every model derives `utoipa::ToSchema`, so the
-  private-server handlers can use them in `#[utoipa::path]` and the regenerated
-  `openapi.json` → `api-types.ts` exposes them to private-web (run
-  `just gen-openapi` in the component that adds the handlers, not here).
-
-### Consumed (from existing code)
-
-- `server_groups(id)`, `servers(id)`, `devices(id)` — FK targets.
-- `crate::pg_duration::PgDuration` (INTERVAL ↔ `SignedDuration`).
-- `jiff_diesel::{Timestamp, NullableTimestamp}` (timestamp mapping).
-- `commons_errors::{AppError, Result}` (return type + `AppError::from`).
-- `device_server_associations` (`first_seen`, per device-server pair) — read by
-  the staleness "present since" query, joined alongside these tables.
-- `database::issues::NewEvent::save(conn, server_id, device_id)` — *not* called
-  by this crate, but the staleness/poisoning alerting (jobs) writes through it;
-  this crate must not duplicate alerting logic. (`source="canopy"`,
-  `ref="backup-staleness"` etc. live in the jobs component.)
-
-### Explicitly NOT in this component
-
-- No AWS SDK / STS / S3 calls (public-server + jobs).
-- No kube client / Secret reads (public-server + jobs).
-- No HTTP handlers, no scheduler loops, no alerting/`NewEvent` construction.
-- No utoipa `#[path]` annotations or `openapi.json` regen (that's whichever
-  crate adds the handlers).
-
----
-
-## Data shapes
-
-- **`retention`** (JSONB, on `server_group_backup_schedule` /
-  `backup_type_defaults`): a kopia keep-policy object, e.g.
-  `{"keep_latest":1,"keep_daily":7,"keep_weekly":4,"keep_monthly":6,"keep_annual":0}`.
-  Stored as `JsonValue`; the typed `RetentionPolicy` helper sits over it.
-  **Floor enforcement** (`keep_daily≥7`, `keep_weekly≥4`, `keep_monthly≥6`) is
-  `RetentionPolicy::validate_floor()` — a DB-crate function (returns
-  `AppError::BadRequest`) called by the private-server write path, **not** a DB
-  constraint (the only DB CHECK on these columns is `jsonb_typeof='object'`).
-- **`status`**: `provisioning` → `escrow_pending` → `ready`. Backups dormant
-  (412/409 from the endpoints) until `ready` — enforced by the *endpoint*, but
-  the column is the source of truth.
-- **`purpose`**: `backup` | `restore`.
-- **`outcome`**: `success` | `failure` (`backup_runs`); same plus NULL-while-
-  running for `backup_maintenance_runs`.
-- **`kind`**: `quick` | `full`.
-- **`source`** (`backup_repo_snapshots`): kopia source string
-  `canopy@<server-id>:<path>`; `server_id` parsed out by the caller.
-
----
-
-## Testing approach (per AGENTS.md)
-
-DB-only tests via `commons_tests::db::TestDb::run(|mut conn, _url| async move { ... })`,
-`#[tokio::test(flavor = "multi_thread")]`, exercising **model functions
-directly** (not HTTP). Put them in `crates/database/tests/` with no `_test`
-suffix (e.g. `tests/backups.rs`), `use database::*;` for the models. Run with
-`just test-package database` or `just test-name <name>`.
-
-Cover:
-
-1. **Migration applies cleanly** — implicitly via every test (each spins a
-   fresh migrated DB) plus an explicit smoke test inserting one row per table.
-2. **`server_group_backup_config`** — insert/upsert round-trip incl. NULL
-   `region`/`expected_interval`, JSONB `retention` round-trips, `status`
-   transitions, `updated_at` auto-touch fires on update, `jsonb_typeof` CHECK
-   rejects a non-object retention.
-3. **FK behaviour (archival model)** — RESOLVED (impl): there is **no** cascade
-   anywhere; groups/servers are archived (`deleted_at`), never hard-deleted, so
-   a config-delete does not cascade and a hard `DELETE` on a `server_groups` row
-   with any backup rows simply fails the FK (and is never done in practice). The
-   original "cascade the config / block on audit rows" split no longer applies —
-   the rule is uniform plain `REFERENCES`. (No dedicated cascade test is needed;
-   the archival path is what's exercised.)
-4. **`backup_runs` client-supplied PK** — insert with a chosen UUID succeeds;
-   re-inserting the same UUID returns an error (PK violation surfaced as
-   `Result::Err`, not a panic); `device_id`/`group_id` are taken from
-   parameters (a test that the helper signature doesn't read them from a body).
-5. **Issuance audit** — `record` snapshots bucket/prefix; later changing the
-   config row does not mutate the issuance row. Indexes exercised by an
-   ordered `(device_id, issued_at DESC)` query returning newest-first.
-6. **Scan helpers** — `list_scheduled` returns only non-NULL-interval rows;
-   `latest_success_*` filters to `purpose='backup'` + `outcome='success'` and
-   ignores a newer `restore` success (the staleness-reset bug guard).
-7. **`backup_repo_stats` split writers** — `upsert_repo_fields` then
-   `upsert_bucket_bytes` accumulate without clobbering each other; either order.
-8. **`backup_repo_snapshots` upsert** — second observation of the same
-   `(group_id, source)` updates `latest_snapshot_at`/`observed_at` in place.
-9. **`backup_requests`** — enqueue is upsert on `(server_id, purpose)`; `clear`
-   removes; `pending_for_server` lists.
-
-No HTTP/e2e here (those belong to the public-server and private-server specs).
-Per repo memory: per-package tests while coding (`just test-package database`),
-no final full-suite run.
-
----
-
-## Open questions / decisions to make
-
-1. **`backup_repo_stats` / `backup_requests` cascade.** RESOLVED (impl):
-   **no cascade** — there is *no* cache/transient-vs-audit split. Every backup
-   FK is plain `REFERENCES`, because groups/servers are archived (`deleted_at`),
-   not hard-deleted, so cascade-vs-preserve never fires.
-2. **`backup_repo_snapshots.server_id` FK on a server delete.** RESOLVED (impl):
-   plain `REFERENCES servers(id)`, no `SET NULL`/cascade — same archival rule.
-   The column is nullable only because a `source` server-id may not resolve at
-   observation time, not for delete semantics.
-3. **Enum representation.** RESOLVED (impl): `commons-types` enums for **all**
-   the closed sets (`purpose`, `outcome`, `kind`, `mode`/`BackupRepoMode`, and
-   `status`/`BackupConfigStatus` — `status` did not stay a plain `String`), each
-   with a matching DB `CHECK`; the open `BackupType{Custom}` for the type name.
-4. **Where `purpose`/`status` CHECK constraints live** — RESOLVED (impl): the
-   closed enums carry **both** a DB `CHECK (... IN ...)` *and* the typed
-   `commons-types` enum at the model layer.
-5. **`backup_runs.id` collision response contract.** DB returns a PK-violation
-   error; the *endpoint* decides whether a duplicate report is a 409 or an
-   idempotent 204. That's a public-server decision, but the DB helper's
-   error-mapping (does `record` map unique-violation to a typed `AppError`
-   variant, or pass the raw diesel error?) should be settled here so the caller
-   can match on it. Lean: map to `AppError::Conflict` so the caller can branch.
-6. **`retention` validation surface.** RESOLVED (impl): code-level via the typed
-   `RetentionPolicy::validate_floor()` (DB crate), called by the private-server
-   write path; not a DB floor-CHECK (the only DB CHECK is `jsonb_typeof`).
-7. **Indexing for the staleness scan.** The provided indexes cover the
-   per-group/per-device "latest run" cuts. If the jobs scan ends up doing a
-   `DISTINCT ON (server)` over `backup_runs` joined through
-   `device_server_associations`, a covering index may be wanted — defer until
-   the jobs query is concrete, then add in a follow-up migration.
-
----
-
-## Backup types addendum (supersedes the relevant schema above)
-
-Added after this spec: backups are keyed `(server, type)`, not `(server)`.
-See the plan's "Backup types" section. Concrete deltas:
-
-- **`server_group_backup_config`** drops `expected_interval` and
-  `retention` — it's now repo-level only (`bucket`, `prefix`,
-  `target_role_arn`, `region`, `repo_password_ref`, `status`).
-- **New tables/models:**
-  - `server_backup_capabilities(server_id, type, enabled, registered_at)`
-    PK `(server_id, type)` — bestool-registered; `enabled` **seeded from**
-    `backup_type_defaults.auto_enable` at first registration, then
-    operator-toggleable per server.
-  - `backup_type_defaults(type PK, default_interval, default_retention
-    JSONB, auto_enable BOOL)` — canopy-wide per-type defaults.
-  - `server_group_backup_schedule(group_id, type, expected_interval,
-    retention)` PK `(group_id, type)` — schedule/retention overrides over
-    the type defaults; absent row → defaults.
-- **`type TEXT` column added to** `backup_credential_issuances`,
-  `backup_runs` (+ a `server_id` column for per-server-type staleness),
-  `backup_repo_snapshots`, and `backup_requests` (PK now
-  `(server_id, type, purpose)`). `backup_maintenance_runs` and
-  `backup_repo_stats` stay per-group (repo-level).
-- **New model surface:** capability upsert + per-server toggle; effective
-  schedule/retention resolution (`override ?? type-default`, with the org
-  retention floor enforced); "list active `(server, type)`" for the
-  scheduler/staleness.
diff --git a/docs/plans/specs/canopy-jobs-detection-preflight.md b/docs/plans/specs/canopy-jobs-detection-preflight.md
deleted file mode 100644
index bbc85b82..00000000
--- a/docs/plans/specs/canopy-jobs-detection-preflight.md
+++ /dev/null
@@ -1,602 +0,0 @@
-# Spec: canopy-jobs-detection — staleness, reconciliation, alerting & upstream preflight
-
-Component of the **backup-credentials** system. Authoritative design:
-[`../backup-credentials.md`](../backup-credentials.md) (stage-2 stub:
-[`../backup-credentials-blind-relay.md`](../backup-credentials-blind-relay.md)).
-
-This spec covers the **detection / alerting / preflight** half of the
-Canopy control plane: the periodic jobs that decide whether each group's
-backups are healthy and raise issues/events when they are not. It does
-**not** cover credential issuance (`public-server` endpoints), the
-maintenance/inspection/init Jobs themselves, the operator UI, or the IaC —
-those are sibling components. It *consumes* the tables those components
-write (`backup_runs`, `backup_maintenance_runs`, `backup_repo_snapshots`,
-`server_group_backup_config`) and the AWS-client plumbing they introduce.
-
-## Purpose
-
-Three classes of periodic check, all running as loops in the `jobs` crate
-and all alerting through the existing issues/events/incidents model
-(`NewEvent::save`, `source="canopy"`):
-
-1. **Signal 1 — staleness scan** (DB-only, frequent): scan servers that are
-   *expected* to be backed up and alert when no recent successful backup is
-   on record. Server-centric. Also catches stuck maintenance.
-2. **Signal reconciliation (1 / 2 / 3)**: cross-check what devices
-   *reported* (signal 1, `backup_runs`) against what *actually landed*
-   (signal 2, `backup_repo_snapshots`) and — later — what PGRO proved
-   *restorable* (signal 3, `backup_restore_checks`). Disagreement is itself
-   an alert; repo corruption (poisoning) is a group-level critical.
-3. **Upstream preflight** (AWS-touching, hash-jittered): Canopy checking its
-   *own* access — `GetCallerIdentity` (shared, ~minute) plus per-group deep
-   checks (both purposes issue working creds + Object-Lock still in place),
-   hourly.
-
-The shared thread running through all three: a **group-level** failure
-(can't mint creds, lock removed, repo corrupt, restore broken) must page
-**regardless of any server's `is_monitored`**, whereas **per-server**
-staleness obeys the existing `is_monitored` gate. The incident model is
-server-keyed today, so the group-level path needs new plumbing (see
-[Group-level alerting](#group-level-alerting-server-independent)).
-
-## Where it lives
-
-RESOLVED (impl) — the bin layout changed from what's sketched below:
-
-- **No separate `backup_staleness` bin.** Open-question 3 resolved →
-  **folded into the renamed `monitor` bin** (formerly `reachability`). The
-  `monitor` bin (`crates/jobs/src/bin/monitor.rs`) now runs the reachability
-  sweep, the backup staleness + reconcile sweep (`database::backup::sweep`),
-  **and** the tailnet key-expiry sweep, all on its one minute loop. The
-  staleness/reconcile *logic* lives in the `database` crate (`database::backup`,
-  see below).
-- **Preflight is not its own bin either** — it's a module
-  (`crates/jobs/src/backup/preflight.rs`) run by the consolidated `backups`
-  bin (see canopy-jobs-maintenance-inspection.md §2/§4 for that single bin).
-
-The original sketch (separate `backup_staleness` + `backup_preflight` bins
-following the `reachability`/`pingtask` template) is kept below for design
-history:
-
-- ~~`crates/jobs/src/bin/backup_staleness.rs`~~ — signal-1 + reconciliation
-  scan (DB-only; no AWS). ~1–5 min cadence. **Folded into `monitor`.**
-- ~~`crates/jobs/src/bin/backup_preflight.rs`~~ — upstream preflight (AWS SDK;
-  STS + S3). **Now a module under the `backups` bin.**
-
-The bulk of the logic lives in the **`database` crate** as model functions
-(like `Status::sweep_reachability`), so it's testable with
-`commons_tests::db::TestDb::run` without standing up a binary. As shipped, the
-`database::backup` module (`crates/database/src/backup/`) holds:
-
-- `staleness.rs` — the scan + classify + file-events logic.
-- `reconcile.rs` — signal 1↔2(↔3) reconciliation.
-- `alerts.rs` — `raise_group_event`, the single group-scoped incident
-  entrypoint (bypasses per-server `is_monitored`).
-- `refs.rs` — the `(source, ref)` constants.
-
-`database::backup::sweep` is the top-level entry the `monitor` bin calls each
-tick (runs signal-1 classify + reconciliation).
-- The preflight's AWS calls live in the **binary** (the `database` crate
-  must not gain an AWS dependency); the preflight's *alerting* reuses the
-  same `NewEvent` helpers. The binary reads config rows via a `database`
-  model function and calls the AWS SDK directly.
-
-Per the workspace memory: `database` is the only crate allowed diesel; the
-preflight binary depends on the new AWS-SDK plumbing the issuance component
-adds (`aws-config` + `aws-sdk-sts` + `aws-sdk-s3`). The `jobs` crate gains
-those deps for `backup_preflight` only.
-
-## Refs and sources (issues/events keys)
-
-All events use `source = "canopy"` (the existing `CANOPY_SOURCE` constant
-in `statuses.rs`; promote it somewhere shared if both crates need it, or
-re-declare a `const BACKUP_*` set in the backup module). Refs (new
-constants — keep them all in one place, e.g. `database::backup::refs`):
-
-| ref | level | severity (active) | severity (recovery) | gate |
-|-----|-------|-------------------|---------------------|------|
-| `backup-staleness` | server | `Error` | `Info` (`active:false`) | `is_monitored` |
-| `backup-never` | server | `Error` | n/a (clears when first success lands) | `is_monitored` |
-| `backup-maintenance-stale` | group | `Error` | `Info` | none (group-level) |
-| `backup-reconcile-missing` | group | `Error` | `Info` | none (group-level) |
-| `backup-reconcile-report-gap` | server | `Warning` | `Info` | `is_monitored` |
-| `backup-corruption` | group | `Critical` | `Info` | none (group-level) |
-| `preflight-identity` | fleet/group | `Critical` | `Info` | none (group-level) |
-| `preflight-assume` | group | `Error` | `Info` | none (group-level) |
-| `preflight-object-lock` | group | `Critical` | `Info` | none (group-level) |
-| `restore-verification` (signal 3, later) | group | `Error` | `Info` | none (group-level) |
-
-Notes:
-- Staleness/never/report-gap are **per-server** → ordinary
-  `NewEvent::save(conn, server_id, Some(device_id))`. They inherit the
-  `is_monitored` incident gate by design (see plan: some prods are
-  intentionally intermittently-alive; per-server backup noise on them is
-  unwanted). They are still *recorded* (visible on the server page) even
-  when unmonitored — `NewEvent::save` records the issue/event unconditionally
-  and only skips the incident contribution.
-- Everything marked **group-level** must page even on unmonitored servers,
-  so it must **not** go through a per-server `NewEvent::save` (which would
-  re-inherit the gate). See [Group-level alerting](#group-level-alerting-server-independent).
-- `Error`+ is required for `opens_incident()` (`OPENS_INCIDENT = [Critical,
-  Error]`, `commons-types/src/issue.rs`). `Warning`/`Info` only join an
-  already-open incident for context; they never open one. So the
-  report-gap notice (`Warning`) is deliberately non-paging on its own.
-
-## Signal 1 — staleness scan
-
-Server-centric. The subject is the **server** being protected; the device
-is the actor recorded in `backup_runs`/snapshot tags.
-
-### Scanned set
-
-Servers in a group whose `server_group_backup_config` has:
-- `status = 'ready'` (dormant configs — `provisioning`/`escrow_pending` —
-  are not yet expected to back up), **and**
-- a non-NULL `expected_interval` (manual-only groups have no schedule, so
-  no staleness alerting — they're simply not in the set).
-
-A manual-only or unconfigured group is therefore never scanned, so
-unauthorized/un-set-up devices never alert. Implement as a single query
-joining `servers` → `server_group_backup_config` (on `servers.group_id`)
-filtered as above, returning `(server_id, group_id, expected_interval,
-config.created_at)`.
-
-### Per-server classification
-
-For each scanned server, find its most recent `backup_runs` row with
-`purpose = 'backup' AND outcome = 'success'` (the `(device_id, …)` /
-`(group_id, reported_at DESC)` indexes support this; a server-centric query
-joins runs to the server via `group_id` **and** the server identity — see
-the source-mapping note below). Let `grace = expected_interval * 2`.
-
-- **Stale** — a prior successful backup exists but none newer than
-  `now - grace` → file `backup-staleness` at `Error`, `active:true`.
-- **Never backed up** — *no* successful `purpose='backup'` row ever, **and**
-  the server has been expected long enough: `now - anchor > grace`, where
-
-  ```
-  anchor = max( MIN(first_seen over this server's device_server_associations rows),
-                server_group_backup_config.created_at )
-  ```
-
-  → file `backup-never` at `Error`, `active:true`. Below the grace from the
-  anchor: no alert yet (freshly-present server or freshly-authorized group
-  must not false-alarm).
-- **Recovered** — a previously-stale server reporting success again: file
-  `backup-staleness` `active:false` at `Info` (the issue leaves the
-  incident and auto-closes). Mirror the reachability sweep's
-  `(false, Some(issue)) if !issue.active => continue` short-circuit so we
-  don't re-file an already-closed recovery every tick.
-
-**Anchor details (do not get these wrong — they're explicit decisions):**
-- `first_seen` in `device_server_associations` is per `(device_id,
-  server_id)` **pair**, *not* a per-server scalar. Use
-  `MIN(first_seen)` over **all** of that server's association rows
-  (earliest any device saw it). Schema:
-  `device_server_associations (device_id, server_id, first_seen, last_seen)`.
-- `created_at` is `server_group_backup_config.created_at` (group-authorized
-  time). A server present long ago but whose group was authorized 5 minutes
-  ago must use the *later* of the two, so a just-authorized group doesn't
-  instantly fire `backup-never` on every member.
-- Filter runs on `purpose='backup'` **specifically** — a recent successful
-  *restore* must **not** reset backup staleness.
-
-### Mapping a `backup_run` to a server
-
-`backup_runs` carries `device_id` + `group_id` but **not** `server_id`
-directly. The protected server is identified via the kopia source
-(`canopy@<server-id>:<path>`, recorded in `backup_repo_snapshots.server_id`)
-and via the device→server association at report time. For signal 1, resolve
-the server from the run's `device_id` via `Server::live_by_device_id` (the
-`servers_device_id_unique` partial unique index guarantees at most one live
-server per device). Scan-side, it's cleaner to drive **from the server**:
-for each scanned server, find runs whose `device_id` is one of that
-server's associated devices and whose `group_id` matches. Encode this as a
-single classify query rather than per-server round-trips.
-
-### Maintenance staleness
-
-`backup_maintenance_runs` (group-level) feeds the same scan: a group whose
-last `outcome='success'` maintenance run (any `kind`) is older than a
-maintenance-cadence threshold (full-weekly default → e.g. `8 days`; make it
-a constant, not `expected_interval`-derived, since maintenance cadence is
-independent of backup cadence) → file `backup-maintenance-stale` at `Error`
-via the **group-level** path. Recovery: a fresh successful maintenance run
-clears it.
-
-## Signal reconciliation (1 / 2 / 3)
-
-Runs in the same `backup_staleness` loop (after signal-1 classify), reading
-`backup_repo_snapshots` (signal-2 ground truth) against `backup_runs`
-(signal-1 reports). Per scanned server (resolved to a kopia `source` =
-`canopy@<server-id>:<path>`, so the join key is `server_id`):
-
-- **report says success but no recent snapshot** (a `backup_runs` success
-  newer than `grace`, but `backup_repo_snapshots.latest_snapshot_at` for
-  that source is older than `grace`, or no snapshot row at all) → the report
-  is wrong or the upload didn't persist. **`backup-reconcile-missing`**,
-  `Error`, **group-level** (a device lying about success / data not landing
-  endangers the group's actual recoverability, so it pages regardless of
-  monitored). This is the case signal 1 alone cannot catch.
-- **recent snapshot but no report** (`latest_snapshot_at` fresh, but no
-  recent `backup_runs` success) → backups are fine, the *reporting path* is
-  broken. **`backup-reconcile-report-gap`**, `Warning`, **per-server**
-  (low-severity, non-paging — it's a telemetry gap, not a backup failure).
-- **neither** → genuinely stale; already covered by signal 1, emit nothing
-  extra here (avoid double-filing on the same `(server)`).
-
-Signal 2 is only as fresh as the inspection Job's last run; if
-`backup_repo_snapshots.observed_at` for a group is itself stale (older than
-the inspection floor), reconciliation can't conclude "missing" reliably —
-**skip the `reconcile-missing` verdict when signal-2 data is stale** and
-instead rely on the inspection Job's own failure to surface (it writes
-`backup_repo_snapshots`/stats; a Job that stops running is caught by the
-preflight/maintenance-staleness machinery, not here). Record this as a
-guard so a lagging inspector doesn't produce false "report lied" alerts.
-
-**Poisoning / corruption** is reported by the inspection Job (signal 2),
-not computed here: when inspection detects content-blob hash mismatch /
-unreadable index, it raises **`backup-corruption`** at `Critical`,
-group-level. This spec owns the *alerting shape* (the constant, severity,
-group-level routing, recovery-runbook pointer in the message body); the
-*detection* (running `kopia` verify) is the inspection-Job component. To
-avoid two components both knowing how to raise a group-level event, expose
-a single helper (below) that the inspection Job calls.
-
-**Signal 3 (restore-verification, later/additive):** PGRO reports
-per-replica restore outcomes into a future `backup_restore_checks` table; a
-failed/stale restorability check is **`restore-verification`** at `Error`,
-group-level. Same group-level helper. Stubbed here so the routing is
-designed-for, not bolted on; the table + ingest endpoint are out of scope
-for this component's first cut.
-
-## Group-level alerting (server-independent)
-
-**The core mechanism wrinkle.** The incident model
-(`crates/database/src/issues.rs`) is **server-keyed**: `Issue.server_id` is
-`NOT NULL`, `NewEvent::save(conn, server_id, device_id)` requires a server,
-and `re_evaluate_incident_membership` gates incident contribution on that
-server's `is_monitored`. Incidents themselves are **group-keyed**
-(`incidents.server_group_id`). There is no "group-level issue with no
-server" path today. Group-level backup checks must page regardless of
-`is_monitored`, so routing them through a per-server `NewEvent::save` is
-wrong (it would inherit the monitored gate and could be silenced by an
-unmonitored member).
-
-**Decision required — pick one (flagged in the plan as
-implementation-time):**
-
-- **Option A — representative monitored server.** Pick a deterministic
-  server in the group (e.g. the highest-rank live member, reusing
-  `ServerGroup::highest_member_ranks` ordering) and file against it, but
-  **bypass the monitored gate** for these refs. This needs a new code path
-  because `re_evaluate_incident_membership` hard-gates on `monitored`;
-  passing `monitored=true` unconditionally for group-level refs is the
-  smallest change but is a lie in the data. Fragile if the group has no
-  live members.
-- **Option B — group sentinel issue (recommended).** Add first-class
-  support for a group-scoped issue with no member server. Concretely: make
-  `issues.server_id` nullable **or** add an `issues.server_group_id`
-  nullable column, and teach `re_evaluate_incident_membership` /
-  `find_or_open_incident` to accept a group directly (the incident is
-  already group-keyed, so `find_or_open_incident(conn, group_id, …)` works
-  as-is — the only gap is producing an `Issue` that points at a group, not
-  a server, and skipping the `is_monitored` lookup for it). This is the
-  clean model and matches "group/control-plane concern, not any one
-  server's." It's a migration + a branch in the membership evaluator.
-
-This spec **recommends Option B** and treats it as the deliverable's
-central new piece of shared plumbing. Provide one helper that both this
-component and the inspection Job call:
-
-```rust
-// database::backup::alerts (new)
-pub async fn raise_group_event(
-    conn: &mut AsyncPgConnection,
-    group_id: Uuid,
-    r#ref: &str,
-    severity: Severity,
-    description: Option<&str>,
-    message: &str,
-    active: bool,
-) -> Result<()>;
-```
-
-Internally it find-or-creates a **group-scoped** issue keyed by
-`(server_group_id, source="canopy", ref)`, appends/coalesces an event (reuse
-`hash_event`), and runs the group-aware membership evaluation that ignores
-`is_monitored`. Recovery is the same `(source, ref)` with `active:false` at
-a lower severity, which lets the issue leave the incident and auto-close —
-identical lifecycle to the per-server path. **Do not** add an
-`Incident::open_for`; there is no such function — reuse
-`find_or_open_incident` → `enqueue_slack_open` → `SlackOutbox::enqueue`,
-which the existing evaluator already drives.
-
-The migration for Option B must be a separate `just migration` step
-(`just migration backup_group_scoped_issues`); never hand-create migration
-dirs. If `issues.server_id` becomes nullable, audit every existing query in
-`issues.rs` that assumes it non-null (the model is large — `list_for_server`,
-`list`, `reconcile_open_incidents`, `re_evaluate_incident_membership`'s
-`Server::get_by_id`). `reconcile_open_incidents` (run on reachability
-startup) must handle group-scoped issues (no server → resolve group
-directly, skip the `is_monitored` short-circuit). This is the
-refactor-thoroughly cost of Option B and must not be half-done.
-
-## Upstream preflight
-
-Watches **Canopy's own upstream access**, not the devices. Lives in
-`backup_preflight.rs`. Alert, **never gate readiness** (a failing check must
-not pull the pod out of rotation — that makes it worse).
-
-### Shared check (every ~minute, on the loop tick)
-
-- **`sts:GetCallerIdentity`** — confirms the pod's IRSA web-identity is
-  mounted and valid. Cheap; rides the minute loop. On failure → raise
-  **`preflight-identity`** at `Critical`, group-level (route it against a
-  fleet sentinel — Option B's group-scoped issue keyed to a "control-plane"
-  pseudo-group, **or** fan out one `preflight-identity` per configured group
-  since "every group's per-group check fails" is the same signal). The plan
-  says: a check failing for *every* group points at the shared IRSA
-  identity rather than any one bucket — so emitting per-group and letting
-  the operator see the fan-out is acceptable, but a single fleet-level alert
-  is cleaner. **Decision required** — see open questions.
-
-### Per-group deep checks (hourly, hash-jittered)
-
-For each `status='ready'` group with a config row, on its jittered slot
-(`hash(group_id) mod window`, stable per group — same scheme as maintenance
-and inspection; factor the jitter helper so all three share it):
-
-1. **Both purposes issue working creds.** Cross-account `sts:AssumeRole` on
-   the group's `target_role_arn`:
-   - **backup path**: plain assume (no session policy), then a **read-only
-     no-op** S3 call against the bucket (e.g. `HeadBucket` or
-     `GetBucketLocation` — a harmless op the backup role policy allows).
-   - **restore path**: assume **with the read-only restore session policy**
-     (the normative JSON from the plan — `GetObject` + unconditioned
-     `GetBucketLocation` + conditioned `ListBucket`), then the same
-     read-only no-op. This proves the restore session policy actually works
-     and catches the `GetBucketLocation`-folded-under-`s3:prefix` class of
-     bug **proactively**, while plain backup issuance still looks fine.
-
-   Any failure (assume or no-op) → **`preflight-assume`**, `Error`,
-   group-level. Message should distinguish which purpose/leg failed.
-2. **Object Lock still in place.** `s3:GetBucketObjectLockConfiguration` on
-   the group's bucket; assert it returns an enabled lock with `mode`
-   present and `days >= 30` (GOVERNANCE, the `backups` stack's `mode:
-   'GOVERNANCE', days: 30`). Missing/weakened lock → **`preflight-object-lock`**,
-   `Critical`, group-level (the whole "can't destroy backups" guarantee
-   rests on it, and there's no other symptom). This action is **not** in
-   `AWS_S3_MULTIPART_ACTIONS`; the issuance/IaC component must add it to the
-   per-bucket role Canopy assumes, or the check itself 403s on day one —
-   note that dependency, don't silently absorb it.
-
-Prefer **behavioural** checks (assume + harmless S3 op) over IAM/policy
-*introspection*: behavioural checks test the real path and need no extra
-`iam:Get*`. The Object-Lock read is the one allowed exception.
-
-The **maintenance path** needs no separate preflight: the read-only
-inspection Job already connects each group's repo on its cadence (proving
-reachability + password), and maintenance-specific failures surface via
-`backup_maintenance_runs` → `backup-maintenance-stale` (signal 1 above).
-
-### Reactive rate-tracking (light)
-
-The live paths are signals too: `/backup-credentials` 502s on STS failure
-and maintenance failures land in `backup_maintenance_runs`. Out of scope to
-build a metrics pipeline here, but note that a spike between hourly
-preflights should ideally surface — for the first cut, the hourly preflight
-+ maintenance staleness cover it; richer rate-tracking is deferred.
-
-## Loop / scheduling shape
-
-Mirror `reachability`/`pingtask`:
-
-```rust
-pub fn spawn() -> JoinHandle<()> {
-    let pool = database::init();
-    task::spawn(async move {
-        loop {
-            sleep(Duration::from_secs(60)).await;
-            let Ok(mut db) = pool.get().await else { error!(…); continue; };
-            // signal-1 + reconcile scan (DB only)
-        }
-    })
-}
-```
-
-- **staleness + reconcile**: DB-only, 60 s tick. RESOLVED (impl): **folded
-  into the `monitor` bin** (formerly `reachability`), which already does the
-  minute-cadence DB sweep + the startup `reconcile_open_incidents`. No separate
-  Deployment; `database::backup::sweep` runs each tick after the reachability
-  sweep. (The earlier recommendation of a separate binary was reversed — see
-  open question 3.)
-- **preflight**: AWS-touching. 60 s tick for `GetCallerIdentity`; per-group
-  deep checks fire only when the tick lands in the group's jittered hourly slot.
-  RESOLVED (impl): runs as a **module under the consolidated `backups` bin**
-  (alongside maintenance/inspection/s3-metrics via `tokio::try_join!`), not its
-  own Deployment — see canopy-jobs-maintenance-inspection.md.
-
-Hash-jitter helper (shared with maintenance/inspection): RESOLVED (impl) it
-lives in `commons_servers::backup_jobs` (`jitter_slot(group_id, window)` +
-`slot_is_due(...)`), so all schedulers agree.
-
-## Interfaces / contracts
-
-### Consumes (written by sibling components)
-
-- **`server_group_backup_config`** — `group_id`, `expected_interval`
-  (NULL / set states), `created_at`, `status` (`provisioning` /
-  `escrow_pending` / `ready`), `bucket`, `target_role_arn`, `region`. Read
-  via a new `database` model fn, e.g.
-  `BackupConfig::scannable(conn) -> Vec<ScanRow>` and
-  `BackupConfig::ready_groups(conn) -> Vec<…>`.
-- **`backup_runs`** — `device_id`, `group_id`, `purpose`, `outcome`,
-  `reported_at`. (Written by `POST /backup-report`, issuance component.)
-- **`backup_maintenance_runs`** — `group_id`, `kind`, `outcome`,
-  `started_at`/`finished_at`. (Written by maintenance Jobs.)
-- **`backup_repo_snapshots`** — `group_id`, `source`, `server_id`,
-  `latest_snapshot_at`, `observed_at`. (Written by inspection Job.)
-- **`device_server_associations`** — `(device_id, server_id, first_seen,
-  last_seen)`, for the `MIN(first_seen)` anchor.
-- **`servers`** / `Server::live_by_device_id`, `is_monitored`, `group_id`.
-- **AWS SDK plumbing** (`aws-config`, `aws-sdk-sts`, `aws-sdk-s3`), the
-  ServiceAccount + IRSA role, and `s3:GetBucketObjectLockConfiguration` on
-  the per-bucket roles — all introduced by the issuance/IaC components.
-
-### Provides (to other components / operators)
-
-- **`database::backup::alerts::raise_group_event(conn, group_id, ref,
-  severity, …)`** — the single group-level alerting entrypoint. The
-  **inspection Job** calls it for `backup-corruption`; **PGRO ingest**
-  (later) calls it for `restore-verification`. Owning this here means there
-  is exactly one place that knows how to open a group-level incident
-  without the `is_monitored` gate.
-- **Stable `(source, ref)` keys** (the table above) — operators silence /
-  snooze by these via the existing `silenced_refs` mechanism; the UI / Slack
-  reference them. Documenting them is part of the contract.
-- **Group-scoped issue support** (Option B migration) — a reusable
-  capability beyond backups (any future control-plane-level check can raise
-  a group issue).
-
-## Data shapes
-
-No new tables are owned by *this* component except the Option-B schema
-change to `issues` (nullable `server_id` or new nullable `server_group_id`)
-and — for signal 3, later — `backup_restore_checks` (out of scope for the
-first cut, noted for design-for). Everything else is reads.
-
-A small internal struct for the scan, e.g.:
-
-```rust
-struct ScanRow {
-    server_id: Uuid,
-    group_id: Uuid,
-    device_id: Option<Uuid>,        // latest-associated device, for NewEvent
-    expected_interval: SignedDuration,
-    config_created_at: Timestamp,
-    min_first_seen: Option<Timestamp>,
-    last_success_at: Option<Timestamp>,   // purpose='backup', outcome='success'
-    latest_snapshot_at: Option<Timestamp>,// from backup_repo_snapshots (reconcile)
-    snapshot_observed_at: Option<Timestamp>, // signal-2 freshness guard
-}
-```
-
-## Testing approach (per AGENTS.md)
-
-UPDATE (shipped): DB-level detection tests exist in
-`crates/database/tests/backup_detection.rs` — covering classify boundaries
-(`classify_boundaries`, `classify_restore_only_history_is_never`),
-staleness/never/reconcile sweeps with the `is_monitored` gate
-(`sweep_files_staleness_*`, `sweep_files_never_*`,
-`unmonitored_staleness_records_issue_but_no_incident_link`,
-`reconcile_files_report_gap_*`, `reconcile_files_missing_*`,
-`reconcile_clears_report_gap_*`), and the headline
-group-level-alert-pages-even-when-all-members-unmonitored case
-(`group_event_pages_even_when_all_members_unmonitored`).
-
-- **Database-level tests** (`commons_tests::db::TestDb::run`) are the
-  primary coverage, since the scan/classify/reconcile logic lives in the
-  `database` crate as model fns. Use direct model functions, not HTTP.
-  Always `use database::ModelName;`. Seed `server_group_backup_config`,
-  `servers`, `device_server_associations`, `backup_runs`,
-  `backup_maintenance_runs`, `backup_repo_snapshots` directly, then assert
-  on the issues/events rows produced.
-- Cases to cover (success **and** the boundary/negative cases):
-  - stale (success older than `×2`) fires `backup-staleness` `Error`;
-  - just-under-`×2` does **not** fire;
-  - never-backed-up past anchor fires `backup-never`; just-authorized group
-    (recent `config.created_at`) does **not**, even with an old
-    `first_seen`; freshly-present server (recent `MIN(first_seen)`) does
-    **not**, even with an old `config.created_at` — assert the `max(...)`
-    anchor explicitly with both orderings;
-  - a recent successful **restore** does **not** clear backup staleness
-    (purpose filter);
-  - recovery: stale → success files `active:false`, and re-running the scan
-    does not re-file (idempotence);
-  - manual-only (`expected_interval` NULL) and non-`ready` configs are
-    **not** scanned;
-  - maintenance staleness fires/clears on `backup_maintenance_runs`;
-  - reconcile: report-success-but-no-snapshot → `backup-reconcile-missing`
-    (group-level, **pages even when the server is unmonitored** — assert the
-    incident opens); snapshot-but-no-report → `report-gap` `Warning`
-    (does not open an incident on its own);
-  - reconcile **skips** the missing verdict when `snapshot_observed_at` is
-    stale;
-  - **group-level vs per-server gating**: a `backup-staleness` on an
-    unmonitored server records the issue but opens **no** incident; a
-    `backup-corruption` / `preflight-object-lock` on a group whose servers
-    are all unmonitored **does** open an incident (this is the headline
-    behaviour and must be tested directly against `incidents` rows).
-- **Reconciliation/incident interplay**: reuse the patterns in the existing
-  issues/events tests — assert `incidents` / `incident_issues` rows and the
-  `slack_outbox` enqueue (`KIND_INCIDENT_OPEN`) for the paging cases, and
-  that recovery enqueues the resolve.
-- **Preflight** AWS calls can't hit real STS/S3 in tests; structure the
-  binary so the AWS-touching functions take a trait/client object that can
-  be faked, and unit-test the **decision logic** (lock-config →
-  pass/fail, assume-result → which ref/severity) separately from the SDK
-  wiring. The alerting side (given a verdict, the right group event is
-  raised) is DB-testable via `raise_group_event`.
-- Use `#[tokio::test(flavor = "multi_thread")]`. Tests run on the ramdisk
-  Postgres via `just test` / `just test-package`. There's no rendered UI in
-  this component, so no Playwright here (the operator stats/onboarding UI is
-  a sibling component and owns its own e2e).
-
-## Open questions / decisions to make
-
-1. **Group-level routing (Option A vs B).** Recommend **B** (group-scoped
-   issue: nullable `issues.server_id` or new `server_group_id`). It's the
-   clean model and is reused by inspection (corruption) and PGRO (signal 3),
-   but it's a migration + a thorough sweep of `issues.rs`. Confirm before
-   building — this is the largest single decision and the rest of the
-   group-level alerting depends on it.
-2. **`preflight-identity` fan-out.** One fleet-level alert (needs a
-   control-plane sentinel target) vs one-per-group (reuses per-group
-   routing, operator sees the fan-out and infers "shared identity"). Lean
-   fleet-level if Option B gives us a non-group sentinel cheaply; otherwise
-   per-group.
-3. **Separate `backup_staleness` binary vs folding into `reachability`.**
-   RESOLVED (impl): **folded** into the bin formerly called `reachability`,
-   now renamed **`monitor`** (it runs reachability + backup
-   staleness/reconcile + tailnet key-expiry on one minute loop). No separate
-   `backup_staleness` Deployment.
-4. **Maintenance-staleness threshold.** Independent of `expected_interval`
-   (maintenance cadence is full-weekly). Proposed constant ~`8 days`;
-   confirm and make it a named constant, not magic.
-5. **Reconcile severities.** `reconcile-missing` = `Error` group-level
-   (pages); `report-gap` = `Warning` per-server (non-paging). Confirm the
-   report-gap shouldn't be group-level — argument for per-server: a broken
-   *reporting* path is a single device's telemetry problem, not a
-   recoverability risk.
-6. **Signal-2 freshness floor for the reconcile guard.** What
-   `observed_at` age makes signal-2 "too stale to conclude missing"? Tie to
-   the inspection cadence floor (weekly for manual-only). Needs the
-   inspection component's cadence to be pinned first.
-7. **Anchor when a server has zero `device_server_associations` rows.**
-   `MIN(first_seen)` is NULL → fall back to `config.created_at` alone (the
-   `max` degenerates). Confirm that's the intended behaviour (a config'd
-   group with a server that no device has ever reported for: it's `never`
-   once `config.created_at` + grace elapses).
-8. **`CANOPY_SOURCE` sharing.** It currently lives in `statuses.rs`. Promote
-   to a shared location, or re-declare in the backup module? Minor, but pick
-   one to avoid drift.
-9. **Signal 3 (`backup_restore_checks` + PGRO ingest)** is explicitly
-   later/additive — confirm it stays out of this component's first cut
-   (only the group-level routing is built now, ready for it).
-
----
-
-## Backup types addendum
-
-Per the plan's "Backup types": staleness is per-`(server, type)`.
-
-- The staleness scan iterates **enabled `(server, type)` capabilities whose
-  effective schedule is non-NULL**, comparing each to its most recent
-  `backup_runs` row **for that type** (`type = ?`, `purpose='backup'`,
-  `outcome='success'`). The `×2` grace and `max(MIN(first_seen),
-  schedule-created)` anchor are unchanged, just per-type.
-- Disabled / manual-only / unconfigured `(server, type)` are out of the
-  scanned set.
-- Group-level alerting (corruption, preflight) is unchanged — it's
-  per-group, not per-type.
diff --git a/docs/plans/specs/canopy-jobs-maintenance-inspection.md b/docs/plans/specs/canopy-jobs-maintenance-inspection.md
deleted file mode 100644
index 7bc140fa..00000000
--- a/docs/plans/specs/canopy-jobs-maintenance-inspection.md
+++ /dev/null
@@ -1,734 +0,0 @@
-# Spec: canopy-jobs-maintenance-inspection
-
-**Component:** `canopy-jobs-maintenance` (repo: `canopy`)
-**Authoritative design:** [`../backup-credentials.md`](../backup-credentials.md) (and the blind-relay stub
-[`../backup-credentials-blind-relay.md`](../backup-credentials-blind-relay.md)).
-This spec implements the Canopy-owned **maintenance**, **read-only inspection**, **S3-metrics**, and
-**repo-creation init** paths — the scheduler loops in the `jobs` crate that drive kopia **in-process**.
-
-UPDATE (shipped) — the architecture changed fundamentally from the original
-"spawn one-shot k8s Jobs that report back" design to **a single long-lived
-`backups` Deployment that runs kopia as an in-process subprocess** for each due
-group. There are **no Kubernetes Jobs** anywhere: the loops parse kopia's
-`--json` output and write results **inline** to the DB. Throughout this spec,
-where older text describes Job manifests, a `JobSpawner`/`jobspec`, Job reaping,
-a `/job-report` endpoint, a pod termination-message, or a separate `kopia-job`
-image/binary, it is **superseded** — see the inline "UPDATE (shipped)" notes and
-§5/§8. The reasons: every Job already shared the one `canopy-jobs` IRSA
-identity, so collapsing into one process loses no isolation; and a long-lived
-process can hold a **refreshing** per-group credential, fixing the 1-hour cap
-that one-shot static creds hit.
-
-This is (jobs-side) the first IRSA usage anywhere in canopy, and the Kubernetes
-API client is now used only for **Secret reads** (the repo passphrase), not Job
-create/watch. "Like reachability" describes only the `spawn()` +
-`loop { sleep(60); pool.get; … }` shape.
-
----
-
-## 1. Purpose
-
-Canopy owns kopia repository lifecycle for every backup-configured server-group: repo creation, retention
-enforcement, snapshot expiry, blob GC/compaction, ground-truth inventory, poisoning detection, and the bucket
-billing-size readout. Devices never run these (they have no `DeleteObject`); the control plane does, off the
-client servers. UPDATE (shipped): it runs them **in-process** in the long-lived
-`backups` Deployment (kopia is a bundled subprocess), **not** as one-shot
-Kubernetes Jobs.
-
-RESOLVED (impl) — **the four scheduler loops ship as ONE bin**,
-`crates/jobs/src/bin/backups.rs`, which runs four modules
-(`crates/jobs/src/backup/{maintenance,inspection,preflight,s3_metrics}.rs`)
-concurrently via `tokio::try_join!`. (Preflight, originally specced in the
-sibling detection-preflight doc, is one of those modules.) The text below
-still describes the work as four loops for clarity of each loop's job; read
-"four scheduler loops" as "four modules in the one `backups` bin" throughout,
-and §2/§4 are updated accordingly.
-
-This component delivers (as four modules of the single `backups` bin). Each due
-group's kopia work runs as an **in-process subprocess** of the same pod (UPDATE
-(shipped) — no per-group k8s Jobs):
-
-1. **Maintenance scheduler** — per-group cycle `assert-retention → kopia snapshot expire → kopia maintenance
-   run`; quick-daily / full-weekly; hash-jittered per group; writes `backup_maintenance_runs` inline.
-2. **Inspection scheduler** — read-only `kopia snapshot list` + repo stats + repo verify (poisoning detection);
-   writes `backup_repo_snapshots` and the repo-derived fields of `backup_repo_stats` inline.
-3. **S3-metrics task** — CloudWatch `BucketSizeBytes` → `backup_repo_stats.bucket_bytes` (best-effort, separate
-   permissions, separate cadence).
-4. **Repo-creation init** — driven by the maintenance loop (not a Job): for each `provisioning` group it runs
-   `kopia repository create` + asserts initial retention in-process, using the group's per-bucket role.
-
-Out of scope here (other specs/components own them): the public-server device endpoints
-(`/backup-credentials`, `/backup-target`, `/backup-report`), the AWS-SDK client on `public-server`'s
-`AppState`, staleness detection over `backup_runs` (signal 1), the per-group upstream **preflight**, the
-operator UI, and all Pulumi `backups`-stack bucket/role changes. Where this component *depends on* those, it is
-called out in §6/§7. UPDATE (shipped): kopia is a **bundled subprocess** of the
-`backups` bin. The kopia binary is copied into the single shipped
-`ghcr.io/beyondessential/canopy` image (`.github/Dockerfile.native`), so the
-backups pod runs that same image — there is no separate job-pod image and no
-inter-process contract — §5 now documents how the bin invokes the kopia CLI
-directly.
-
----
-
-## 2. Where it lives & the loop template
-
-RESOLVED (impl) — **one** bin, not four, following `reachability.rs` /
-`pingtask.rs` for the outer `spawn()`/`main()` shape but driving four loop
-modules concurrently:
-
-```
-crates/jobs/src/bin/backups.rs                  # the single long-lived bin
-crates/jobs/src/backup.rs                        # backup module root
-crates/jobs/src/backup/maintenance.rs           # maintenance scheduler loop (+ drives init)
-crates/jobs/src/backup/inspection.rs            # read-only inspection scheduler loop
-crates/jobs/src/backup/preflight.rs             # upstream preflight (see detection-preflight spec)
-crates/jobs/src/backup/s3_metrics.rs            # CloudWatch BucketSizeBytes task
-crates/jobs/src/backup/kopia.rs                  # in-process kopia execution layer (subprocess wrappers + parsing)
-crates/jobs/src/backup/worker.rs                 # shared Worker: pool, kube client (Secret reads), concurrency, in-flight set
-crates/jobs/src/backup/complete.rs               # inline DB writes from a kopia op's typed outcome
-```
-
-UPDATE (shipped) — the bin's `main()` builds a shared
-[`Worker`](#) (DB pool, `kube::Client` for Secret reads, concurrency semaphore,
-in-flight group set) **once**, then launches the four loops under one
-`tokio::try_join!`. Maintenance and inspection share the `Worker`; preflight and
-s3-metrics build their own pool/AWS clients. The kopia ops for each due group run
-**in-process** as subprocesses (the kopia CLI is bundled in the image), parse
-kopia's `--json`, and write results inline via `complete.rs`. (Original sketch was
-three separate `backup_*` bins each as its own Deployment, each *spawning k8s
-Jobs* — both superseded: one bin, in-process kopia.)
-
-Repo-creation is **not** its own loop and **not** handler-driven — UPDATE
-(shipped): the **maintenance loop** runs `init` in-process for `provisioning`
-groups (gated by the same in-flight set), then advances the status inline (see
-§3.8). The onboarding handler only sets `status = 'provisioning'`; private-server
-holds no kube/jobs dependency.
-
-Each bin keeps the established structure verbatim:
-
-```rust
-pub fn spawn() -> JoinHandle<()> {
-    let pool = database::init();
-    task::spawn(async move {
-        // build the kube client + scheduler config ONCE at startup
-        // (like reachability builds the TailnetDirectory once)
-        loop {
-            sleep(Duration::from_secs(TICK)).await;
-            let Ok(mut db) = pool.get().await else { error!(...); continue; };
-            // … per-tick work …
-        }
-    })
-}
-
-#[derive(Debug, Parser)]
-struct Args { #[command(flatten)] logging: LoggingArgs }
-
-#[tokio::main]
-async fn main() -> miette::Result<()> { /* identical to reachability.rs main() */ }
-```
-
-Deviation from the DB-only sweeps: at startup the bin builds a **kube client** (used only for repo-password
-**Secret reads**, not Jobs) and the s3-metrics/preflight tasks build AWS clients. UPDATE (shipped): the kube
-client is built once in `main()` (a hard failure there exits the pod); a transient API blip when *reading a
-Secret* mid-loop is per-group and just skips that group's op for the tick, so it doesn't kill the pod.
-
-RESOLVED (impl): the four loops share **one** single-replica `Recreate`
-Deployment (`backups`) in `ops/pulumi/tamanu/meta/src/jobs.ts` (see §4). UPDATE
-(shipped): the per-group work runs **in-process** (kopia subprocess) inside the
-loop pod — it is **not** a k8s Job.
-
-### Tick vs. cadence
-
-The loop ticks frequently (default 60s, matching reachability) but **per-group work is gated by hash-jittered
-cadence**, so a tick mostly finds nothing due. The loop's job each tick is: enumerate configured+`ready`
-groups, compute each group's due-ness for *this* loop's cadence, and — for those due and not already in-flight —
-**claim a per-group + concurrency slot and run the kopia op in-process**. This keeps "is anything due" cheap (a
-DB read + arithmetic) and the heavy work in bounded in-process subprocesses.
-
----
-
-## 3. Concrete changes (canopy)
-
-### 3.1 New crate dependencies (`crates/jobs/Cargo.toml`)
-
-Net-new; **do not pin versions without checking the registry** (per global rule). UPDATE (shipped) — the
-shipped `crates/jobs/Cargo.toml` carries:
-
-- `kube` and `k8s-openapi` — used **only for Secret reads** (the per-group repo passphrase, via
-  `worker::read_repo_password`). UPDATE (shipped): **no** Job create/list/watch/delete — there are no Jobs.
-  `k8s-openapi` is pinned to a feature matching the cluster (verify against the deployed control-plane version,
-  do not guess).
-- `aws-config` + `aws-sdk-cloudwatch` + `aws-sdk-s3` + `aws-sdk-sts` for **preflight** (upstream reachability)
-  and the **S3-metrics** task. The maintenance/inspection loops do **not** use the AWS SDK directly — kopia's
-  own bundled AWS SDK talks to S3 (the bin only overrides `AWS_ROLE_ARN` per subprocess).
-- **No `axum`** — UPDATE (shipped): there is no `/job-report` HTTP server; results are typed Rust values
-  written inline.
-- Hash-jitter uses a stable hash of the group UUID (stable across restarts — not a randomly-seeded hasher);
-  the helpers live in `commons_servers::backup_jobs`.
-
-(The AWS SDK also lands on `public-server` per the endpoints spec; the kube client also lands on `public-server`
-for Secret-read per `/backup-target` — both are *separate* additions owned by the endpoints component.)
-
-### 3.2 In-process execution model: `kopia.rs` + `worker.rs` + `complete.rs`
-
-UPDATE (shipped) — there is **no** k8s-Job manifest builder (`jobspec.rs`), no `JobSpawner`, and no
-`spawn_*_job` helpers. The kopia work runs in-process. The code splits three ways:
-
-```
-crates/jobs/src/backup/kopia.rs      # subprocess wrappers + parsing + per-kind orchestration (run_init/run_maintenance/run_inspect)
-crates/jobs/src/backup/worker.rs     # shared Worker: pool, kube client (Secret reads), Slots (semaphore + in-flight set)
-crates/jobs/src/backup/complete.rs   # inline DB writes from a kopia op's typed outcome
-```
-
-Pure scheduler logic (hash-jitter, due-ness, billing labels, retention floor, `JobKind`) still lives in
-**`commons_servers::backup_jobs`**, reused by the loops and by private-server (so private-server need not depend
-on the `jobs` crate). UPDATE (shipped): private-server does not call any spawn helper — init is scheduler-driven
-(§3.8).
-
-**`kopia.rs`** (in-process execution layer):
-
-- Builds a per-op `KopiaEnv { target_role_arn, region, password }` and applies it to each `tokio::process::Command`:
-  it sets `AWS_ROLE_ARN` = the group's `target_role_arn` (overriding the pod's shared `canopy-jobs` IRSA role),
-  `AWS_REGION`/`AWS_DEFAULT_REGION` = the group's region, and `KOPIA_PASSWORD` = the repo passphrase. The
-  projected `AWS_WEB_IDENTITY_TOKEN_FILE` is **inherited** from the pod env, so kopia's own AWS SDK does
-  `AssumeRoleWithWebIdentity` against the per-bucket role **directly** (not chained → up to the role's
-  `MaxSessionDuration`, auto-refreshed — no 1h cap). This replaces the per-kind IRSA-SA distinction: every op
-  uses the same pod SA and overrides the role per subprocess.
-- `connect(...)` always connects with `--override-username canopy --override-hostname canopy-maintenance` so the
-  running identity is the maintenance owner (kopia 0.23.1 requires running identity == owner for
-  `maintenance run`; see §5). `run_init` sets that identity as the owner.
-- `run_init` / `run_maintenance` / `run_inspect` orchestrate the kopia subcommands and return **typed Rust
-  outcomes** (e.g. `MaintOutcome { bytes_reclaimed }`, `InspectOutcome { verify_ok, counts, per-source
-  inventory }`) — no JSON-over-HTTP, no termination-log.
-- The repo password is **never** logged; it is read from the group's k8s Secret and passed only via the
-  subprocess env.
-
-**`worker.rs`** (concurrency + Secret reads):
-
-- `Worker { pool, kube, cfg, slots }` is built once in `main()` and shared (cheaply cloned) by maintenance +
-  inspection. `Cfg::from_env()` reads `CANOPY_NAMESPACE`, the Secret password key, and the web-identity token
-  file path — so one binary works across stacks.
-- `read_repo_password(secret_name)` reads the named key from the group's k8s Secret (the only kube API use).
-- `Slots` holds a tokio `Semaphore` (max concurrency from `CANOPY_BACKUP_MAX_CONCURRENCY`, default 4) and an
-  in-flight `HashSet<Uuid>`. `try_claim(group_id)` takes a permit then marks the group in-flight, returning an
-  `InFlightGuard` that releases both on drop — enforcing **one op per group at a time** across maintenance +
-  inspection + init, plus a global concurrency cap.
-
-**`complete.rs`** (inline completion):
-
-- Called inline with the typed outcome: `complete_maint` closes the `backup_maintenance_runs` row (success →
-  `bytes_reclaimed`; failure → error), `complete_init` advances `provisioning → escrow_pending`/`ready` or
-  records `last_init_error`, and the inspection path upserts inventory/stats and raises/recovers the corruption
-  alert off `verify_ok`. There is no report endpoint and no crash-detection: the op runs in the same process, so
-  its outcome is known directly.
-
-**Billing labels** (`commons_servers::backup_jobs`) are still computed from the group (for the Deployment's cost
-labels, not per-Job pods):
-
-- `billing.product` = group's `billing.product` tag if present else `"tamanu"`.
-- `billing.deployment` = group's `billing.deployment` tag if present else the group **name**.
-- `billing.stage` = group's `billing.stage` tag if present, else derived from
-  `ServerGroup::highest_member_ranks` → `rank_priority`, mapped **explicitly** to the CUR stage strings ops
-  already emits — **not** the `ServerRank` `Display` strings, which don't match:
-
-  | `ServerRank` | `Display` | billing stage |
-  |---|---|---|
-  | `Production` | `production` | `prod` |
-  | `Clone` | `clone` | `clone` |
-  | `Demo` | `demo` | `demo` |
-  | `Test` | `test` | `test` |
-  | `Dev` | `dev` | `dev` |
-
-  RESOLVED (impl): `billing.stage` maps explicitly, with `ServerRank::Production
-  → "prod"` (the load-bearing mismatch); the others coincide but are mapped
-  explicitly so a future `Display` rename can't silently break CUR tags. (See §8
-  for the all-unranked fallback.)
-
-`commons_servers::backup_jobs` provides hash-jittered due-ness (UPDATE (shipped) — these live in
-`backup_jobs`, not a `schedule.rs`):
-
-```rust
-/// Cadence elapsed since the last run-of-this-kind (window arithmetic only).
-pub fn is_due(window: Duration, last: Option<Timestamp>, now: Timestamp) -> bool;
-
-/// Stable per-group jitter slot: true only on the tick matching this group's
-/// hashed offset within the window. Stable across restarts (hashes the group
-/// UUID, NOT a randomly-seeded hasher).
-pub fn slot_is_due(group_id: Uuid, window: Duration, tick: Duration, secs_into_window: u64) -> bool;
-```
-
-### 3.3 Database changes
-
-The **tables** are defined in the design doc and shared with sibling components; this component **reads** config
-and **writes** run/inventory/stats rows. Migrations are created with **`just migration NAME`** (never
-hand-authored — per project rule). To avoid two specs both trying to own the same migration, ownership is:
-
-- `server_group_backup_config`, `backup_credential_issuances`, `backup_runs`, `backup_requests` — owned by the
-  **endpoints/onboarding** components (this component only **reads** `server_group_backup_config`).
-- **This component owns the migrations for** `backup_maintenance_runs`, `backup_repo_snapshots`,
-  `backup_repo_stats` (DDL verbatim from the design doc §"Database changes"). If a single migration is
-  preferred for the whole feature, coordinate so this component contributes these three tables.
-
-New database-crate model modules (mirroring `chrome_releases.rs` shape: a `Queryable` struct + a `New*` insert
-struct + impl methods, re-exported from `lib.rs`):
-
-- `crates/database/src/backup_maintenance_runs.rs` — `MaintenanceRun` / `NewMaintenanceRun`.
-  - `NewMaintenanceRun::start(conn, group_id, kind) -> id` (insert with `outcome = NULL`, returns `BIGSERIAL`).
-  - `MaintenanceRun::finish(conn, id, outcome, error, bytes_reclaimed)`.
-  - `MaintenanceRun::latest_for_group(conn, group_id, kind) -> Option<MaintenanceRun>` (for due-ness +
-    staleness).
-- `crates/database/src/backup_repo_snapshots.rs` — `RepoSnapshot` / `NewRepoSnapshot`.
-  - `NewRepoSnapshot::upsert_many(conn, group_id, rows)` (PK `(group_id, source)`, `ON CONFLICT … DO UPDATE`
-    `latest_snapshot_at`/`observed_at`).
-  - parse `server_id` from the kopia `source` (`canopy@<server-id>:<path>`) at write time.
-- `crates/database/src/backup_repo_stats.rs` — `RepoStats`.
-  - `RepoStats::upsert_repo_fields(conn, group_id, snapshot_count, source_count, logical, physical)` — written
-    by inspection.
-  - `RepoStats::upsert_bucket_bytes(conn, group_id, bytes)` — written by the S3-metrics task; must **not**
-    clobber the repo fields (partial upsert), since the two tasks run on different cadences. `bucket_bytes` is
-    nullable/best-effort.
-
-Use PostgreSQL-native upserts (`ON CONFLICT`) per project DB conventions; keep the per-task partial-update
-separation so the two writers don't race over each other's columns.
-
-`backup_repo_config` reader: add `server_group_backup_config` model (likely owned by the config/onboarding
-component) — this component needs a read like `BackupConfig::all_ready(conn) -> Vec<BackupConfig>` (status =
-`'ready'`, used to enumerate groups to schedule) and `BackupConfig::by_group(conn, group_id)`. If that model
-doesn't exist yet, this component adds the read-only accessors it needs.
-
-### 3.4 Maintenance scheduler loop (`backup/maintenance.rs`)
-
-Per tick (UPDATE (shipped) — no Jobs; the op is an in-process subprocess task):
-
-1. `ServerGroupBackupConfig::all(...)` → candidate groups; `provisioning` ones go through init (§3.8),
-   `ready` ones through maintenance.
-2. For each `ready` group, decide quick vs full and due-ness (`due_kind`):
-   - **full**: due weekly (`is_due(WEEK, last_full, now)`) **and** this tick matches the group's hashed slot
-     (`slot_is_due(group, WEEK, TICK, …)`).
-   - **quick**: due daily, slot-jittered over the day; `full` subsumes quick.
-   - If both are due, run **full**.
-3. For each due group: `Worker::try_claim(group_id)` — skip if the group is already in-flight (across
-   maintenance/inspection/init) or the concurrency cap is hit. No kube API query.
-4. `NewMaintenanceRun::start(...)` → `run_id`; read the repo password from the group's Secret, build the
-   `KopiaEnv`, **spawn a tokio task** that runs `kopia::run_maintenance(...)` in-process and then calls
-   `complete::complete_maint(run_id, …)` inline with the typed outcome. The `InFlightGuard` releases the slot on
-   task completion.
-5. Cadence defaults: quick-daily, full-weekly, deployment-wide (`TICK`/`DAY`/`WEEK` constants). Per-group override
-   is later (design non-goal).
-
-**Finish recording — UPDATE (shipped): inline, in-process.** The op runs as a
-subprocess of the same pod, so the loop knows its outcome directly: it calls
-`complete::complete_maint(run_id, outcome, error)` → `MaintenanceRun::finish`.
-There is **no** `/job-report` endpoint, no axum server, and no kube poll for
-reaping/crash-detection. (Superseded design: first a report-endpoint, then a pod
-termination-message / pod-log read, then a bearer-authed `/job-report` POST with
-a kube reap/crash-detect poll — see §8 #2. The termination-message/pod-log read
-proved unreliable in practice — k8s truncates/drops the message and the pod may
-be gone before it's read, PGRO's pattern too — and the report round-trip became
-unnecessary once kopia runs in-process, since the IRSA was shared anyway.) A
-crash now can't leave a row stuck at `outcome IS NULL` via a missing report: if
-the in-process op panics or errors, the loop's task records the failure inline.
-
-The maintenance cycle's **three steps run in-process** (`kopia.rs::run_maintenance`), not in a separate Job:
-`assert retention → kopia snapshot expire → kopia maintenance run [--full]`, all under the group's per-bucket
-role. RESOLVED (impl) — **per-`(group, type)` retention is resolved and applied per source.** The loop calls
-`commons_servers::backup_jobs::effective_retention_for_group` (per enabled type: `server_group_backup_schedule`
-override → `backup_type_defaults` → floor baseline, each `.enforce_floor()`-clamped — never below `keep_daily 7,
-keep_weekly 4, keep_monthly 6`) and builds a `{type → policy}` **map** (`kopia::RetentionMap`). The kopia layer
-applies it **per source**: for each `canopy@<server-id>:<type>` source it sets that type's kopia policy
-(`policy set <user@host:path>`), then expires — so different types sharing a group's repo get their own
-retention. (`init` sets a strictest-of-the-map global baseline since the repo has no sources yet.) The
-private-server write path validates operator input against the same floor via the DB-crate
-`RetentionPolicy::validate_floor()` (rejects below-floor rather than clamping).
-
-### 3.5 Inspection scheduler loop (`backup/inspection.rs`)
-
-Per tick, same enumerate-and-gate shape, on its **own cadence** (default ≈ `expected_interval`, tunable; floor
-weekly for manual-only `NULL`-interval groups that still hold backups). RESOLVED (impl): the per-group cadence
-is `commons_servers::backup_jobs::effective_interval_for_group` (the **min** effective `expected_interval`
-across the group's enabled types), floored to weekly. UPDATE (shipped): for each due group it claims a slot and
-**runs `kopia::run_inspect(...)` in-process** (no read-only Job), which:
-
-- `kopia snapshot list --all --json` → per-source inventory (latest snapshot per source).
-- repo stats (`kopia content stats` — note: **no `--json`**, parsed from text) → snapshot/source counts +
-  logical/physical bytes.
-- **repo verify** → a `verify_ok` flag (poisoning/corruption signal).
-
-UPDATE (shipped) — **inspection completion runs inline**, in-process. There is no
-`/job-report` POST and no kube reap poll. The loop calls `complete.rs` directly
-with the typed `InspectOutcome`, which:
-
-- writes `backup_repo_snapshots` (latest snapshot per source; `server_id`/`type`
-  parsed from each source) and the repo-derived fields of `backup_repo_stats`.
-- on `verify_ok: false`, raises the **`backup-corruption`** `Severity::Critical`
-  **group-level** alert (via `raise_group_event`, §3.7), with a matching
-  `active: false` recovery when a later inspection verifies clean.
-
-This is a **group-level** alert that fires **regardless of any server's
-`is_monitored`** — routed through `raise_group_event` (§3.7), not a per-server
-`NewEvent`.
-
-Inspection results vs signal-1 reconciliation (report-said-success-but-no-snapshot, etc.) is **owned by the
-signal-1 staleness component**, which reads `backup_repo_snapshots`/`backup_runs`. This component's job is to
-*write the ground truth* and to raise the *corruption* alert; the cross-signal reconciliation alerts are the
-staleness component's.
-
-### 3.6 S3-metrics task (`backup/s3_metrics.rs`)
-
-A loop module in the one `backups` bin (UPDATE (shipped) — not a separate bin); it builds its own AWS clients.
-Per tick (own cadence, ≈ `expected_interval`, weekly floor):
-
-- For each `ready` group, read CloudWatch `AWS/S3 BucketSizeBytes`. It is reported **per `StorageType`**
-  (storage class) with no "all storage types" total, and the class depends on bucket config (Standard,
-  Intelligent-Tiering tiers, …), so the task **`ListMetrics`-discovers** whichever `StorageType`s the bucket
-  actually emits and **sums** the latest `GetMetricStatistics` datapoint across them — no hardcoded class.
-  RESOLVED (impl): the metric lives in the **deployment** account, so the task **assumes the group's
-  `target_role_arn`** (the same role preflight assumes) and reads CloudWatch with those cross-account
-  credentials — no dedicated canopy-side cross-account CloudWatch IRSA. The per-bucket role must grant
-  `cloudwatch:GetMetricStatistics` **and `cloudwatch:ListMetrics`** (ops `backups`-stack).
-- `RepoStats::upsert_bucket_bytes(conn, group_id, bytes)` — best-effort; on error log + continue, never alert
-  (best-effort/nullable per design).
-
-This task reads CloudWatch directly from the `backups` pod (lightweight); it never ran as a Job.
-
-### 3.7 Group-level alerting path (shared concern, must be settled here)
-
-Maintenance failure (stuck/failed maintenance) and inspection corruption are **group/control-plane** concerns
-that must **not** pass the per-server `is_monitored` gate (design "Group-level checks alert regardless of
-`is_monitored`"). But the incident model (`issues.rs`) is **server-keyed**: `NewEvent::save(conn, server_id,
-device_id)` and `re_evaluate_incident_membership` gate on the server's `is_monitored`. There is no
-"group-level issue with no server" path today.
-
-RESOLVED (impl): the gap was closed with a **group-scoped issue** path. The
-shipped entrypoint is **`database::backup::alerts::raise_group_event(conn,
-group_id, ref, severity, …)`**, which find-or-creates a group-scoped issue
-(nullable `issues.server_id`, group resolved directly — migration
-`2026-06-15-...backup_group_scoped_issues`), runs the group-aware membership
-evaluation that **bypasses `is_monitored`**, and drains to `slacker_outbox`
-unchanged. Both this component (corruption) and the detection component call it;
-its tests cover the all-members-unmonitored paging case. Do **not** route
-group-level alerts through a per-server `NewEvent`. (See the detection-preflight
-spec for the full `raise_group_event` contract and the migration sweep.)
-
-`(source, ref)` conventions for this component (mirroring reachability's `source="canopy"`):
-- maintenance stuck/failed → `ref = "backup-maintenance"`, `Severity::Error` (opens incident).
-- repo corruption/poisoning → `ref = "backup-corruption"`, `Severity::Critical`.
-Recovery is the **same `(source, ref)`** event with `active: false` / lower severity, so the issue leaves the
-incident and auto-closes (same pattern reachability uses). `slacker_outbox` drains to Slack unchanged.
-
-(The `backup_maintenance_runs` staleness scan — "a group whose maintenance silently stopped" — can live in the
-maintenance bin's loop or the signal-1 staleness component; **recommend** it lives with signal-1 so all
-staleness logic is in one place, with this component only emitting the corruption alert and writing the runs
-table. **Decision §8.**)
-
-### 3.8 Repo-creation init
-
-RESOLVED (impl) — **init is scheduler-driven, not handler-driven** (cleaner: no
-kube/Jobs dependency in private-server). UPDATE (shipped): it runs **in-process**,
-not as a Job. The onboarding handler only sets `status = 'provisioning'`; it does
-**not** spawn anything. The **maintenance loop** then:
-
-- enumerates groups in `provisioning`, guarded by `last_init_error IS NULL`
-  (cleared by the operator-UI retry) **and** not-already-in-flight,
-- runs `kopia::run_init(...)` in-process under the group's per-bucket role
-  (creating the repo format blob needs the full-access role, not the device's
-  no-delete set): `kopia repository create` (CONFIRMED kopia 0.23.1: exits
-  non-zero if the repo already exists → falls back to `connect` and treats that
-  as success), connects with the fixed `canopy@canopy-maintenance` identity and
-  sets it as the maintenance **owner**, and asserts the floor-enforced initial
-  retention,
-- on completion, `complete::complete_init(...)` advances the status inline:
-  `provisioning → escrow_pending` for **FromBirth** mode (Canopy-minted
-  passphrase → escrow flow) / `provisioning → ready` for **Import** mode
-  (operator already holds the passphrase), **or** records `last_init_error` on
-  failure (operator-UI clears it to retry).
-
-(So both the original "the onboarding handler calls `spawn_init_job`" *and* the
-intermediate "scheduler spawns an init Job that POSTs `/job-report`" are
-superseded: private-server holds no kube/jobs dependency, and the maintenance
-loop runs init in-process and advances the status directly.)
-
----
-
-## 4. IaC changes (ops — `ops/pulumi/tamanu/meta`)
-
-Owned jointly with the ops/IaC spec; the canopy-jobs-relevant pieces:
-
-- RESOLVED (impl): **one** new single-replica `Recreate` Deployment in `jobs.ts`
-  mirroring `reachability`/`pingtask` — `backups` (`['backups']`), running all
-  four loop modules — `dependsOn: [migrator]`, with `costLabels`. (Originally
-  specced as three separate `backup-maintenance`/`backup-inspection`/`backup-s3-metrics`
-  Deployments — superseded.)
-- **ServiceAccount + IRSA, net-new to canopy.** UPDATE (shipped) — there is **one** SA, the `canopy-jobs` SA on
-  the single `backups` Deployment; **no per-Job SAs**, no per-kind maintenance/inspection SA split (every kopia
-  subprocess overrides `AWS_ROLE_ARN` to the group's per-bucket role and reuses the pod's projected web-identity
-  token). `spec.ts` injects no `serviceAccountName` today, so add an optional `serviceAccountName` to the
-  `spec()` container args (or a sibling helper) and create the SA via the existing
-  **`common/eksServiceAccount.ts`** helper. The SA needs:
-  - **k8s RBAC: `get` Secrets** in canopy's namespace — to read the per-group repo passwords — **and that's it**.
-    UPDATE (shipped): **NO** create/list/watch/delete Jobs, **no** pods, **no** tokenreviews.
-  - **AWS/IRSA:** assume the per-bucket roles via **web-identity** (direct, refreshing — up to each role's
-    `MaxSessionDuration`; set it high enough to cover a long maintenance run). The per-bucket role must **trust
-    the `canopy-jobs` SA's OIDC subject** and (for s3-metrics) grant `cloudwatch:GetMetricStatistics`. The
-    per-bucket role trust + action set + `s3:GetBucketObjectLockConfiguration` are **`backups`-stack** changes
-    owned by the ops spec.
-  - **OIDC-provider-per-account** wiring so the pod's web-identity can assume cross-account (ops/IaC).
-- **The shipped `ghcr.io/beyondessential/canopy` image** bundles kopia (the kopia binary is copied into
-  `.github/Dockerfile.native` from `kopia/kopia:0.23.1`), so the backups pod runs the same image as the other
-  components — there is no separate kopia-job image, and there is **no `CANOPY_BACKUP_IMAGE` env** (no Job image
-  to reference). See §5.
-- UPDATE (shipped): **no** report Service/Secret (`CANOPY_BACKUP_REPORT_*` / bearer token) — there is no
-  `/job-report` endpoint. The k8s RBAC is a least-privilege namespace `Role`/`RoleBinding` granting only
-  `get secrets`.
-
-This component's canopy code reads the namespace (and Secret password key / web-identity token file) from
-**env/config** (like DATABASE_URL), not hardcoded, so the same binary works across stacks.
-
----
-
-## 5. Interfaces / contracts
-
-### Consumes
-
-- **DB config:** `server_group_backup_config` (read): `group_id`, `bucket`, `prefix`, `target_role_arn`,
-  `region`, `repo_password_ref`, `status`, `mode`, `last_init_error`. Schedule/retention are read from
-  `server_group_backup_schedule` / `backup_type_defaults` (the addendum moved `expected_interval`/`retention`
-  off the config table). `status = 'ready'` groups are scheduled for maintenance/inspection; `provisioning`
-  groups drive the init flow (§3.8).
-- **`server_groups`:** `ServerGroup::highest_member_ranks`, `rank_priority`, `tags` (`TagMap`) for billing
-  labels.
-- **kopia repo password Secret** named by `repo_password_ref`, in canopy's namespace — UPDATE (shipped): **read
-  via the kube API** (`worker::read_repo_password`) and passed to the kopia subprocess as `KOPIA_PASSWORD`, not
-  mounted via `secretKeyRef`. Owned by the repo-password/onboarding component; consumed here read-only.
-- **Per-bucket IAM roles** (`target_role_arn`) trusting the **`canopy-jobs` SA** OIDC subject cross-account; the
-  kopia subprocess assumes them directly via web-identity. Owned by the ops `backups`-stack spec. UPDATE
-  (shipped): one role per group (no separate full-access vs read-only role per kind).
-- **`database::backup::alerts::raise_group_event`** (group-level alert path, §3.7). `Severity` from
-  `commons_types::issue` (`OPENS_INCIDENT = [Critical, Error]`).
-- **kube API** (Secret reads only) + **CloudWatch** (s3-metrics) + **S3/STS** (preflight, and CloudWatch
-  cross-account assume).
-
-### Provides
-
-- **DB writes** other components read:
-  - `backup_maintenance_runs` (start/finish; consumed by signal-1 staleness + the stats UI panel).
-  - `backup_repo_snapshots` (ground-truth inventory; consumed by signal-1/2 reconciliation + UI).
-  - `backup_repo_stats` repo fields + `bucket_bytes` (consumed by the operator stats panel).
-- **Shared library** (`commons_servers::backup_jobs`): pure scheduler helpers — `JobKind`, billing labels,
-  `RetentionPolicy`/floor (`effective_retention_for_group`), `effective_interval_for_group`,
-  `is_due`/`slot_is_due` — shared by the loops and by private-server's validation. UPDATE (shipped): **no**
-  `spawn_*_job` helpers (init is scheduler-driven, in-process).
-- **Group-level alerts** `(source="canopy", ref ∈ {backup-maintenance, backup-corruption})` feeding the
-  existing incident → Slack pipeline.
-
-### kopia invocation (in-process subprocess — no inter-process contract)
-
-UPDATE (shipped) — there is **no inter-process contract** anymore. The earlier
-designs (config-via-ENV/args + results-via-POST-to-`/job-report`, and the
-separate `images/kopia-job/` image with its `CONTRACT.md`) are **superseded**:
-the `backups` bin invokes the **bundled** kopia CLI directly (`tokio::process`)
-in `kopia.rs`, parses its `--json`/text output into typed Rust values, and writes
-the results inline. No ENV-config handoff, no result JSON over HTTP, no
-`terminationMessagePolicy`, no `kopia-job` image.
-
-Per op the bin builds a `KopiaEnv` (`AWS_ROLE_ARN` = group's `target_role_arn`,
-`AWS_REGION`, `KOPIA_PASSWORD` from the Secret; projected web-identity token
-inherited) and runs:
-
-- **maint-quick / maint-full:** `connect` → per source set that type's policy
-  (`policy set <user@host:path> --keep-*`, from the `{type → policy}` map) →
-  `snapshot expire` → `maintenance run [--full]`. (`init` sets a
-  strictest-of-map global baseline since there are no sources yet.)
-- **inspect:** `snapshot list --all --json` → repo stats (`content stats`) → verify.
-- **init:** `repository create` + assert initial retention; connect as the canopy
-  identity and set it as the maintenance **owner**, **disabling** client-side
-  maintenance/expiry (so devices never attempt delete-needing ops).
-
-**Verified kopia 0.23.1 facts** (confirmed against the bundled version, encoded in `kopia.rs`):
-
-- **Connect identity / maintenance owner:** `maintenance run` refuses unless the connected client identity
-  equals the maintenance owner. So every op connects with `--override-username canopy --override-hostname
-  canopy-maintenance` (constants `MAINTENANCE_USER`/`MAINTENANCE_HOST`), and `init` sets that identity as the
-  owner. Devices connect with their own identity, so they never become owner.
-- **Per-source policy:** retention is applied per source via `kopia policy set <user@host:path> --keep-*`
-  (per-type, keyed by `canopy@<server-id>:<type>`).
-- **`kopia content stats` has no `--json`** — physical-bytes are parsed from its text output (a "Total Bytes:"
-  line), best-effort (`None` if unparseable).
-- **`kopia repository create` exits non-zero if the repo already exists** — `run_init` treats that as success by
-  falling back to `connect`.
-- **`kopia snapshot list --all --json`** elements carry `source = { userName, host, path }`, parsed into the
-  per-source inventory (`server_id`/`type` from `host`/`path`).
-
-Typed outcomes (written inline by `complete.rs`, no wire schema):
-
-- **init:** ok / error → status advance or `last_init_error`.
-- **maint:** `MaintOutcome { bytes_reclaimed }`.
-- **inspect:** `InspectOutcome { verify_ok, snapshot/source counts, logical/physical bytes, per-source
-  inventory }`.
-
----
-
-## 6. Data shapes (Rust)
-
-```rust
-// commons_servers::backup_jobs (kept as a *kind* enum even though there are no Jobs)
-pub enum JobKind { MaintQuick, MaintFull, Inspect, Init }
-
-pub struct BillingLabels {
-    pub product: String,            // default "tamanu"
-    pub deployment: String,         // default = group name
-    pub stage: Option<String>,      // None => omit label (all-unranked group)
-}
-
-// RESOLVED (impl): RetentionPolicy lives in the DATABASE crate
-// (database::backups::RetentionPolicy), over the schedule/type-default JSONB.
-pub struct RetentionPolicy {
-    pub keep_latest: i32,           // default 1, not floored
-    pub keep_daily: i32,            // floor 7  (FLOOR_DAILY)
-    pub keep_weekly: i32,           // floor 4  (FLOOR_WEEKLY)
-    pub keep_monthly: i32,          // floor 6  (FLOOR_MONTHLY)
-    pub keep_annual: i32,           // default 0
-}
-impl RetentionPolicy {
-    // validates (does NOT silently clamp): below-floor → AppError::BadRequest
-    pub fn validate_floor(&self) -> Result<()>;
-    pub fn from_json(&JsonValue) -> Option<Self>;
-    pub fn to_json(&self) -> JsonValue;
-}
-```
-
-```rust
-// database models
-pub struct MaintenanceRun { pub id: i64, pub group_id: Uuid, pub kind: String,
-    pub started_at: Timestamp, pub finished_at: Option<Timestamp>,
-    pub outcome: Option<String>, pub error: Option<String>, pub bytes_reclaimed: Option<i64> }
-
-pub struct RepoSnapshot { pub group_id: Uuid, pub source: String,
-    pub server_id: Option<Uuid>, pub latest_snapshot_at: Option<Timestamp>, pub observed_at: Timestamp }
-
-pub struct RepoStats { pub group_id: Uuid, pub snapshot_count: Option<i32>, pub source_count: Option<i32>,
-    pub logical_bytes: Option<i64>, pub physical_bytes: Option<i64>,
-    pub bucket_bytes: Option<i64>, pub observed_at: Timestamp }
-```
-
-Kopia source parse: `canopy@<server-id>:<path>` → `server_id = Uuid::parse(...)` (best-effort; `None` if the
-host segment isn't a UUID, e.g. legacy/imported repos — store the row with `server_id = NULL` rather than
-dropping it).
-
----
-
-## 7. Testing approach (per AGENTS.md)
-
-- **DB model tests** with `commons_tests::db::TestDb::run(|mut conn, _url| async move { … })`, calling model
-  functions directly (not HTTP), per the project rule. Cover:
-  - `NewMaintenanceRun::start` then `MaintenanceRun::finish` (success + failure rows), `latest_for_group`.
-  - `RepoSnapshot::upsert_many` idempotency on `(group_id, source)`; `server_id` parse from a real
-    `canopy@<uuid>:/path` source and from a non-UUID host (→ `NULL`).
-  - `RepoStats` **partial upserts**: `upsert_repo_fields` then `upsert_bucket_bytes` must not clobber each
-    other (the two-writer split is the load-bearing invariant — test it explicitly).
-  - retention **floor enforcement**: a policy below floor is raised; an above-floor override is preserved;
-    `keep_latest` is **not** floored.
-  - 404/absent cases (`latest_for_group` for an unknown group → `None`).
-  - Always `use database::ModelName;` imports.
-- **Pure-logic unit tests** (plain `#[test]` / `#[tokio::test(flavor = "multi_thread")]`):
-  - `slot_is_due`/`slot_offset` is **stable** for a fixed UUID across calls (regression guard against a
-    randomly-seeded hasher) and spreads across the window for distinct UUIDs.
-  - `is_due` boundaries (just-before / just-after the window, full subsumes quick).
-  - **billing label mapping** — especially `Production → "prod"` (the gotcha) and all-unranked → `None`.
-  - UPDATE (shipped) — instead of a Job-manifest test: **`kopia.rs` parsing/policy helpers** (retention `--keep-*`
-    flag building, `snapshot list --json` source parsing, `content stats` text parsing) and **`worker::Slots`**
-    concurrency (one-op-per-group exclusion + semaphore cap) — all unit-testable without a cluster.
-- **Kube/kopia interaction:** do **not** stand up a real cluster or invoke real kopia in tests. UPDATE (shipped):
-  there is no `JobSpawner` trait — concurrency/due-ness is tested via `Slots` + the `commons_servers::backup_jobs`
-  helpers; the kopia subprocess itself is not exercised in CI.
-- **Alerting:** assert the corruption path raises the group-level event (`Severity::Critical`, `ref =
-  "backup-corruption"`) via `raise_group_event` and that recovery emits the matching `active: false` event (the
-  group-scoped-issue path is shipped, §3.7).
-- **No e2e/Playwright here** — this component has no rendered UI (the onboarding/stats UI is a separate
-  component and carries its own Playwright per AGENTS.md). The init flow is exercised via the
-  `complete_init` status-advance path.
-- Run per-package while iterating: `just test-package jobs` and `just test-package database`; let CI run the
-  full suite (no final local full-suite run, per memory). `just check` for compile/warnings.
-
----
-
-## 8. Open questions / decisions to make
-
-1. **Shared-library location** — RESOLVED: **`commons_servers::backup_jobs`**.
-   (And, as shipped, init is scheduler-driven so private-server doesn't call a
-   spawn helper at all — §3.8.)
-2. **Maintenance/inspection completion signal** — RESOLVED (impl): **in-process.**
-   kopia runs as a subprocess of the `backups` pod and its typed outcome is
-   written inline (`complete.rs`); there is no completion *signal* to receive.
-   This supersedes both earlier ideas: (a) the kube-watch / pod-termination-message
-   / pod-log read (reverted — k8s truncates/drops the message and the pod may be
-   gone before it's read, unreliable in practice), and (b) the bearer-authed
-   `/job-report` POST + kube reap/crash-detect poll (dropped — the report
-   round-trip became unnecessary once kopia runs in-process, since the IRSA is
-   shared anyway). No `/job-report` endpoint, no axum, no kube poll.
-3. **Group-level alert mechanism** — RESOLVED: **group-scoped issue** via
-   `database::backup::alerts::raise_group_event` (bypasses `is_monitored`),
-   backed by the `backup_group_scoped_issues` migration (nullable
-   `issues.server_id`). No per-server `NewEvent` workaround.
-4. **Where the maintenance-staleness scan lives** — RESOLVED: the
-   staleness/reconcile sweep lives with the detection slice
-   (`database::backup::sweep`, run by the `monitor` bin); this component writes
-   `backup_maintenance_runs` and raises only the corruption alert.
-5. **kopia image** — RESOLVED (impl): there is no separate kopia-job image and no entrypoint contract. kopia is
-   **bundled** into the single shipped `ghcr.io/beyondessential/canopy` image (the kopia binary is copied into
-   `.github/Dockerfile.native` from `kopia/kopia:0.23.1`) and invoked in-process; no `CANOPY_BACKUP_IMAGE`
-   env. (§5.)
-6. **S3-metrics cross-account read** — RESOLVED (impl): assume the group's `target_role_arn` and read CloudWatch
-   with those creds (no dedicated canopy CloudWatch IRSA); per-bucket role grants `cloudwatch:GetMetricStatistics`
-   + `cloudwatch:ListMetrics`. `BucketSizeBytes` is summed across the bucket's actual `StorageType`s
-   (ListMetrics-discovered — handles Standard / Intelligent-Tiering), not a hardcoded class.
-7. **All-unranked group billing stage** — fall back to `prod` vs omit `billing.stage`. *Recommend omit.*
-8. **k8s-openapi API version** — pin to the cluster's actual control-plane version (verify; don't guess), per
-   the no-guessing-versions rule. Same for `kube`/AWS-SDK crate versions (check the registry before pinning).
-9. **Migration ownership** — confirm with the config/endpoints components whether the feature ships one
-   migration or several; this component contributes `backup_maintenance_runs` / `backup_repo_snapshots` /
-   `backup_repo_stats` either way.
-10. **Cadence/tick defaults** — quick-daily / full-weekly / inspection ≈ `expected_interval` (weekly floor) /
-    metrics ≈ `expected_interval` (weekly floor); confirmed-tunable per design but the constants live in code.
-
----
-
-## 9. Net-new infrastructure summary (none exists in canopy today)
-
-UPDATE (shipped) — the shape changed from "k8s Jobs + per-kind IRSA SAs + a separate kopia image" to "one
-long-lived Deployment that runs bundled kopia in-process":
-
-- **Kubernetes API client** (`kube` + `k8s-openapi`) on the `backups` bin — used **only to `get` Secrets** (repo
-  passwords). No Job create/list/watch/delete.
-- **ServiceAccount + IRSA** plumbed through `spec.ts` for the single `backups` pod: **one** `canopy-jobs` SA with
-  k8s RBAC = `get secrets` and an IRSA role; each kopia subprocess overrides `AWS_ROLE_ARN` to the group's
-  per-bucket role and assumes it **directly via web-identity** (refreshing, up to the role's `MaxSessionDuration`).
-  No per-Job/per-kind SAs, no chained read-only path.
-- **AWS SDK** (`aws-config` + `aws-sdk-cloudwatch` + `aws-sdk-s3` + `aws-sdk-sts`) on the `backups` bin for
-  **preflight** + **s3-metrics**. kopia's own bundled AWS SDK handles the S3 repo I/O.
-- **The shipped `ghcr.io/beyondessential/canopy` image** bundles kopia (its binary is copied into
-  `.github/Dockerfile.native` from `kopia/kopia:0.23.1`). No separate kopia-job image, no ENV-config/POST
-  contract.
-- **OIDC-provider-per-account** wiring for cross-account web-identity (ops/IaC).
-
-This component owns the **jobs-side** kube (Secret-read) client + the `canopy-jobs` IRSA; `public-server`'s
-kube/AWS additions are owned by the endpoints component.
-
----
-
-## Backup types addendum
-
-Per the plan's "Backup types":
-
-- **Retention is per-`(group, type)`.** The maintenance cycle's
-  assert-retention step asserts *each type's* effective keep-policy
-  (`server_group_backup_schedule.retention ?? backup_type_defaults`, org
-  floor applied) as a kopia **per-source/path policy**, so `kopia snapshot
-  expire` honours the right policy per type. The maintenance *run* itself
-  stays per-group (one repo per group, shared by all types).
-- **Scheduling is per-`(group, type)`** — the maintenance/inspection
-  schedulers iterate active `(group, type)` (or per-group for the
-  repo-wide maintenance run; per-type for retention assertion).
-- **Inspection** parses the snapshot's `canopy-type` tag → writes
-  `backup_repo_snapshots.type`; `(server, type)` is one source.
-- `backup_repo_stats` stays per-group (repo is shared; size is repo-level).
diff --git a/docs/plans/specs/canopy-operator-ui.md b/docs/plans/specs/canopy-operator-ui.md
deleted file mode 100644
index 4362e199..00000000
--- a/docs/plans/specs/canopy-operator-ui.md
+++ /dev/null
@@ -1,571 +0,0 @@
-# Spec: canopy-operator-ui
-
-Implementation spec for the **operator UI** of the backup-credentials system:
-the `TailscaleAdmin`-gated private-server admin endpoints and the private-web
-React/MUI screens that drive group backup onboarding/config, repo-creation
-trigger + status, the reveal-once escrow + acknowledgment, the one-off "backup
-now" trigger, and the read-only stats panel.
-
-Authoritative design: [`../backup-credentials.md`](../backup-credentials.md)
-(see esp. "Operator workflows & repo provisioning (private-server UI)",
-"Repository password ownership" → "DR escrow", "Backup cadence and triggering"
-→ "Operator one-off", "Operational story"). This component owns **only** the
-operator-facing surface; it consumes the data model, AWS/k8s machinery,
-schedulers, and detection owned by the other backup-credentials components.
-
-This spec is scoped to repo `canopy`: `crates/private-server` (axum admin fns)
-and `private-web` (React SPA + Playwright e2e), following the patterns in
-`AGENTS.md` ("Private server architecture", "React frontend").
-
----
-
-## 1. Purpose
-
-Make group backup onboarding a real, self-serve operator workflow in the
-existing Tailscale-gated admin SPA — *not* a SQL bootstrap. Concretely, give an
-operator the ability to:
-
-1. **Onboard / configure** a group's backup: set `bucket`, `target_role_arn`,
-   `region`, `expected_interval`, `retention`, choose from-birth vs. import
-   mode, and kick off repo creation.
-2. **See repo-creation status** and the lifecycle state machine
-   (`provisioning → escrow_pending → ready`), including init-Job failures.
-3. **Reveal the generated passphrase once** (from-birth repos), with a "saved
-   to Bitwarden" acknowledgment that flips `escrow_pending → ready`.
-4. **Trigger a one-off "backup now"** for any group (scheduled or
-   manual-only), writing a `backup_requests` row the device picks up on its
-   next ~1-minute tick.
-5. **Read a stats panel** — cached `backup_repo_stats` plus recent
-   `backup_runs` / `backup_maintenance_runs` per group.
-
-The UI is the human end of the control plane; it never talks to AWS, kopia, or
-k8s directly — it only reads/writes Canopy's database via private-server fns
-and *triggers* the init Job through a fn that the jobs-side machinery acts on.
-
----
-
-## 2. Where it lives in the repo
-
-### Backend (private-server)
-
-- New module `crates/private-server/src/fns/backups.rs` exposing
-  `pub fn routes() -> OpenApiRouter<AppState>`, mounted under `/api/backups`
-  in `crates/private-server/src/fns/mod.rs` (add
-  `.nest("/backups", backups::routes())` and `pub mod backups;`).
-- Follows the exact handler shape used by `server_groups.rs`: bare axum
-  handlers `(State<AppState>, [TailscaleAdmin], Json<Args>) -> Result<Json<T>>`,
-  each annotated with `#[utoipa::path(post, path = "/<fn>", operation_id =
-  "backups_<fn>", tag = "backups", security(("tailscale-admin" = [])), …)]`.
-  Read-only endpoints use `security(("tailscale-user" = []))` (matching how
-  `server_groups::list`/`get` are user-gated while mutations are admin-gated).
-- DB access via `state.db.get().await?`; all model logic lives in the
-  **database crate** (`crates/database/src/`), never inline in private-server
-  (per AGENTS.md: no diesel in private-server).
-
-### Backend (database crate)
-
-The UI fns are thin wrappers over model functions. The migrations and models
-listed in §4 are **shared with the data-model component** of
-backup-credentials; if that component lands them first, this component only
-adds the *operator-facing* query/mutation methods. To avoid a silent gap, this
-spec lists the full set the UI needs; whoever lands the table owns the
-migration, and this component owns the query methods it calls. Coordinate via
-the `depends_on` contract in the orchestration metadata.
-
-### Frontend (private-web)
-
-- New route components under `private-web/src/routes/`:
-  - `BackupConfig.tsx` — onboarding / edit config form (create + edit modes,
-    mirroring `GroupEdit.tsx`'s split).
-  - `BackupEscrow.tsx` — reveal-once passphrase + ack (often rendered as a
-    section inside the group backup page, gated on `status === 'escrow_pending'`).
-  - `BackupPanel.tsx` — the per-group backup overview: status, stats, recent
-    runs, "backup now" button, links to config/escrow.
-- Surfaced from the existing **group detail page** (`GroupDetail.tsx`): add a
-  "Backups" section/card that renders `BackupPanel` (or a "Set up backups"
-  CTA when no config row exists). New routes registered in `App.tsx`:
-  - `/groups/:id/backups` → `BackupPanel`
-  - `/groups/:id/backups/config` → `BackupConfig` (create or edit)
-- Wire types come from `private-web/src/api-types.ts` (generated) re-exported
-  through `private-web/src/types.ts`. UI-only label/order constants
-  (status labels, retention field labels) go in `types.ts` below the
-  re-exports, same as `SEVERITY_INTENT` / `SERVER_RANK_ORDER`.
-- After any handler request/response change, run **`just gen-openapi`** and
-  commit both `private-web/openapi.json` and `private-web/src/api-types.ts`
-  alongside the Rust change (per AGENTS.md).
-
-### e2e
-
-- `private-web/e2e/backups.spec.ts` (new), using `./test-fixtures` +
-  `./seed.ts`. Extend `seed.ts` with `seedServerGroupBackupConfig`,
-  `seedBackupRun`, `seedBackupRepoStats`, `seedBackupRequest` helpers and add
-  the new tables to `resetSeededTables`'s `TRUNCATE` list.
-
----
-
-## 3. Lifecycle state machine (the UI's spine)
-
-`server_group_backup_config.status ∈ { 'provisioning', 'escrow_pending',
-'ready' }`. The UI renders one of three top-level states per group, plus the
-"no config yet" zero-state and an explicit "init failed" sub-state:
-
-```
-(no row)  ──[Set up backups: create config]──►  provisioning
-provisioning  ──[init Job creates repo, from-birth]──►  escrow_pending
-provisioning  ──[init Job creates repo, import mode]──►  ready
-provisioning  ──[init Job fails]──►  provisioning + last_init_error shown (retry available)
-escrow_pending  ──[operator acks "saved to Bitwarden"]──►  ready
-ready  ──[edit non-structural config]──►  ready
-```
-
-- The UI **does not** itself run the init Job; it calls `backups.create_repo`,
-  which records intent / sets `status='provisioning'` and lets the jobs-side
-  init-Job machinery (owned by the maintenance/jobs component) pick it up. The
-  UI polls config status (`useReloadInterval`, like the incidents badge) to
-  reflect `provisioning → escrow_pending/ready`.
-- **Backups are dormant until `ready`** — this is enforced on the device path
-  (412/409), not in this UI; the UI surfaces *why* (status chip + helper text)
-  so an operator isn't confused that "configured" ≠ "live".
-- **Import mode** skips escrow: `create_repo` with `mode='import'` moves
-  `provisioning → ready` once the repo connects (operator already holds the
-  passphrase / points `repo_password_ref` at an existing Secret).
-
-How the init Job's outcome reaches `status` and `last_init_error` is the
-jobs-side component's contract (see §6 consumed contracts). The UI only reads
-those fields; it must not assume an in-process transition.
-
----
-
-## 4. Data shapes (DB)
-
-These tables come from the backup-credentials data model; the UI reads/writes
-the subset below. Migrations are created with `just migration NAME` (never
-hand-authored — per AGENTS.md). Two of these (`status`, `last_init_error`,
-`mode`, `repo_password_ref`, escrow tracking) are the columns the UI most
-depends on, so if the base table is authored elsewhere, confirm these exist.
-
-### `server_group_backup_config` (read + write by UI)
-
-Per the main plan's schema, plus the columns the UI lifecycle needs. If the
-base table predates this work, the UI requires at minimum:
-
-```sql
--- (from backup-credentials.md "New table: server_group_backup_config")
-group_id          UUID PRIMARY KEY REFERENCES server_groups(id) ON DELETE CASCADE,
-bucket            TEXT NOT NULL,
-prefix            TEXT NOT NULL DEFAULT '',
-target_role_arn   TEXT NOT NULL,
-region            TEXT,
-expected_interval INTERVAL,            -- NULL = manual-only
-retention         JSONB NOT NULL,      -- kopia keep-* policy
-repo_password_ref TEXT NOT NULL,
-status            TEXT NOT NULL,       -- 'provisioning'|'escrow_pending'|'ready'
-created_at        TIMESTAMPTZ NOT NULL DEFAULT now(),
-updated_at        TIMESTAMPTZ NOT NULL DEFAULT now(),
--- UI/lifecycle additions this component needs (confirm/author):
-mode              TEXT NOT NULL DEFAULT 'from_birth',  -- 'from_birth' | 'import'
-last_init_error   TEXT,                -- set when the init Job fails; cleared on success/retry
-escrow_acked_at   TIMESTAMPTZ,         -- set when the operator acks the reveal-once (from-birth)
-escrow_acked_by   TEXT                 -- operator identity (from TailscaleAdmin)
-```
-
-`expected_interval` maps to/from the UI via the same minutes/seconds pattern
-`GroupEdit.tsx` uses for `slack_open_delay`, except the API column is
-`INTERVAL`. Reuse `database::pg_duration::PgDuration` (already used on
-`server_groups.slack_open_delay`) and `#[schema(value_type = Option<i64>)]`
-so the wire type is seconds. **NULL `expected_interval` = manual-only** must be
-representable distinctly from `0`; the form needs a "Manual only (no schedule)"
-toggle, not just an empty number field.
-
-`retention` is a small JSON object; on the wire model it as a typed struct
-(not raw `serde_json::Value`) so `openapi-typescript` emits a real shape:
-
-```rust
-#[derive(Serialize, Deserialize, ToSchema)]
-pub struct RetentionPolicy {
-    pub keep_latest:  i32,  // default 1 (not floor-enforced)
-    pub keep_daily:   i32,  // floor 7
-    pub keep_weekly:  i32,  // floor 4
-    pub keep_monthly: i32,  // floor 6
-    pub keep_annual:  i32,  // default 0
-}
-```
-
-The org-minimum **floor** (`keep_daily ≥ 7, keep_weekly ≥ 4, keep_monthly ≥ 6`)
-is enforced in the model/handler on create+update; the UI also validates
-client-side (helper text + disabled submit) but the server is authoritative
-(returns `400 AppError::BadRequest`-style problem-details on violation).
-
-### `backup_requests` (write + read by UI — "backup now")
-
-```sql
-CREATE TABLE backup_requests (
-    server_id    UUID NOT NULL REFERENCES servers(id),
-    purpose      TEXT NOT NULL,            -- "backup" | "restore"
-    requested_at TIMESTAMPTZ NOT NULL DEFAULT now(),
-    requested_by TEXT,                     -- operator identity (TailscaleAdmin login)
-    PRIMARY KEY (server_id, purpose)
-);
-```
-
-The "backup now" button targets a **server** (the request is keyed by
-`server_id`), so the UI offers per-server one-off triggers within the group's
-backup panel (the group's member servers come from `server_groups.get`). A
-group-wide "backup all members" convenience can fan out to one row per member
-server (open question §8). Cleared by the device path on report; the UI shows
-"requested <TimeAgo>, pending" while a row exists.
-
-### Read-only display tables (read by UI)
-
-- `backup_repo_stats` (PK `group_id`): `snapshot_count`, `source_count`,
-  `logical_bytes`, `physical_bytes`, `bucket_bytes` (nullable), `observed_at`.
-- `backup_runs` (recent N per group): `device_id`, `purpose`, `outcome`,
-  `error`, `bytes_uploaded`, `snapshot_id`, `reported_at`.
-- `backup_maintenance_runs` (recent N per group): `kind`, `started_at`,
-  `finished_at`, `outcome`, `error`, `bytes_reclaimed`.
-- `backup_repo_snapshots` (optional, for a per-source "latest snapshot" list):
-  `source`, `server_id`, `latest_snapshot_at`.
-
-### Escrow secret read
-
-The reveal-once passphrase is read from the **k8s Secret** named by
-`repo_password_ref`. This requires `public-server`/the relevant pod to have a
-kube client + Secret-read RBAC — that machinery is **net-new and owned by the
-AWS/k8s-infra component**, not this UI. The escrow reveal endpoint
-(`backups.reveal_escrow`) consumes that kube-client capability on `AppState`.
-If private-server does not have the kube client at the time this lands, see
-§8 open question on where the escrow read executes.
-
----
-
-## 5. Interfaces this component EXPOSES
-
-All under `/api/backups/<fn>`, POST, `TailscaleAdmin`-gated unless noted.
-Argument/response structs live in `backups.rs` with `#[derive(…, ToSchema)]`;
-operation ids prefixed `backups_`. Names are the contract for the React layer
-and any other consumer.
-
-| fn | gate | args | returns | purpose |
-|----|------|------|---------|---------|
-| `backups_get` | user | `{ server_group_id }` | `BackupConfigView \| null` | full config + lifecycle for a group (null = no config) |
-| `backups_list` | user | `{}` | `Vec<BackupConfigSummary>` | all configured groups (fleet overview) |
-| `backups_create` | admin | `CreateBackupConfigArgs` | `BackupConfigView` | insert config row (`status='provisioning'`), validate floor; does **not** create repo |
-| `backups_update` | admin | `UpdateBackupConfigArgs` | `BackupConfigView` | edit config on a `ready` group. RESOLVED (impl): `UpdateBackupConfigArgs` carries **only `region`** — structural fields are inexpressible (no 409 path). (Interval/retention are per-`(group, type)` on the schedule table, edited separately.) |
-| `backups_create_repo` | admin | `{ server_group_id }` | `BackupConfigView` | record intent for the init Job (sets/keeps `provisioning`, clears `last_init_error`); idempotent retry |
-| `backups_reveal_escrow` | admin | `{ server_group_id }` | `RevealEscrowResponse` | reveal-once passphrase (from-birth, `escrow_pending` only); reads the k8s Secret |
-| `backups_ack_escrow` | admin | `{ server_group_id }` | `BackupConfigView` | flip `escrow_pending → ready`, stamp `escrow_acked_at/by` |
-| `backups_request_now` | admin | `{ server_id, purpose }` | `()` | upsert a `backup_requests` row (one-off "backup now") |
-| `backups_cancel_request` | admin | `{ server_id, purpose }` | `()` | delete a pending `backup_requests` row |
-| `backups_stats` | user | `{ server_group_id }` | `BackupStatsView` | `backup_repo_stats` + recent `backup_runs` + recent `backup_maintenance_runs` + pending requests |
-| `backups_delete` | admin | `{ server_group_id }` | `()` | delete the config row (decommission; see audit-table FK note in main plan) |
-
-### Response/argument shapes (wire)
-
-```rust
-pub struct BackupConfigView {
-    pub server_group_id: Uuid,
-    pub bucket: String,
-    pub prefix: String,
-    pub target_role_arn: String,
-    pub region: Option<String>,
-    #[schema(value_type = Option<i64>)]            // seconds; None = manual-only
-    pub expected_interval: Option<PgDuration>,
-    pub retention: RetentionPolicy,
-    pub mode: BackupRepoMode,                       // FromBirth | Import (serde lowercase)
-    pub status: BackupConfigStatus,                 // Provisioning | EscrowPending | Ready
-    pub last_init_error: Option<String>,
-    pub escrow_acked_at: Option<Timestamp>,
-    pub created_at: Timestamp,
-    pub updated_at: Timestamp,
-    // NOTE: never includes repo_password_ref's *value* — only reveal_escrow does.
-}
-
-pub struct CreateBackupConfigArgs {
-    pub server_group_id: Uuid,
-    pub bucket: String,
-    #[serde(default)] pub prefix: String,
-    pub target_role_arn: String,
-    pub region: Option<String>,
-    #[schema(value_type = Option<i64>)]
-    pub expected_interval: Option<PgDuration>,
-    pub retention: RetentionPolicy,
-    pub mode: BackupRepoMode,
-    /// Import mode only: name of a pre-existing k8s Secret holding the
-    /// passphrase. From-birth leaves this None (Canopy generates + names it).
-    pub repo_password_ref: Option<String>,
-}
-
-pub struct RevealEscrowResponse {
-    pub passphrase: String,        // shown once; UI must not persist
-    pub repo_password_ref: String, // the Secret name, for the "saved where" note
-}
-
-pub struct BackupStatsView {
-    pub stats: Option<BackupRepoStats>,            // None until first inspection
-    pub recent_runs: Vec<BackupRunRow>,            // most-recent first, capped (e.g. 20)
-    pub recent_maintenance: Vec<BackupMaintenanceRow>,
-    pub pending_requests: Vec<PendingRequestRow>,  // server_id, purpose, requested_at, requested_by
-}
-```
-
-Use `commons_types::Uuid` and `jiff::Timestamp` to match the rest of the
-codebase (`server_groups.rs` uses these). Status/mode are string enums with
-`#[serde(rename_all = "snake_case")]` so the generated TS unions read
-`"provisioning" | "escrow_pending" | "ready"` and `"from_birth" | "import"`.
-
-### Error contract (problem-details)
-
-Reuse existing `AppError` variants; map to the documented statuses in
-`#[utoipa::path(responses(...))]`:
-
-- `404` (`AppError`'s not-found path) — group / config not found in `get` when
-  the caller expects one; but `backups_get` returns `null` for "no config"
-  rather than 404, matching the "zero-state" UI. Use 404 only for a bad
-  `server_group_id` (group itself missing).
-- `400` — retention floor violation, or `create` for a group that already has
-  config. Prefer `AppError::Conflict(String)` (→ 409) for "already configured"
-  and a bad-request variant for floor violations; pick per existing
-  `commons-errors` variants and **update ERRORS.md** if a new variant is added
-  (per AGENTS.md, heading must match the problem type).
-- `409` (`AppError::Conflict`) — `reveal_escrow`/`ack_escrow` called when
-  `status != 'escrow_pending'`, or `create_repo` on an already-ready group.
-  RESOLVED (impl) — the "409 on **structural-field** update" path **does not
-  exist by design**: `UpdateBackupConfigArgs` carries **only `region`**, so a
-  bucket/role/mode edit is simply *inexpressible* over the wire rather than
-  rejected at runtime. Field-omission supersedes the 409.
-- `502` — `reveal_escrow` if the k8s Secret read fails (control-plane error).
-
----
-
-## 6. Interfaces this component CONSUMES
-
-From **other backup-credentials components** (must exist first or be stubbed):
-
-- **Data model component (canopy DB):** the migrations/tables in §4 and the
-  base `database::server_group_backup_config` / `backup_requests` /
-  `backup_runs` / `backup_maintenance_runs` / `backup_repo_stats` models.
-  Contract: model structs with the columns in §4; the UI adds query methods
-  (`get_for_group`, `list_configured`, `create`, `update`, `set_status`,
-  `ack_escrow`, recent-runs queries) — author these in the database crate.
-- **AWS/k8s-infra component:** a **kube client on `AppState`** with Secret-read
-  capability, so `backups_reveal_escrow` can read the passphrase Secret named
-  by `repo_password_ref`. This is net-new (canopy has no kube client today).
-  Contract consumed: `state.kube` (or equivalent) + a helper like
-  `read_secret(name, key) -> Result<String>`. The UI does **not** create
-  Secrets — from-birth passphrase generation + Secret creation is the init
-  Job's job; the UI only reveals.
-- **Jobs/maintenance component:** the **init Job** that performs
-  `kopia repository create` and drives `status`/`last_init_error`. Contract
-  consumed: the UI's `backups_create_repo` records intent (sets
-  `status='provisioning'`, clears `last_init_error`); the Job is expected to
-  transition the row to `escrow_pending` (from-birth) or `ready` (import), or
-  set `last_init_error` on failure. The exact handoff (a flag column, a queue,
-  or the Job polling `provisioning` rows) is the jobs component's decision —
-  this UI only depends on the *observable* `status`/`last_init_error` fields.
-- **Device path / detection components:** none consumed directly; the UI
-  surfaces their *output* (runs, staleness via the existing issues/events
-  model already shown on the server/group pages — no new wiring here).
-  UPDATE (shipped): the **group-scoped** issues raised by detection/inspection
-  (nullable `server_id`, see the detection-preflight spec's `raise_group_event`)
-  are rendered — `IssueRow`/the `issues` fn handle a null `server_id` and key
-  off `server_group_id`/`server_group_name` for group-scoped issues (member
-  servers resolved via `Server::group_refs_by_server_ids`), so a corruption /
-  preflight alert with no member server shows correctly.
-
-From **existing canopy code** (already present):
-
-- `commons_servers::tailscale_auth::TailscaleAdmin` extractor (gate +
-  operator identity for `requested_by` / `escrow_acked_by`). Confirm how to
-  extract the login string from it (mirror whatever `admins.rs` / audited
-  endpoints do).
-- `database::server_groups::ServerGroup` (member list for per-server "backup
-  now"; group existence checks).
-- React: `useApi` / `useApiAction` (`private-web/src/api.ts`),
-  `useIsAdmin` (`hooks/useIsAdmin.tsx`), `useReloadInterval`
-  (status polling), `TimeAgo`, `usePageTitle`, `TagsEditor` pattern.
-
----
-
-## 7. Frontend behaviour detail
-
-### `BackupPanel` (`/groups/:id/backups`, and a card on `GroupDetail`)
-
-- `useApi("backups", "get", { server_group_id: id }, [id])`.
-  - `null` → zero-state: "Backups not set up" + admin-only "Set up backups"
-    button → `/groups/:id/backups/config`.
-  - non-null → status chip (`provisioning`/`escrow_pending`/`ready` with the
-    same intent-helper-text pattern as `SEVERITY_INTENT`), config summary
-    (bucket, region, interval or "Manual only", retention), and:
-    - `provisioning` → spinner + "Creating repository…"; if `last_init_error`,
-      an error Alert + admin "Retry repo creation" (`create_repo`).
-    - `escrow_pending` → prominent warning card → render `BackupEscrow`.
-    - `ready` → stats (`backups.stats`), recent runs table, per-server
-      "Backup now" buttons.
-- Poll status with `useReloadInterval` (e.g. 5s while `provisioning`, slower
-  when `ready`) so the operator sees the init Job land without a manual reload.
-
-### `BackupConfig` (`/groups/:id/backups/config`)
-
-- Create vs edit split like `GroupEdit.tsx` (`isCreate = no config row`).
-- Fields: bucket, target_role_arn, region (optional), **schedule mode toggle**
-  (Manual only ↔ Scheduled every N minutes — `expected_interval`), retention
-  (5 number fields with floor validation + helper text), repo mode
-  (From-birth ↔ Import; Import reveals a `repo_password_ref` field).
-- Structural fields (bucket, target_role_arn, mode) are **create-only**;
-  disabled in edit mode with helper text ("changing the bucket is a repo
-  migration — out of scope here").
-- On create success → if from-birth, `create_repo` is offered as the next step
-  (or auto-called) so the operator flows into provisioning → escrow.
-
-### `BackupEscrow`
-
-- Renders only when `status === 'escrow_pending'` and `mode === 'from_birth'`.
-- "Reveal passphrase" button → `useApiAction("backups", "reveal_escrow")`;
-  shows the passphrase in a monospace, copy-to-clipboard block with a loud
-  "Save this to Bitwarden NOW — it cannot be shown again" warning.
-- A required checkbox "I have saved this passphrase to Bitwarden" enables the
-  "Acknowledge & activate backups" button → `ack_escrow` → flips to `ready`.
-- The reveal is deliberately re-callable while `escrow_pending` (operator may
-  reload before acking); once `ready`, `reveal_escrow` returns 409.
-
-### Admin gating
-
-- Read views (`get`/`list`/`stats`) render for any Tailscale user (user-gate),
-  matching `server_groups::list`/`get`.
-- All mutating buttons gate on `useIsAdmin() === true`, mirroring
-  `GroupDetail.tsx`'s `admin && (<Button …/>)` pattern.
-
----
-
-## 8. Testing approach (per AGENTS.md)
-
-### Rust endpoint tests (`crates/private-server/tests/`)
-
-- File `backups.rs` (no `_test` suffix). Use `commons_tests::server::run(|conn,
-  _public, private| async move { … })`. Endpoints at
-  `/api/backups/<fn>`, params via `.json(&serde_json::json!({...}))`, empty
-  body `{}` for no-arg fns.
-- Cover, with `use database::…;` imports and direct model seeding via `conn`:
-  - `create` happy path (status becomes `provisioning`); retention-floor
-    rejection → 400/expected status; duplicate-config → 409.
-  - `get` returns `null` for unconfigured group; full view once configured.
-  - `create_repo` clears `last_init_error` and is idempotent.
-  - `ack_escrow` only from `escrow_pending` (409 otherwise); stamps
-    `escrow_acked_at/by`.
-  - `request_now` upserts; `cancel_request` deletes; PK `(server_id, purpose)`
-    means re-request is a no-op upsert, not an error.
-  - `update` accepts `region`. RESOLVED (impl): there's no structural-field
-    rejection to test — `UpdateBackupConfigArgs` only carries `region`, so a
-    structural change is inexpressible (no 409 path).
-  - Auth: confirm admin-gated fns reject non-admin (the test harness's auth
-    posture — match how other admin fns are tested).
-  - `reveal_escrow`: since the kube client is net-new, test against a stubbed
-    secret reader if the harness allows; otherwise gate this test on the infra
-    component and assert the 409-when-not-escrow_pending branch (which needs no
-    Secret read).
-
-### Playwright e2e (`private-web/e2e/backups.spec.ts`)
-
-Per AGENTS.md, UI features ship with e2e coverage in the same change. Seed
-state directly via `seed.ts` helpers (extend it — see §2). Cover the rendered
-behaviour Rust tests can't:
-
-- Zero-state: group with no config shows "Set up backups" (admin) /
-  hidden (non-admin).
-- Config form: create writes a `server_group_backup_config` row with the right
-  `expected_interval` (assert via `EXTRACT(EPOCH …)` like the cooldown test),
-  retention JSON, and `status='provisioning'`; floor violation blocks submit.
-- Manual-only toggle persists `expected_interval IS NULL` (distinct from 0).
-- Escrow flow: seed `status='escrow_pending'`, mode from_birth; reveal shows
-  the passphrase, ack checkbox gates the button, ack flips DB row to `ready`
-  and stamps `escrow_acked_at`. (Stub/seed the Secret value via whatever the
-  reveal path reads — coordinate with the infra component; if the kube client
-  isn't available in e2e, test the ack transition with a pre-revealed state and
-  cover reveal separately or behind a fixture flag.)
-- "Backup now": clicking writes a `backup_requests` row for the server; a
-  pending row shows "requested <ago>"; cancel deletes it.
-- Stats panel: seed `backup_repo_stats` + a couple of `backup_runs`; assert the
-  numbers and recent-run rows render; `bucket_bytes` NULL renders as "unknown",
-  not hidden (per the user's "indicators show unknown state" rule).
-
-### Frontend typecheck / unit
-
-- `just typecheck` for TS (not bare `tsc` — per AGENTS.md). Run `just
-  gen-openapi` first so `api-types.ts` matches the handlers.
-- Optional vitest unit tests for any pure helper (e.g. interval↔minutes,
-  retention-floor validation) mirroring `humanDuration` style.
-
----
-
-## 9. Open questions / decisions to make
-
-1. **Where does `reveal_escrow` read the Secret?** The plan says the repo
-   password is a k8s Secret Canopy owns, and that `public-server` gets a kube
-   client for `/backup-target`. But escrow reveal is a **private-server**
-   (admin) concern. Decide: (a) give private-server its own kube client +
-   Secret-read RBAC, or (b) have private-server proxy to an internal endpoint,
-   or (c) store the from-birth passphrase transiently for the escrow window.
-   Leaning (a) — least machinery, and private-server is already the admin trust
-   surface. This is a dependency on the AWS/k8s-infra component.
-
-2. **One-off "backup now" granularity.** `backup_requests` is keyed by
-   `server_id`. Do we expose per-server buttons only, a group-wide "back up all
-   members" fan-out (N rows), or both? The main plan's operator-story says
-   "trigger a best-effort immediate backup for any group", implying a
-   group-level affordance. Proposal: per-server buttons + a group-level "Back
-   up all" that fans out, restore as a separate (less prominent) action.
-
-3. **Init-Job handoff signal.** This UI sets `status='provisioning'` and clears
-   `last_init_error` on `create_repo`; the exact mechanism the jobs component
-   uses to notice (poll `provisioning` rows? a dedicated `init_requested_at`
-   column? a queue table?) is the jobs component's call. Confirm so the UI
-   writes whatever field the Job reads. Default assumption: Job polls
-   `status='provisioning'` rows.
-
-4. **Retry semantics on init failure.** Is "Retry repo creation" just
-   `create_repo` again (idempotent), or does a failed repo need cleanup first
-   (e.g. a half-created kopia format blob)? UI calls `create_repo`; the Job
-   owns idempotency/cleanup. Flag for the jobs component.
-
-5. **Editing structural config.** Spec rejects bucket/role/mode edits
-   post-creation (repo migration is out of scope). Confirm with product that
-   the only "change" path for those is delete-config + re-onboard, and document
-   that in the onboarding runbook.
-
-6. **`region` change UX.** The plan notes changing region/bucket is really a
-   repo migration. `region` is editable per the config schema but pointing at a
-   different region typically means a different bucket. Decide whether the edit
-   form allows `region` edits at all, or gates them behind a warning. Proposal:
-   allow but warn loudly.
-
-7. **Group-level vs server-level stats anchor.** Stats are per-group
-   (`backup_repo_stats` PK `group_id`) but runs/requests are per-server/device.
-   The panel mixes both; confirm the grouping the operator expects (group
-   headline stats + per-server run history).
-
-8. **Decommission flow in UI.** `backups_delete` removes the config row;
-   audit tables intentionally have no CASCADE and the bucket persists
-   (object-locked). Decide how much of that the UI explains vs. defers to the
-   runbook — at minimum a confirm dialog noting "the bucket and its locked
-   objects persist; this only stops issuance".
-
----
-
-## Backup types addendum
-
-Per the plan's "Backup types", the UI is type-aware:
-
-- **Capabilities view** per server: the registered types + an `enabled`
-  toggle (the per-server on/off; seeded from `auto_enable`).
-  RESOLVED (impl) — **the per-server capabilities enable-toggle is shipped**
-  (no longer future/missing): endpoints `backups.capabilities` (read, the
-  server's registered `(type, enabled)` rows) + `backups.set_capability` (admin
-  toggle of `enabled`), with the UI on the **ServerDetail** page.
-- **Per-`(group, type)` schedule + retention** editing
-  (`server_group_backup_schedule` overrides; show the inherited type
-  default when no override). Org retention floor enforced in the form.
-- **One-off "backup now"** picks a `(server, type)` (writes
-  `backup_requests` with the type).
-- Stats panel groups by type where useful (latest snapshot per
-  `(server, type)`).
-- (Optional, later) a small admin view of `backup_type_defaults`.
diff --git a/docs/plans/specs/canopy-public-server.md b/docs/plans/specs/canopy-public-server.md
deleted file mode 100644
index a45b3cc4..00000000
--- a/docs/plans/specs/canopy-public-server.md
+++ /dev/null
@@ -1,612 +0,0 @@
-# Spec: canopy-public-server — device backup endpoints (backup-credentials, backup-target, backup-report, backup-capabilities)
-
-Implementation spec for the `public-server` slice of the
-[backup-credentials](../backup-credentials.md) system. This is **canopy's
-first AWS SDK usage** and its first Kubernetes API client on the
-internet-facing pod. Read the parent plan for the why; this file is the
-how, grounded in the real `crates/public-server` code.
-
-Authoritative design: [`backup-credentials.md`](../backup-credentials.md)
-(esp. "Endpoint shape", "Permission templates", "IAM model", "Repository
-password ownership", and the "Accepted stage-1 risk" note). The stage-2
-hardening that removes this component's blast radius is
-[`backup-credentials-blind-relay.md`](../backup-credentials-blind-relay.md)
-— **out of scope here**; we build the stage-1 on-demand minting path.
-
-## Purpose
-
-Add `ServerDevice`-authenticated device endpoints to `public-server`.
-UPDATE (shipped): **four** endpoints — the backup-types addendum added
-`POST /backup-capabilities` (bestool registers the backup types its server can
-run). The original three:
-
-- `POST /backup-credentials` — mint short-lived per-group S3 creds via a
-  cross-account `sts:AssumeRole` and return them in `credential_process`
-  JSON. `restore` purpose adds a read-only session policy.
-- `GET /backup-target` — return `{storage, bucket, prefix, region,
-  repo_password}` so bestool can reconstruct the kopia repo connection on
-  every run. `repo_password` is read from a k8s Secret.
-- `POST /backup-report` — record a run outcome into `backup_runs` (the
-  "a backup actually completed" signal staleness detection reads).
-
-All three resolve `device → server (live_by_device_id) → group_id →
-server_group_backup_config` identically, returning **412** when the device
-is bound to no live server and **409** when the server is ungrouped or the
-group has no `ready` backup config.
-
-This component **owns the AWS STS client and the kube client on
-`AppState`**, plus the net-new deps to support them. It **consumes** the
-DB models (new tables defined by the canopy-database component) and the
-IAM trust / bucket config provisioned by the ops `backups` stack.
-
-## Scope boundary
-
-In scope (this component):
-- The three handlers + their module(s) under `crates/public-server/src/`.
-- Net-new workspace deps: `aws-config`, `aws-sdk-sts`, `aws-sdk-s3` (s3
-  only if a behavioural no-op lands here; see open questions), `kube`,
-  `k8s-openapi`.
-- `AppState` AWS-client + kube-client fields and their `FromRef` impls;
-  binary `init()` wiring; test-harness wiring.
-- The restore session-policy JSON builder.
-- Inserting `backup_credential_issuances` rows (capturing `AccessKeyId` +
-  best-effort `sts_request_id`).
-- A new `AppError` variant for STS/upstream failure → **502**.
-
-Out of scope (other components — depend on, don't build):
-- The table schemas/migrations and diesel models (`canopy-database`).
-- Staleness scan, maintenance/inspection/preflight Jobs, schedulers
-  (`canopy-jobs`).
-- Operator onboarding UI, escrow, one-off "backup now" (`private-server` /
-  `private-web`).
-- The ops `backups`-stack IRSA-trust + action-set + lifecycle changes, the
-  ServiceAccount/IRSA Pulumi wiring (`ops`). This spec states the contract
-  it needs from ops but does not author the Pulumi.
-- bestool's `canopy backup` / `backup-credentials` subcommands (separate
-  repo).
-
-## Where it lives in the repo
-
-`public-server` mounts feature routers in `crate::routes()`
-(`crates/public-server/src/lib.rs:24`), each module exposing
-`pub fn routes() -> OpenApiRouter<AppState>`. Existing device modules:
-`events.rs`, `tags.rs`, `statuses.rs`, `servers.rs`, `versions.rs`.
-
-Add one module, `crates/public-server/src/backup.rs`, exposing all three
-endpoints, mounted **at the root** (not nested) so the paths are exactly
-`/backup-credentials`, `/backup-target`, `/backup-report` (matching the
-plan's endpoint shape). `events.rs` is already merged at the root via
-`.merge(events::routes())` and is the pattern to copy:
-
-```rust
-// crates/public-server/src/lib.rs, in routes()
-let mut router = OpenApiRouter::new()
-    .merge(events::routes())
-    .merge(backup::routes())   // NEW — root-mounted, like events
-    .nest("/artifacts", artifacts::routes())
-    // ...
-```
-
-`backup::routes()` registers all three handlers:
-
-```rust
-pub fn routes() -> OpenApiRouter<AppState> {
-    OpenApiRouter::new()
-        .routes(routes!(credentials)) // POST /backup-credentials
-        .routes(routes!(target))      // GET  /backup-target
-        .routes(routes!(report))      // POST /backup-report
-}
-```
-
-Add `pub mod backup;` to `lib.rs`. The module is **not** behind the `ui`
-feature (it's a device API, like `events`/`statuses`).
-
-## Net-new dependencies
-
-No AWS SDK and no kube client exist anywhere in the workspace today (the
-only `aws-*` presence is the `aws-lc-rs` crypto backend, unrelated). Add
-to the workspace `Cargo.toml` `[workspace.dependencies]` and reference
-from `crates/public-server/Cargo.toml`:
-
-- `aws-config` — default credential/region provider chain. In-cluster this
-  resolves the pod's IRSA web-identity automatically (`AWS_ROLE_ARN` +
-  `AWS_WEB_IDENTITY_TOKEN_FILE`, injected by EKS once the ServiceAccount is
-  IRSA-annotated). No explicit credential wiring in canopy code.
-- `aws-sdk-sts` — for `assume_role`.
-- `aws-sdk-s3` — **only if** a behavioural no-op (e.g. `GetBucketLocation`)
-  is performed at issuance time. The plan puts deep S3 checks in the
-  preflight Job, not the hot issuance path, so the public-server may not
-  need `aws-sdk-s3` at all. Decide per the open question below; do not add
-  it speculatively.
-- `kube` (client + `Api<Secret>`) and `k8s-openapi` (with a pinned
-  Kubernetes feature, e.g. `v1_30` — match the cluster; verify, don't
-  guess). Used by `GET /backup-target` to read the repo-password Secret.
-
-Do **not** pin AWS/kube crate versions from memory — check the registry
-(`cargo add --dry-run` / crates.io) and use the workspace's
-`[workspace.dependencies]` convention. The repo rule is "never guess
-versions; verify or ask."
-
-Feature-gating: these are core to the backup endpoints, which are always
-compiled (not `ui`-gated). Add them as unconditional `public-server` deps.
-
-## AppState changes
-
-`AppState` (`crates/public-server/src/state.rs`) is `Clone + Debug` and is
-constructed in three places: `init()` (binary), `from_db*()` (helpers),
-and the test harness (`commons-tests/src/server.rs:92` builds the struct
-literal directly). All three must stay compiling.
-
-Add two fields, both `Option<…>` so the test harness and the
-private-server's nested `/public/...` mount (which also build `AppState`)
-can leave them `None` and so a missing AWS/kube environment degrades to a
-clean error rather than a panic at startup:
-
-```rust
-pub struct AppState {
-    pub db: Db,
-    // ... existing fields ...
-
-    /// STS client built from the pod's IRSA web-identity. `None` when no
-    /// AWS environment is configured (tests, the nested private mount).
-    /// Backup-credentials issuance requires it; absent ⇒ 502 with a
-    /// clear "issuer not configured" message.
-    pub sts: Option<aws_sdk_sts::Client>,
-
-    /// Kube client for reading repo-password Secrets in canopy's
-    /// namespace. `None` in tests / non-cluster runs ⇒ `/backup-target`
-    /// 502s. The namespace to read from is fixed at construction.
-    pub kube: Option<BackupSecrets>,
-}
-```
-
-`BackupSecrets` is a small wrapper holding the `kube::Client` + the
-namespace (read from `POD_NAMESPACE` / downward-API env, default
-`canopy`), exposing `async fn read_password(&self, secret_name: &str,
-key: &str) -> Result<String>`. Keep the kube surface this narrow — the
-handler only ever does `get` on one Secret and pulls one key out.
-
-`Debug` derive: `aws_sdk_sts::Client` and `kube::Client` are `Debug`;
-`BackupSecrets` derives `Debug`. If any field isn't `Debug`, switch
-`AppState` to a manual `Debug` impl rather than dropping the derive
-elsewhere.
-
-### FromRef impls
-
-Add `FromRef<AppState>` for the two new client types so handlers can take
-them as `State<…>` extractors, mirroring the existing `Db` / `RateLimiter`
-impls (`state.rs:77-93`):
-
-```rust
-impl FromRef<AppState> for Option<aws_sdk_sts::Client> { /* clone */ }
-impl FromRef<AppState> for Option<BackupSecrets> { /* clone */ }
-```
-
-(Both AWS SDK clients and `kube::Client` are cheap to clone — they're
-`Arc`-backed handles.)
-
-### Binary init wiring
-
-In `AppState::init()` (and a new async constructor, since `aws_config::
-load_defaults` is async — `init()` is currently sync), build the clients:
-
-```rust
-let aws = aws_config::load_defaults(BehaviorVersion::latest()).await;
-let sts = Some(aws_sdk_sts::Client::new(&aws));
-let kube = match kube::Client::try_default().await {
-    Ok(c) => Some(BackupSecrets::new(c, namespace_from_env())),
-    Err(_) => None, // log; backup-target will 502 until fixed
-};
-```
-
-`init()` is called from `main.rs:49` (`AppState::init()?`). Make the
-backup wiring an **async** init path; `main` is already `#[tokio::main]`
-so awaiting is fine. Keep a sync/`None`-clients fallback constructor for
-the private-server nested mount and any non-AWS deployment.
-
-Region: per-request region comes from the group config row
-(`server_group_backup_config.region`, nullable → deployment default). The
-STS client itself uses the provider-chain region; only the eventual S3
-addressing cares about the bucket's region, and that is handed to the
-device in `GET /backup-target`. STS `AssumeRole` is global-ish; set the
-STS client region from the default provider.
-
-## Handler 1 — `POST /backup-credentials`
-
-### Request / response
-
-```rust
-#[derive(Deserialize, ToSchema)]
-pub struct CredentialsArgs {
-    #[serde(default)] pub purpose: Purpose, // default Backup
-}
-
-#[derive(Deserialize, Serialize, ToSchema, Clone, Copy, PartialEq)]
-#[serde(rename_all = "lowercase")]
-pub enum Purpose { Backup, Restore }
-impl Default for Purpose { fn default() -> Self { Self::Backup } }
-
-// credential_process output — field names fixed by the AWS SDK.
-#[derive(Serialize, ToSchema)]
-#[serde(rename_all = "PascalCase")]
-pub struct CredentialProcessOutput {
-    pub version: u8,            // serialized as "Version": 1
-    pub access_key_id: String,
-    pub secret_access_key: String,
-    pub session_token: String,
-    pub expiration: String,     // RFC3339 / ISO8601 Z
-}
-```
-
-`Version` is the literal `1`; the wire output must be exactly
-`Version/AccessKeyId/SecretAccessKey/SessionToken/Expiration` — the AWS
-`credential_process` v1 contract.
-UPDATE (shipped): a single `#[serde(rename_all = "PascalCase")]` on the struct
-does produce the exact casing (`access_key_id → AccessKeyId`, etc.), so no
-per-field renames were needed; the earlier caution about per-field casing drift
-turned out unwarranted.
-
-### Handler flow (mirrors plan step list)
-
-Signature follows the bare-handler pattern (events.rs is the closest
-twin):
-
-```rust
-async fn credentials(
-    State(db): State<Db>,
-    State(sts): State<Option<aws_sdk_sts::Client>>,
-    device: ServerDevice,
-    Json(args): Json<CredentialsArgs>,
-) -> Result<Json<CredentialProcessOutput>>
-```
-
-1. `ServerDevice` authenticates (it yields only a `Device`;
-   `device.0.0.id` is the device id — same access as `events.rs:44`,
-   `statuses.rs`).
-2. Resolve server: `Server::live_by_device_id(&mut conn, device_id)`
-   (`crates/database/src/servers.rs:352`). It returns `Vec<Server>`; the
-   `servers_device_id_unique` partial index guarantees ≤1, so
-   `.into_iter().next()`. Empty ⇒ `AppError::DeviceHasNoServer` (**412**,
-   maps at `commons-errors/src/lib.rs:193`). Use `live_by_device_id` (not
-   `get_by_device_id`) so archived servers don't issue creds.
-3. Read `server.group_id: Option<Uuid>` (`servers.rs:58`). `None` ⇒
-   **409** via `AppError::Conflict("server is not in a group")`.
-4. Load `ServerGroupBackupConfig::by_group_id(&mut conn, group_id)` (model
-   provided by canopy-database). Absent ⇒ **409**
-   `AppError::Conflict("group has no backup config")`. Also gate on
-   `status == 'ready'`: a `provisioning`/`escrow_pending` row is **409**
-   (dormant). This yields `target_role_arn`, `bucket`, `prefix`, `region`.
-5. For `purpose == Restore` only: build the read-only **session policy**
-   JSON (template below). `Backup` needs none (the per-bucket role's own
-   policy is the scoping).
-6. Require `sts` is `Some`; else **502** (`AppError::Upstream(...)`,
-   "issuer not configured"). Call cross-account assume:
-   ```rust
-   sts.assume_role()
-      .role_arn(&cfg.target_role_arn)
-      .role_session_name(format!("canopy-{}-{}", purpose_str, device_id))
-      .set_policy(restore_policy_json)   // None for backup
-      .duration_seconds(3600)            // chained sessions cap at 1h anyway
-      .send().await
-   ```
-   Any SDK error ⇒ **502** (`AppError::Upstream`). Capture
-   `request_id()` (via `aws_sdk_sts::error::ProvideErrorMetadata` /
-   `RequestId` trait) best-effort for `sts_request_id`.
-7. Pull `credentials` from the response: `access_key_id`,
-   `secret_access_key`, `session_token`, `expiration`. A response missing
-   credentials ⇒ **502**.
-8. Insert `backup_credential_issuances` (canopy-database model
-   `NewBackupCredentialIssuance`): `device_id`, `group_id`,
-   `expires_at` (from STS `Expiration`), `purpose`, `sts_assumed_role`
-   (= `target_role_arn`), `sts_request_id` (nullable), `access_key_id`,
-   `bucket`/`prefix` (snapshot of config). A failed audit insert should
-   fail the request (don't hand out creds we didn't record).
-9. Return `Json(CredentialProcessOutput { version: 1, .. })`, **200**.
-
-`RoleSessionName`: `canopy-backup-<device-id>` / `canopy-restore-<device-id>`
-(decision #2; the `canopy-` prefix makes CloudTrail provenance
-unambiguous). Note `RoleSessionName` is capped at 64 chars — `canopy-` (7)
-+ `restore-` (8) + a 36-char UUID = 51, within budget.
-
-### Restore session policy (normative)
-
-Authored at assume-time; ANDs down to read-only against the per-bucket
-role. `<prefix>` is normally empty (repo at bucket root). `GetBucketLocation`
-**must be its own unconditioned statement** — the `s3:prefix` context key
-isn't populated for it, so folding it under the prefix condition would
-silently deny it:
-
-```json
-{
-  "Version": "2012-10-17",
-  "Statement": [
-    { "Effect": "Allow", "Action": ["s3:GetObject"],
-      "Resource": "arn:aws:s3:::<bucket>/<prefix>*" },
-    { "Effect": "Allow", "Action": ["s3:GetBucketLocation"],
-      "Resource": "arn:aws:s3:::<bucket>" },
-    { "Effect": "Allow", "Action": ["s3:ListBucket"],
-      "Resource": "arn:aws:s3:::<bucket>",
-      "Condition": { "StringLike": { "s3:prefix": ["<prefix>*"] } } }
-  ]
-}
-```
-
-Build this with `serde_json` (a typed struct or `json!` macro) from the
-config's `bucket`/`prefix`; pass the serialized string to `.policy(...)`.
-The **backup** permission set is **not** authored here — it's the
-per-bucket role's own policy in ops (`AWS_S3_MULTIPART_ACTIONS` is the
-source of truth). public-server only authors the restore downscope.
-
-### OpenAPI annotation
-
-`#[utoipa::path(post, path = "/backup-credentials", tag = "backup",
-security(("server-device" = [])), request_body = CredentialsArgs,
-responses((status=200, body=CredentialProcessOutput), (status=409,
-body=ProblemDetailsSchema), (status=412, body=ProblemDetailsSchema),
-(status=502, body=ProblemDetailsSchema)))]`. Run `just gen-openapi` is a
-private-web concern; the **public** server has its own `openapi.rs`
-(`ApiDoc`) — register the new tag/handlers there.
-
-## Handler 2 — `GET /backup-target`
-
-```rust
-#[derive(Serialize, ToSchema)]
-pub struct BackupTarget {
-    pub storage: String,       // "s3"
-    pub bucket: String,
-    pub prefix: String,        // normally ""
-    pub region: String,        // config.region or deployment default
-    pub repo_password: String, // read from the k8s Secret
-}
-
-async fn target(
-    State(db): State<Db>,
-    State(kube): State<Option<BackupSecrets>>,
-    device: ServerDevice,
-) -> Result<Json<BackupTarget>>
-```
-
-Flow: steps 1–4 identical to credentials (same 412/409, same `ready`
-gate). Then:
-
-5. Require `kube` is `Some` else **502**.
-6. `kube.read_password(&cfg.repo_password_ref, KEY).await?` — reads the
-   Secret named by `repo_password_ref` from canopy's namespace and pulls
-   the password key out (decide the key name — e.g. `password`; fix it as
-   a constant and document it for the operator-UI/escrow component that
-   creates the Secret). A missing Secret or key ⇒ **502** (upstream
-   misconfig), with the group named in the (server-side) log, not the
-   body.
-7. `region` = `cfg.region` or the deployment default (an env/config
-   constant; the plan calls it "deployment default (AWS region)").
-8. Return `Json(BackupTarget { storage: "s3".into(), .. })`, **200**.
-
-**Blast-radius note (carry into the PR description, accepted stage-1
-risk):** serving `repo_password` makes the internet-facing pod hold
-Secret-read for every group's repo password. The stage-1 plan
-**accepts** this; the blind-relay stub removes it later. Two invariants
-this component must not violate: it only ever does `get` on the one named
-Secret (never `list`), and it holds no delete/bypass capability.
-
-## Handler 3 — `POST /backup-report`
-
-```rust
-#[derive(Deserialize, ToSchema)]
-pub struct BackupReport {
-    pub run_id: Uuid,          // becomes backup_runs.id (client-minted)
-    pub purpose: Purpose,
-    pub outcome: Outcome,      // success | failure
-    pub error: Option<String>,
-    pub bytes_uploaded: Option<i64>,
-    pub snapshot_id: Option<String>,
-}
-
-#[derive(Deserialize, ToSchema)]
-#[serde(rename_all = "lowercase")]
-pub enum Outcome { Success, Failure }
-
-async fn report(
-    State(db): State<Db>,
-    device: ServerDevice,
-    Json(rep): Json<BackupReport>,
-) -> Result<StatusCode>  // 204
-```
-
-Flow: steps 1–3 (resolve device → live server → `group_id`; 412/409). The
-config row need **not** be `ready` to accept a report (a report is just an
-observation), but the server **must** be grouped — `device_id`/`group_id`
-come from the authenticated context, never the client's claim (per plan:
-forgery-proof attribution).
-
-Insert `backup_runs` via `NewBackupRun` (canopy-database model) with
-`id = rep.run_id` (client-supplied PK — safe: `device_id`/`group_id` are
-server-derived, and a duplicate `id` fails its **own** insert with a PK
-violation, can't overwrite another row). On a PK-conflict, return a clean
-**409** (duplicate run_id) rather than a 500 — map the diesel unique
-violation. Return **204 No Content** on success.
-
-This component **does not** run staleness detection or clear the one-off
-`backup_requests` flag — those are `canopy-jobs` / scheduler concerns that
-read the `backup_runs` rows this writes. (The plan says the one-off flag
-is "cleared when the run is reported"; whether that clear happens in this
-handler or the scheduler is an open question below.)
-
-## Error handling — new variant
-
-Add `AppError::Upstream(String)` (or `AppError::StsFailed` /
-`AppError::BadGateway`) to `crates/commons-errors/src/lib.rs`:
-
-- `IntoResponse` status map (currently `commons-errors/src/lib.rs:180+`):
-  `Self::Upstream(_) => StatusCode::BAD_GATEWAY`.
-- Problem-type slug (the `match` near `:243`): `Self::Upstream(_) =>
-  "upstream"`.
-- Add an entry to `ERRORS.md` with heading matching the slug (`upstream`)
-  — AGENTS.md requires this.
-
-Reuse existing variants where they fit: `DeviceHasNoServer` (412) and
-`Conflict(String)` (409) already exist and map correctly — do not add new
-409/412 variants. Only the 502 path is new.
-
-Keep STS/kube error **detail out of the response body** (it can name
-roles/buckets); log it server-side and return a generic upstream message.
-
-## Interfaces / contracts
-
-### Consumed (must exist first)
-
-- **canopy-database** models + migrations:
-  - `ServerGroupBackupConfig` with `by_group_id(conn, group_id) ->
-    Result<Option<Self>>`; fields `bucket`, `prefix`, `target_role_arn`,
-    `region: Option<String>`, `repo_password_ref`, `status`
-    (`provisioning|escrow_pending|ready`).
-  - `NewBackupCredentialIssuance` insert (fields per plan's
-    `backup_credential_issuances`).
-  - `NewBackupRun` insert with caller-supplied `id` (UUID PK), exposing a
-    way to distinguish a unique-violation for the 409 mapping.
-  - Re-exports in `database/src/lib.rs`.
-- **`Server::live_by_device_id`** — already exists
-  (`database/src/servers.rs:352`); `Server.group_id` already exists.
-- **ops `backups` stack** (contract, provisioned elsewhere): the
-  per-bucket `target_role_arn` **trusts canopy's IRSA principal** for
-  cross-account `AssumeRole`; the role's own policy grants the device
-  backup action set (`AWS_S3_MULTIPART_ACTIONS`, no delete) and
-  `s3:GetBucketObjectLockConfiguration`. The pod's ServiceAccount is
-  IRSA-annotated and has RBAC `get` on Secrets in canopy's namespace.
-  public-server is **never** granted `CreateBucket` / delete / bypass.
-- **Repo-password Secret**: a k8s Secret named by `repo_password_ref`,
-  with the password under an agreed key, created by the
-  onboarding/escrow component before a group reaches `ready`.
-
-### Provided (to others)
-
-- **`POST /backup-credentials`** → `credential_process` JSON
-  (`Version/AccessKeyId/SecretAccessKey/SessionToken/Expiration`); consumed
-  by bestool's `credential_process` hook. 412/409/502 contract above.
-- **`GET /backup-target`** → `{storage, bucket, prefix, region,
-  repo_password}`; consumed by bestool to build the kopia connection.
-- **`POST /backup-report`** → 204; writes `backup_runs` rows that
-  `canopy-jobs` staleness/reconciliation reads.
-- **`backup_credential_issuances` rows** — the audit log + CloudTrail join
-  key (`access_key_id`) other components/operators query.
-- **`AppError::Upstream` (502)** — reusable upstream-failure variant.
-- **`AppState.sts` / `AppState.kube`** — the AWS/kube clients now available
-  on public-server's state for any future device-facing AWS use.
-
-## Data shapes (wire)
-
-- credentials request: `{ "purpose": "backup" | "restore" }` (default
-  backup).
-- credentials response: the four-field `credential_process` JSON +
-  `Version: 1`.
-- target response: `{ "storage": "s3", "bucket": "...", "prefix": "",
-  "region": "...", "repo_password": "..." }`.
-- report request: `{ "run_id": uuid, "purpose": ..., "outcome": ...,
-  "error"?: str, "bytes_uploaded"?: int, "snapshot_id"?: str }` → 204.
-
-## Testing approach
-
-Per AGENTS.md, HTTP endpoint tests use
-`commons_tests::server::run_with_device_auth("server", |conn, cert,
-device_id, public, private| async move { ... })`, adding the
-`mtls-certificate` header on each request:
-`.add_header("mtls-certificate", &cert)`. Use
-`#[tokio::test(flavor = "multi_thread")]`. Put tests in
-`crates/public-server/tests/backup.rs` (no `_test` suffix) or an inline
-`#[cfg(test)] mod tests` as `bestool.rs` does.
-
-**The AWS/kube clients are `None` in the test harness** (the harness
-builds `AppState` directly without AWS env — `commons-tests/src/server.rs:92`).
-That cleanly tests the **resolution + error** matrix without a live AWS:
-
-- `POST /backup-credentials` with a device bound to **no live server** ⇒
-  **412** (seed a device with no server row).
-- device → **ungrouped** server ⇒ **409** (seed a server with
-  `group_id = NULL`).
-- device → grouped server but **no config row** ⇒ **409**.
-- device → grouped, config row in `provisioning`/`escrow_pending` ⇒ **409**
-  (dormant gate).
-- device → grouped, config `ready`, but `sts == None` ⇒ **502** ("issuer
-  not configured") — this is the harness default and proves the 502 path.
-- `GET /backup-target`: same 412/409 matrix; with config `ready` and
-  `kube == None` ⇒ **502**.
-- `POST /backup-report`: grouped server ⇒ row written, **204**; assert the
-  `backup_runs` row via the model (`device_id`/`group_id` from context,
-  not from body — try sending a bogus group and confirm it's ignored);
-  duplicate `run_id` ⇒ **409**; ungrouped ⇒ **409**; no live server ⇒
-  **412**.
-
-**Successful issuance (200)** needs the STS call mocked or stubbed —
-options to weigh (open question): (a) inject a stub STS behind a small
-trait so the harness can return canned creds; (b) `aws-smithy-mocks` /
-`StaticReplayClient` to replay an `AssumeRole` HTTP response with the
-`sts` client present; (c) leave the 200 path to a manual/integration test
-against a real role. Prefer (a) or (b) so the happy path + the
-issuance-audit insert + the restore-vs-backup `policy` difference are
-covered in CI. The restore session-policy JSON builder should also have a
-**pure unit test** (assert the three statements, the unconditioned
-`GetBucketLocation`, the `<prefix>*` substitution) — that's the
-correctness-critical, AWS-free piece.
-
-Seeding helpers: `run_with_device_auth` creates the device + key; insert
-the `servers` / `server_groups` / `server_group_backup_config` rows via
-the database models (add a seed helper if one doesn't exist). Use direct
-model calls for DB state, HTTP for the endpoint behaviour (AGENTS.md).
-
-No frontend/Playwright work in this component (device API, no UI).
-
-## Open questions / decisions to make
-
-1. **STS happy-path test strategy** — stub trait vs `aws-smithy-mocks`
-   replay vs manual-only. Recommend a stub trait or smithy mock so CI
-   covers 200 + audit insert + the restore/backup policy split. (Pick
-   before implementing the happy path.)
-2. **`aws-sdk-s3` on public-server at all?** The plan keeps deep S3
-   behavioural checks in the preflight Job, not the issuance hot path. If
-   issuance does **no** S3 no-op, public-server needs only `aws-sdk-sts`.
-   Confirm we are not adding a per-issuance `GetBucketLocation` here
-   (latency + an extra permission on the issuer). Default: STS only.
-3. **`AppState::init()` becomes async** — it's currently sync and called
-   `?`-style from `main`. Confirm making the AWS/kube-aware constructor
-   async (and keeping a sync `None`-clients path for the private nested
-   mount + tests) is acceptable, vs. lazily building the clients on first
-   use.
-4. **Repo-password Secret key name** — fix the data key inside the Secret
-   (e.g. `password`) and align it with the onboarding/escrow component
-   that creates the Secret. Single source of truth needed.
-5. **Deployment-default region** — where does the fallback region live
-   (env var `AWS_REGION` / a canopy config constant)? `GET /backup-target`
-   must always return a concrete region string even when
-   `config.region IS NULL`.
-6. **Who clears the one-off `backup_requests` flag** — the plan says
-   "cleared when the run is reported." Decide whether `POST /backup-report`
-   clears it (this component) or the scheduler does (`canopy-jobs`).
-   Leaning: the scheduler owns `backup_requests`; this handler just writes
-   `backup_runs`. Confirm so the flag-clear isn't dropped between
-   components.
-7. **Namespace discovery** — read `POD_NAMESPACE` via the downward API, or
-   `kube`'s in-cluster namespace inference? Pick one and set it at
-   `BackupSecrets` construction.
-8. **`Purpose`/`Outcome` shared types** — these enums recur across
-   `backup_credential_issuances`, `backup_runs`, the endpoints, and the
-   jobs crate. Decide whether they live in `commons-types` (shared) or are
-   defined per-crate. Recommend `commons-types` to avoid drift.
-
----
-
-## Backup types addendum
-
-Per the plan's "Backup types": requests carry a `type`, and there's a new
-registration endpoint.
-
-- **New `POST /backup-capabilities`** (ServerDevice): body
-  `{ "types": [...] }`; resolve device→server; upsert
-  `server_backup_capabilities`, seeding `enabled` from each type's
-  `backup_type_defaults.auto_enable` for newly-seen types (don't clobber an
-  operator-set `enabled`). 204.
-- **`POST /backup-credentials`** body gains `"type"`; **`POST
-  /backup-report`** body gains `"type"`; both record it
-  (`backup_credential_issuances.type`, `backup_runs.type`/`server_id`).
-- Issuance/credentials gating is per `(server, type)`: the capability must
-  be `enabled` and the group `ready`.
-- Add a shared **effective-config resolver** (override ?? type-default,
-  retention floor) — also consumed by the jobs + UI components.
diff --git a/migrations/2026-06-15-064431-0000_backup_group_scoped_issues/up.sql b/migrations/2026-06-15-064431-0000_backup_group_scoped_issues/up.sql
index 4df9b984..3e198ba8 100644
--- a/migrations/2026-06-15-064431-0000_backup_group_scoped_issues/up.sql
+++ b/migrations/2026-06-15-064431-0000_backup_group_scoped_issues/up.sql
@@ -1,4 +1,4 @@
--- Group-scoped issues (Option B from specs/canopy-jobs-detection-preflight.md).
+-- Group-scoped issues: an issue keyed to a server_group with no member server.
 --
 -- The incident model is server-keyed today: issues.server_id is NOT NULL and
 -- every issue belongs to exactly one server. Group-level backup checks