beyondessential · passcod · Jun 30, 2026 · Jun 30, 2026 · Jun 30, 2026
diff --git a/.workhorse/specs/jobs/backup.md b/.workhorse/specs/jobs/backup.md
@@ -0,0 +1,67 @@
+---
+id: BKJ
+---
+
+# Backup control plane
+
+Canopy maintains, verifies, and watches over the fleet's backups itself — clients neither run maintenance nor hold the rights to.
+This is the autonomous half of the backup system: the work Canopy does on a cadence with no device asking, and the health signals it raises from it.
+
+## Scope
+
+This spec covers Canopy's own background backup work: repo maintenance, inspection, storage metering, upstream preflight, and the detection and alerting that turn all of it into incidents.
+
+It does not cover the device contract (see [BAK](../public-server/backup.md)), the operator's configuration of a group (see [BKO](../private-server/backup.md)), or restore-health (the managed restore replicas spec, `RST`).
+
+Canopy acts only on groups whose configuration is ready, runs at most one operation per group at a time, and bounds how many groups it works on at once.
+
+## Maintenance
+
+Canopy runs each group's repo maintenance on a cadence — clients are never granted the rights to.
+It enforces the group's retention as part of maintenance, and records every run's outcome so a stuck or failing maintenance is itself detectable.
+
+## Passphrase rotation
+
+Canopy rotates each group's repo passphrase on a cadence, so a leaked passphrase is useful only until the next rotation rather than indefinitely.
+Rotation is crash-safe: an interrupted rotation is reconciled on the next attempt, and throughout it the repo stays openable with either the previous or the new passphrase — it is never left unopenable.
+Like maintenance, rotation is Canopy's to do; operators never run it.
+
+## Recovery escrow
+
+Because Canopy holds the only copy of every passphrase, it continuously escrows the state needed to recover access without it: the per-group passphrases and repo coordinates, and the group, server, configuration, schedule, and capability records that frame them.
+The escrow is encrypted to a set of offline recipient keys whose private halves Canopy never holds, and written to versioned, object-locked storage.
+So Canopy can write the escrow but never read it back — a full Canopy compromise cannot disclose the escrowed secrets, and object-lock keeps past versions undeletable until they expire.
+Recipients are mandatory: Canopy refuses to run without them, so there is never a silent recovery gap.
+This is the escrow the operator recovery ceremony verifies (see [BKO](../private-server/backup.md)).
+
+## Inspection
+
+Canopy periodically inspects each group's repo against the storage directly, independent of what devices reported:
+
+- it verifies repo integrity, and a failed verification is repo corruption;
+- it inventories the repo — the latest snapshot per source — as the ground truth a device's report is reconciled against;
+- it records repo size, logical and physical, and the storage cost basis for display.
+
+## Upstream preflight
+
+Canopy watches its own access to each group's storage, so a broken control plane is caught at the source rather than when the fleet starts failing.
+It checks that its identity resolves, that it can assume each group's role and perform a read-only no-op, and that the bucket's object-lock is present and at least the required retention.
+Preflight only alerts; it never pulls Canopy out of service, because a failing check must not make a degraded situation worse.
+
+## Detection
+
+Canopy reconciles three sources — what a device reported, what credentials were issued, and what actually landed in the repo — and alerts on disagreement:
+
+- **staleness** — a server with a prior successful backup but none recent, or one that has never backed up though it has been expected long enough.
+- **reconcile** — a device reported a successful backup but no matching snapshot landed (the report is false or the upload didn't persist), or a fresh snapshot exists but no recent report (the reporting path is broken).
+- **maintenance** — a group whose maintenance is overdue, or whose most recent maintenance failed.
+
+## Alerting
+
+Backup alerts are raised at one of two scopes:
+
+- **Per-server** signals (staleness, never-backed-up, the report-gap) are subject to the server's monitoring gate: still recorded for visibility, but they contribute to an incident only when the server is monitored, because some servers are intentionally intermittent.
+- **Group-level** signals (repo corruption, maintenance failure, missing-snapshot reconciliation, preflight failures, and restore-verification — see the managed restore replicas spec, `RST`) page regardless of any member's monitoring state, because they are control-plane or data-safety concerns that belong to no single server.
+
+Each signal has a stable key by which operators silence or snooze it and by which the interface and notifications refer to it; the keys are a contract and are not renamed without migrating stored silences.
+A signal recovers when the condition that raised it clears.
diff --git a/.workhorse/specs/private-server/backup.md b/.workhorse/specs/private-server/backup.md
@@ -0,0 +1,70 @@
+---
+id: BKO
+---
+
+# Operator backup control
+
+An operator configures, through Canopy, how a server group backs up: where its repo lives, on what cadence, with what retention, and which servers and types participate.
+Canopy owns the repo passphrase throughout — it is generated or accepted once, stored in Canopy's secret store, and never handed back except through the audited recovery ceremony.
+
+## Scope
+
+This spec covers the operator-facing control surface: per-group backup configuration and its lifecycle, scheduling and retention, per-server participation, on-demand backups, the status view, and passphrase recovery.
+
+It does not cover the device contract (see [BAK](../public-server/backup.md)) or Canopy's autonomous maintenance, inspection, detection, and alerting (see [BKJ](../jobs/backup.md)).
+
+Reads are available to any tailnet user; changes require an administrator.
+
+## Per-group configuration
+
+A group has at most one backup configuration: the bucket, prefix, region, the cross-account roles Canopy assumes, the reference to the group's passphrase, and its placement and lifecycle state.
+
+Placement is one of:
+
+- **external** — the operator brings their own bucket and supplies the role ARNs Canopy will assume.
+- **shared** — Canopy provisions and names a bucket in its own shared account; the operator supplies nothing about location.
+
+A configuration is created once and its structural fields (bucket, roles, placement) are immutable; the region and the operational settings below are editable.
+Decommissioning a group deletes its configuration row — which stops all credential issuance for the group — and deletes the Canopy-owned passphrase.
+The bucket and its object-locked contents persist independently and are not Canopy's to delete; teardown is a separate, deliberate act gated by the lock window.
+
+## Lifecycle and provisioning
+
+A configuration moves from **provisioning** to **ready**; devices are refused until it is ready.
+Creating a configuration sets it provisioning and asks Canopy to create or connect the repo; that work transitions the configuration to ready, or records the error it failed with so the operator sees why.
+The operator interface depends only on these observable states, not on how provisioning is carried out.
+
+A configuration may also be created or reconciled idempotently by machine — for infrastructure-as-code — under administrator-equivalent authentication, with the same probe and provisioning behaviour as the interactive path.
+
+## Setup and the passphrase
+
+When a configuration is created, Canopy probes the target bucket and classifies it: empty, an existing kopia repo, holding unrelated content, or inaccessible.
+The classification chooses the mode:
+
+- **from-birth** — an empty bucket; Canopy generates a fresh passphrase and creates a new repo.
+- **passphrase** — an existing repo; the operator supplies its passphrase and Canopy connects to it.
+
+A bucket holding unrelated content is refused rather than written into; Canopy never deletes to make room.
+Either way Canopy creates and owns the passphrase secret, and configuration and secret are created together — if the secret cannot be stored, the configuration is rolled back, so a configuration never exists without its passphrase.
+The supplied or generated passphrase is only the starting point: Canopy rotates it on a cadence thereafter (see [BKJ](../jobs/backup.md)), and the recovery ceremony recovers whatever the current passphrase is.
+
+## Scheduling and retention
+
+Each `(group, type)` has an expected backup interval and a retention policy, taken from a per-`(group, type)` override when set, otherwise from the fleet-wide default for that type.
+A manual-only type has no interval and is backed up only on an explicit request.
+Retention is floored to an organisational minimum; a configuration may deliberately opt out of the floor, which is recorded as the dangerous choice it is.
+
+## Participation and on-demand
+
+A server participates in a type when that type is an enabled capability on it; an operator toggles participation per `(server, type)`.
+An operator may queue a one-off backup — or restore — for a `(server, type)` to run on the next cycle, and may cancel a queued one before it runs.
+
+## Status
+
+The operator can see, per group: the repo's size and cost basis, recent runs with their outcomes and errors, recent maintenance, the latest snapshot per server, and any in-flight or pending one-off requests.
+
+## Passphrase recovery
+
+Because Canopy owns the only copy of each passphrase, the ability to recover it without Canopy is verified, not assumed.
+Recovery is a ceremony: a passphrase is escrowed encrypted to a set of offline recipient keys, and an operator periodically proves the recipients can still decrypt it.
+The ceremony is recorded so staleness — too long since the last successful proof — is visible.
diff --git a/.workhorse/specs/public-server/backup.md b/.workhorse/specs/public-server/backup.md
@@ -0,0 +1,63 @@
+---
+id: BAK
+---
+
+# Device backups
+
+A server device backs up to object storage that Canopy mediates: it holds no long-lived storage credentials and knows nothing of the bucket layout until Canopy tells it, per run.
+Canopy is the control plane — it owns the credentials, the repo location, the passphrase, and the record of what ran.
+
+## Scope
+
+This spec covers the device-facing contract: how a device learns what it may back up, obtains short-lived credentials and the repo coordinates for a run, and reports the outcome.
+
+It does not cover what an operator configures (see [BKO](../private-server/backup.md)), what Canopy does on its own — maintenance, inspection, detection, alerting (see [BKJ](../jobs/backup.md)) — or restoring backups (the managed restore replicas spec, `RST`).
+
+## Identity and resolution
+
+A device authenticates with the `server` role, over either transport Canopy accepts (a client certificate on the internet-facing path, or tailnet identity on the private mount).
+Every device request resolves through the authenticated identity, never the request body: device → its single live server → that server's group → the group's backup configuration.
+A device bound to no live server is refused; a server with no group, or whose group has no ready configuration, is refused.
+
+## Capabilities
+
+A device registers the backup types it can run on its server.
+A newly seen type is enabled for scheduling or not according to that type's fleet default; a type already known keeps the operator's setting.
+Registration requires the server to be grouped, but not the group's configuration to be ready.
+
+## Credentials
+
+A device requests credentials for a `(type, purpose)`.
+Canopy issues short-lived credentials by assuming the group's dedicated cross-account storage role under a session policy that confines them to the group's bucket and prefix:
+
+- **backup** purpose grants the write set kopia needs, including a version-less delete — but never deletion of a locked version, nor any weakening of object-lock or retention.
+- **restore** purpose grants read-only access.
+
+The credentials carry the storage role's identity for at most an hour; a device refreshes them as a run outruns that lifetime.
+Every issuance is recorded before the credentials are returned.
+
+A `(type, purpose)` is issuable only when the type is an enabled capability of the server, or an operator has queued a one-off request of that purpose for it; otherwise it is refused.
+The group's configuration must be ready: until then the endpoints refuse, so a half-provisioned group cannot be written to.
+
+## Target
+
+A device fetches the repo coordinates for its group each run: the storage kind, bucket, prefix, region, and the repo passphrase.
+The passphrase is Canopy-owned and read from the group's secret store at request time; the device never stores it.
+
+## Reporting
+
+A device reports each run's outcome: the type and purpose, success or failure, an error when it failed, the resulting snapshot identifier, the bytes uploaded, and the object-storage traffic the run moved.
+The run is keyed by an identifier the device mints at the start of the run; the device, server, and group are taken from the authenticated context, so a device cannot report a run as another group's.
+A duplicate run identifier is refused.
+Reporting a run clears any matching operator one-off request, so the standing "back up now" prompt stops.
+
+## Guarantees
+
+A compromised device cannot destroy backups.
+Its credentials cannot delete a locked object version or weaken the bucket's object-lock; at worst it writes a delete-marker that object-lock and versioning leave recoverable.
+Decommissioning a device is revoking its certificate: it can no longer obtain credentials, and any it already holds expire within the hour.
+
+## Failure contract
+
+The device endpoints distinguish: the caller is bound to no live server; the server is ungrouped, has no ready configuration, the type is not issuable, or a run identifier is duplicate; and Canopy's own dependency — the credential issuer or the secret store — is unavailable or unconfigured.
+Each is a distinct, stable status so a device need not guess.