Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions .workhorse/specs/jobs/backup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
id: BKJ
---

# Backup control plane

Canopy maintains, verifies, and watches over the fleet's backups itself β€” clients neither run maintenance nor hold the rights to.
This is the autonomous half of the backup system: the work Canopy does on a cadence with no device asking, and the health signals it raises from it.

## Scope

This spec covers Canopy's own background backup work: repo maintenance, inspection, storage metering, upstream preflight, and the detection and alerting that turn all of it into incidents.

It does not cover the device contract (see [BAK](../public-server/backup.md)), the operator's configuration of a group (see [BKO](../private-server/backup.md)), or restore-health (the managed restore replicas spec, `RST`).

Canopy acts only on groups whose configuration is ready, runs at most one operation per group at a time, and bounds how many groups it works on at once.

## Maintenance

Canopy runs each group's repo maintenance on a cadence β€” clients are never granted the rights to.
It enforces the group's retention as part of maintenance, and records every run's outcome so a stuck or failing maintenance is itself detectable.

## Passphrase rotation

Canopy rotates each group's repo passphrase on a cadence, so a leaked passphrase is useful only until the next rotation rather than indefinitely.
Rotation is crash-safe: an interrupted rotation is reconciled on the next attempt, and throughout it the repo stays openable with either the previous or the new passphrase β€” it is never left unopenable.
Like maintenance, rotation is Canopy's to do; operators never run it.

## Recovery escrow

Because Canopy holds the only copy of every passphrase, it continuously escrows the state needed to recover access without it: the per-group passphrases and repo coordinates, and the group, server, configuration, schedule, and capability records that frame them.
The escrow is encrypted to a set of offline recipient keys whose private halves Canopy never holds, and written to versioned, object-locked storage.
So Canopy can write the escrow but never read it back β€” a full Canopy compromise cannot disclose the escrowed secrets, and object-lock keeps past versions undeletable until they expire.
Recipients are mandatory: Canopy refuses to run without them, so there is never a silent recovery gap.
This is the escrow the operator recovery ceremony verifies (see [BKO](../private-server/backup.md)).

## Inspection

Canopy periodically inspects each group's repo against the storage directly, independent of what devices reported:

- it verifies repo integrity, and a failed verification is repo corruption;
- it inventories the repo β€” the latest snapshot per source β€” as the ground truth a device's report is reconciled against;
- it records repo size, logical and physical, and the storage cost basis for display.

## Upstream preflight

Canopy watches its own access to each group's storage, so a broken control plane is caught at the source rather than when the fleet starts failing.
It checks that its identity resolves, that it can assume each group's role and perform a read-only no-op, and that the bucket's object-lock is present and at least the required retention.
Preflight only alerts; it never pulls Canopy out of service, because a failing check must not make a degraded situation worse.

## Detection

Canopy reconciles three sources β€” what a device reported, what credentials were issued, and what actually landed in the repo β€” and alerts on disagreement:

- **staleness** β€” a server with a prior successful backup but none recent, or one that has never backed up though it has been expected long enough.
- **reconcile** β€” a device reported a successful backup but no matching snapshot landed (the report is false or the upload didn't persist), or a fresh snapshot exists but no recent report (the reporting path is broken).
- **maintenance** β€” a group whose maintenance is overdue, or whose most recent maintenance failed.

## Alerting

Backup alerts are raised at one of two scopes:

- **Per-server** signals (staleness, never-backed-up, the report-gap) are subject to the server's monitoring gate: still recorded for visibility, but they contribute to an incident only when the server is monitored, because some servers are intentionally intermittent.
- **Group-level** signals (repo corruption, maintenance failure, missing-snapshot reconciliation, preflight failures, and restore-verification β€” see the managed restore replicas spec, `RST`) page regardless of any member's monitoring state, because they are control-plane or data-safety concerns that belong to no single server.

Each signal has a stable key by which operators silence or snooze it and by which the interface and notifications refer to it; the keys are a contract and are not renamed without migrating stored silences.
A signal recovers when the condition that raised it clears.
70 changes: 70 additions & 0 deletions .workhorse/specs/private-server/backup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
id: BKO
---

# Operator backup control

An operator configures, through Canopy, how a server group backs up: where its repo lives, on what cadence, with what retention, and which servers and types participate.
Canopy owns the repo passphrase throughout β€” it is generated or accepted once, stored in Canopy's secret store, and never handed back except through the audited recovery ceremony.

## Scope

This spec covers the operator-facing control surface: per-group backup configuration and its lifecycle, scheduling and retention, per-server participation, on-demand backups, the status view, and passphrase recovery.

It does not cover the device contract (see [BAK](../public-server/backup.md)) or Canopy's autonomous maintenance, inspection, detection, and alerting (see [BKJ](../jobs/backup.md)).

Reads are available to any tailnet user; changes require an administrator.

## Per-group configuration

A group has at most one backup configuration: the bucket, prefix, region, the cross-account roles Canopy assumes, the reference to the group's passphrase, and its placement and lifecycle state.

Placement is one of:

- **external** β€” the operator brings their own bucket and supplies the role ARNs Canopy will assume.
- **shared** β€” Canopy provisions and names a bucket in its own shared account; the operator supplies nothing about location.

A configuration is created once and its structural fields (bucket, roles, placement) are immutable; the region and the operational settings below are editable.
Decommissioning a group deletes its configuration row β€” which stops all credential issuance for the group β€” and deletes the Canopy-owned passphrase.
The bucket and its object-locked contents persist independently and are not Canopy's to delete; teardown is a separate, deliberate act gated by the lock window.

## Lifecycle and provisioning

A configuration moves from **provisioning** to **ready**; devices are refused until it is ready.
Creating a configuration sets it provisioning and asks Canopy to create or connect the repo; that work transitions the configuration to ready, or records the error it failed with so the operator sees why.
The operator interface depends only on these observable states, not on how provisioning is carried out.

A configuration may also be created or reconciled idempotently by machine β€” for infrastructure-as-code β€” under administrator-equivalent authentication, with the same probe and provisioning behaviour as the interactive path.

## Setup and the passphrase

When a configuration is created, Canopy probes the target bucket and classifies it: empty, an existing kopia repo, holding unrelated content, or inaccessible.
The classification chooses the mode:

- **from-birth** β€” an empty bucket; Canopy generates a fresh passphrase and creates a new repo.
- **passphrase** β€” an existing repo; the operator supplies its passphrase and Canopy connects to it.

A bucket holding unrelated content is refused rather than written into; Canopy never deletes to make room.
Either way Canopy creates and owns the passphrase secret, and configuration and secret are created together β€” if the secret cannot be stored, the configuration is rolled back, so a configuration never exists without its passphrase.
The supplied or generated passphrase is only the starting point: Canopy rotates it on a cadence thereafter (see [BKJ](../jobs/backup.md)), and the recovery ceremony recovers whatever the current passphrase is.

## Scheduling and retention

Each `(group, type)` has an expected backup interval and a retention policy, taken from a per-`(group, type)` override when set, otherwise from the fleet-wide default for that type.
A manual-only type has no interval and is backed up only on an explicit request.
Retention is floored to an organisational minimum; a configuration may deliberately opt out of the floor, which is recorded as the dangerous choice it is.

## Participation and on-demand

A server participates in a type when that type is an enabled capability on it; an operator toggles participation per `(server, type)`.
An operator may queue a one-off backup β€” or restore β€” for a `(server, type)` to run on the next cycle, and may cancel a queued one before it runs.

## Status

The operator can see, per group: the repo's size and cost basis, recent runs with their outcomes and errors, recent maintenance, the latest snapshot per server, and any in-flight or pending one-off requests.

## Passphrase recovery

Because Canopy owns the only copy of each passphrase, the ability to recover it without Canopy is verified, not assumed.
Recovery is a ceremony: a passphrase is escrowed encrypted to a set of offline recipient keys, and an operator periodically proves the recipients can still decrypt it.
The ceremony is recorded so staleness β€” too long since the last successful proof β€” is visible.
63 changes: 63 additions & 0 deletions .workhorse/specs/public-server/backup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
---
id: BAK
---

# Device backups

A server device backs up to object storage that Canopy mediates: it holds no long-lived storage credentials and knows nothing of the bucket layout until Canopy tells it, per run.
Canopy is the control plane β€” it owns the credentials, the repo location, the passphrase, and the record of what ran.

## Scope

This spec covers the device-facing contract: how a device learns what it may back up, obtains short-lived credentials and the repo coordinates for a run, and reports the outcome.

It does not cover what an operator configures (see [BKO](../private-server/backup.md)), what Canopy does on its own β€” maintenance, inspection, detection, alerting (see [BKJ](../jobs/backup.md)) β€” or restoring backups (the managed restore replicas spec, `RST`).

## Identity and resolution

A device authenticates with the `server` role, over either transport Canopy accepts (a client certificate on the internet-facing path, or tailnet identity on the private mount).
Every device request resolves through the authenticated identity, never the request body: device β†’ its single live server β†’ that server's group β†’ the group's backup configuration.
A device bound to no live server is refused; a server with no group, or whose group has no ready configuration, is refused.

## Capabilities

A device registers the backup types it can run on its server.
A newly seen type is enabled for scheduling or not according to that type's fleet default; a type already known keeps the operator's setting.
Registration requires the server to be grouped, but not the group's configuration to be ready.

## Credentials

A device requests credentials for a `(type, purpose)`.
Canopy issues short-lived credentials by assuming the group's dedicated cross-account storage role under a session policy that confines them to the group's bucket and prefix:

- **backup** purpose grants the write set kopia needs, including a version-less delete β€” but never deletion of a locked version, nor any weakening of object-lock or retention.
- **restore** purpose grants read-only access.

The credentials carry the storage role's identity for at most an hour; a device refreshes them as a run outruns that lifetime.
Every issuance is recorded before the credentials are returned.

A `(type, purpose)` is issuable only when the type is an enabled capability of the server, or an operator has queued a one-off request of that purpose for it; otherwise it is refused.
The group's configuration must be ready: until then the endpoints refuse, so a half-provisioned group cannot be written to.

## Target

A device fetches the repo coordinates for its group each run: the storage kind, bucket, prefix, region, and the repo passphrase.
The passphrase is Canopy-owned and read from the group's secret store at request time; the device never stores it.

## Reporting

A device reports each run's outcome: the type and purpose, success or failure, an error when it failed, the resulting snapshot identifier, the bytes uploaded, and the object-storage traffic the run moved.
The run is keyed by an identifier the device mints at the start of the run; the device, server, and group are taken from the authenticated context, so a device cannot report a run as another group's.
A duplicate run identifier is refused.
Reporting a run clears any matching operator one-off request, so the standing "back up now" prompt stops.

## Guarantees

A compromised device cannot destroy backups.
Its credentials cannot delete a locked object version or weaken the bucket's object-lock; at worst it writes a delete-marker that object-lock and versioning leave recoverable.
Decommissioning a device is revoking its certificate: it can no longer obtain credentials, and any it already holds expire within the hour.

## Failure contract

The device endpoints distinguish: the caller is bound to no live server; the server is ungrouped, has no ready configuration, the type is not issuable, or a run identifier is duplicate; and Canopy's own dependency β€” the credential issuer or the secret store β€” is unavailable or unconfigured.
Each is a distinct, stable status so a device need not guess.
Loading