Skip to content

feat(canopy): tear down ephemeral (verify) replicas after verification#87

Merged
passcod merged 1 commit into
mainfrom
feat/canopy-verify-teardown
Jul 2, 2026
Merged

feat(canopy): tear down ephemeral (verify) replicas after verification#87
passcod merged 1 commit into
mainfrom
feat/canopy-verify-teardown

Conversation

@passcod

@passcod passcod commented Jul 2, 2026

Copy link
Copy Markdown
Member

🤖 A verify replica's job is to prove a snapshot restores, not to serve
queries — but it was booting postgres and then idling forever. This adds
spec.ephemeral (default false; set true by the verify intent):
once a restore reaches Active (postgres came up healthy, and for
canopy replicas the RestoreVerification was reported), the reconciler
records status.verifiedSnapshotId and deletes the restore — reclaiming
the Deployment + PVC. The replica CR and namespace stay so canopy's
worklist stays satisfied.

Re-restore gating

With no active restore after teardown, the naive triggers
(active_restore_deleted, canopy desired-snapshot-changed) would fire
immediately and loop. The verified marker breaks that: the reconciler
compares the desired snapshot against verifiedSnapshotId when there's
no active restore, so it only restores again when canopy offers a
newer snapshot (canopy path) or the schedule fires (legacy path).

The analytics intents stay ephemeral: false (long-lived query
replicas). The whole teardown path is gated behind spec.ephemeral, so
existing replicas are unchanged.

Tests

tests/ephemeral.rs (+ CI matrix entry) drives a legacy ephemeral
replica — no stub-canopy needed — through restore → Active → teardown
and asserts the restore is gone, verifiedSnapshotId is set,
currentRestore is cleared, and it does not re-restore.

A `verify` replica's job is to prove a snapshot restores, not to serve
queries. It was booting postgres and then idling forever. Add
`spec.ephemeral` (default false; set true by the verify intent): once a
restore reaches Active (postgres came up healthy, and for canopy
replicas the RestoreVerification was reported in the switchover block),
the reconciler records `status.verifiedSnapshotId` and deletes the
restore, reclaiming the Deployment + PVC. The replica CR and namespace
stay so canopy's worklist stays satisfied.

Re-restore is gated on the verified marker: with no active restore after
teardown, the reconciler compares the desired snapshot against
`verifiedSnapshotId` instead of a (now absent) active restore, so it only
restores again when canopy offers a newer snapshot (canopy path) or the
schedule fires (legacy path). Without the marker the
active-restore-deleted / desired-changed triggers would loop.

The analytics intents keep ephemeral=false (long-lived query replicas).
Gated entirely behind spec.ephemeral, so non-ephemeral replicas are
unchanged.

Adds an integration test (tests/ephemeral.rs, new CI matrix entry) that
does not need stub-canopy: it drives a legacy ephemeral replica through
restore -> Active -> teardown and asserts no re-restore loop.

Does NOT include the requested health_details on RestoreVerification:
bestool-canopy 0.4.3 has no such field yet. Deferred until the crate
ships it.
@passcod passcod force-pushed the feat/canopy-verify-teardown branch from 30ab42c to 518d535 Compare July 2, 2026 06:18
@passcod passcod merged commit 310a078 into main Jul 2, 2026
14 checks passed
@passcod passcod deleted the feat/canopy-verify-teardown branch July 2, 2026 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant