Skip to content

feat(restore): restore-health ingest, alerting & overdue sweep (PR2)#294

Merged
passcod merged 2 commits into
mainfrom
restore-verification-health
Jun 30, 2026
Merged

feat(restore): restore-health ingest, alerting & overdue sweep (PR2)#294
passcod merged 2 commits into
mainfrom
restore-verification-health

Conversation

@passcod

@passcod passcod commented Jun 30, 2026

Copy link
Copy Markdown
Member

🤖 PR2 of the PGRO restore-verification integration (TAM-6877): the reports-in / health half. Stacked on #293 — review/merge that first.

Closes the lifecycle loop: produced → persisted → restorable. The consumer reports each replica's restore outcome; a failed or overdue restore is a group-level incident that pages regardless of any one server's monitoring state.

What's here

  • backup_restore_checks table + database::restore::BackupRestoreCheck. record_report writes the report and raises or recovers a per-(server, type, intent) group-level restore-verification alert (via the existing raise_group_event + the RESTORE_VERIFICATION ref). Per-replica keying means a healthy verify never masks a failing disaster-recovery on the same server; success-and-healthy recovers, anything else (including restored-but-unhealthy) raises.
  • Overdue-freshness sweep (sweep_overdue, hooked into database::backup::sweep() — the 60s monitor loop): raises restore-verification for any enabled, capability-supported, freshness-bound replica whose last healthy report is missing or older than its window. Gaps (unsupported intents) are skipped — they're config notices, not health incidents. Recovery is driven by the next healthy report, so the sweep only raises.
  • POST /restore-verification (backup-restore role): same declaration-based authz as credentials, records the report, drives the alert. Outcome is success/failure only (no unsupported — that's handled by capability filtering upstream).
  • Operator UI: a Recent restore checks panel on /restore-replicas, backed by a restore_replicas/checks admin endpoint.

Wire shapes

RestoreVerification carries server_id, intent, and the optional replica_id (from the worklist entry) on top of the handoff's original fields — frozen now for the bestool Appendix A restatement.

Tests

  • database::restorerecord_report raise→recover, restored-but-unhealthy still raises, sweep_overdue raises for stale replicas and skips gaps.
  • public-server::restore/restore-verification 403 without a declaration; records a check row and raises an active group issue with one.
  • private-web/e2e — the restore-health panel renders a seeded check with its outcome.

Reports-in half of managed restore replicas.

- Migration backup_restore_checks + database::restore::BackupRestoreCheck
  model: record_report writes the check and raises/recovers a per-(server,
  type,intent) group-level restore-verification alert (always pages, recovers
  independently); queries for recent checks and latest-healthy anchors.
- sweep_overdue hooked into database::backup::sweep() (the 60s monitor loop):
  raises restore-verification for any enabled, capability-supported,
  freshness-bound replica whose last healthy report is missing or stale;
  skips gaps (unsupported intents).
- public-server POST /restore-verification (backup-restore role): authz via
  declaration, records the report, drives the alert.
- private-server restore_replicas/checks + a Recent-restore-checks panel on
  the operator UI.
- Regenerated openapi + api-types.
- Tests: db record_report raise/recover, unhealthy-success, sweep overdue vs
  gaps; public-server verification 403 + records-and-alerts; e2e health panel.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@passcod

passcod commented Jun 30, 2026

Copy link
Copy Markdown
Member Author

🤖 Added a trailing unplan(restore) commit: removes the now-implemented coordination docs docs/plans/pgro-restore-verification-handoff.md and docs/plans/pgro-restore-replicas-canopy-response.md. The durable spec .workhorse/specs/public-server/restore-replicas.md stays. So this PR both ships restore-health and unplans the restore design docs as it lands.

Base automatically changed from pgro-restore-verification to main June 30, 2026 05:27
@passcod passcod added this pull request to the merge queue Jun 30, 2026
Merged via the queue into main with commit 432df4e Jun 30, 2026
7 checks passed
@passcod passcod deleted the restore-verification-health branch June 30, 2026 05:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant