feat(restore): restore-health ingest, alerting & overdue sweep (PR2)#294
Merged
Conversation
Reports-in half of managed restore replicas. - Migration backup_restore_checks + database::restore::BackupRestoreCheck model: record_report writes the check and raises/recovers a per-(server, type,intent) group-level restore-verification alert (always pages, recovers independently); queries for recent checks and latest-healthy anchors. - sweep_overdue hooked into database::backup::sweep() (the 60s monitor loop): raises restore-verification for any enabled, capability-supported, freshness-bound replica whose last healthy report is missing or stale; skips gaps (unsupported intents). - public-server POST /restore-verification (backup-restore role): authz via declaration, records the report, drives the alert. - private-server restore_replicas/checks + a Recent-restore-checks panel on the operator UI. - Regenerated openapi + api-types. - Tests: db record_report raise/recover, unhealthy-success, sweep overdue vs gaps; public-server verification 403 + records-and-alerts; e2e health panel. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1d29d35 to
d0f8d60
Compare
Member
Author
|
🤖 Added a trailing |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 PR2 of the PGRO restore-verification integration (TAM-6877): the reports-in / health half. Stacked on #293 — review/merge that first.
Closes the lifecycle loop: produced → persisted → restorable. The consumer reports each replica's restore outcome; a failed or overdue restore is a group-level incident that pages regardless of any one server's monitoring state.
What's here
backup_restore_checkstable +database::restore::BackupRestoreCheck.record_reportwrites the report and raises or recovers a per-(server, type, intent)group-levelrestore-verificationalert (via the existingraise_group_event+ theRESTORE_VERIFICATIONref). Per-replica keying means a healthyverifynever masks a failingdisaster-recoveryon the same server; success-and-healthy recovers, anything else (including restored-but-unhealthy) raises.sweep_overdue, hooked intodatabase::backup::sweep()— the 60s monitor loop): raisesrestore-verificationfor any enabled, capability-supported, freshness-bound replica whose last healthy report is missing or older than its window. Gaps (unsupported intents) are skipped — they're config notices, not health incidents. Recovery is driven by the next healthy report, so the sweep only raises.POST /restore-verification(backup-restore role): same declaration-based authz as credentials, records the report, drives the alert. Outcome issuccess/failureonly (nounsupported— that's handled by capability filtering upstream)./restore-replicas, backed by arestore_replicas/checksadmin endpoint.Wire shapes
RestoreVerificationcarriesserver_id,intent, and the optionalreplica_id(from the worklist entry) on top of the handoff's original fields — frozen now for the bestool Appendix A restatement.Tests
database::restore—record_reportraise→recover, restored-but-unhealthy still raises,sweep_overdueraises for stale replicas and skips gaps.public-server::restore—/restore-verification403 without a declaration; records a check row and raises an active group issue with one.private-web/e2e— the restore-health panel renders a seeded check with its outcome.