Skip to content

fix: re-enable legacy metrics reporter for audit bootstrap#275

Merged
mateeullahmalik merged 1 commit intomasterfrom
fix/re-enable-legacy-metrics
Mar 10, 2026
Merged

fix: re-enable legacy metrics reporter for audit bootstrap#275
mateeullahmalik merged 1 commit intomasterfrom
fix/re-enable-legacy-metrics

Conversation

@mateeullahmalik
Copy link
Collaborator

Problem

The audit module's epoch-end recovery for POSTPONED supernodes requires peer observations from active probers. On testnet, a deadlock exists where:

  1. All supernodes running v2.4.5-testnet were already POSTPONED before the chain upgrade (legacy staleness kicked them ~500 blocks after they stopped submitting MsgReportSupernodeMetrics)
  2. The epoch anchor's active_supernode_accounts has been empty since epoch 1 — no active probers exist
  3. POSTPONED SNs faithfully submit audit epoch reports (~100 each) but cannot recover without peer observations
  4. The 3 SNs on old releases (v2.4.29/35/36) bounce ACTIVE↔POSTPONED via legacy metrics every epoch but are always POSTPONED at epoch-start anchor freeze time

Fix

Re-enable the legacy supernode_metrics reporter alongside the audit host_reporter. Both run in parallel:

  • Legacy MsgReportSupernodeMetrics → instant recovery from POSTPONED → ACTIVE (mid-epoch)
  • Audit MsgSubmitEpochReport → prevents re-postponement at epoch end (report exists, host minimums disabled)
  • Next epoch: SN is ACTIVE at anchor freeze → enters active_supernode_accounts → becomes a prober
  • Peer observations flow → remaining POSTPONED SNs recover via audit path

Evidence (testnet lumera-testnet-2)

  • 47/59 supernodes POSTPONED before audit activation (h=3763000)
  • 12 SNs postponed at h≈3730xxx — exactly 501 blocks after last legacy metrics (staleness threshold = 500)
  • 0 epoch reports from 3 old-release SNs (no audit code) — 229 bounce cycles each
  • ~100 epoch reports from each v2.4.5-testnet SN — all stuck POSTPONED, zero peer observations available
  • 0 active_supernode_accounts in every epoch anchor since epoch 1

After stabilization

Once the active set is stable (1-2 epochs), the legacy reporter can be removed in a future release.

The audit module's epoch-end recovery requires peer observations from
active probers. When the module was first activated on testnet, all
supernodes running v2.4.5-testnet had already been POSTPONED by the
legacy staleness handler (they stopped submitting MsgReportSupernodeMetrics
~500 blocks after upgrading, before the chain upgrade).

This created a deadlock:
- Recovery needs peer observations from active probers
- No active probers exist (empty active_supernode_accounts in every anchor)
- POSTPONED SNs submit epoch reports but cannot recover
- The 3 SNs on old releases bounce ACTIVE↔POSTPONED via legacy metrics
  but are always POSTPONED at epoch start (anchor freeze time)

Fix: run the legacy metrics reporter alongside the audit host_reporter.
Legacy MsgReportSupernodeMetrics recovers POSTPONED SNs to ACTIVE
mid-epoch. Since they also submit audit epoch reports, the audit
EndBlocker won't re-postpone them (report exists, host minimums are
disabled, no peer-port streak). They survive the epoch end and appear
ACTIVE in the next epoch anchor, bootstrapping the peer-observation
cycle for all remaining POSTPONED SNs.

Once the active set stabilizes, the legacy reporter can be removed
in a future release.
@roomote-v0
Copy link

roomote-v0 bot commented Mar 10, 2026

Rooviewer Clock   See task

Reviewed the diff and traced the Collector type, its Run/Stop lifecycle, the service interface contract, the reachability.SetCurrentEpochID concurrency model, and the host_reporter interaction. No issues found -- the change is a clean re-enablement of previously working code with correct service interface compliance and safe concurrent behavior.

Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues.

@mateeullahmalik mateeullahmalik merged commit 235d45f into master Mar 10, 2026
7 checks passed
j-rafique added a commit to LumeraProtocol/lumera that referenced this pull request Mar 10, 2026
Add unit and system tests that reproduce the testnet deadlock where all
supernodes are POSTPONED and the epoch anchor has an empty active set.
Without active probers, peer observations cannot be generated, making
audit recovery impossible.

Tests:
- TestEnforceEpochEnd_EmptyActiveSet_PostponedCannotRecover (unit)
  Proves that compliant host-only reports from POSTPONED SNs are
  insufficient for recovery when no peer observations exist.

- TestEnforceEpochEnd_LegacyRecoveredSN_SurvivesWithReport (unit)
  Proves that SNs recovered to ACTIVE mid-epoch (via legacy metrics)
  with audit reports survive the EndBlocker enforcement.

- TestAuditEmptyActiveSetDeadlock_HostOnlyReportsCannotRecover (system)
  Full-chain E2E: registers SNs, misses epoch 0, submits host-only
  reports for 3 epochs — all remain POSTPONED throughout.

- TestAuditEmptyActiveSetBootstrap_LegacyMetricsBreaksDeadlock (system)
  Full-chain E2E: same deadlock setup, then legacy metrics recovery
  breaks the deadlock — SNs survive enforcement and remain ACTIVE.

Ref: LumeraProtocol/supernode#275
mateeullahmalik added a commit that referenced this pull request Mar 11, 2026
mateeullahmalik added a commit that referenced this pull request Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant