Skip to content

fix: atomic session-guarded broker disconnect to prevent reconnect race#144

Closed
scion-gteam[bot] wants to merge 1 commit into
mainfrom
fix/session-guarded-broker-disconnect
Closed

fix: atomic session-guarded broker disconnect to prevent reconnect race#144
scion-gteam[bot] wants to merge 1 commit into
mainfrom
fix/session-guarded-broker-disconnect

Conversation

@scion-gteam
Copy link
Copy Markdown

@scion-gteam scion-gteam Bot commented Jun 5, 2026

Summary

Fixes the recurring no_runtime_broker 422 errors caused by a TOCTOU race in the broker disconnect callback (issue #131).

  • Root cause: The onDisconnect callback used separate ReleaseRuntimeBrokerConnection + UpdateRuntimeBrokerHeartbeat(offline) calls. When a broker disconnects and reconnects rapidly (same second), the stale disconnect's ReleaseRuntimeBrokerConnection CAS succeeds (old session still in DB), then the subsequent UpdateRuntimeBrokerHeartbeat(offline) — which has no session guard — clobbers the new connection's online status. Provider statuses are also clobbered and never restored by heartbeats, leaving the broker permanently invisible until hub restart.
  • Fix: New ReleaseAndMarkBrokerOffline store method that atomically clears affinity AND stamps status=offline in a single CAS write. If a concurrent reconnect has already claimed the broker with a new session, the compare fails and the callback is a no-op. Also adds a re-check guard before updating provider statuses to prevent clobbering by the provider status loop.
  • Why PR fix: prevent stale disconnect from marking reconnected broker offline #133 wasn't enough: PR fix: prevent stale disconnect from marking reconnected broker offline #133's fix (session-ID guard in removeConnection) is already in the code and prevents a different race layer. This fix addresses the deeper store-level TOCTOU between the session check and the offline stamp.

Changes

  • pkg/store/store.go — Add ReleaseAndMarkBrokerOffline to RuntimeBrokerStore interface
  • pkg/store/entadapter/project_store.go — Implement atomic release+offline CAS
  • pkg/hub/server.go — Update onDisconnect callback to use new method + provider re-check guard
  • pkg/store/entadapter/broker_affinity_test.go — 4 new tests covering the happy path, session mismatch, post-reclaim no-op, and unclaimed no-op

Test plan

  • TestReleaseAndMarkBrokerOffline_StampsOffline — matching session clears affinity and stamps offline
  • TestReleaseAndMarkBrokerOffline_NoopOnSessionMismatch — stale session is a no-op, status stays online
  • TestReleaseAndMarkBrokerOffline_NoopAfterReclaim — reproduces the exact race from issue Bug: Runtime broker control channel disconnects after ~4h, does not recover without hub restart #131
  • TestReleaseAndMarkBrokerOffline_NoopWhenUnclaimed — unclaimed broker is a no-op
  • All existing TestBrokerAffinity_* and TestControlChannel* tests pass (regression)
  • Full pkg/store/... test suite passes

Refs #131

…ce (#131)

The onDisconnect callback previously used separate ReleaseRuntimeBrokerConnection
and UpdateRuntimeBrokerHeartbeat calls. When a broker disconnects and reconnects
rapidly, the stale disconnect's offline stamp can clobber the new connection's
online status because UpdateRuntimeBrokerHeartbeat has no session guard — it
unconditionally overwrites status. Provider statuses are also clobbered and never
restored by heartbeats, leaving the broker permanently invisible until hub restart.

Add ReleaseAndMarkBrokerOffline which atomically clears affinity AND stamps
status=offline in a single CAS write. If a concurrent reconnect has already
claimed the broker with a new session, the compare fails and the callback is
a no-op. Also add a re-check guard before updating provider statuses.
@ptone ptone closed this Jun 5, 2026
@ptone
Copy link
Copy Markdown
Owner

ptone commented Jun 5, 2026

Closing — fix incorporated into upstream PR GoogleCloudPlatform#303 (branch scion/dev-issue-131) via cherry-pick of commit 4c10502. The atomic ReleaseAndMarkBrokerOffline solution is fully covered there.

@scion-gteam scion-gteam Bot deleted the fix/session-guarded-broker-disconnect branch June 5, 2026 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant