fix: atomic session-guarded broker disconnect to prevent reconnect race#144
Closed
scion-gteam[bot] wants to merge 1 commit into
Closed
fix: atomic session-guarded broker disconnect to prevent reconnect race#144scion-gteam[bot] wants to merge 1 commit into
scion-gteam[bot] wants to merge 1 commit into
Conversation
…ce (#131) The onDisconnect callback previously used separate ReleaseRuntimeBrokerConnection and UpdateRuntimeBrokerHeartbeat calls. When a broker disconnects and reconnects rapidly, the stale disconnect's offline stamp can clobber the new connection's online status because UpdateRuntimeBrokerHeartbeat has no session guard — it unconditionally overwrites status. Provider statuses are also clobbered and never restored by heartbeats, leaving the broker permanently invisible until hub restart. Add ReleaseAndMarkBrokerOffline which atomically clears affinity AND stamps status=offline in a single CAS write. If a concurrent reconnect has already claimed the broker with a new session, the compare fails and the callback is a no-op. Also add a re-check guard before updating provider statuses.
Owner
|
Closing — fix incorporated into upstream PR GoogleCloudPlatform#303 (branch scion/dev-issue-131) via cherry-pick of commit 4c10502. The atomic ReleaseAndMarkBrokerOffline solution is fully covered there. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the recurring
no_runtime_broker422 errors caused by a TOCTOU race in the broker disconnect callback (issue #131).onDisconnectcallback used separateReleaseRuntimeBrokerConnection+UpdateRuntimeBrokerHeartbeat(offline)calls. When a broker disconnects and reconnects rapidly (same second), the stale disconnect'sReleaseRuntimeBrokerConnectionCAS succeeds (old session still in DB), then the subsequentUpdateRuntimeBrokerHeartbeat(offline)— which has no session guard — clobbers the new connection's online status. Provider statuses are also clobbered and never restored by heartbeats, leaving the broker permanently invisible until hub restart.ReleaseAndMarkBrokerOfflinestore method that atomically clears affinity AND stamps status=offline in a single CAS write. If a concurrent reconnect has already claimed the broker with a new session, the compare fails and the callback is a no-op. Also adds a re-check guard before updating provider statuses to prevent clobbering by the provider status loop.removeConnection) is already in the code and prevents a different race layer. This fix addresses the deeper store-level TOCTOU between the session check and the offline stamp.Changes
pkg/store/store.go— AddReleaseAndMarkBrokerOfflinetoRuntimeBrokerStoreinterfacepkg/store/entadapter/project_store.go— Implement atomic release+offline CASpkg/hub/server.go— UpdateonDisconnectcallback to use new method + provider re-check guardpkg/store/entadapter/broker_affinity_test.go— 4 new tests covering the happy path, session mismatch, post-reclaim no-op, and unclaimed no-opTest plan
TestReleaseAndMarkBrokerOffline_StampsOffline— matching session clears affinity and stamps offlineTestReleaseAndMarkBrokerOffline_NoopOnSessionMismatch— stale session is a no-op, status stays onlineTestReleaseAndMarkBrokerOffline_NoopAfterReclaim— reproduces the exact race from issue Bug: Runtime broker control channel disconnects after ~4h, does not recover without hub restart #131TestReleaseAndMarkBrokerOffline_NoopWhenUnclaimed— unclaimed broker is a no-opTestBrokerAffinity_*andTestControlChannel*tests pass (regression)pkg/store/...test suite passesRefs #131