Skip to content

fix: prevent stale disconnect from marking reconnected broker offline#303

Open
ptone wants to merge 2 commits into
GoogleCloudPlatform:mainfrom
ptone:scion/dev-issue-131
Open

fix: prevent stale disconnect from marking reconnected broker offline#303
ptone wants to merge 2 commits into
GoogleCloudPlatform:mainfrom
ptone:scion/dev-issue-131

Conversation

@ptone
Copy link
Copy Markdown
Member

@ptone ptone commented Jun 4, 2026

Summary

  • Root cause: When a broker's control channel disconnects and reconnects rapidly (within the same second), the old connection's deferred removeConnection goroutine races with the new connection. It removes the new connection from the map and fires onDisconnect, leaving onlineProviders=0 despite a live WebSocket session.
  • Fix: removeConnection now takes a *BrokerConnection pointer and only removes the map entry / fires the disconnect callback when the passed connection is still the active one for that brokerID. Superseded connections are silently skipped.
  • Adds two new tests reproducing the exact race scenario described in the issue.

Fixes issue 131

Test plan

  • TestControlChannelManager_ReconnectDoesNotTriggerDisconnect — verifies a stale removeConnection from the old connection does NOT fire onDisconnect or remove the new connection
  • TestControlChannelManager_StaleRemoveAfterReconnect_ThenRealDisconnect — verifies the full lifecycle: stale remove is skipped, then a real disconnect of the current connection fires the callback correctly
  • Existing TestControlChannelManager_OnDisconnectCallback and _NilSafe updated and passing
  • Full pkg/hub/... test suite passes

ptone added 2 commits June 3, 2026 23:37
When a broker's control channel disconnects and reconnects in the same
second, the old connection's deferred removeConnection was deleting the
new connection from the map and firing the onDisconnect callback. This
left onlineProviders=0 despite a live WebSocket session.

Guard removeConnection with a pointer comparison: only remove the map
entry and fire the disconnect callback when the connection being cleaned
up is still the active one for that brokerID. If a reconnect has already
replaced it, skip both — the new session is healthy.

Fixes #131
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a race condition where a rapid disconnect and reconnect of a broker's control channel could leave the broker permanently offline. The fix updates removeConnection to accept and verify the specific *BrokerConnection pointer, ensuring that stale deferred disconnects do not remove newly established connections. Corresponding unit tests have been added to verify this behavior. There are no review comments, and I have no additional feedback to provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant