fix(relay): force-reconnect path survives backoff + stale sockets (#86)#87
fix(relay): force-reconnect path survives backoff + stale sockets (#86)#87variablefate wants to merge 2 commits into
Conversation
#86) The "Reconnect to Relays" button could leave the app permanently unable to connect after a single press, with subsequent presses and even app restarts failing to recover. Root cause was five compounding bugs in the manual-reconnect path: 1. No WebSocket pingInterval — NAT/wifi transitions silently dropped sockets while the client thought they were still CONNECTED. 2. disconnectAll() used graceful WebSocket.close(), which can hold an OkHttp dispatcher slot for up to readTimeout (30s) while the close handshake races with the immediate connectAll() that followed. 3. reconnectAttempts only reset on a successful onOpen, so repeated auto-retry failures ramped backoff to 60s — manual presses that themselves failed just kept incrementing the counter. 4. ensureConnected() (used by the roadflare-rider button) only reconnects DISCONNECTED relays, so a connection stuck in CONNECTING couldn't recover. 5. The button's "Reconnecting…" state was a fake 2s timer disconnected from actual connection state, and it leaked a MainScope coroutine on every press. This adds a dedicated forceReconnect path: - RelayConnection.forceReconnect(): single synchronized transition that cancels (not closes) the prior socket for immediate teardown, resets reconnectAttempts to 0, bumps connectionGeneration so stale callbacks from the prior socket are filtered, and opens a new socket. - RelayManager.forceReconnectAll(newRelayUrls: List<String>?): fan-out to every connection, plus optional relay-list sync so user edits to custom relays take effect without an app restart. - NostrService.forceReconnect(): thin facade. - OkHttp client now sends 30s pings (matches CashuWebSocket). - RelayManagementScreen button state now derives from connectionStates rather than a fixed timer, and uses LaunchedEffect rather than MainScope so no coroutine leaks. All three apps' "Reconnect" handlers (rider-app, drivestr, roadflare-rider) are migrated to the new path. Public API of disconnectAll/connectAll/ensureConnected is preserved — they're still called from MainActivity onCreate, ProfileSyncManager, LogoutManager, etc. New methods are additive. Tests: 4 new unit tests cover the forceReconnect contract (reset backoff, bump generation, transition to CONNECTING, recover after disconnect). Full common suite: 508/508 passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 5 parallel reviewers raised 8 candidate findings (concurrency in 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Address two findings flagged by the self-review on PR #87 that scored just below the strict-merge threshold but were verified as real: 1. Subscription-seeding gap on dynamically-added relays --------------------------------------------------- `RelayManager.addRelay()` created a new `RelayConnection` and added it to `connections`, but never pushed the manager-level `subscriptions` registry to it. `subscribe()` only fans out to connections that existed at call time, and `RelayConnection.resubscribeAll()` only sends entries from the connection's own local map. Result: a relay added mid-session (which the new `forceReconnectAll(newRelayUrls)` path now does when the user edits custom relays then taps Reconnect) would open CONNECTED but receive zero events — a silent data-loss footgun. Fix: in `addRelay()`, after inserting the new connection, push every active `subscriptions[*]` entry into it. At construction time this is a no-op because `subscriptions` is empty. 2. Spinner-clears-too-early race on partial connectivity ----------------------------------------------------- `connection.state` collectors run on `Dispatchers.IO` (RelayManager line 49). The button onClick runs on Main: it writes `reconnectInitiatedAt`, calls `forceReconnect()` (which mutates each `RelayConnection._state.value = CONNECTING` synchronously), then returns. Compose then recomposes for the `reconnectInitiatedAt` change BEFORE the IO→Main hop has propagated the new state through `collectAsState()`. The LaunchedEffect fires with stale `connectedCount > 0`, clears `reconnectInitiatedAt`, and the spinner never appears. Fix: replace the single key+early-clear pattern with a two-phase `snapshotFlow` wait keyed only on `reconnectInitiatedAt`: - Phase 1: wait for `connectedCount` to reach 0 (absorbs the IO→Main propagation window — without this, pre-existing CONNECTED state would let Phase 2 resolve instantly). - Phase 2: wait for `connectedCount > 0` (reconnect succeeded) or hit the 15s hard cap. `rememberUpdatedState` exposes the parameter to the snapshot tracker so the flow observes recomposition updates without re-keying the effect. Tests: 3 new `RelayManagerAddRelayTest` cases covering subscription seeding under three conditions (existing subs at addRelay time, empty subs at construction time, end-to-end via forceReconnectAll). Total common suite: 511/511 passing. Test-only accessors added: - `RelayConnection.activeSubscriptionIdsForTest()` - `RelayManager.connectionForTest(url)` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review follow-up — both items verified and fixedPushed 8492fb0. Closes the two findings the prior scoring pass left at 70/55 confidence (below the 80 merge bar), now confirmed as real bugs. 1. Subscription-seeding gap — verified by re-reading RelayManager.addRelay: the new 2. Spinner-clears-too-early race — verified by tracing dispatchers: Tests: 3 new cases in 🤖 Generated with Claude Code |
Summary
Fixes #86 — the critical "Reconnect to Relays" bug where pressing the button could leave the app permanently unable to connect, with subsequent presses, force-close, and even cache clear failing to recover.
After a deep dive I identified five compounding bugs in the manual-reconnect path. No single one explains every symptom, but together they create the "broken forever" experience the reporter saw:
pingIntervalon the OkHttp client — NAT/wifi transitions can silently drop the WebSocket while the client still thinks it's CONNECTED.CashuWebSocketsets a 30s ping interval;RelayManagerdidn't. Most likely explains the "still couldn't connect after force-close + cache clear" symptom: the new process opened sockets, network silently ate them, and there was no ping to detect it.disconnectAll(); connectAll()uses gracefulWebSocket.close()— close initiates a close handshake that can hold an OkHttp dispatcher slot for up toreadTimeout(30s). The immediateconnectAll()afterwards starts new sockets that compete with the dying ones.reconnectAttemptsis only reset on a successfulonOpen— after repeated auto-retry failures, backoff ramps to 60s. Manual presses that themselves fail just keep incrementing the counter.ensureConnected()(the roadflare-rider button) only reconnects DISCONNECTED relays — a connection stuck in CONNECTING from a prior never-completing handshake stays stuck.MainScope()coroutine on every press.What changed
A new dedicated force-reconnect path that does the right thing in one synchronized transition:
RelayConnection.forceReconnect()(common/.../RelayConnection.kt): cancels (not closes) the prior socket for immediate teardown, resetsreconnectAttemptsto 0, bumpsconnectionGenerationso stale callbacks are filtered, and opens a new socket.RelayManager.forceReconnectAll(newRelayUrls: List<String>? = null)(common/.../RelayManager.kt): fan-out to every connection, plus optional relay-list reconciliation so user edits to custom relays take effect without an app restart.NostrService.forceReconnect()(common/.../NostrService.kt): thin facade.RelayManagementScreen(common/.../RelayManagementScreen.kt) — button state now derives fromconnectionStates(clears on success, hard-capped at 15s), usesLaunchedEffectinstead ofMainScope()so no coroutine leak.All three apps' "Reconnect" handlers (rider-app, drivestr, roadflare-rider) migrated to the new path.
Backward compatibility
disconnectAll/connectAll/ensureConnectedare preserved unchanged — they're still called fromMainActivity.onCreate,ProfileSyncManager,LogoutManager, etc. The new methods are purely additive. Impact analysis via gitnexus showed HIGH risk on those existing APIs (28+ callers across all three apps), so I deliberately avoided changing their signatures.Test plan
disconnect(). See RelayConnectionForceReconnectTest.kt.:commontest suite: 508/508 passing, no regressions.:rider-app:compileDebugKotlin,:drivestr:compileDebugKotlin,:roadflare-rider:compileDebugKotlin.Notes for reviewer
disconnectAll()/connectAll()directly instead of adding new methods, but impact analysis showed they're load-bearing for app startup and logout flows (28+ callers). Surgical additive approach felt safer.CONNECT_TIMEOUT_MS(10s) + slack. If all relays are unreachable, the button re-enables so the user can retry rather than being locked out.setReconnectAttemptsForTestis@VisibleForTesting internal, used only by the new test to simulate prior backoff state without needing a real failing WebSocket.🤖 Generated with Claude Code