fix(relay): force-reconnect path survives backoff + stale sockets (#86) by variablefate · Pull Request #87 · variablefate/ridestr

variablefate · 2026-05-14T08:07:19Z

Summary

Fixes #86 — the critical "Reconnect to Relays" bug where pressing the button could leave the app permanently unable to connect, with subsequent presses, force-close, and even cache clear failing to recover.

After a deep dive I identified five compounding bugs in the manual-reconnect path. No single one explains every symptom, but together they create the "broken forever" experience the reporter saw:

No pingInterval on the OkHttp client — NAT/wifi transitions can silently drop the WebSocket while the client still thinks it's CONNECTED. CashuWebSocket sets a 30s ping interval; RelayManager didn't. Most likely explains the "still couldn't connect after force-close + cache clear" symptom: the new process opened sockets, network silently ate them, and there was no ping to detect it.
disconnectAll(); connectAll() uses graceful WebSocket.close() — close initiates a close handshake that can hold an OkHttp dispatcher slot for up to readTimeout (30s). The immediate connectAll() afterwards starts new sockets that compete with the dying ones.
reconnectAttempts is only reset on a successful onOpen — after repeated auto-retry failures, backoff ramps to 60s. Manual presses that themselves fail just keep incrementing the counter.
ensureConnected() (the roadflare-rider button) only reconnects DISCONNECTED relays — a connection stuck in CONNECTING from a prior never-completing handshake stays stuck.
The button's "Reconnecting…" state was a fixed 2-second timer disconnected from actual connection state, and it leaked a MainScope() coroutine on every press.

What changed

A new dedicated force-reconnect path that does the right thing in one synchronized transition:

RelayConnection.forceReconnect() (common/.../RelayConnection.kt): cancels (not closes) the prior socket for immediate teardown, resets reconnectAttempts to 0, bumps connectionGeneration so stale callbacks are filtered, and opens a new socket.
RelayManager.forceReconnectAll(newRelayUrls: List<String>? = null) (common/.../RelayManager.kt): fan-out to every connection, plus optional relay-list reconciliation so user edits to custom relays take effect without an app restart.
NostrService.forceReconnect() (common/.../NostrService.kt): thin facade.
OkHttp client now sends 30s pings (common/.../RelayConfig.kt + RelayManager) — matches the existing CashuWebSocket setup.
RelayManagementScreen (common/.../RelayManagementScreen.kt) — button state now derives from connectionStates (clears on success, hard-capped at 15s), uses LaunchedEffect instead of MainScope() so no coroutine leak.

All three apps' "Reconnect" handlers (rider-app, drivestr, roadflare-rider) migrated to the new path.

Backward compatibility

disconnectAll/connectAll/ensureConnected are preserved unchanged — they're still called from MainActivity.onCreate, ProfileSyncManager, LogoutManager, etc. The new methods are purely additive. Impact analysis via gitnexus showed HIGH risk on those existing APIs (28+ callers across all three apps), so I deliberately avoided changing their signatures.

Test plan

Unit: 4 new tests cover the force-reconnect contract — reset backoff, bump generation, transition to CONNECTING, recover after disconnect(). See RelayConnectionForceReconnectTest.kt.
Full :common test suite: 508/508 passing, no regressions.
All three apps compile: :rider-app:compileDebugKotlin, :drivestr:compileDebugKotlin, :roadflare-rider:compileDebugKotlin.
Manual smoke (reviewer): tap "Reconnect to Relays" — should drop and re-establish all relays within ~5s. Toggle airplane mode for 10s then back, observe that pings detect the dropped sockets and recover (vs. the previous behavior where they'd stay "CONNECTED" indefinitely).
Manual smoke (reviewer): with custom relays configured, add a new relay then tap reconnect — the new relay should connect without needing app restart.

Notes for reviewer

I considered changing disconnectAll()/connectAll() directly instead of adding new methods, but impact analysis showed they're load-bearing for app startup and logout flows (28+ callers). Surgical additive approach felt safer.
The 15s hard-cap on the button's "Reconnecting…" state is arbitrary but bounded by CONNECT_TIMEOUT_MS (10s) + slack. If all relays are unreachable, the button re-enables so the user can retry rather than being locked out.
setReconnectAttemptsForTest is @VisibleForTesting internal, used only by the new test to simulate prior backoff state without needing a real failing WebSocket.

🤖 Generated with Claude Code

#86) The "Reconnect to Relays" button could leave the app permanently unable to connect after a single press, with subsequent presses and even app restarts failing to recover. Root cause was five compounding bugs in the manual-reconnect path: 1. No WebSocket pingInterval — NAT/wifi transitions silently dropped sockets while the client thought they were still CONNECTED. 2. disconnectAll() used graceful WebSocket.close(), which can hold an OkHttp dispatcher slot for up to readTimeout (30s) while the close handshake races with the immediate connectAll() that followed. 3. reconnectAttempts only reset on a successful onOpen, so repeated auto-retry failures ramped backoff to 60s — manual presses that themselves failed just kept incrementing the counter. 4. ensureConnected() (used by the roadflare-rider button) only reconnects DISCONNECTED relays, so a connection stuck in CONNECTING couldn't recover. 5. The button's "Reconnecting…" state was a fake 2s timer disconnected from actual connection state, and it leaked a MainScope coroutine on every press. This adds a dedicated forceReconnect path: - RelayConnection.forceReconnect(): single synchronized transition that cancels (not closes) the prior socket for immediate teardown, resets reconnectAttempts to 0, bumps connectionGeneration so stale callbacks from the prior socket are filtered, and opens a new socket. - RelayManager.forceReconnectAll(newRelayUrls: List<String>?): fan-out to every connection, plus optional relay-list sync so user edits to custom relays take effect without an app restart. - NostrService.forceReconnect(): thin facade. - OkHttp client now sends 30s pings (matches CashuWebSocket). - RelayManagementScreen button state now derives from connectionStates rather than a fixed timer, and uses LaunchedEffect rather than MainScope so no coroutine leaks. All three apps' "Reconnect" handlers (rider-app, drivestr, roadflare-rider) are migrated to the new path. Public API of disconnectAll/connectAll/ensureConnected is preserved — they're still called from MainActivity onCreate, ProfileSyncManager, LogoutManager, etc. New methods are additive. Tests: 4 new unit tests cover the forceReconnect contract (reset backoff, bump generation, transition to CONNECTING, recover after disconnect). Full common suite: 508/508 passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

variablefate · 2026-05-14T08:16:32Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

5 parallel reviewers raised 8 candidate findings (concurrency in forceReconnect(), torn-Long log read, LaunchedEffect key choice, ConcurrentHashMap iterator weakness, new-relay subscription seeding, button-spinner clearing, main-thread settings read, per-symbol impact-analysis documentation). After confidence scoring, none cleared the 80 threshold — most were pre-existing patterns that the PR preserves intentionally, false-positive theories that don't hold against OkHttp's documented non-blocking newWebSocket() contract, or workflow guidance not auditable post-hoc on merged code.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

Address two findings flagged by the self-review on PR #87 that scored just below the strict-merge threshold but were verified as real: 1. Subscription-seeding gap on dynamically-added relays --------------------------------------------------- `RelayManager.addRelay()` created a new `RelayConnection` and added it to `connections`, but never pushed the manager-level `subscriptions` registry to it. `subscribe()` only fans out to connections that existed at call time, and `RelayConnection.resubscribeAll()` only sends entries from the connection's own local map. Result: a relay added mid-session (which the new `forceReconnectAll(newRelayUrls)` path now does when the user edits custom relays then taps Reconnect) would open CONNECTED but receive zero events — a silent data-loss footgun. Fix: in `addRelay()`, after inserting the new connection, push every active `subscriptions[*]` entry into it. At construction time this is a no-op because `subscriptions` is empty. 2. Spinner-clears-too-early race on partial connectivity ----------------------------------------------------- `connection.state` collectors run on `Dispatchers.IO` (RelayManager line 49). The button onClick runs on Main: it writes `reconnectInitiatedAt`, calls `forceReconnect()` (which mutates each `RelayConnection._state.value = CONNECTING` synchronously), then returns. Compose then recomposes for the `reconnectInitiatedAt` change BEFORE the IO→Main hop has propagated the new state through `collectAsState()`. The LaunchedEffect fires with stale `connectedCount > 0`, clears `reconnectInitiatedAt`, and the spinner never appears. Fix: replace the single key+early-clear pattern with a two-phase `snapshotFlow` wait keyed only on `reconnectInitiatedAt`: - Phase 1: wait for `connectedCount` to reach 0 (absorbs the IO→Main propagation window — without this, pre-existing CONNECTED state would let Phase 2 resolve instantly). - Phase 2: wait for `connectedCount > 0` (reconnect succeeded) or hit the 15s hard cap. `rememberUpdatedState` exposes the parameter to the snapshot tracker so the flow observes recomposition updates without re-keying the effect. Tests: 3 new `RelayManagerAddRelayTest` cases covering subscription seeding under three conditions (existing subs at addRelay time, empty subs at construction time, end-to-end via forceReconnectAll). Total common suite: 511/511 passing. Test-only accessors added: - `RelayConnection.activeSubscriptionIdsForTest()` - `RelayManager.connectionForTest(url)` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

variablefate · 2026-05-14T15:52:04Z

Self-review follow-up — both items verified and fixed

Pushed 8492fb0. Closes the two findings the prior scoring pass left at 70/55 confidence (below the 80 merge bar), now confirmed as real bugs.

1. Subscription-seeding gap — verified by re-reading RelayManager.addRelay: the new RelayConnection was added to connections but never received the manager-level subscriptions registry. A relay added mid-session via the new forceReconnectAll(newRelayUrls) path would open CONNECTED and silently deliver zero events. Fixed in addRelay — at construction time the loop is a no-op (no subs yet); mid-session it seeds every active subscription into the new connection's local map so resubscribeAll() actually sends them on socket open.

2. Spinner-clears-too-early race — verified by tracing dispatchers: connection.state.collect runs on Dispatchers.IO while the button onClick is on Main. After forceReconnect() mutates each state, Compose can recompose for the reconnectInitiatedAt write before the IO→Main hop propagates the new connectionStates. Old LaunchedEffect saw stale connectedCount > 0 and cleared instantly. Fixed with a two-phase snapshotFlow wait keyed only on reconnectInitiatedAt: Phase 1 waits for connectedCount == 0 (absorbs the propagation window), Phase 2 waits for > 0 or the 15s cap. rememberUpdatedState exposes the parameter to the snapshot tracker.

Tests: 3 new cases in RelayManagerAddRelayTest covering the seeding contract end-to-end. Full common suite: 511/511 passing.

🤖 Generated with Claude Code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(relay): force-reconnect path survives backoff + stale sockets (#86)#87

fix(relay): force-reconnect path survives backoff + stale sockets (#86)#87
variablefate wants to merge 2 commits into
mainfrom
claude/cranky-curie-1d8700

variablefate commented May 14, 2026

Uh oh!

variablefate commented May 14, 2026

Uh oh!

variablefate commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

variablefate commented May 14, 2026

Summary

What changed

Backward compatibility

Test plan

Notes for reviewer

Uh oh!

variablefate commented May 14, 2026

Code review

Uh oh!

variablefate commented May 14, 2026

Self-review follow-up — both items verified and fixed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant