Skip to content

fix(relay): force-reconnect path survives backoff + stale sockets (#86)#87

Open
variablefate wants to merge 2 commits into
mainfrom
claude/cranky-curie-1d8700
Open

fix(relay): force-reconnect path survives backoff + stale sockets (#86)#87
variablefate wants to merge 2 commits into
mainfrom
claude/cranky-curie-1d8700

Conversation

@variablefate
Copy link
Copy Markdown
Owner

Summary

Fixes #86 — the critical "Reconnect to Relays" bug where pressing the button could leave the app permanently unable to connect, with subsequent presses, force-close, and even cache clear failing to recover.

After a deep dive I identified five compounding bugs in the manual-reconnect path. No single one explains every symptom, but together they create the "broken forever" experience the reporter saw:

  1. No pingInterval on the OkHttp client — NAT/wifi transitions can silently drop the WebSocket while the client still thinks it's CONNECTED. CashuWebSocket sets a 30s ping interval; RelayManager didn't. Most likely explains the "still couldn't connect after force-close + cache clear" symptom: the new process opened sockets, network silently ate them, and there was no ping to detect it.
  2. disconnectAll(); connectAll() uses graceful WebSocket.close() — close initiates a close handshake that can hold an OkHttp dispatcher slot for up to readTimeout (30s). The immediate connectAll() afterwards starts new sockets that compete with the dying ones.
  3. reconnectAttempts is only reset on a successful onOpen — after repeated auto-retry failures, backoff ramps to 60s. Manual presses that themselves fail just keep incrementing the counter.
  4. ensureConnected() (the roadflare-rider button) only reconnects DISCONNECTED relays — a connection stuck in CONNECTING from a prior never-completing handshake stays stuck.
  5. The button's "Reconnecting…" state was a fixed 2-second timer disconnected from actual connection state, and it leaked a MainScope() coroutine on every press.

What changed

A new dedicated force-reconnect path that does the right thing in one synchronized transition:

  • RelayConnection.forceReconnect() (common/.../RelayConnection.kt): cancels (not closes) the prior socket for immediate teardown, resets reconnectAttempts to 0, bumps connectionGeneration so stale callbacks are filtered, and opens a new socket.
  • RelayManager.forceReconnectAll(newRelayUrls: List<String>? = null) (common/.../RelayManager.kt): fan-out to every connection, plus optional relay-list reconciliation so user edits to custom relays take effect without an app restart.
  • NostrService.forceReconnect() (common/.../NostrService.kt): thin facade.
  • OkHttp client now sends 30s pings (common/.../RelayConfig.kt + RelayManager) — matches the existing CashuWebSocket setup.
  • RelayManagementScreen (common/.../RelayManagementScreen.kt) — button state now derives from connectionStates (clears on success, hard-capped at 15s), uses LaunchedEffect instead of MainScope() so no coroutine leak.

All three apps' "Reconnect" handlers (rider-app, drivestr, roadflare-rider) migrated to the new path.

Backward compatibility

disconnectAll/connectAll/ensureConnected are preserved unchanged — they're still called from MainActivity.onCreate, ProfileSyncManager, LogoutManager, etc. The new methods are purely additive. Impact analysis via gitnexus showed HIGH risk on those existing APIs (28+ callers across all three apps), so I deliberately avoided changing their signatures.

Test plan

  • Unit: 4 new tests cover the force-reconnect contract — reset backoff, bump generation, transition to CONNECTING, recover after disconnect(). See RelayConnectionForceReconnectTest.kt.
  • Full :common test suite: 508/508 passing, no regressions.
  • All three apps compile: :rider-app:compileDebugKotlin, :drivestr:compileDebugKotlin, :roadflare-rider:compileDebugKotlin.
  • Manual smoke (reviewer): tap "Reconnect to Relays" — should drop and re-establish all relays within ~5s. Toggle airplane mode for 10s then back, observe that pings detect the dropped sockets and recover (vs. the previous behavior where they'd stay "CONNECTED" indefinitely).
  • Manual smoke (reviewer): with custom relays configured, add a new relay then tap reconnect — the new relay should connect without needing app restart.

Notes for reviewer

  • I considered changing disconnectAll()/connectAll() directly instead of adding new methods, but impact analysis showed they're load-bearing for app startup and logout flows (28+ callers). Surgical additive approach felt safer.
  • The 15s hard-cap on the button's "Reconnecting…" state is arbitrary but bounded by CONNECT_TIMEOUT_MS (10s) + slack. If all relays are unreachable, the button re-enables so the user can retry rather than being locked out.
  • setReconnectAttemptsForTest is @VisibleForTesting internal, used only by the new test to simulate prior backoff state without needing a real failing WebSocket.

🤖 Generated with Claude Code

#86)

The "Reconnect to Relays" button could leave the app permanently unable
to connect after a single press, with subsequent presses and even app
restarts failing to recover. Root cause was five compounding bugs in the
manual-reconnect path:

1. No WebSocket pingInterval — NAT/wifi transitions silently dropped
   sockets while the client thought they were still CONNECTED.
2. disconnectAll() used graceful WebSocket.close(), which can hold an
   OkHttp dispatcher slot for up to readTimeout (30s) while the close
   handshake races with the immediate connectAll() that followed.
3. reconnectAttempts only reset on a successful onOpen, so repeated
   auto-retry failures ramped backoff to 60s — manual presses that
   themselves failed just kept incrementing the counter.
4. ensureConnected() (used by the roadflare-rider button) only reconnects
   DISCONNECTED relays, so a connection stuck in CONNECTING couldn't
   recover.
5. The button's "Reconnecting…" state was a fake 2s timer disconnected
   from actual connection state, and it leaked a MainScope coroutine on
   every press.

This adds a dedicated forceReconnect path:

  - RelayConnection.forceReconnect(): single synchronized transition that
    cancels (not closes) the prior socket for immediate teardown,
    resets reconnectAttempts to 0, bumps connectionGeneration so stale
    callbacks from the prior socket are filtered, and opens a new socket.
  - RelayManager.forceReconnectAll(newRelayUrls: List<String>?):
    fan-out to every connection, plus optional relay-list sync so user
    edits to custom relays take effect without an app restart.
  - NostrService.forceReconnect(): thin facade.
  - OkHttp client now sends 30s pings (matches CashuWebSocket).
  - RelayManagementScreen button state now derives from connectionStates
    rather than a fixed timer, and uses LaunchedEffect rather than
    MainScope so no coroutine leaks.

All three apps' "Reconnect" handlers (rider-app, drivestr, roadflare-rider)
are migrated to the new path.

Public API of disconnectAll/connectAll/ensureConnected is preserved —
they're still called from MainActivity onCreate, ProfileSyncManager,
LogoutManager, etc. New methods are additive.

Tests: 4 new unit tests cover the forceReconnect contract (reset
backoff, bump generation, transition to CONNECTING, recover after
disconnect). Full common suite: 508/508 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@variablefate
Copy link
Copy Markdown
Owner Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

5 parallel reviewers raised 8 candidate findings (concurrency in forceReconnect(), torn-Long log read, LaunchedEffect key choice, ConcurrentHashMap iterator weakness, new-relay subscription seeding, button-spinner clearing, main-thread settings read, per-symbol impact-analysis documentation). After confidence scoring, none cleared the 80 threshold — most were pre-existing patterns that the PR preserves intentionally, false-positive theories that don't hold against OkHttp's documented non-blocking newWebSocket() contract, or workflow guidance not auditable post-hoc on merged code.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Address two findings flagged by the self-review on PR #87 that scored
just below the strict-merge threshold but were verified as real:

1. Subscription-seeding gap on dynamically-added relays
   ---------------------------------------------------
   `RelayManager.addRelay()` created a new `RelayConnection` and added it
   to `connections`, but never pushed the manager-level `subscriptions`
   registry to it. `subscribe()` only fans out to connections that
   existed at call time, and `RelayConnection.resubscribeAll()` only
   sends entries from the connection's own local map. Result: a relay
   added mid-session (which the new `forceReconnectAll(newRelayUrls)`
   path now does when the user edits custom relays then taps Reconnect)
   would open CONNECTED but receive zero events — a silent data-loss
   footgun.

   Fix: in `addRelay()`, after inserting the new connection, push every
   active `subscriptions[*]` entry into it. At construction time this is
   a no-op because `subscriptions` is empty.

2. Spinner-clears-too-early race on partial connectivity
   -----------------------------------------------------
   `connection.state` collectors run on `Dispatchers.IO` (RelayManager
   line 49). The button onClick runs on Main: it writes
   `reconnectInitiatedAt`, calls `forceReconnect()` (which mutates each
   `RelayConnection._state.value = CONNECTING` synchronously), then
   returns. Compose then recomposes for the `reconnectInitiatedAt`
   change BEFORE the IO→Main hop has propagated the new state through
   `collectAsState()`. The LaunchedEffect fires with stale
   `connectedCount > 0`, clears `reconnectInitiatedAt`, and the spinner
   never appears.

   Fix: replace the single key+early-clear pattern with a two-phase
   `snapshotFlow` wait keyed only on `reconnectInitiatedAt`:
     - Phase 1: wait for `connectedCount` to reach 0 (absorbs the IO→Main
       propagation window — without this, pre-existing CONNECTED state
       would let Phase 2 resolve instantly).
     - Phase 2: wait for `connectedCount > 0` (reconnect succeeded) or
       hit the 15s hard cap.
   `rememberUpdatedState` exposes the parameter to the snapshot tracker
   so the flow observes recomposition updates without re-keying the
   effect.

Tests: 3 new `RelayManagerAddRelayTest` cases covering subscription
seeding under three conditions (existing subs at addRelay time, empty
subs at construction time, end-to-end via forceReconnectAll). Total
common suite: 511/511 passing.

Test-only accessors added:
- `RelayConnection.activeSubscriptionIdsForTest()`
- `RelayManager.connectionForTest(url)`

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@variablefate
Copy link
Copy Markdown
Owner Author

Self-review follow-up — both items verified and fixed

Pushed 8492fb0. Closes the two findings the prior scoring pass left at 70/55 confidence (below the 80 merge bar), now confirmed as real bugs.

1. Subscription-seeding gap — verified by re-reading RelayManager.addRelay: the new RelayConnection was added to connections but never received the manager-level subscriptions registry. A relay added mid-session via the new forceReconnectAll(newRelayUrls) path would open CONNECTED and silently deliver zero events. Fixed in addRelay — at construction time the loop is a no-op (no subs yet); mid-session it seeds every active subscription into the new connection's local map so resubscribeAll() actually sends them on socket open.

2. Spinner-clears-too-early race — verified by tracing dispatchers: connection.state.collect runs on Dispatchers.IO while the button onClick is on Main. After forceReconnect() mutates each state, Compose can recompose for the reconnectInitiatedAt write before the IO→Main hop propagates the new connectionStates. Old LaunchedEffect saw stale connectedCount > 0 and cleared instantly. Fixed with a two-phase snapshotFlow wait keyed only on reconnectInitiatedAt: Phase 1 waits for connectedCount == 0 (absorbs the propagation window), Phase 2 waits for > 0 or the 15s cap. rememberUpdatedState exposes the parameter to the snapshot tracker.

Tests: 3 new cases in RelayManagerAddRelayTest covering the seeding contract end-to-end. Full common suite: 511/511 passing.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Critical: Relay 'Reconnect to Relays' button disconnects but never reconnects; persists across app restart and cache clear

1 participant