feat: repeater liveness indicator with relay stats (#662)#755
feat: repeater liveness indicator with relay stats (#662)#755efiten wants to merge 21 commits intoKpa-clawbot:masterfrom
Conversation
Kpa-clawbot
left a comment
There was a problem hiding this comment.
Protocol Engineer + Torvalds Review
Protocol Engineer Lens
Relay detection model — correct. The PR indexes relay activity from tx.ResolvedPath (full resolved pubkeys), not raw hop prefixes. This avoids the hash-collision noise at 1-byte prefixes with 2K+ nodes. Any packet where a node appears in the resolved path counts as relay activity for that node. The API gates output on role == "repeater", so companions/rooms/sensors don't get spurious relay metrics. Sound design.
Three-state model — matches the issue spec exactly:
relaying: heard within threshold ANDrelay_count_24h > 0active(idle): heard within threshold ANDrelay_count_24h == 0stale: beyond silent threshold (relay count irrelevant)
Stale takes precedence over relay activity — correct, because a repeater that hasn't been heard at all shouldn't show green just because it had old path appearances.
Performance — O(log n) at query time, O(log n) on insert. Sorted slices with binary search. No per-request full scans of byPathHop. Relay metrics are computed inline during GetBulkHealth/GetNodeHealth which already hold s.mu.RLock — no extra lock needed. Acceptable.
Torvalds Lens
The code is clean and does what it says. Three focused functions (addTxToRelayTimeIndex, removeFromRelayTimeIndex, relayMetrics), each under 30 lines. Binary search insert/remove is textbook. Test coverage is thorough — 16 Go tests, 10 frontend tests. No complaints about the core logic.
Now, what's wrong:
Must-Fix
1. removeFromRelayTimeIndex over-allocates on removal
newSlice := make([]int64, 0, len(slice)-1)
newSlice = append(newSlice, slice[:i]...)
newSlice = append(newSlice, slice[i+1:]...)
idx[pk] = newSliceThis allocates a brand new slice on every single removal. Compare with removeTxFromPathHopIndex right next door which does append(slice[:i], slice[i+1:]...) — zero allocation, in-place. There's no aliasing risk here since the slice lives only in this map. Use the same pattern for consistency and to avoid pointless GC pressure during eviction (which removes thousands of packets at once).
2. Gratuitous HEALTH_THRESHOLDS → window.HEALTH_THRESHOLDS change in roles.js
- var staleMs = isInfra ? HEALTH_THRESHOLDS.infraSilentMs : ...
+ var staleMs = isInfra ? window.HEALTH_THRESHOLDS.infraSilentMs : ...HEALTH_THRESHOLDS is already in closure scope within the IIFE. Changing to window.HEALTH_THRESHOLDS is unnecessary and subtly changes the lookup path. If this was needed to make the test file work (it accesses ctx.window.HEALTH_THRESHOLDS), that's the test adapting the production code to its needs — not the other way around. Revert and fix the test instead.
3. 1,222 lines of plan/spec docs committed to the repo
docs/superpowers/plans/2026-04-15-repeater-liveness.md (1,059 lines) and docs/superpowers/specs/2026-04-15-repeater-liveness-design.md (163 lines) are process artifacts. The plan literally contains the full code of every file that was written — it's a build script, not documentation. These will rot immediately and add review noise. The PR description and issue #662 already capture the design. Remove both files from this PR.
4. relayMetrics returns last_relayed for entries older than 24h
When relayTimes[pk] has entries but all are older than 24h, relayMetrics returns count1h=0, count24h=0 but lastRelayed=<old timestamp>. The API then includes last_relayed in the response even though both counts are zero. This is arguably useful info ("last relayed 3 days ago") but the behavior should be documented with a comment, and the frontend explanation text should handle this case — currently if count24h == 0 the node shows "Idle" with no mention of when it last relayed, even though the data is available.
5. addTxToRelayTimeIndex re-parses FirstSeen as RFC3339 on every call
time.Parse(time.RFC3339, tx.FirstSeen) is called for every packet during buildPathHopIndex (full store rebuild). The same string is parsed elsewhere in the codebase. Consider caching the parsed millis on StoreTx (or using an existing cached field if one exists) to avoid redundant parsing during bulk rebuilds with 30K+ packets.
6. Frontend template indentation is broken in the relay stats rows
The <tr> tags inside the conditional repeater block sit at column 0 inside a deeply-nested template literal:
${si.role === 'repeater' ? `
<tr><td>Relay (1h)</td>...
<tr><td>Relay (24h)</td>...
` : ''}This produces inconsistent HTML indentation in the output. Match the surrounding indentation level ( <tr>...).
7. Test TestRelayTimesWiredIntoIngest has a weak assertion
if relayKeys == 0 {
t.Fatalf("relayTimes not populated: byPathHop has %d keys but relayTimes has 0", hopKeys)
}This only checks "non-zero." It should verify that relayKeys <= hopKeys (relay keys are a subset of path-hop keys, since relay only indexes full pubkeys while path-hop also indexes raw prefixes). A simple bounds assertion adds confidence the wiring is correct.
f90b128 to
09c1fe0
Compare
PR #755 — Final Review: Repeater Liveness Three-State IndicatorReviewers: Protocol Engineer · Code Quality Engineer Overall AssessmentGood feature — solves a real operator need (issue #662). The three-state model ( 🔴 Must-Fix1. RFC3339 re-parsed on every
|
| Spec Item | Status | Notes |
|---|---|---|
Milestone 1: relay_count_1h / relay_count_24h in API |
✅ | Both /api/nodes and /api/nodes/{pubkey}/health |
Milestone 1: last_relayed timestamp |
✅ | Present but see item #4 above |
| Milestone 2: Three-state indicator (🟢/🟡/⚪) | ✅ | Implemented in list + detail pane |
| Milestone 2: Relay counts in detail pane | ✅ | Both 1h and 24h shown |
| Milestone 3: Repeater health dashboard | ❌ Not in scope | Spec marked this as "optional" — fine to defer |
| Spec limitation #4: Performance of time-windowed counts | Addressed via sorted index + binary search, but RFC3339 parsing on rebuild is the new bottleneck (item #1) | |
| No plan/spec docs committed to repo | ✅ | Clean — no markdown spec files in the diff |
Summary
4 must-fix items, 3 should-fix. The feature design is correct and well-tested. The issues are all implementation-level: a hot-path performance concern (RFC3339 parsing under lock), an allocation pattern in removal, a lock scope that's wider than necessary, and an ambiguous API behavior. None require rearchitecting — they're all targeted fixes.
Must-fix items: - addTxToRelayTimeIndex now accepts pre-parsed millis to avoid 30K+ RFC3339 re-parses during bulk startup rebuild under the write lock - removeFromRelayTimeIndex uses copy+reslice instead of append to avoid per-eviction heap allocation proportional to path length - handleNodes snapshots relay times under RLock then releases before enrichment, keeping the lock scope to a quick map copy rather than the full node loop - Suppress last_relayed in API response when both counts are zero across all three endpoints (handleNodes, bulk health, single-node health) Should-fix items: - Add NOTE comment on HEALTH_THRESHOLDS mirror in test-repeater-liveness.js - Flatten multiline repeater template block into per-row inline conditionals in both table and dl views to match avgSnr/avgHops style Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Must-fix items: - addTxToRelayTimeIndex now accepts pre-parsed millis to avoid 30K+ RFC3339 re-parses during bulk startup rebuild under the write lock - removeFromRelayTimeIndex uses copy+reslice instead of append to avoid per-eviction heap allocation proportional to path length - handleNodes snapshots relay times under RLock then releases before enrichment, keeping the lock scope to a quick map copy rather than the full node loop - Suppress last_relayed in API response when both counts are zero across all three endpoints (handleNodes, bulk health, single-node health) Should-fix items: - Add NOTE comment on HEALTH_THRESHOLDS mirror in test-repeater-liveness.js - Flatten multiline repeater template block into per-row inline conditionals in both table and dl views to match avgSnr/avgHops style Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kpa-clawbot
left a comment
There was a problem hiding this comment.
Final adversarial review (REVIEW-ONLY — fork, no push access)
Status: 🟡 Approve-after-rebase. No new substantive changes required.
950b56b cleanly addresses the prior CHANGES_REQUESTED review (must-fix items 1–4, should-fix 5–7 — all confirmed in diff). Targeted Go tests pass (go test -run "Repeater|Liveness|Relay" — all green, incl. TestAddTxToRelayTimeIndex_LowercasesKey, TestGetBulkHealthRepeaterRelayFields, TestTouchRelayLastSeen_Debouncing). Frontend tests pass (546 passed, 0 failed; test-repeater-liveness.js 10 passed).
🚧 Conflict scope (mergeStateStatus = DIRTY)
Dry-run rebase against origin/master aborted at commit 3/20 (feat(store): add relayTimes index and relay metrics functions (#662) — i.e. PR-internal commit).
Conflicting files: cmd/server/store.go only.
Root cause is #806 (perf(#800): remove per-StoreTx ResolvedPath, replace with membership index + on-demand decode) — which deleted the StoreTx.ResolvedPath and StoreObs.ResolvedPath fields this PR's addTxToRelayTimeIndex / removeFromRelayTimeIndex iterate over (store.go lines 2519, 2526, 2541, 2548).
Rebase recipe for the author:
git fetch origin && git rebase origin/master- In
cmd/server/store.go, replacetx.ResolvedPath/obs.ResolvedPathiteration with the new on-demand decode helper (see howextractResolvedPubkeys/ the new resolved-path index are used inIngestNewObservationsandbyPathHopupdates on master). - Re-run
go test ./cmd/server/...andnode test-repeater-liveness.js. - Likely also touch
neighbor_persist.goif it referenced the removed field — verify.
No other files conflict (routes.go, nodes.js, roles.js, style.css, tests are clean).
Per-expert verdicts
| Persona | Verdict | Notes |
|---|---|---|
| mesh-operator | ✅ ok | Three-state (relaying/active/stale) matches field monitoring needs; "stale beats relay" precedence is right — a silent repeater shouldn't show green from old paths. |
| meshcore | ✅ ok | Indexing from full ResolvedPath (not 1-byte hop prefixes) correctly avoids hash-collision noise at 2K nodes. Role-gated to repeaters. Idempotent insert dedups multi-observer hits of same packet. |
| dijkstra | ✅ ok | Sorted-millis slice + binary search is correct; idempotent insert prevents dup; remove is symmetric. Window boundaries are now − 1h/now − 24h half-open, consistent. |
| carmack | ✅ ok | Hot-path parse hoisted out (must-fix #1 done); remove is now in-place copy (must-fix #2 done); handleNodes snapshots under RLock then enriches outside (must-fix #3 done). |
| torvalds | ✅ ok | Three small functions, ~30 LOC each. Cohesive. |
| tufte | 🟢 nit (non-blocking) | Three-state is graded, not just binary — good. Confirm color choices remain WCAG-AA after rebase; not changed in this PR. |
| djb | ✅ ok | Counters derived from already-validated resolved pubkeys; no new attacker-controllable strings into SQL/log. Spoof surface = same as existing path indexing. |
Smoke recipe (after rebase)
cd cmd/server && go test ./... -count=1 -timeout 180snode test-frontend-helpers.js && node test-repeater-liveness.jscurl -s localhost:8080/api/nodes | jq '[.[] | select(.role=="Repeater") | {pk: .public_key, s: .stats}] | .[0:3]'— verifyrelay_count_1h/relay_count_24hpresent andlast_relayedonly present when a count > 0.- UI: nodes list — repeater rows show 🟢/🟡/⚪; non-repeaters unaffected.
Final go/no-go
Go — recommend merge after a clean rebase against current master. No further substantive changes requested from this seat.
store.go and neighbor_persist.go from upstream/master removed relayTimes/relayMetrics which PR Kpa-clawbot#755 (not yet upstream) still requires. Restore fork versions until Kpa-clawbot#755 is merged upstream. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pstream resolved-path index Upstream merge dropped relayTimes/relayMetrics needed by PR Kpa-clawbot#755 (repeater liveness, not yet upstream). Restore relay functions while keeping upstream's new resolvedPubkeyIndex fields required by resolved_index.go. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…veness compat ResolvedPath on StoreTx/StoreObs needed by neighbor_persist.go (PR Kpa-clawbot#755). whereClause variable rename was incomplete — fix to use correct name. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…awbot#662) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…es (Kpa-clawbot#662) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d GetNodeHealth test (Kpa-clawbot#662)
…detail pane (Kpa-clawbot#662) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- removeFromRelayTimeIndex: use in-place append instead of extra allocation - roles.js: revert window.HEALTH_THRESHOLDS to bare HEALTH_THRESHOLDS; fix test to expose bare global in VM context - docs: remove plan/spec process artifacts from repo - relayMetrics: document last_relayed behavior for entries older than 24h - addTxToRelayTimeIndex: add comment explaining FirstSeen parse rationale - nodes.js: show last relayed time in idle repeater explanation when available - nodes.js: fix relay stats template indentation in both detail pane blocks - TestRelayTimesWiredIntoIngest: add relayKeys <= hopKeys bounds assertion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Kpa-clawbot#662) backfillResolvedPathsAsync was updating tx.ResolvedPath via pickBestObservation but never updating relayTimes. On a fresh deploy (or any instance where resolved_path was NULL in the DB), all repeaters showed as Idle despite active relay traffic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ats (Kpa-clawbot#662) Two bugs: 1. addTxToRelayTimeIndex only used tx.ResolvedPath (best observation). If the best observer received a packet directly while another observer received it via a repeater, that repeater's relay activity was invisible. Fix: scan ALL observations' resolved paths. Insert is now idempotent so the function can be called multiple times safely. removeFromRelayTimeIndex updated symmetrically to avoid orphaned entries on eviction. Also add an else branch in pollAndMerge so new observations on existing packets (path unchanged) still update relay times immediately. 2. renderDetail/loadFullNode called getStatusInfo(n) where n = data.node (from /api/nodes/{pubkey}, no relay stats). Relay counts and status were always undefined / Idle. Fix: assign n.stats = stats (from the health endpoint) before calling getStatusInfo. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pa-clawbot#662) New 'St.' column after Last Seen shows 🟢/🟡/⚪ for each node. Hovering shows full context: 'Relaying — Relayed 2592 packets in last 24h, last 26s ago' (or 'Idle — Alive but no relay traffic...', 'Stale — Not heard for...'). Also replaces the manual status/class computation in renderRows with a single getStatusInfo() call for consistency. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The nodes list was calling getStatusInfo(n) on nodes that had no stats (the list endpoint only returned DB data). This caused all repeaters to show as yellow/idle in the overview even when actively relaying. Fix: handleNodes in routes.go now enriches each repeater node with relay_count_1h, relay_count_24h, and last_relayed from the in-memory relayTimes index, matching exactly what the single-node health endpoint returns. The status filter also updated to use getStatusInfo() so relay state is considered when filtering active/stale. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
buildPathHopIndex was indexing relay timestamps for all historical packets. On a live instance with 2.8M observations this allocated ~14 GB causing the OOM killer to terminate the server process. relayMetrics only queries 1h and 24h windows so any timestamp older than 24h is useless in the index. Skip addTxToRelayTimeIndex for packets older than 24h during the initial build. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vent OOM Loading all 366K transmissions and 2.8M observations at startup caused the server to consume 14+ GB and get OOM-killed on the live instance. The Load() SQL now adds a WHERE clause filtering to the configured retentionHours window when set. With retentionHours=168 (7 days), only ~7 days of packets are loaded into memory — sufficient for all in-memory queries. Historical data remains in SQLite and is available via DB queries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Must-fix items: - addTxToRelayTimeIndex now accepts pre-parsed millis to avoid 30K+ RFC3339 re-parses during bulk startup rebuild under the write lock - removeFromRelayTimeIndex uses copy+reslice instead of append to avoid per-eviction heap allocation proportional to path length - handleNodes snapshots relay times under RLock then releases before enrichment, keeping the lock scope to a quick map copy rather than the full node loop - Suppress last_relayed in API response when both counts are zero across all three endpoints (handleNodes, bulk health, single-node health) Should-fix items: - Add NOTE comment on HEALTH_THRESHOLDS mirror in test-repeater-liveness.js - Flatten multiline repeater template block into per-row inline conditionals in both table and dl views to match avgSnr/avgHops style Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
950b56b to
0c35da5
Compare
…e maxPackets fallback relay_liveness_test.go used StoreTx.ResolvedPath (removed by Kpa-clawbot#800) and called addTxToRelayTimeIndex as a standalone function — both no longer existed after the rebase. Fix by extracting relayIndexInsertPaths / relayIndexRemovePaths as package-level helpers that operate on map[string][]int64 directly; store methods delegate to them. Tests rewritten to call helpers without StoreTx involvement. Also restores the maxPackets whereClause fallback dropped when retentionHours support was added, fixing TestBoundedLoad_LimitedMemory. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
page.$() captures an element handle at query time. When the nodes page WebSocket auto-refresh fires between the querySelector and the .click() call, the table is re-rendered and the element is detached, causing "Element is not attached to the DOM". Replace both occurrences with page.click(selector), which re-queries the DOM at click time and retries until the element is stable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Final Review: LGTM after rebaseConflicts to resolve: Once rebased, this is ready to merge. Positives
Non-blocking nits
|
Summary
relayTimesin-memory index (sorted unix-millis per repeater pubkey), maintained in lockstep withbyPathHop. Populated at startup from all packet observations (not just best), updated on ingest/evict/backfill. Exposesrelay_count_1h,relay_count_24h,last_relayedin both/api/nodes(for repeaters) and/api/nodes/{pubkey}/health.getNodeStatusextended to three-state (relaying/active/stale) for repeaters based on relay_count_24h.getStatusInfois the single source of truth for status label, explanation, and relay stats. Detail pane shows relay counts and last relayed time. Nodes list gets a status emoji column with hover tooltip showing relay info.Changes
cmd/server/store.gorelayTimes map[string][]int64field added toPacketStoreaddTxToRelayTimeIndex/removeFromRelayTimeIndex: scan all observations, idempotent sorted insert, lowercase keysrelayMetrics(times, nowMs): returns(count1h, count24h, lastRelayed)buildPathHopIndex: populatesrelayTimesat startuppollAndMerge: updates relay index on ingest and eviction; newelsebranch for path-unchanged observationsaddTxToPathHopIndex/removeTxFromPathHopIndex: lowercase resolved pubkeys (fixes casing mismatch with lookup)cmd/server/routes.goGetBulkHealth/GetNodeHealth: include relay stats for repeater nodeshandleNodes: enriches repeater nodes with relay stats fromrelayTimesso list view has same data as detail panecmd/server/neighbor_persist.gobackfillResolvedPathsAsync: callsaddTxToRelayTimeIndexafterpickBestObservationto capture newly resolved pubkeyspublic/roles.jsgetNodeStatus(role, lastSeenMs, relayCount24h): three-state logic for repeatersgetStatusInfo(n): single source of truth returning status, label, explanation, relay counts, last relayedpublic/nodes.jsn.statspopulated from health endpoint beforegetStatusInfocallgetStatusInfoTests
relay_liveness_test.go: index functions, relay metrics, wiring integration, bulk/single health endpointstest-repeater-liveness.js: three-state frontend logic, backward compatTest plan
go test ./...passesnode test-repeater-liveness.jspasses🤖 Generated with Claude Code