Skip to content

feat: repeater liveness indicator with relay stats (#662)#755

Open
efiten wants to merge 21 commits intoKpa-clawbot:masterfrom
efiten:feat/repeater-liveness-662
Open

feat: repeater liveness indicator with relay stats (#662)#755
efiten wants to merge 21 commits intoKpa-clawbot:masterfrom
efiten:feat/repeater-liveness-662

Conversation

@efiten
Copy link
Copy Markdown
Contributor

@efiten efiten commented Apr 15, 2026

Summary

  • Backend: adds relayTimes in-memory index (sorted unix-millis per repeater pubkey), maintained in lockstep with byPathHop. Populated at startup from all packet observations (not just best), updated on ingest/evict/backfill. Exposes relay_count_1h, relay_count_24h, last_relayed in both /api/nodes (for repeaters) and /api/nodes/{pubkey}/health.
  • Frontend: getNodeStatus extended to three-state (relaying / active / stale) for repeaters based on relay_count_24h. getStatusInfo is the single source of truth for status label, explanation, and relay stats. Detail pane shows relay counts and last relayed time. Nodes list gets a status emoji column with hover tooltip showing relay info.
  • Correctness fixes: relay index scans all observations per packet (not just best); backfill now updates relay index after resolving paths; pubkeys lowercased consistently throughout index.

Changes

cmd/server/store.go

  • relayTimes map[string][]int64 field added to PacketStore
  • addTxToRelayTimeIndex / removeFromRelayTimeIndex: scan all observations, idempotent sorted insert, lowercase keys
  • relayMetrics(times, nowMs): returns (count1h, count24h, lastRelayed)
  • buildPathHopIndex: populates relayTimes at startup
  • pollAndMerge: updates relay index on ingest and eviction; new else branch for path-unchanged observations
  • addTxToPathHopIndex / removeTxFromPathHopIndex: lowercase resolved pubkeys (fixes casing mismatch with lookup)

cmd/server/routes.go

  • GetBulkHealth / GetNodeHealth: include relay stats for repeater nodes
  • handleNodes: enriches repeater nodes with relay stats from relayTimes so list view has same data as detail pane

cmd/server/neighbor_persist.go

  • backfillResolvedPathsAsync: calls addTxToRelayTimeIndex after pickBestObservation to capture newly resolved pubkeys

public/roles.js

  • getNodeStatus(role, lastSeenMs, relayCount24h): three-state logic for repeaters
  • getStatusInfo(n): single source of truth returning status, label, explanation, relay counts, last relayed

public/nodes.js

  • Detail pane: n.stats populated from health endpoint before getStatusInfo call
  • Nodes list: status emoji column with relay hover tooltip; status filter uses getStatusInfo

Tests

  • relay_liveness_test.go: index functions, relay metrics, wiring integration, bulk/single health endpoints
  • test-repeater-liveness.js: three-state frontend logic, backward compat

Test plan

  • Repeater with recent relay traffic shows green relaying emoji in list and detail pane
  • Repeater with no relay traffic in 24h shows yellow idle in both views
  • Repeater not heard recently shows grey stale in both views
  • Non-repeater nodes unaffected (no relay stats, no status change)
  • Hover tooltip on list emoji shows relay count and last relayed time
  • go test ./... passes
  • node test-repeater-liveness.js passes

🤖 Generated with Claude Code

Copy link
Copy Markdown
Owner

@Kpa-clawbot Kpa-clawbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Protocol Engineer + Torvalds Review

Protocol Engineer Lens

Relay detection model — correct. The PR indexes relay activity from tx.ResolvedPath (full resolved pubkeys), not raw hop prefixes. This avoids the hash-collision noise at 1-byte prefixes with 2K+ nodes. Any packet where a node appears in the resolved path counts as relay activity for that node. The API gates output on role == "repeater", so companions/rooms/sensors don't get spurious relay metrics. Sound design.

Three-state model — matches the issue spec exactly:

  • relaying: heard within threshold AND relay_count_24h > 0
  • active (idle): heard within threshold AND relay_count_24h == 0
  • stale: beyond silent threshold (relay count irrelevant)

Stale takes precedence over relay activity — correct, because a repeater that hasn't been heard at all shouldn't show green just because it had old path appearances.

Performance — O(log n) at query time, O(log n) on insert. Sorted slices with binary search. No per-request full scans of byPathHop. Relay metrics are computed inline during GetBulkHealth/GetNodeHealth which already hold s.mu.RLock — no extra lock needed. Acceptable.

Torvalds Lens

The code is clean and does what it says. Three focused functions (addTxToRelayTimeIndex, removeFromRelayTimeIndex, relayMetrics), each under 30 lines. Binary search insert/remove is textbook. Test coverage is thorough — 16 Go tests, 10 frontend tests. No complaints about the core logic.

Now, what's wrong:

Must-Fix

1. removeFromRelayTimeIndex over-allocates on removal

newSlice := make([]int64, 0, len(slice)-1)
newSlice = append(newSlice, slice[:i]...)
newSlice = append(newSlice, slice[i+1:]...)
idx[pk] = newSlice

This allocates a brand new slice on every single removal. Compare with removeTxFromPathHopIndex right next door which does append(slice[:i], slice[i+1:]...) — zero allocation, in-place. There's no aliasing risk here since the slice lives only in this map. Use the same pattern for consistency and to avoid pointless GC pressure during eviction (which removes thousands of packets at once).

2. Gratuitous HEALTH_THRESHOLDSwindow.HEALTH_THRESHOLDS change in roles.js

-    var staleMs = isInfra ? HEALTH_THRESHOLDS.infraSilentMs : ...
+    var staleMs = isInfra ? window.HEALTH_THRESHOLDS.infraSilentMs : ...

HEALTH_THRESHOLDS is already in closure scope within the IIFE. Changing to window.HEALTH_THRESHOLDS is unnecessary and subtly changes the lookup path. If this was needed to make the test file work (it accesses ctx.window.HEALTH_THRESHOLDS), that's the test adapting the production code to its needs — not the other way around. Revert and fix the test instead.

3. 1,222 lines of plan/spec docs committed to the repo

docs/superpowers/plans/2026-04-15-repeater-liveness.md (1,059 lines) and docs/superpowers/specs/2026-04-15-repeater-liveness-design.md (163 lines) are process artifacts. The plan literally contains the full code of every file that was written — it's a build script, not documentation. These will rot immediately and add review noise. The PR description and issue #662 already capture the design. Remove both files from this PR.

4. relayMetrics returns last_relayed for entries older than 24h

When relayTimes[pk] has entries but all are older than 24h, relayMetrics returns count1h=0, count24h=0 but lastRelayed=<old timestamp>. The API then includes last_relayed in the response even though both counts are zero. This is arguably useful info ("last relayed 3 days ago") but the behavior should be documented with a comment, and the frontend explanation text should handle this case — currently if count24h == 0 the node shows "Idle" with no mention of when it last relayed, even though the data is available.

5. addTxToRelayTimeIndex re-parses FirstSeen as RFC3339 on every call

time.Parse(time.RFC3339, tx.FirstSeen) is called for every packet during buildPathHopIndex (full store rebuild). The same string is parsed elsewhere in the codebase. Consider caching the parsed millis on StoreTx (or using an existing cached field if one exists) to avoid redundant parsing during bulk rebuilds with 30K+ packets.

6. Frontend template indentation is broken in the relay stats rows

The <tr> tags inside the conditional repeater block sit at column 0 inside a deeply-nested template literal:

          ${si.role === 'repeater' ? `
  <tr><td>Relay (1h)</td>...
  <tr><td>Relay (24h)</td>...
` : ''}

This produces inconsistent HTML indentation in the output. Match the surrounding indentation level ( <tr>...).

7. Test TestRelayTimesWiredIntoIngest has a weak assertion

if relayKeys == 0 {
    t.Fatalf("relayTimes not populated: byPathHop has %d keys but relayTimes has 0", hopKeys)
}

This only checks "non-zero." It should verify that relayKeys <= hopKeys (relay keys are a subset of path-hop keys, since relay only indexes full pubkeys while path-hop also indexes raw prefixes). A simple bounds assertion adds confidence the wiring is correct.

@efiten efiten changed the title feat: repeater liveness — three-state indicator and relay traffic metrics (#662) feat: repeater liveness indicator with relay stats (#662) Apr 16, 2026
@efiten efiten force-pushed the feat/repeater-liveness-662 branch from f90b128 to 09c1fe0 Compare April 16, 2026 16:07
@efiten efiten requested a review from Kpa-clawbot April 16, 2026 18:29
@Kpa-clawbot
Copy link
Copy Markdown
Owner

PR #755 — Final Review: Repeater Liveness Three-State Indicator

Reviewers: Protocol Engineer · Code Quality Engineer


Overall Assessment

Good feature — solves a real operator need (issue #662). The three-state model (relaying / active / stale) is correct per firmware behavior, the relayTimes index design is sound, and test coverage is solid. However, there are several issues ranging from a performance concern in a hot path to missing spec items. Listing in priority order.


🔴 Must-Fix

1. RFC3339 re-parsed on every addTxToRelayTimeIndex call during bulk rebuild

func addTxToRelayTimeIndex(idx map[string][]int64, tx *StoreTx) {
    ms, err := time.Parse(time.RFC3339, tx.FirstSeen)

buildPathHopIndex iterates every packet in the store and calls this function. On a store with 30K+ packets, that's 30K+ time.Parse(time.RFC3339, ...) calls — all during startup under the write lock. StoreTx already has no cached millis field (the comment in the code acknowledges this!), but calling it "cheap relative to binary-search inserts" is hand-waving. On startup rebuild the sorted-insert is O(1) amortized (appending in roughly chronological order), making the parse the dominant cost.

Fix: Cache the parsed millis on StoreTx (e.g. firstSeenMs int64, populated once at ingest/load), or at minimum parse once in the caller and pass millis as a parameter. This is exactly the kind of thing the repo's AGENTS.md rule #0 ("Performance is a feature") is about — no expensive work under locks without justification.

2. removeFromRelayTimeIndex over-allocates on removal via append(slice[:i], slice[i+1:]...)

idx[pk] = append(slice[:i], slice[i+1:]...)

This creates a new backing array on every removal. During EvictStale (which removes old packets in bulk), this fires for every evicted packet × every hop in its path. The existing removeTxFromPathHopIndex uses the removeTxFromSlice helper which does an in-place swap-delete. The relay index should do the same — shift in place and reslice:

copy(slice[i:], slice[i+1:])
slice = slice[:len(slice)-1]

Or use slices.Delete if on Go 1.21+. The current approach generates garbage proportional to eviction batch size × average path length.

3. handleNodes holds store.mu.RLock across the entire node enrichment loop

s.store.mu.RLock()
for _, node := range nodes {
    // ... EnrichNodeWithHashSize + relayMetrics per node
}
s.store.mu.RUnlock()

EnrichNodeWithHashSize was previously outside any store lock. Now it's inside RLock, meaning every /api/nodes request holds the store read lock for the duration of iterating + enriching ALL nodes. If there are 2K nodes, that's 2K iterations under lock blocking any writer (ingest, eviction).

Fix: Copy relayTimes data you need under the lock (a quick map snapshot of just the repeater keys), then release the lock and enrich outside it. EnrichNodeWithHashSize doesn't need the store lock at all — hashInfo is already a snapshot.

4. last_relayed returned even when both counts are zero — undocumented, potentially confusing

From the relayMetrics function comment:

last_relayed is always the most recent entry regardless of window age — callers receive it even when both counts are zero

The issue #662 spec says last_relayed is "timestamp of the most recent packet where this node appeared as a path hop." The PR returns it even when counts are 0/0 (e.g. last relay was 5 days ago). This is fine semantically but the frontend doesn't handle this case — getStatusInfo only checks lastRelayed truthiness, so a stale node could show "last relayed 5 days ago" without context that both counts are zero. The detail pane shows relay counts (which would be 0) and last relayed — this could confuse operators.

Fix: Either suppress last_relayed in the API response when both counts are zero (cleaner), or add explicit frontend handling showing "Last relayed 5d ago (none in 24h)" for clarity. Pick one — don't leave it ambiguous.


🟡 Should-Fix

5. HEALTH_THRESHOLDS exposed to global scope via test harness

In test-repeater-liveness.js:

ctx.HEALTH_THRESHOLDS = ctx.window.HEALTH_THRESHOLDS;

This is fine for testing, but the test relies on HEALTH_THRESHOLDS being a property of window — which it is, because roles.js assigns it there. The concern is that this pattern encourages treating the thresholds as a stable public API. If thresholds ever move to a config object, the test breaks silently. Minor, but worth a // NOTE: mirrors browser global; update if HEALTH_THRESHOLDS moves comment.

6. getStatusInfo calls getNodeStatus on every row render in the nodes list

The node list filter now calls getStatusInfo(n) per node during filtering:

filtered = filtered.filter(n => {
    const si = getStatusInfo(n);
    if (statusFilter === 'active') return si.status === 'active' || si.status === 'relaying';
    return si.status === statusFilter;
});

Then renderNodesTable calls getStatusInfo(n) again per row. That's 2× calls for filtered nodes. getStatusInfo isn't expensive today, but it does new Date() parsing and string building. With 2K nodes this is fine; just note it as a minor inefficiency to watch.

7. HTML template indentation is inconsistent in the detail pane additions

The new relay stats rows use inconsistent indentation compared to surrounding template literals:

          ${si.role === 'repeater' ? `
          <tr><td>Relay (1h)</td>...

The conditional block starts at the same indent level as the surrounding <tr> but the content inside has no additional indent. Compare with the existing avgSnr row which is inline. Minor readability issue — pick one style and be consistent.


🟢 Looks Good

  • Three-state model correctly implements the feat: repeater liveness — detect if repeater is actively relaying traffic #662 spec: relaying (heard + relay > 0), active/idle (heard + relay == 0), stale (not heard). Firmware behavior matches — repeaters prepend their hash when forwarding, so appearing in resolved_path = evidence of relay activity. ✅
  • Relay index design is sound: sorted millis slices with binary search for O(log n) windowed counting. Idempotent inserts prevent duplicates. Scanning all observations (not just best) is correct — relay hops appear in non-best observations too. ✅
  • Lowercase normalization throughout the relay index fixes the casing mismatch that would have silently dropped relay counts. Good catch. ✅
  • Backfill integration in neighbor_persist.go correctly re-indexes after path resolution. The idempotency guarantee makes this safe. ✅
  • Test coverage is thorough — 340 lines of Go tests covering index CRUD, metrics, wiring, bulk/single health endpoints, case normalization, role gating. Frontend tests cover all three states + backward compat. ✅
  • Non-repeater nodes unaffected — role gate in both backend and frontend ensures companions/rooms/sensors don't get relay fields. ✅
  • Backward compatibilitygetNodeStatus with 2 args still works (relay arg undefined → active for heard repeaters). ✅

Spec Compliance vs Issue #662

Spec Item Status Notes
Milestone 1: relay_count_1h / relay_count_24h in API Both /api/nodes and /api/nodes/{pubkey}/health
Milestone 1: last_relayed timestamp Present but see item #4 above
Milestone 2: Three-state indicator (🟢/🟡/⚪) Implemented in list + detail pane
Milestone 2: Relay counts in detail pane Both 1h and 24h shown
Milestone 3: Repeater health dashboard ❌ Not in scope Spec marked this as "optional" — fine to defer
Spec limitation #4: Performance of time-windowed counts ⚠️ Addressed via sorted index + binary search, but RFC3339 parsing on rebuild is the new bottleneck (item #1)
No plan/spec docs committed to repo Clean — no markdown spec files in the diff

Summary

4 must-fix items, 3 should-fix. The feature design is correct and well-tested. The issues are all implementation-level: a hot-path performance concern (RFC3339 parsing under lock), an allocation pattern in removal, a lock scope that's wider than necessary, and an ambiguous API behavior. None require rearchitecting — they're all targeted fixes.

efiten added a commit to efiten/meshcore-analyzer that referenced this pull request Apr 19, 2026
Must-fix items:
- addTxToRelayTimeIndex now accepts pre-parsed millis to avoid 30K+ RFC3339
  re-parses during bulk startup rebuild under the write lock
- removeFromRelayTimeIndex uses copy+reslice instead of append to avoid
  per-eviction heap allocation proportional to path length
- handleNodes snapshots relay times under RLock then releases before enrichment,
  keeping the lock scope to a quick map copy rather than the full node loop
- Suppress last_relayed in API response when both counts are zero across all
  three endpoints (handleNodes, bulk health, single-node health)

Should-fix items:
- Add NOTE comment on HEALTH_THRESHOLDS mirror in test-repeater-liveness.js
- Flatten multiline repeater template block into per-row inline conditionals
  in both table and dl views to match avgSnr/avgHops style

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
efiten added a commit to efiten/meshcore-analyzer that referenced this pull request Apr 19, 2026
Must-fix items:
- addTxToRelayTimeIndex now accepts pre-parsed millis to avoid 30K+ RFC3339
  re-parses during bulk startup rebuild under the write lock
- removeFromRelayTimeIndex uses copy+reslice instead of append to avoid
  per-eviction heap allocation proportional to path length
- handleNodes snapshots relay times under RLock then releases before enrichment,
  keeping the lock scope to a quick map copy rather than the full node loop
- Suppress last_relayed in API response when both counts are zero across all
  three endpoints (handleNodes, bulk health, single-node health)

Should-fix items:
- Add NOTE comment on HEALTH_THRESHOLDS mirror in test-repeater-liveness.js
- Flatten multiline repeater template block into per-row inline conditionals
  in both table and dl views to match avgSnr/avgHops style

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Owner

@Kpa-clawbot Kpa-clawbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final adversarial review (REVIEW-ONLY — fork, no push access)

Status: 🟡 Approve-after-rebase. No new substantive changes required.

950b56b cleanly addresses the prior CHANGES_REQUESTED review (must-fix items 1–4, should-fix 5–7 — all confirmed in diff). Targeted Go tests pass (go test -run "Repeater|Liveness|Relay" — all green, incl. TestAddTxToRelayTimeIndex_LowercasesKey, TestGetBulkHealthRepeaterRelayFields, TestTouchRelayLastSeen_Debouncing). Frontend tests pass (546 passed, 0 failed; test-repeater-liveness.js 10 passed).

🚧 Conflict scope (mergeStateStatus = DIRTY)

Dry-run rebase against origin/master aborted at commit 3/20 (feat(store): add relayTimes index and relay metrics functions (#662) — i.e. PR-internal commit).

Conflicting files: cmd/server/store.go only.

Root cause is #806 (perf(#800): remove per-StoreTx ResolvedPath, replace with membership index + on-demand decode) — which deleted the StoreTx.ResolvedPath and StoreObs.ResolvedPath fields this PR's addTxToRelayTimeIndex / removeFromRelayTimeIndex iterate over (store.go lines 2519, 2526, 2541, 2548).

Rebase recipe for the author:

  1. git fetch origin && git rebase origin/master
  2. In cmd/server/store.go, replace tx.ResolvedPath / obs.ResolvedPath iteration with the new on-demand decode helper (see how extractResolvedPubkeys / the new resolved-path index are used in IngestNewObservations and byPathHop updates on master).
  3. Re-run go test ./cmd/server/... and node test-repeater-liveness.js.
  4. Likely also touch neighbor_persist.go if it referenced the removed field — verify.

No other files conflict (routes.go, nodes.js, roles.js, style.css, tests are clean).

Per-expert verdicts

Persona Verdict Notes
mesh-operator ✅ ok Three-state (relaying/active/stale) matches field monitoring needs; "stale beats relay" precedence is right — a silent repeater shouldn't show green from old paths.
meshcore ✅ ok Indexing from full ResolvedPath (not 1-byte hop prefixes) correctly avoids hash-collision noise at 2K nodes. Role-gated to repeaters. Idempotent insert dedups multi-observer hits of same packet.
dijkstra ✅ ok Sorted-millis slice + binary search is correct; idempotent insert prevents dup; remove is symmetric. Window boundaries are now − 1h/now − 24h half-open, consistent.
carmack ✅ ok Hot-path parse hoisted out (must-fix #1 done); remove is now in-place copy (must-fix #2 done); handleNodes snapshots under RLock then enriches outside (must-fix #3 done).
torvalds ✅ ok Three small functions, ~30 LOC each. Cohesive.
tufte 🟢 nit (non-blocking) Three-state is graded, not just binary — good. Confirm color choices remain WCAG-AA after rebase; not changed in this PR.
djb ✅ ok Counters derived from already-validated resolved pubkeys; no new attacker-controllable strings into SQL/log. Spoof surface = same as existing path indexing.

Smoke recipe (after rebase)

  1. cd cmd/server && go test ./... -count=1 -timeout 180s
  2. node test-frontend-helpers.js && node test-repeater-liveness.js
  3. curl -s localhost:8080/api/nodes | jq '[.[] | select(.role=="Repeater") | {pk: .public_key, s: .stats}] | .[0:3]' — verify relay_count_1h/relay_count_24h present and last_relayed only present when a count > 0.
  4. UI: nodes list — repeater rows show 🟢/🟡/⚪; non-repeaters unaffected.

Final go/no-go

Go — recommend merge after a clean rebase against current master. No further substantive changes requested from this seat.

efiten added a commit to efiten/meshcore-analyzer that referenced this pull request Apr 26, 2026
store.go and neighbor_persist.go from upstream/master removed
relayTimes/relayMetrics which PR Kpa-clawbot#755 (not yet upstream) still requires.
Restore fork versions until Kpa-clawbot#755 is merged upstream.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
efiten added a commit to efiten/meshcore-analyzer that referenced this pull request Apr 26, 2026
…pstream resolved-path index

Upstream merge dropped relayTimes/relayMetrics needed by PR Kpa-clawbot#755 (repeater
liveness, not yet upstream). Restore relay functions while keeping upstream's
new resolvedPubkeyIndex fields required by resolved_index.go.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
efiten added a commit to efiten/meshcore-analyzer that referenced this pull request Apr 26, 2026
…veness compat

ResolvedPath on StoreTx/StoreObs needed by neighbor_persist.go (PR Kpa-clawbot#755).
whereClause variable rename was incomplete — fix to use correct name.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
efiten and others added 16 commits April 26, 2026 15:09
…awbot#662)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…es (Kpa-clawbot#662)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…detail pane (Kpa-clawbot#662)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- removeFromRelayTimeIndex: use in-place append instead of extra allocation
- roles.js: revert window.HEALTH_THRESHOLDS to bare HEALTH_THRESHOLDS; fix test to expose bare global in VM context
- docs: remove plan/spec process artifacts from repo
- relayMetrics: document last_relayed behavior for entries older than 24h
- addTxToRelayTimeIndex: add comment explaining FirstSeen parse rationale
- nodes.js: show last relayed time in idle repeater explanation when available
- nodes.js: fix relay stats template indentation in both detail pane blocks
- TestRelayTimesWiredIntoIngest: add relayKeys <= hopKeys bounds assertion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Kpa-clawbot#662)

backfillResolvedPathsAsync was updating tx.ResolvedPath via
pickBestObservation but never updating relayTimes. On a fresh deploy
(or any instance where resolved_path was NULL in the DB), all repeaters
showed as Idle despite active relay traffic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ats (Kpa-clawbot#662)

Two bugs:

1. addTxToRelayTimeIndex only used tx.ResolvedPath (best observation).
   If the best observer received a packet directly while another observer
   received it via a repeater, that repeater's relay activity was invisible.
   Fix: scan ALL observations' resolved paths. Insert is now idempotent
   so the function can be called multiple times safely. removeFromRelayTimeIndex
   updated symmetrically to avoid orphaned entries on eviction.
   Also add an else branch in pollAndMerge so new observations on existing
   packets (path unchanged) still update relay times immediately.

2. renderDetail/loadFullNode called getStatusInfo(n) where n = data.node
   (from /api/nodes/{pubkey}, no relay stats). Relay counts and status
   were always undefined / Idle. Fix: assign n.stats = stats (from the
   health endpoint) before calling getStatusInfo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pa-clawbot#662)

New 'St.' column after Last Seen shows 🟢/🟡/⚪ for each node.
Hovering shows full context: 'Relaying — Relayed 2592 packets in
last 24h, last 26s ago' (or 'Idle — Alive but no relay traffic...',
'Stale — Not heard for...').

Also replaces the manual status/class computation in renderRows with
a single getStatusInfo() call for consistency.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The nodes list was calling getStatusInfo(n) on nodes that had no stats
(the list endpoint only returned DB data). This caused all repeaters to
show as yellow/idle in the overview even when actively relaying.

Fix: handleNodes in routes.go now enriches each repeater node with
relay_count_1h, relay_count_24h, and last_relayed from the in-memory
relayTimes index, matching exactly what the single-node health endpoint
returns. The status filter also updated to use getStatusInfo() so relay
state is considered when filtering active/stale.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
efiten and others added 3 commits April 26, 2026 15:19
buildPathHopIndex was indexing relay timestamps for all historical
packets. On a live instance with 2.8M observations this allocated ~14 GB
causing the OOM killer to terminate the server process.

relayMetrics only queries 1h and 24h windows so any timestamp older than
24h is useless in the index. Skip addTxToRelayTimeIndex for packets
older than 24h during the initial build.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vent OOM

Loading all 366K transmissions and 2.8M observations at startup caused
the server to consume 14+ GB and get OOM-killed on the live instance.

The Load() SQL now adds a WHERE clause filtering to the configured
retentionHours window when set. With retentionHours=168 (7 days), only
~7 days of packets are loaded into memory — sufficient for all in-memory
queries. Historical data remains in SQLite and is available via DB
queries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Must-fix items:
- addTxToRelayTimeIndex now accepts pre-parsed millis to avoid 30K+ RFC3339
  re-parses during bulk startup rebuild under the write lock
- removeFromRelayTimeIndex uses copy+reslice instead of append to avoid
  per-eviction heap allocation proportional to path length
- handleNodes snapshots relay times under RLock then releases before enrichment,
  keeping the lock scope to a quick map copy rather than the full node loop
- Suppress last_relayed in API response when both counts are zero across all
  three endpoints (handleNodes, bulk health, single-node health)

Should-fix items:
- Add NOTE comment on HEALTH_THRESHOLDS mirror in test-repeater-liveness.js
- Flatten multiline repeater template block into per-row inline conditionals
  in both table and dl views to match avgSnr/avgHops style

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@efiten efiten force-pushed the feat/repeater-liveness-662 branch from 950b56b to 0c35da5 Compare April 26, 2026 13:33
efiten and others added 2 commits April 27, 2026 10:32
…e maxPackets fallback

relay_liveness_test.go used StoreTx.ResolvedPath (removed by Kpa-clawbot#800) and called
addTxToRelayTimeIndex as a standalone function — both no longer existed after the
rebase. Fix by extracting relayIndexInsertPaths / relayIndexRemovePaths as
package-level helpers that operate on map[string][]int64 directly; store methods
delegate to them. Tests rewritten to call helpers without StoreTx involvement.

Also restores the maxPackets whereClause fallback dropped when retentionHours
support was added, fixing TestBoundedLoad_LimitedMemory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
page.$() captures an element handle at query time. When the nodes page
WebSocket auto-refresh fires between the querySelector and the .click()
call, the table is re-rendered and the element is detached, causing
"Element is not attached to the DOM".

Replace both occurrences with page.click(selector), which re-queries
the DOM at click time and retries until the element is stable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Kpa-clawbot
Copy link
Copy Markdown
Owner

Final Review: LGTM after rebase

Conflicts to resolve: cmd/server/neighbor_persist.go, cmd/server/store.go

Once rebased, this is ready to merge.

Positives

  1. O(log n) relay metrics — binary search on pre-sorted slice, no per-request scan. Lock-minimal snapshot pattern in /api/nodes handler.
  2. Comprehensive test coverage — 340-line Go test file covers boundary cases (empty, partial remove, duplicate pubkeys, lowercase normalization, role gating). Frontend tests cover three-state logic and backward compat (omitted relay arg).
  3. Clean UI semantics — three-state (🟢 Relaying / 🟡 Idle / ⚪ Stale) with informative tooltips explaining what "idle" means for operators. All colors via CSS variables.

Non-blocking nits

  • Clock skew edge case: A future-dated FirstSeen from a clock-skewed node would inflate relay counts until time catches up. Not a blocker — liveness itself depends on last_heard age which is observer-timestamped. Could add a if millis > now { millis = now } clamp in a follow-up if this surfaces in the field.
  • "Idle" framing: Operators in quiet mesh areas will see healthy repeaters as yellow/idle permanently. The tooltip explains it well, but consider a note in the UI or docs that "idle" means "alive but no observed forwarding" — not "broken."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants