Skip to content

fix(repeater-liveness): address issue #755 review feedback#798

Closed
efiten wants to merge 43 commits intoKpa-clawbot:masterfrom
efiten:fix/755-review-feedback
Closed

fix(repeater-liveness): address issue #755 review feedback#798
efiten wants to merge 43 commits intoKpa-clawbot:masterfrom
efiten:fix/755-review-feedback

Conversation

@efiten
Copy link
Copy Markdown
Contributor

@efiten efiten commented Apr 19, 2026

Summary

  • Perf: addTxToRelayTimeIndex now accepts pre-parsed millis — eliminates 30K+ redundant RFC3339 parses during startup rebuild under the write lock
  • Perf: removeFromRelayTimeIndex uses copy+reslice instead of append — no per-eviction heap allocation
  • Lock scope: handleNodes snapshots relay times under RLock then releases before enrichment, keeping lock held for a quick map copy only
  • API clarity: last_relayed suppressed when both counts are zero across all three endpoints
  • Should-fix: NOTE comment on HEALTH_THRESHOLDS mirror in test file; repeater template rows flattened to inline conditionals

Test plan

  • go test ./... passes (excluding pre-existing TestMultiByteCapability_* failures)
  • node test-repeater-liveness.js — 10 passed, 0 failed
  • Startup log shows relay-time keys populated correctly
  • /api/nodes repeater nodes: last_relayed absent when both counts are 0, present when counts > 0

🤖 Generated with Claude Code

efiten and others added 30 commits April 1, 2026 23:19
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Async timers (setInterval/setTimeout) started by animateHop() can fire
after destroy() has nulled animLayer and removed DOM elements. This
caused three console errors on the Live page when navigating away mid-
animation. Guards added at each async callback site.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…a-clawbot#475)

bufferPacket was stamping all live WS packets with Date.now() (receive
time). Packets arriving in the same batch all got identical timestamps,
making the message history show the same "Xs ago" for every entry.

Use pkt.timestamp || pkt.created_at (mirroring dbPacketToLive) so each
packet reflects its actual origination time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…a-clawbot#284)

Two bugs remained after prior fix attempts:

1. buildExport() diffed home fields against DEFAULTS instead of the
   server's original config (_SITE_CONFIG_ORIGINAL_HOME). If the server
   provided custom steps and the user only edited the checklist, those
   server steps were captured in localStorage — blocking future server
   updates from being visible after reload.

2. autoSave() synced branding to SITE_CONFIG but not home, so the live
   preview (hashchange re-render) read stale SITE_CONFIG.home and didn't
   reflect the user's edits until a full page reload.

Fix: diff home fields against _SITE_CONFIG_ORIGINAL_HOME in buildExport,
and sync state.home back to SITE_CONFIG.home in autoSave using the same
original snapshot as base to prevent cumulative mutations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ve (Kpa-clawbot#288)

P0 items from issue Kpa-clawbot#288:

- Override indicators: small ⬤ dot next to any setting that differs from
  the server's configured value (not just DEFAULTS). Badge on each tab
  showing the count of overridden settings in that section.
- "Saved!" toast: brief green pill animates in the panel after every
  autoSave, replacing the removed manual save button.
- Remove "Save as my theme" button: autoSave already handles this;
  the Export tab now shows an override count summary instead.
- Full export: Download/Copy now exports all current settings (not just
  diffs), so admins get a complete theme.json. localStorage still stores
  only user overrides.
- Per-setting reset uses server values: Reset buttons for colors/branding
  restore to server config (theme.json / config.json), not DEFAULTS.
- _serverState snapshot: computed once in initState() from DEFAULTS +
  server config (no localStorage) — single source of truth for all
  override comparisons.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rideIndicators

Dots and badges were only updated on full render() (tab switch). Now
autoSave() calls refreshOverrideIndicators() which updates tab badges
and field dots in-place without a full re-render, so indicators appear
immediately when a setting is changed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…a-clawbot#464)

When groupByHash=true, each group only carries the representative observer_id.
Client-side filter was checking only that field, silently dropping groups seen
by the selected observer but with a different representative.

Fix: pass observer param to the server so filterPackets/buildGroupedWhere do
the correct "any observation matches" check. Skip the broken client-side
observer filter in grouped mode. Extend both db and store paths to support
comma-separated observer IDs (multi-select UI).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Kpa-clawbot#463)

Two complementary fixes:

1. Ingestor (cmd/ingestor/main.go): move the status topic handler BEFORE
   the JSON parse. Previously a non-JSON or malformed status payload caused
   an early return, skipping UpsertObserver entirely and leaving last_seen
   stale. Now last_seen is always updated when a status message arrives,
   regardless of payload format.

2. Frontend (public/observers.js): replace Date.now() with a server-clock-
   corrected reference in healthStatus(). The API response already includes
   server_time; we now capture it at fetch time and use it (plus elapsed ms)
   as the reference point. This eliminates client/server clock skew from
   the online/stale/offline calculation.

Adds TestHandleMessageStatusNonJSONPayload to cover the non-JSON case.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
estimatedMemoryMB() used a hardcoded formula (packets × 5120 + obs × 500)
that ignored distHops (1.5M records), distPaths (93K records), and spIndex
(4.1M entries). This caused the formula to report ~1.2GB while actual heap
was ~5GB, so the memory-based eviction check barely fired and the process
grew until OOM-killed.

Replace estimatedMemoryMB() with runtime.ReadMemStats so all data structures
are counted. Replace the inner eviction simulation loop (which re-used the
wrong formula) with a proportional calculation: if heap is N× over budget,
evict enough packets to keep (1/N × 0.9) of the current count, with a 10%
buffer so the next ingest cycle doesn't immediately re-trigger.

Add memoryEstimator func() float64 field to PacketStore for test injection,
so eviction tests stay deterministic without spawning goroutines or relying
on process heap size.

Addresses the "Packet store estimated memory" item in epic Kpa-clawbot#559.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Direct/zero-hop adverts always use pathByte=0x00 regardless of the
node's actual hash size setting. When hashCount==0, the hash size bits
are meaningless — display them as unknown rather than "1 byte".

- packets.js: suppress Hash Size card entry and Advertised Hash Size
  field row when hashCount==0; update Path Length description to say
  "direct advert" instead of showing spurious hash_size=1
- nodes.js: skip per-advert hash size badge for zero-hop packets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mirror the server decoder test suite — add TestZeroHopTransportDirectHashSize
and TestZeroHopTransportDirectHashSizeWithNonZeroUpperBits to the ingestor,
closing the symmetry gap flagged in PR Kpa-clawbot#653 review.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…awbot#662)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…es (Kpa-clawbot#662)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…detail pane (Kpa-clawbot#662)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
efiten and others added 13 commits April 16, 2026 18:05
- removeFromRelayTimeIndex: use in-place append instead of extra allocation
- roles.js: revert window.HEALTH_THRESHOLDS to bare HEALTH_THRESHOLDS; fix test to expose bare global in VM context
- docs: remove plan/spec process artifacts from repo
- relayMetrics: document last_relayed behavior for entries older than 24h
- addTxToRelayTimeIndex: add comment explaining FirstSeen parse rationale
- nodes.js: show last relayed time in idle repeater explanation when available
- nodes.js: fix relay stats template indentation in both detail pane blocks
- TestRelayTimesWiredIntoIngest: add relayKeys <= hopKeys bounds assertion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ot#662)

byPathHop was storing resolved pubkeys with original case while
relayTimes and routes.go lookups (addCandidates(lowerPK)) both use
lowercase, breaking the subset invariant and causing missed lookups for
uppercase pubkeys. Lowercase consistently in both add and remove.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Kpa-clawbot#662)

backfillResolvedPathsAsync was updating tx.ResolvedPath via
pickBestObservation but never updating relayTimes. On a fresh deploy
(or any instance where resolved_path was NULL in the DB), all repeaters
showed as Idle despite active relay traffic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ats (Kpa-clawbot#662)

Two bugs:

1. addTxToRelayTimeIndex only used tx.ResolvedPath (best observation).
   If the best observer received a packet directly while another observer
   received it via a repeater, that repeater's relay activity was invisible.
   Fix: scan ALL observations' resolved paths. Insert is now idempotent
   so the function can be called multiple times safely. removeFromRelayTimeIndex
   updated symmetrically to avoid orphaned entries on eviction.
   Also add an else branch in pollAndMerge so new observations on existing
   packets (path unchanged) still update relay times immediately.

2. renderDetail/loadFullNode called getStatusInfo(n) where n = data.node
   (from /api/nodes/{pubkey}, no relay stats). Relay counts and status
   were always undefined / Idle. Fix: assign n.stats = stats (from the
   health endpoint) before calling getStatusInfo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pa-clawbot#662)

New 'St.' column after Last Seen shows 🟢/🟡/⚪ for each node.
Hovering shows full context: 'Relaying — Relayed 2592 packets in
last 24h, last 26s ago' (or 'Idle — Alive but no relay traffic...',
'Stale — Not heard for...').

Also replaces the manual status/class computation in renderRows with
a single getStatusInfo() call for consistency.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The nodes list was calling getStatusInfo(n) on nodes that had no stats
(the list endpoint only returned DB data). This caused all repeaters to
show as yellow/idle in the overview even when actively relaying.

Fix: handleNodes in routes.go now enriches each repeater node with
relay_count_1h, relay_count_24h, and last_relayed from the in-memory
relayTimes index, matching exactly what the single-node health endpoint
returns. The status filter also updated to use getStatusInfo() so relay
state is considered when filtering active/stale.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
# Conflicts:
#	cmd/ingestor/decoder_test.go
buildPathHopIndex was indexing relay timestamps for all historical
packets. On a live instance with 2.8M observations this allocated ~14 GB
causing the OOM killer to terminate the server process.

relayMetrics only queries 1h and 24h windows so any timestamp older than
24h is useless in the index. Skip addTxToRelayTimeIndex for packets
older than 24h during the initial build.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vent OOM

Loading all 366K transmissions and 2.8M observations at startup caused
the server to consume 14+ GB and get OOM-killed on the live instance.

The Load() SQL now adds a WHERE clause filtering to the configured
retentionHours window when set. With retentionHours=168 (7 days), only
~7 days of packets are loaded into memory — sufficient for all in-memory
queries. Historical data remains in SQLite and is available via DB
queries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Must-fix items:
- addTxToRelayTimeIndex now accepts pre-parsed millis to avoid 30K+ RFC3339
  re-parses during bulk startup rebuild under the write lock
- removeFromRelayTimeIndex uses copy+reslice instead of append to avoid
  per-eviction heap allocation proportional to path length
- handleNodes snapshots relay times under RLock then releases before enrichment,
  keeping the lock scope to a quick map copy rather than the full node loop
- Suppress last_relayed in API response when both counts are zero across all
  three endpoints (handleNodes, bulk health, single-node health)

Should-fix items:
- Add NOTE comment on HEALTH_THRESHOLDS mirror in test-repeater-liveness.js
- Flatten multiline repeater template block into per-row inline conditionals
  in both table and dl views to match avgSnr/avgHops style

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@efiten
Copy link
Copy Markdown
Contributor Author

efiten commented Apr 19, 2026

Closing — fixes should go on the existing PR #755 branch instead.

@efiten efiten closed this Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant