fix: bimodal clock hysteresis — don't flip to 'No Clock' on transient bad bursts#894
Closed
Kpa-clawbot wants to merge 1 commit intomasterfrom
Closed
fix: bimodal clock hysteresis — don't flip to 'No Clock' on transient bad bursts#894Kpa-clawbot wants to merge 1 commit intomasterfrom
Kpa-clawbot wants to merge 1 commit intomasterfrom
Conversation
… bad bursts Symptom: Kpa Roof Solar (and other historically-bimodal nodes) briefly flip to '🚫 No Clock' when the last 5 adverts in the recent-1h window all happen to have nonsense timestamps, even though the very next advert decoded with a valid 2026 timestamp. Root cause: the bimodal classifier from #845 looks at the last 5 samples within the past hour. When a bimodal node hits a transient burst of bad RTC samples, recent goodFraction = 0 and severity flips to no_clock regardless of the long-term picture (16k+ samples, ~38% historically good). Fix (option C from triage): 1. Widen the recent window: 5 → 20 samples, 1h → 6h time bound (more data, less jitter, still recent). 2. Add hysteresis: only drop to no_clock when BOTH recent goodFraction AND long-term goodFraction are < 10%. A node with historical good samples stays bimodal even when the recent window is 100% bad. 3. When recent has zero good samples, fall back to long-term good median for the displayed skew so the operator sees a meaningful number instead of stale poison. API: add longTermGoodFraction to /api/nodes/{pk}/clock-skew so operators can see the hysteresis input directly. Tests: - TestBimodalHysteresis: recent all-bad + long-term mixed → bimodal_clock - TestNoClock_BothWindowsBad: recent all-bad + long-term all-bad → no_clock - Updated TestSeverityUsesRecentNotMedian + TestReporterScenario_789 to match the new wider window and accept bimodal_clock for nodes with massive historical poison (#845's whole premise: bimodal deserves a flag, not OK status).
This was referenced Apr 24, 2026
Owner
Author
|
Superseded by PR #907 (default-detection redesign). Closing. |
KpaBap
pushed a commit
that referenced
this pull request
May 2, 2026
## Summary UI completion of #690 — surfaces observer clock skew and per-hash evidence that the backend already computes but wasn't exposed in the frontend. **Not related to #845/PR #894** (bimodal detection) — this is the UI surface for the original #690 scope. ## Changes ### Backend: per-hash evidence in node clock-skew API (commit 1) - Extended `GET /api/nodes/{pubkey}/clock-skew` to return `recentHashEvidence` (most recent 10 hashes with per-observer raw/corrected skew and observer offset) and `calibrationSummary` (total/calibrated/uncalibrated counts). - Evidence is cached during `ClockSkewEngine.Recompute()` — route handler is cheap. - Fleet endpoint omits evidence to keep payload small. ### Frontend: observer list page — clock offset column (commit 2) - Added "Clock Offset" column to observers table. - Fetches `/api/observers/clock-skew` once on page load, joins by ObserverID. - Color-coded severity badge + sample count tooltip. - Singleton observers show "—" not "0". ### Frontend: observer-detail clock card (commit 3) - Added clock offset card mirroring node clock card style. - Shows: offset value, sample count, severity badge. - Inline explainer describing how offset is computed from multi-observer packets. ### Frontend: node clock card evidence panel (commit 4) - Collapsible "Evidence" section in existing node clock skew card. - Per-hash breakdown: observer count, median corrected skew, per-observer raw/corrected/offset. - Calibration summary line and plain-English severity reason at top. ## Test Results ``` go test ./... (cmd/server) — PASS (19.3s) go test ./... (cmd/ingestor) — PASS (31.6s) Frontend helpers: 610 passed, 0 failed ``` New test: `TestNodeClockSkew_EvidencePayload` — 3-observer scenario verifying per-hash array shape, corrected = raw + offset math, and median. No frontend JS smoke test added — no existing test harness for clock/observer rendering. Noted for future. ## Screenshots Screenshots TBD ## Perf justification Evidence is computed inside the existing `Recompute()` cycle (already O(n) on samples). The `hashEvidence` map adds ~32 bytes per sample of memory. Evidence is stripped from fleet responses. Per-node endpoint returns at most 10 evidence entries — bounded payload. --------- Co-authored-by: you <you@example.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Symptom
Operator on staging:
/#/nodes/c0dedad4…(Kpa Roof Solar) shows🚫 No Clockeven though the most recent advert decoded with a valid 2026 timestamp. Same shape on other historically-bimodal nodes when their last few adverts happen to all be bad.Root cause
The bimodal classifier from #845 looks at the last 5 samples within the past hour. When a bimodal node hits a transient burst of bad RTC timestamps,
recent goodFraction = 0and severity flips tono_clock— regardless of the long-term picture (16k+ samples, ~38% historically good in this case).Fix (option C)
no_clockwhen BOTH recent goodFraction AND long-term goodFraction are < 10%. A node with historical good samples staysbimodal_clockeven when the recent window is 100% bad.API change
Add
longTermGoodFractionto/api/nodes/{pk}/clock-skewso operators can see the hysteresis input directly.Tests
TestBimodalHysteresis— recent all-bad + long-term mixed →bimodal_clock(NEW; reproduces the staging symptom)TestNoClock_BothWindowsBad— recent all-bad + long-term all-bad →no_clock(NEW; verifies real no-clock nodes still classify correctly)TestSeverityUsesRecentNotMedian+TestReporterScenario_789to match the new wider window and acceptbimodal_clockfor nodes with massive historical poison — feat: bimodal_clock severity — surface flaky-RTC nodes instead of hiding as 'No Clock' #845's whole premise: bimodal deserves a flag, not OK status.All clock_skew tests pass; full
go test ./...green.