Skip to content

fix: bimodal clock hysteresis — don't flip to 'No Clock' on transient bad bursts#894

Closed
Kpa-clawbot wants to merge 1 commit intomasterfrom
fix/845-no-clock-hysteresis
Closed

fix: bimodal clock hysteresis — don't flip to 'No Clock' on transient bad bursts#894
Kpa-clawbot wants to merge 1 commit intomasterfrom
fix/845-no-clock-hysteresis

Conversation

@Kpa-clawbot
Copy link
Copy Markdown
Owner

Symptom

Operator on staging: /#/nodes/c0dedad4… (Kpa Roof Solar) shows 🚫 No Clock even though the most recent advert decoded with a valid 2026 timestamp. Same shape on other historically-bimodal nodes when their last few adverts happen to all be bad.

Root cause

The bimodal classifier from #845 looks at the last 5 samples within the past hour. When a bimodal node hits a transient burst of bad RTC timestamps, recent goodFraction = 0 and severity flips to no_clock — regardless of the long-term picture (16k+ samples, ~38% historically good in this case).

Fix (option C)

  1. Widen the recent window: 5 → 20 samples, 1h → 6h time bound. More data, less jitter, still meaningfully recent.
  2. Add hysteresis: only drop to no_clock when BOTH recent goodFraction AND long-term goodFraction are < 10%. A node with historical good samples stays bimodal_clock even when the recent window is 100% bad.
  3. Fallback skew display: when recent has zero good samples but long-term does, use the long-term good median as the displayed skew so the operator sees a meaningful number instead of stale poison.

API change

Add longTermGoodFraction to /api/nodes/{pk}/clock-skew so operators can see the hysteresis input directly.

Tests

  • TestBimodalHysteresis — recent all-bad + long-term mixed → bimodal_clock (NEW; reproduces the staging symptom)
  • TestNoClock_BothWindowsBad — recent all-bad + long-term all-bad → no_clock (NEW; verifies real no-clock nodes still classify correctly)
  • Updated TestSeverityUsesRecentNotMedian + TestReporterScenario_789 to match the new wider window and accept bimodal_clock for nodes with massive historical poison — feat: bimodal_clock severity — surface flaky-RTC nodes instead of hiding as 'No Clock' #845's whole premise: bimodal deserves a flag, not OK status.

All clock_skew tests pass; full go test ./... green.

… bad bursts

Symptom: Kpa Roof Solar (and other historically-bimodal nodes) briefly
flip to '🚫 No Clock' when the last 5 adverts in the recent-1h window
all happen to have nonsense timestamps, even though the very next advert
decoded with a valid 2026 timestamp.

Root cause: the bimodal classifier from #845 looks at the last 5 samples
within the past hour. When a bimodal node hits a transient burst of bad
RTC samples, recent goodFraction = 0 and severity flips to no_clock
regardless of the long-term picture (16k+ samples, ~38% historically good).

Fix (option C from triage):
1. Widen the recent window: 5 → 20 samples, 1h → 6h time bound
   (more data, less jitter, still recent).
2. Add hysteresis: only drop to no_clock when BOTH recent goodFraction
   AND long-term goodFraction are < 10%. A node with historical good
   samples stays bimodal even when the recent window is 100% bad.
3. When recent has zero good samples, fall back to long-term good median
   for the displayed skew so the operator sees a meaningful number
   instead of stale poison.

API: add longTermGoodFraction to /api/nodes/{pk}/clock-skew so operators
can see the hysteresis input directly.

Tests:
- TestBimodalHysteresis: recent all-bad + long-term mixed → bimodal_clock
- TestNoClock_BothWindowsBad: recent all-bad + long-term all-bad → no_clock
- Updated TestSeverityUsesRecentNotMedian + TestReporterScenario_789 to
  match the new wider window and accept bimodal_clock for nodes with
  massive historical poison (#845's whole premise: bimodal deserves a flag,
  not OK status).
@Kpa-clawbot
Copy link
Copy Markdown
Owner Author

Superseded by PR #907 (default-detection redesign). Closing.

KpaBap pushed a commit that referenced this pull request May 2, 2026
## Summary

UI completion of #690 — surfaces observer clock skew and per-hash
evidence that the backend already computes but wasn't exposed in the
frontend.

**Not related to #845/PR #894** (bimodal detection) — this is the UI
surface for the original #690 scope.

## Changes

### Backend: per-hash evidence in node clock-skew API (commit 1)
- Extended `GET /api/nodes/{pubkey}/clock-skew` to return
`recentHashEvidence` (most recent 10 hashes with per-observer
raw/corrected skew and observer offset) and `calibrationSummary`
(total/calibrated/uncalibrated counts).
- Evidence is cached during `ClockSkewEngine.Recompute()` — route
handler is cheap.
- Fleet endpoint omits evidence to keep payload small.

### Frontend: observer list page — clock offset column (commit 2)
- Added "Clock Offset" column to observers table.
- Fetches `/api/observers/clock-skew` once on page load, joins by
ObserverID.
- Color-coded severity badge + sample count tooltip.
- Singleton observers show "—" not "0".

### Frontend: observer-detail clock card (commit 3)
- Added clock offset card mirroring node clock card style.
- Shows: offset value, sample count, severity badge.
- Inline explainer describing how offset is computed from multi-observer
packets.

### Frontend: node clock card evidence panel (commit 4)
- Collapsible "Evidence" section in existing node clock skew card.
- Per-hash breakdown: observer count, median corrected skew,
per-observer raw/corrected/offset.
- Calibration summary line and plain-English severity reason at top.

## Test Results

```
go test ./... (cmd/server) — PASS (19.3s)
go test ./... (cmd/ingestor) — PASS (31.6s)
Frontend helpers: 610 passed, 0 failed
```

New test: `TestNodeClockSkew_EvidencePayload` — 3-observer scenario
verifying per-hash array shape, corrected = raw + offset math, and
median.

No frontend JS smoke test added — no existing test harness for
clock/observer rendering. Noted for future.

## Screenshots

Screenshots TBD

## Perf justification

Evidence is computed inside the existing `Recompute()` cycle (already
O(n) on samples). The `hashEvidence` map adds ~32 bytes per sample of
memory. Evidence is stripped from fleet responses. Per-node endpoint
returns at most 10 evidence entries — bounded payload.

---------

Co-authored-by: you <you@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant