Skip to content

feat: clock skew classifier redesign — default-detection model#907

Open
Kpa-clawbot wants to merge 13 commits intomasterfrom
feat/clock-skew-default-detection
Open

feat: clock skew classifier redesign — default-detection model#907
Kpa-clawbot wants to merge 13 commits intomasterfrom
feat/clock-skew-default-detection

Conversation

@Kpa-clawbot
Copy link
Copy Markdown
Owner

Clock Skew Classifier Redesign

Replaces the windowed-median / hysteresis / good-fraction classifier with a simple default-detection model based on known firmware epochs.

Spec: docs/clock-skew-redesign.md

New 5-tier severity model

Tier Condition Color
Default advert_ts within [epoch, epoch + 2y] of any known firmware default Gray
OK ` skew
Degrading `15s < skew
Degraded `60s < skew
Wrong ` skew

What changed

  1. Backend classifier (cmd/server/clock_skew.go): dropped no_clock, severe, warn, absurd, bimodal_clock severities and all windowing/hysteresis logic. New isDefaultEpoch() checks advert timestamp against known firmware defaults [0, 1609459200, 1672531200, 1715770351]. Per-node severity = classification of most recent advert. Dropped recentMedianSkewSec, goodFraction, recentBadSampleCount, recentSampleCount from API. Added defaultEpoch field.

  2. Backend tests (cmd/server/clock_skew_test.go): new tests for all tiers, boundary values, default epoch detection, beyond-uptime-cap fallthrough. Observer calibration tests preserved.

  3. Frontend badges (public/roles.js, public/style.css, public/analytics.js, public/nodes.js): updated severity labels, CSS classes, filter tabs, and badge rendering. Removed bimodal/no_clock-specific UI code.

  4. Frontend explainer (public/nodes.js): per-tier plain-English reason line in node clock card.

Go test results

ok  github.com/corescope/server  19.364s

References

…epoch test

- Remove stale 'recentMedianSkewSec' reference in nodes.js comment
- Add TestIsDefault_OverlappingWindowsPicksLargest covering epoch
  selection when default ranges overlap
Copy link
Copy Markdown
Owner Author

@Kpa-clawbot Kpa-clawbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent Adversarial Review

Reviewed the full diff, spec doc, and file context for each changed file. Applied lenses: Frankenstein check, duplicate logic, spec drift, edge cases, API compat, test coverage, performance.


Must-fix (4 items — requesting changes)

1. extractTimestamp conflates "no timestamp" with epoch-0 — false default classification
cmd/server/clock_skew.go:247extractTimestamp returns 0 both when no timestamp field exists AND when the timestamp is literally 0. After the advertTS < 0 filter change (line 216), advertTS == 0 now passes through to isDefaultEpoch, which matches epoch 0 and classifies as default. Any ADVERT packet with a missing or unparseable timestamp will be incorrectly treated as an epoch-0 node.

Fix: Change extractTimestamp to return -1 for "not found" (or use a (int64, bool) return), and keep the advertTS < 0 guard in collectSamples. Alternatively, keep the old advertTS <= 0 filter and accept that actual epoch-0 nodes need advertTS >= 1 to be detected — but this conflicts with the spec's explicit "epoch 0 → default" requirement.

2. Stale CSS row classes .clock-fleet-row--warning and .clock-fleet-row--critical
public/style.css:2307-2308 — The old severity names warning and critical were replaced everywhere except these two CSS row highlight rules. They should be .clock-fleet-row--degrading and .clock-fleet-row--degraded respectively. The analytics table in analytics.js:3516 builds row classes as clock-fleet-row--<severity>, so rows with degrading/degraded severity get no background highlight (the CSS rules target the old names).

Fix: Rename .clock-fleet-row--warning.clock-fleet-row--degrading and .clock-fleet-row--critical.clock-fleet-row--degraded.

3. computeNodeSkew classifies using last-appended observer's skew, not most-recent observation
cmd/server/clock_skew.go:371lastCorrectedSkew := cs[len(cs)-1].skew takes the last element of the cs slice for a hash, which is the last observer appended (by sample iteration order), not necessarily the most recent observation. The spec says "most recent observation across all observers." The per-node path in getNodeClockSkewLocked correctly picks by lastObsTS, but the per-hash intermediate NodeClockSkew may have incorrect severity if observers arrive non-chronologically.

Fix: Pick the skew from the cs element with the largest observedTS, or sort cs by observedTS before indexing.

4. No integration test for advertTS == 0 through PacketStore.GetNodeClockSkew
cmd/server/clock_skew_test.goTestIsDefaultEpoch_Boundaries tests isDefaultEpoch(0) in isolation, but no integration test puts an ADVERT with timestamp: 0 through the full PacketStore path. This is critical given bug #1 — need a test that confirms real epoch-0 adverts classify correctly AND that missing-timestamp packets don't false-positive.


Out-of-scope (pre-existing / follow-up)

  • --status-red CSS variable: skew-badge--wrong uses var(--status-red) — verify it exists in :root. Not introduced by this PR.
  • Breaking API fields removal: recentMedianSkewSec, goodFraction, recentBadSampleCount, recentSampleCount silently disappear. Consider a "Breaking changes" section in the PR description for external consumers.
  • isDefaultEpoch O(n) loop: Negligible at 4 epochs, note for future if the list grows.

you added 4 commits April 24, 2026 23:42
Distinguishes 'no timestamp field' (returns -1) from real epoch-0
(returns 0). Adds jsonNumberOk helper that returns (value, bool).
The collectSamples guard 'advertTS < 0' correctly filters missing
timestamps while allowing epoch-0 through to isDefaultEpoch.

Updates TestExtractTimestamp to verify both cases.

Review item #1 on PR #907.
.clock-fleet-row--warning  → .clock-fleet-row--degrading
.clock-fleet-row--critical → .clock-fleet-row--degraded

The JS in analytics.js builds classes as 'clock-fleet-row--<severity>'
so the CSS must match the actual severity strings.

Review item #2 on PR #907.
Uses max observedTS instead of last-appended slice element to
determine the most recent skew sample per hash. Consolidates
the latestObsTS and anyCal loop into a single pass.

Review item #3 on PR #907.
TestGetNodeClockSkew_EpochZeroAdvert: verifies advert with timestamp==0
flows through PacketStore and classifies as severity=default, epoch=0.

TestGetNodeClockSkew_MissingTimestamp: verifies advert with no timestamp
field is skipped (extractTimestamp returns -1, filtered by collectSamples).

Review item #4 on PR #907.
@Kpa-clawbot
Copy link
Copy Markdown
Owner Author

Review fixes addressed

All 4 must-fix items from the adversarial review:

Fix 1 — extractTimestamp sentinel for missing timestamp — d4b1aa4
Returns timestampMissing (-1) instead of 0 when no timestamp field exists. Added jsonNumberOk helper. Epoch-0 (ts == 0) now correctly flows through to isDefaultEpoch.

Fix 2 — CSS fleet row class rename — c46a60f
.clock-fleet-row--warning.clock-fleet-row--degrading, .clock-fleet-row--critical.clock-fleet-row--degraded. Already had --wrong and --default variants.

Fix 3 — computeNodeSkew uses chronologically-latest — 86a4403
Picks skew from sample with max observedTS instead of last-appended slice element. Consolidated the loop.

Fix 4 — Integration tests for epoch-0 and missing timestamp — 3cd7186
TestGetNodeClockSkew_EpochZeroAdvert: timestamp=0 → severity=default, epoch=0.
TestGetNodeClockSkew_MissingTimestamp: no timestamp field → skipped (nil result).

All tests pass: go test ./... → ok

Copy link
Copy Markdown
Owner Author

@Kpa-clawbot Kpa-clawbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎓 Dijkstra Review — The Formalist

"Simplicity is prerequisite for reliability."

This redesign is a rare pleasure: a complex state machine replaced by a clean, total function with well-separated concerns. The spec is rigorous, the production data analysis is thorough, and the implementation follows the spec faithfully. I shall nonetheless hold it to the standard it deserves.


Invariants & Mutual Exclusivity ✅

The 5-tier severity model is mutually exclusive by construction: classifySkew checks default-epoch membership first (preempting all bands), then cascades through <= boundaries in a switch. No input can satisfy two branches. The tiers partition the input space completely — every (advertTS, absCorrectedSkewSec) pair maps to exactly one severity.

State Space ✅

Can a node be both "default" and "wrong"? No. isDefaultEpoch runs first; if it matches, severity is SkewDefault regardless of skew magnitude. The test TestClassify_Default_AllKnownEpochs correctly passes skew=999999 to verify this preemption.

Separation of Concerns ✅

Default detection operates on raw advertTS — independent of observer calibration. Band classification operates on absCorrectedSkewSec — the calibrated value. These are orthogonal: observer offsets (seconds-to-minutes scale) cannot move a raw advertTS from 2024 into 2026. Phase 2 calibration and default detection cannot interfere with each other. Clean.

Type Safety ✅

SkewSeverity is a named string type with named constants — idiomatic Go. The classifier only returns values from the constant set. Frontend uses SKEW_SEVERITY_COLORS / SKEW_SEVERITY_LABELS maps as the authoritative value registry; unknown severities degrade gracefully (unstyled badge).

Totality & Edge Cases ✅ (with one exception)

  • Negative advertTS: No epoch matches (all epochs ≥ 0), falls through to band classification. Safe.
  • advertTS = 0: Matches epoch-0 → default. Tested (TestGetNodeClockSkew_EpochZeroAdvert).
  • Missing timestamp: extractTimestamp returns timestampMissing (-1), collectSamples filters advertTS < 0. Clean sentinel. Tested (TestGetNodeClockSkew_MissingTimestamp).
  • NaN skew: All <= comparisons return false → falls through to SkewWrong. Correct behavior, though implicit.
  • +Inf skew: Same — SkewWrong. Correct.
  • Overlapping default-epoch windows: isDefaultEpoch picks the largest epoch ≤ advertTS. Tested (TestIsDefault_OverlappingWindowsPicksLargest).

Observer Calibration Orthogonality ✅

calibrateObservers is unchanged, computing offsets from multi-observer packet medians. computeNodeSkew applies offsets to raw skew before passing to classifySkew. Default detection reads only raw advertTS. These are cleanly layered — calibration cannot cause a false default, and default detection cannot corrupt calibration.


🔴 Must-Fix (1)

classifySkew does not enforce its own preconditioncmd/server/clock_skew.go:66

The parameter is named absCorrectedSkewSec (implying non-negative), but the function body does not enforce this. If a future caller passes a raw (possibly negative) skew without math.Abs():

classifySkew(advTS, -86400)  // -86400 <= 15 is TRUE → SkewOK ❌

A node 1 day behind would be classified as OK. Both current call sites correctly pass math.Abs(...), so this is latent — but the function violates a core principle: enforce preconditions at the boundary, do not rely on callers to uphold implicit contracts.

Fix (one line):

func classifySkew(advertTS int64, absCorrectedSkewSec float64) (SkewSeverity, int64) {
    absCorrectedSkewSec = math.Abs(absCorrectedSkewSec)  // ← add this
    if ok, epoch := isDefaultEpoch(advertTS); ok {

Add a test:

func TestClassify_NegativeSkewTreatedAsAbsolute(t *testing.T) {
    advTS := int64(1900000000)
    sev, _ := classifySkew(advTS, -86400)
    if sev != SkewWrong {
        t.Errorf("classifySkew(advTS, -86400) = %v, want wrong", sev)
    }
}

Out-of-scope (pre-existing, for follow-up)

  • jsonNumber (the non-Ok variant) still exists and returns 0 for missing keys — indistinguishable from a real zero. Pre-existing; not introduced by this PR.

Verdict

1 must-fix (cmd/server/clock_skew.go:66 — defensive math.Abs in classifySkew). Once addressed: zero structural concerns — the classifier is correct by construction, tiers are exhaustive and mutually exclusive, separation is clean, and tests cover all boundaries including the tricky epoch-0 and missing-timestamp cases.

"Testing shows the presence, not the absence of bugs." — but this classifier is simple enough that we can reason about its correctness directly, which is the higher standard.

@Kpa-clawbot
Copy link
Copy Markdown
Owner Author

✅ Dijkstra must-fix addressed in 4291b38: classifySkew now takes math.Abs internally; parameter renamed from absCorrectedSkewSec to skewSec. Callers no longer need to pre-abs. Regression test TestClassify_NegativeSkewUsesAbsoluteValue added.

Copy link
Copy Markdown
Owner Author

@Kpa-clawbot Kpa-clawbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔬 MeshCore Protocol Engineer Review

Reviewed against firmware source at firmware/src/helpers/ArduinoHelpers.h and full git history of VolatileRTCClock.

Protocol-correctness summary

✅ ADVERT-only classificationcollectSamples() filters by PayloadADVERT (line 195), getNodeClockSkewLocked() skips non-ADVERT (line 422). No from_node leakage from non-ADVERTs. Correct.

1715770351 confirmedVolatileRTCClock constructor in ArduinoHelpers.h:11: base_time = 1715770351. Present since initial commit. set time RESET in CommonCLI.cpp calls setCurrentTime(1715770351). Verified.

✅ Epoch-0 / missing-timestamp distinctiontimestampMissing = -1 sentinel, jsonNumberOk helper, collectSamples filters advertTS < 0. Epoch-0 flows through to isDefaultEpoch. Clean.

✅ Originator attribution — classifier operates on the ADVERT's embedded timestamp (originator's clock), not observation metadata. Repeater forwarding doesn't alter this. Correct.

✅ Default detection independent of calibrationclassifySkew() checks raw advert_ts against isDefaultEpoch() before examining corrected skew. Observer offsets (seconds-to-minutes scale) can't move a 2024 timestamp to 2026. Correct per spec.


Must-fix (1)

MF-1: maxPlausibleUptimeSec = 730 days is too tight — imminent misclassification

File: cmd/server/clock_skew.go:47

The firmware default 1715770351 is 2024-05-15. Today is 2026-04-25 — that's 710 days. Solar-powered repeaters deployed at firmware release and never rebooted are 20 days from being reclassified from default to wrong.

Real-world solar repeaters (mountain-top, off-grid) run for years without reboot. A node at 800 days uptime with firmware-default clock should be default, not wrong — the latter implies "operator set incorrect time" which triggers unnecessary intervention.

Fix: Increase to 1095 * 86400 (3 years), or compute dynamically: max(baseline, now - oldest_known_epoch + margin). The constant table already acknowledges epochs will grow; the uptime cap should track that reality.

This is imminent — the 00id.net deployment already has nodes at 1715770351 + 0..113d. Some booted near firmware release.


Comments (not blocking)

C-1: 1672531200 — no firmware source evidence. Git history shows base_time was always 1715770351 from initial commit. The 5 production nodes at ~1672531542 are empirical evidence, but likely from a pre-repo firmware version or fork. Add a code comment noting "production-observed, firmware source not available."

C-2: 1609459200 — acknowledged speculation. Spec says "single-sample evidence." Shipping it is fine (false positive = harmless gray badge). Add a // speculative comment in the constant table.

C-3: VCR replay misclassification. Old replayed adverts will show as wrong because their timestamps are months old vs current time.Now(). Pre-existing issue, not introduced here, but the new red wrong badges make it more visible. Suggest a follow-up issue.

C-4: Single-observer uncalibrated nodes. When only one observer exists with no calibration, raw skew is used. If that observer's clock is off by 30s, a perfectly-synced node shows degrading. Inherent to Phase 2's design. Follow-up: surface "uncalibrated — single observer" more prominently in UI.


Verdict

1 must-fix (MF-1: uptime cap about to expire on real nodes). After that: approve.

Everything else is correct. Default-detection model matches firmware behavior, ADVERT-only classification is clean, observer calibration preserved, sentinel handling for epoch-0 vs missing-timestamp is solid work.

@Kpa-clawbot
Copy link
Copy Markdown
Owner Author

✅ MF-1 addressed in 86ca793: maxPlausibleUptimeSec bumped 730d → 1095d (3 years). Spec doc + tests updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant