Skip to content

Add realtime operator display: push-based reporting + console viewer#45

Draft
firstmorecoffee wants to merge 3 commits intodeimoscontrols:mainfrom
firstmorecoffee:feat/realtime-operator-display
Draft

Add realtime operator display: push-based reporting + console viewer#45
firstmorecoffee wants to merge 3 commits intodeimoscontrols:mainfrom
firstmorecoffee:feat/realtime-operator-display

Conversation

@firstmorecoffee
Copy link
Copy Markdown
Contributor

@firstmorecoffee firstmorecoffee commented Apr 21, 2026

Summary

Adds a decision-grade live view of controller signals for test-stand operators, complementing the existing firmware-level auto-abort interlocks. The Grafana path (dispatchers → TimescaleDB → query) is too indirect for judgment-cadence decisions; this gives the control room a wait-free direct view whose contents can be defended as equal to what the controller saw.

Reporting dispatcher (software/deimos/src/dispatcher/reporting/)

  • UDP multicast transport, default 239.255.0.1:29573; writes are wait-free and infallible from the control loop (WouldBlock increments a drop counter and returns Ok(())).
  • Wire format: postcard-encoded ReportingMessage::{Schema, Row} with a leading byte tag. Schema re-emits every 2 s (configurable) so late-joining viewers discover channels within a bounded window.
  • terminate() emits one final session-end Schema for clean boundary detection.

deimos-console viewer (software/deimos-console/)

  • Pure-Rust eframe/egui — no browser, no npm.
  • Receiver thread joins the multicast group via socket2, buffers Row until the first Schema, then renders one egui_plot Plot per configured panel with unit-labeled y-axes at 30 Hz repaint.
  • Sequence-gap detection inserts NaN sentinels so dropped frames render as explicit discontinuities — never silent interpolation.
  • Five-state connection health: NoSchemaYet / Fresh / Stale / SessionEnded / ReceiverDead, with a configurable staleness threshold (default 2 s). SessionEnded fires the instant the controller's session-end Schema arrives so operators distinguish clean shutdown from silent stall immediately, without waiting out the stale timeout.
  • Freeze-and-inspect toggle halts scrolling while the receiver continues feeding the ring buffer and forensic log.
  • Per-session forensic CSV with size-based rotation records (viewer_received_at, seq, controller_timestamp, controller_system_time, <channels...>) so post-run review correlates what was on screen with the underlying sample.

Viewer observability

  • Per-second stderr tick with row count, rate, last seq, and per-category drop counters, so during debug the console is an inspectable pipeline rather than a black box.
  • Wall-clock rx-lag min/mean/max, anchored on Row.system_time (the controller's cycle-start SystemTime::now()) so the number answers "how old is what the operator sees" across transport + UI drain + any clock skew. A long comment in process_row captures why that value sits near one cycle_duration on loopback and documents the rejected epoch_ns + timestamp alternative.
  • Schema events tagged new-session / re-emission / session-end so controller lifecycle is visible alongside the UI.
  • Idle suppression: after three consecutive zero-row ticks, emit one heartbeat per minute instead of flooding the log.
  • New chrono dep (clock feature, no default features) for the local HH:MM:SS.mmm log prefix that lets viewer lines correlate with the controller's ISO-timestamped tracing output.

Calc unit metadata

  • Calc::get_output_units() -> Vec<Option<String>> is a required trait method. Every built-in calc declares its output units explicitly.
  • RtdPt100, TcKtypeSome("K"). SequenceMachineSome("s") for sequence_time_s, None for user-defined data channels. Affine, InverseAffine, Polynomial, Constant, Sin → optional output_unit via a with_output_unit(...) builder.
  • Butter2 and Pid return vec![None] with a doc comment noting that passthrough-inheritance from the input channel's unit is the intended eventual behavior, once CalcOrchestrator plumbs input units into each calc's init.
  • CalcOrchestrator::get_dispatch_units() mirrors get_dispatch_names(). Units flow to dispatchers via the new ControllerCtx.channel_units field (parallel to channel_names). Dispatcher::init signature is unchanged.

Incidental cleanup

  • CsvDispatcher downgrades the core_affinity::set_for_current returned-false warning from warn! to debug!. Affinity on that path is best-effort — macOS's advisory scheduler always returns false — and the noise obscures real issues at the default log level. Still surfaceable via RUST_LOG=deimos::dispatcher::csv=debug.

Test plan

  • cargo test --manifest-path software/deimos/Cargo.toml — 38 passing
  • cargo test --manifest-path software/deimos-console/Cargo.toml — 7 passing (5 unit + 2 integration)
  • cargo run --example hootl_reporting --manifest-path software/deimos/Cargo.toml — exercises reporting dispatcher end-to-end, asserts zero dropped frames
  • cargo run --example hootl_with_console --manifest-path software/deimos/Cargo.toml — two-terminal run renders live signals with unit-labeled y-axes

Docs

  • software/deimos-console/README.md — TOML config format and invocation.

Deferred (noted for follow-up)

  • Runtime unit checking (pint-style typo/conversion detection) — too large for this PR; addressed as a separate change.
  • Butter2/Pid passthrough-inheritance of input units — requires CalcOrchestrator to pass the assembled channel-units slice into each calc's init; tracked as a separate follow-on.

@firstmorecoffee firstmorecoffee force-pushed the feat/realtime-operator-display branch from b5f9cc7 to eb326f7 Compare April 21, 2026 14:10
@jlogan03
Copy link
Copy Markdown
Contributor

Neat! This is very cool - it'll take me a bit to dig through all of it, think about multicast networks, try it out, and have some thoughts. I like the core premise, and think we can lean into parts of it even more (there are no third-party plugins yet, so we can make units a required output of calcs, but we'll need to eventually combine that with a run-time unit checking system similar to pint to catch typos and conversion errors). This also appears to be separate in functionality from the earlier 3 PRs, so it should be possible to bring some of this in while keep the spec files and such as a local environment.

Thanks again for putting this together!

@firstmorecoffee firstmorecoffee force-pushed the feat/realtime-operator-display branch 7 times, most recently from fbc2e7e to 30f6a81 Compare April 24, 2026 15:08
@jlogan03
Copy link
Copy Markdown
Contributor

@firstmorecoffee

Finally had a chance to test-drive this today! It's very slick. I like the super lightweight, low-latency plotting interface; I might have reached for some heavier web-y stack like leptos+plotly myself, out of convenience, but this native-first egui setup is objectively better since it's less complex, optimizes better, and can run on closed networks.

The multicast system for getting data to the consoles also solves a long-standing issue that I've had in the back of my head re: how to handle multiple user consoles without over-subscribing the root node. Postcard with periodic schema delivery is an excellent balance of simplicity and performance to accomplish that without a shared compiled boundary.

Couple notes from running it with a Deimos DAQ rev7 hardware attached via ethernet-usb adapter on an ubuntu laptop & running through a codex review pass:

  • When minimizing the egui app tab, the gui updates appear to stall.
    • When navigating back to the tab after 5-10 seconds, they may either (1) leave a large gap in the plot with missing data and register more dropped frames or (2) accelerate through updating the plots one frame at a time until the buffer is drained (which can take several seconds, during which only stale data is displayed)
    • Ideally, instead of draining a buffer backlog by doing individual plot updates, we would skip forward to consume the latest data so that, to the best of our ability, the latest data is always displayed, but we may jump over some stale data to clear a backlog
  • There's a potential synchronization gotcha in main.rs
    • Codex's summary: Drop the extra sender so receiver death is observable — /home/jlogan/git/ jbass_deimos/software/deimos-console/src/main.rs:67-68 Because _tx is a real binding, it stays alive until run_native returns. If the receiver thread exits after a socket error, rx.try_recv() will not report Disconnected while this sender is still held, so the ReceiverDead health state is unreachable in the binary; drop the returned sender immediately or avoid returning it from spawn.
    • The underlying thing here is that, in Rust, binding a local like let _tx = ... is syntactically different from binding like let _ = ... in terms of when drop() is called on the bound variable: in the let _tx case, it is dropped when it goes out of scope, while in the let _ case, it is dropped immediately after binding. For types with trivial drop implementations, these usually optimize to the same behavior, but for synchronization utilities that manipulate atomics during drop, there is a real distinction that can meaningfully change the behavior of the program.
  • hootl_reporting.rs and hootl_with_console.rs are pretty similar. I'd recommend keeping hootl_with_console.rs because it gives more description of the what-and-why

@jlogan03
Copy link
Copy Markdown
Contributor

@firstmorecoffee Here's a snip of the gap after navigating away for ~5sec, and the very large number of dropped frames due to navigating away to write the previous comment

image

@firstmorecoffee firstmorecoffee force-pushed the feat/realtime-operator-display branch from 30f6a81 to a6a7554 Compare April 27, 2026 19:12
@firstmorecoffee
Copy link
Copy Markdown
Contributor Author

Updates:

1. Stall on minimize / drain-the-backlog behavior. Reproduced via SIGSTOP/SIGCONT on the binary. Two compounding root causes:

  • The receiver thread's bounded channel used try_send, which on full drops the newest incoming message — so after a long pause the viewer was guaranteed stale-first. The "drain frame-by-frame" effect you saw is exactly that: 3000 oldest frames played out before the latest data could reach the plot.
  • The per-channel ring buffer evicted on timestamp − window_seconds, so pre-pause samples lingered in the visible window until live controller time advanced past them.

Fix is viewer-side only (no wire-format or controller change):

  • Receiver: drop-oldest on full channel. New OVERWRITTEN_FRAMES counter distinguishes channel-eviction backpressure from genuine wire_drops packet loss.
  • App: stall detection in drain_messages — wall-clock gap between drains, or a drained row whose received_at − controller_system_time exceeds staleness_threshold_secs. On a stall, evict pre-stall samples down to tail_keep_secs and splice a NaN sentinel before the first retained post-stall point. The result is the behavior you described: skip forward to the latest data, with a visible discontinuity marking what was elided.
  • Recovering health state: amber dot for recovery_settle_secs after the last stall, then back to Fresh — so an operator can tell "we paused and recovered" from "the link is unhealthy."
  • Per-second tick gains recv_drops, overwritten_frames, wire_drops, stalls_detected, stale_rows_evicted so the failure mode is observable at a glance.

Verified end-to-end: post-SIGCONT on a long pause, plot jumps to current data with one sentinel-break, stalls_detected=1, no false positives during steady streaming. The forensic CSV records receipt, so stall-elided rows still appear in the audit trail.

2. let _tx keepalive. Fixed. The clone is gone; the recv thread now holds the only sender, so Disconnected propagates on its death and ReceiverDead becomes reachable.

3. hootl_reporting vs hootl_with_console. Folded the smoke test into hootl_with_console.rs and removed hootl_reporting.rs. The richer doc comment wins.

firstmorecoffee and others added 3 commits May 2, 2026 13:03
Adds a decision-grade live view of controller signals for test-stand
operators, complementing the existing firmware-level auto-abort
interlocks. The Grafana path (dispatchers -> TimescaleDB -> query) is
too indirect for judgment-cadence decisions; this change gives the
control room a wait-free direct view whose contents can be defended as
equal to what the controller saw.

- ReportingDispatcher serializes each per-cycle Row onto a UDP
  multicast transport (default 239.255.0.1:29573). Writes are wait-free
  and infallible from the control loop: WouldBlock increments a drop
  counter and returns Ok(()).
- Wire format is postcard-encoded ReportingMessage::{Schema, Row} with
  a leading byte tag; the Schema re-emits every 2 s (configurable) so
  late-joining viewers discover channels within a bounded window.
- terminate() emits one final session-end Schema so viewers can mark
  clean session boundaries.
- Schema/Row round-trip tests plus a HOOTL example (hootl_reporting.rs)
  verify no drops in the normal path.

- Pure-Rust eframe/egui viewer; no browser, no npm. Matches Deimos's
  "no platform restrictions" principle.
- Receiver thread joins the multicast group via socket2, buffers Row
  until the first Schema arrives, then renders one egui_plot Plot per
  configured panel with unit-labeled y-axes at a fixed 30 Hz repaint
  cadence.
- Sequence-gap detection inserts NaN sentinels so dropped frames render
  as explicit discontinuities, never silent interpolation.
- Three-state connection-health indicator (Fresh / Stale / NoSchemaYet)
  with a configurable staleness threshold (default 2 s).
- Freeze-and-inspect toggle halts scrolling while the receiver
  continues feeding the ring buffer and forensic log.
- Per-session forensic CSV records (viewer_received_at, seq,
  controller_timestamp, controller_system_time, <channels...>) with
  size-based rotation so post-run review can correlate what was on
  screen with the underlying sample.

- Calc::get_output_units() -> Vec<Option<String>> with a default
  returning all None - third-party calc plugins keep compiling.
- RtdPt100 and TcKtype declare K; Affine, InverseAffine, Polynomial,
  Constant, and Sin accept an optional output_unit via a
  with_output_unit(...) builder.
- CalcOrchestrator::get_dispatch_units() mirrors get_dispatch_names().
- Units flow from calcs to dispatchers via ControllerCtx.channel_units
  (parallel to channel_names); the Dispatcher::init signature is
  unchanged so user dispatcher plugins keep compiling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Turns the console from a "does it work?" black box into an inspectable
pipeline during debug, cleans up existing log noise, and fixes a
handful of readability issues exposed by the first round of hands-on
use.

Viewer stderr telemetry (app.rs, main.rs):
- Per-second tick with row count, rate, last seq, and per-category
  drop counters.
- Wall-clock rx-lag min/mean/max, anchored on Row.system_time (the
  controller's cycle-start SystemTime::now()) so it answers "how old
  is what the operator sees" across transport + UI drain + any clock
  skew. A long comment in process_row captures why that number sits
  near one cycle_duration on loopback and documents the rejected
  epoch_ns + timestamp approach.
- Schema lines tagged new-session / re-emission / session-end so
  controller lifecycle is visible alongside the UI.
- Local HH:MM:SS.mmm prefix on every line for easy correlation with
  the controller's ISO-timestamped tracing output.
- Idle suppression: after three consecutive zero-row ticks, emit only
  one heartbeat per minute instead of flooding the log.

New ConnectionHealth::SessionEnded state:
- Fires the instant the controller's session-end Schema arrives, not
  after the 2 s stale timeout — an operator can distinguish "clean
  shutdown" from "silent stall" immediately.
- Replaces the secondary "Session ended." text line previously drawn
  below the indicator; unit test asserts precedence over Fresh.

UI palette and layout (app.rs):
- Replace saturated screen-primary colors (YELLOW/GREEN/RED) with a
  muted palette (MINT_GREEN, AMBER, CORAL_RED, SKY_BLUE, ORCHID)
  reserved for indicator glyphs only.
- Labels render in the theme default foreground color so the text is
  legible on both dark and light themes; status communicated via a
  leading dot / warning glyph.
- "All clean" dropped-frames row uses default text, not GRAY — no
  more looking-like-a-disabled-widget when nothing is wrong.

Controller noise cleanup (csv.rs):
- CSV-dispatcher core_affinity::set_for_current returned-false
  warning downgraded from warn! to debug!. Affinity on that path is
  best-effort; macOS's advisory scheduler always returns false. Still
  surfaceable via RUST_LOG=deimos::dispatcher::csv=debug.

Two-terminal debug workflow (scripts/, .gitignore):
- scripts/hootl-console / scripts/console run the two sides with
  RUST_LOG=deimos=debug,info and tee timestamped transcripts under
  logs/.
- Forensic log pre-enabled in the example config (commented path
  explains per-session suffixing and 64 MiB rotation).
- New dep: chrono (clock feature, no default features) for local-time
  formatting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The console could display stale data on a "live" plot for up to one
window after the UI thread paused (e.g. an obscured window on Linux).
Two root causes, both reproduced via SIGSTOP/SIGCONT:

1. The receiver thread's bounded channel used `try_send`, which on full
   drops the *newest* incoming message and keeps the oldest 3000 — so
   after a long pause the viewer is guaranteed stale-first.
2. The per-channel ring buffer evicted on `timestamp − window_seconds`,
   so pre-pause samples lingered in the visible window until live
   controller time advanced past them.

Fix on the viewer side (no wire-format or controller changes):

- Receiver: drop-oldest semantics on a full channel. New
  `OVERWRITTEN_FRAMES` counter distinguishes drop-oldest evictions
  from genuine packet loss in `DROPPED_FRAMES`, and is only
  incremented on actual displacement (the eviction-then-retry cannot
  fail under the single-writer invariant).
- App: stall detection in `drain_messages` — wall-clock gap between
  drains, or a drained row whose `received_at − controller_system_time`
  exceeds `staleness_threshold_secs`. On a stall, clear pre-stall
  samples down to `tail_keep_secs` and splice a NaN discontinuity
  sentinel before the first retained post-stall point. Empty drains
  no longer advance `last_drain_at`, so stall detection fires on the
  first drain after a pause rather than being masked.
- Sentinel emission deferred to a post-second-pass `apply_pending_stall`
  step so catch-up rows can't render after the sentinel, and so a
  stall that fires while the view is frozen still emits its sentinel
  on the first non-frozen drain after unfreeze.
- New `Recovering` connection-health state: amber dot for
  `recovery_settle_secs` after the last stall, then back to Fresh.
- Per-second telemetry tick gains `recv_drops`, `overwritten_frames`,
  `wire_drops`, `stalls_detected`, `stale_rows_evicted`.
- Lag accounting skips rows whose `system_time` falls outside
  `chrono::DateTime::timestamp_nanos_opt`'s representable range so a
  misconfigured clock can't inject a ~56-year spurious lag value.
- Forensic log records *receipt*, not display, so stall-elided rows
  still appear in the audit trail.

Also bundled (prior scramble review on this branch):

- Bound freeze-buffer growth and broaden console test coverage.
- Mark `deimos-console` as `publish = false` (internal binary).
- Fold `hootl_reporting` smoke test into `hootl_with_console`.
- Misc doc-string corrections (csv_row width claim, hootl_lifecycle
  initial-state name, receiver loop description, dropped_frames /
  Schema re-emission notes, stalls_detected tick-line description,
  pre-Schema-evicted-rows carve-out).
- `tests/integration_wire.rs` uses `.expect(...)` on dispatcher
  consume/terminate so contract violations surface as test failures.

Spec: `improve-console-freshness-on-stall` adds the freshness
invariant and stall-recovery requirement to `realtime-reporting`,
archived under `openspec/changes/archive/2026-04-27-improve-console-
freshness-on-stall/`. The dispatcher CLAUDE.md gains a one-line
cross-reference to the recv-side freshness invariants.

End-to-end verified: SIGSTOP/SIGCONT on the patched binary reports
`stalls_detected=1` post-resume with no false positives during steady
streaming. Higher-rate overflow of the bounded channel was not
reproducible in this environment because the OS UDP buffer drops
backlogs first (`wire_drops`); not a regression — an environment
limit on forcing channel overflow.

Verified: cargo test -p deimos -p deimos-console (all pass),
cargo fmt --all -- --check (clean), scripts/validate-branch
james/main (clippy-on-this-branch's-lines clean).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@firstmorecoffee firstmorecoffee force-pushed the feat/realtime-operator-display branch from a6a7554 to 9a247ae Compare May 2, 2026 17:03
@jlogan03
Copy link
Copy Markdown
Contributor

jlogan03 commented May 3, 2026

Cool! Those fixes sound reasonable. I tried this out again today using the basic.rs example connected to a rev7 DAQ at 100Hz samplerate. Still seeing a gap and some rubberbanding after minimizing and then bringing the console up again.

I think the egui app update loop might be pausing when minimized, which makes it so that any logic implemented behind it to better handle buffer overflow doesn't end up running at all. We might have to separate the buffering logic from the display update, which would be good for handling higher data rates in any case - with 1kHz+ samplerate, that'll drive a lot of display updates that aren't visible anyway.

console_rubberbanding.webm

It looks like the only way to get around it is to run the buffer processing on another thread with an update loop that isn't tied to egui's display update. I'll see if I can prototype something along those lines as a proof-of-concept.

@jlogan03
Copy link
Copy Markdown
Contributor

jlogan03 commented May 3, 2026

Here's a proof-of-concept with the buffer processing on a different thread from the UI update. This does resolve the rubberbanding, and the console is able to stay stable up to a 1kHz data rate.

As written, the buffer processing thread uses the same update rate as the UI draw cycle (30Hz), but I think a higher rate would be preferable in order to reduce maximum frame lag to well below human reaction time. Maybe 120Hz buffer processing and configurable UI draw rate up to 60Hz or something along those lines.

ui_thread.patch

console_stable.webm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants