Skip to content

feat(profilers): GPU + link profiling with directional measurements#2010

Open
AlexCheema wants to merge 13 commits into
mainfrom
alexcheema/profilers-dashboard
Open

feat(profilers): GPU + link profiling with directional measurements#2010
AlexCheema wants to merge 13 commits into
mainfrom
alexcheema/profilers-dashboard

Conversation

@AlexCheema
Copy link
Copy Markdown
Contributor

Summary

  • New active-probe profiler subsystem (src/exo/utils/profilers/) measuring GPU FP16 TFLOPS, GPU memory bandwidth, per-edge socket bandwidth (upload + download separately), socket RTT, per-edge RDMA bandwidth (rank0→rank1 and rank1→rank0 via mx.distributed.send/recv), and RDMA RTT.
  • State-driven reconciliation loop (15 s tick, replay-safe) with TTLs: GPU 1 h, socket 5 min, RDMA 6 h. GPU and RDMA probes skip while runners are active; all RDMA work serialised through a process-global lock. Subprocess isolation for jaccl init keeps the worker process clean.
  • Dashboard adds: a HuggingFace-style "GPU poor → GPU rich" bar (weighted blend of TFLOPS / memory bandwidth / memory against an 8× H100 anchor; FLOPS heaviest, memory least; per-dimension cap so a single insanely high axis still pulls the thumb right), per-node TFLOPS/bandwidth labels, a directional edge label (↑X ↓Y · Z RTT), and a hover tooltip with a per-direction breakdown including inferred connection type (Wi-Fi / Ethernet / Thunderbolt 4 / Thunderbolt 5 / RDMA — generation parsed from system_profiler's linkSpeed).
  • GPU-bench variance fix: long warm-up to lock the Apple GPU into peak performance state, plus best-of-N timing. Two healthy M3 Ultras now agree to ~2% (was ~15% before).

Screenshots

Topology view (no hover)

Two-node M3 Ultra cluster over Thunderbolt 5 + Tailscale. The GPU rich bar shows 47.0 TFLOPS / 1.33 TB/s / 1.00 TB; the thumb sits in the middle yellow zone — driven by the maxed-out memory dimension while compute and bandwidth are well below an 8× H100 anchor. Per-node label under each Mac shows GPU profile (e.g. 23.6 TFLOPS · 661 GB/s). Edge label reads ↑74.77 Gbps ↓74.79 Gbps · 18 µs RTT.

Topology

Hovering an edge — directional breakdown tooltip

Each row is one measured profile (one direction × one transport). Connection type is inferred per row.

Tooltip

Implementation notes

  • Why MLX mx.sum instead of GEMV for memory bandwidth. GEMV pays for an inner reduction plus a vector broadcast and caps at ~58 % of peak DRAM bandwidth on Apple Silicon. A pure streaming mx.sum over a 2 GB FP16 buffer (well past M3 Ultra's ~96 MB SLC) hits ~80 % — the closer-to-the-ceiling number that's useful for placement decisions.
  • RDMA probe is subprocess-isolated. mx.distributed.init(backend="jaccl") is process-global; running it inside the worker would clash with active inference. Probe lives in rdma_probe_main.py, spawned via /profile/rdma_probe HTTP rendezvous.
  • One-way bandwidth, RTT-only latency. Socket uploads use server-side timing (POST /profile/upload returns its own recv_duration_ms) so the small response's RTT doesn't pollute the result. RDMA uses mx.distributed.send/recv per direction. Latency stays as RTT — true one-way needs sub-µs clock sync that we don't have, and synthetic half-RTT split would be misleading.
  • State-driven reconciliation, not event-reactive. EXO replays the event log on master changes; reactive probe triggers would re-fire historical events. The reconciler reads current state every tick and probes anything missing or past TTL.

Test plan

  • uv run basedpyright && uv run ruff check && nix fmt && uv run pytest — all green (416 pass, 16 new tests in src/exo/utils/profilers/tests/).
  • Two-node end-to-end on M3 Ultra cluster (mike + james over TB5 + Tailscale): GPU profile, socket bandwidth (upload + download), socket RTT, RDMA bandwidth (upload + download), RDMA RTT all populated; values stable across hourly probes.
  • Connection-type inference: TB5 RDMA / Wi-Fi / Ethernet / Tailscale all classified correctly (Tailscale shows as "Unknown" by design — we don't fake CGNAT classification).
  • Inference safety: starting a generation while a GPU/RDMA probe interval elapses skips the probe cleanly (no contention, no errors).
  • Reviewer eyeball: confirm dashboard rendering on a multi-node cluster including hover behaviour.

🤖 Generated with Claude Code

AlexCheema and others added 6 commits May 1, 2026 01:21
Adds an active-probe profiler subsystem alongside the existing info
gatherer. Each node measures:

- GPU FP16 TFLOPS (square FP16 matmul) and memory bandwidth (large
  `mx.sum` streaming read). Long warm-up + best-of-N timing keeps two
  M3 Ultras agreeing to within ~2%.
- Per-edge socket bandwidth (upload + download separately, server
  times the receive on uploads) and RTT, via new `/profile/echo`,
  `/profile/upload`, and `/profile/download` endpoints.
- Per-edge RDMA bandwidth (rank0->rank1 and rank1->rank0 via
  `mx.distributed.send`/`recv`) and RTT (tiny-payload `all_sum`
  ping-pong), in a child process so jaccl init doesn't poison the
  worker process.

Scheduling is a state-driven reconciliation loop (15s tick) so the
event log can be replayed without re-firing probes. TTLs: GPU 1h,
socket 5m, RDMA 6h. GPU and RDMA probes skip while runners are active;
all RDMA work serialised through a process-global lock.

Dashboard:
- New GpuRichBar component at the top of the topology view: gradient
  thumb scored as a weighted blend of TFLOPS / bandwidth / memory
  against an 8x H100 anchor (TFLOPS heaviest, memory least), with a
  per-dimension cap so a single insanely high axis still pulls the
  thumb right.
- Per-node TFLOPS / memory bandwidth labels under each device.
- Edge label shows max upload / max download / min RTT across all
  profiles for the pair.
- Hovering an edge shows a breakdown table: direction, inferred
  connection type (Wi-Fi / Ethernet / Thunderbolt N / RDMA),
  upload / download / RTT per profile.
- Connection-type inference reads `nodeNetwork.interfaceType` for
  socket edges and parses `nodeThunderbolt.linkSpeed` (e.g. "Up to
  80 Gb/s") to distinguish TB4 from TB5 on RDMA edges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The lock was a defensive guess that simultaneous jaccl/RDMA probes
would fight for the Thunderbolt RDMA hardware. They don't —
concurrent processes get independent QPs. Verified end-to-end on a
2x M3 Ultra cluster: both nodes start probes simultaneously at
discovery and both complete cleanly with sensible numbers.

Keep the `state.runners` gate — that one is still real (don't compete
with active inference traffic).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The white circle reads as a slider thumb that the user can grab and
drag, but the bar is a passive readout. Apple HIG distinguishes
controls (slider with prominent thumb) from indicators (gauge, with
a tick or filled portion). For a "where do you fall on this scale"
display, the indicator pattern is the right one.

Replace the circular thumb with:
- A thin 2px vertical tick line through the track at the value, with
  a dark halo so it stays legible across the red->green gradient.
- The gradient past the marker is dimmed so the eye lands on the
  position instead of perceiving a static spectrum with a tick on it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The edge metrics label was "↑X ↓Y · Z µs RTT" — the "RTT" suffix
was redundant once the unit was already there and made the label
crowded. Keep the "RTT" column header in the hover tooltip where
the explicit semantic still helps readers parse the table.

Also picks up minor nix-fmt comment-spacing tweaks across the
profiler modules.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@AlexCheema AlexCheema force-pushed the alexcheema/profilers-dashboard branch from c0644e3 to e5760dd Compare May 1, 2026 00:23
AlexCheema and others added 7 commits May 1, 2026 01:30
`mx.random.uniform(shape=..., dtype=float16)` internally generates
fp32 then casts to fp16, which doubles the peak Metal allocation —
our 2 GiB fp16 buffer briefly needs 4 GiB of heap. CI's macOS
runner has max_buffer_size = 3.5 GiB and rejected the alloc:

    RuntimeError: [metal::malloc] Attempting to allocate 4294967296
    bytes which is greater than the maximum allowed buffer size of
    3758096384 bytes.

DRAM bandwidth is independent of the values being streamed, so we
just allocate `mx.zeros` directly. No fp32 temp, peak heap = exactly
the 2 GiB we want.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dashboard re-derives the connection type on every reactive
update. The source data — `nodeNetwork[peer].interfaces` — is
re-parsed on the backend every 10 s from `networksetup` and
occasionally drops an entry for one tick before the next refresh
puts it back. Without this fix the user sees the label flicker
between "Ethernet" and "Unknown" several times a minute.

Solution: cache the last *concrete* (non-"Unknown") classification
per (sinkNodeId, sinkIp). When a fresh lookup returns "Unknown"
we ignore it in favour of the cached answer. Concrete answers
always update the cache, so a real network change propagates
immediately.

Cache is bounded by O(N²) for N nodes (one entry per directed
edge × IP), so no leak.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the apply layer's link-profile dedup keyed on transport
alone, so a peer reachable on multiple IPs (LAN + Tailscale +
link-local + ...) would have all of its socket profiles collapsed
into one slot — and the reconciler probing each IP in turn would
overwrite the previous, making the displayed bandwidth and
classification bounce ("Ethernet 1.1 Gbps" → "Unknown 400 Mbps" →
"Ethernet 1.1 Gbps" → ...).

Switch the dedup to the natural identity per transport:
- socket: (transport, sink_ip)  — one row per IP
- rdma:   (transport, source_iface, sink_iface)

Each connection now gets its own stable row. The dashboard's edge
label still shows max-up / max-down / min-RTT across all profiles,
so the summary is the best path; the hover tooltip shows the full
breakdown per connection.

Verified live on the 4-node M3 Ultra cluster: james -> s14 has 4
distinct socket rows (link-local, LAN, Tailscale, and a slow path)
plus the RDMA row, all stable across 5+ minutes of probes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Added `latency_jitter_ms` to both SocketLinkProfile and
RDMALinkProfile. Defined as the mean of |Δ| between consecutive
RTT samples — same convention iperf3 reports, captures short-term
variance better than stddev.

- Socket: bumped LATENCY_SAMPLES from 5 → 10 so the mean-of-deltas
  is meaningful (4 deltas was thin).
- RDMA: 50 samples already, just compute the deltas alongside the
  median.
- State + apply: plumb `latency_jitter_ms` through. NodeSocketLinkProfile
  defaults to 0.0; NodeRdmaLinkProfile to None (matches the rest of
  its optional fields).

Dashboard: new "Jitter" column in the hover tooltip. Edge label
left alone — keeping it short.

Tooltip styling fixes pulled in along the way:
- `position: fixed` so it can escape `overflow: hidden` on the
  topology container — multi-row tooltips were getting clipped at
  the box edge.
- `white-space: nowrap` on table cells; bandwidth numbers were
  wrapping onto two lines and overlapping. Slightly wider per-cell
  padding for breathing room.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…TT/2

GpuRichBar simplified to "ClusterStats": dropped the gradient bar +
scoring and kept just the three aggregate tiles (FP16 compute,
memory bandwidth, total memory). Centered, inline.

Edge labels reworked: instead of one combined "↑X ↓Y · Z RTT" label
per pair (which collided on tight layouts and forced ↑↓ glyphs to
disambiguate direction), now we render up to three labels per edge:

- Per-direction bandwidth — placed next to its arrow head, on the
  matching side of the midpoint. The arrow direction implies which
  way the number applies, so no glyphs needed.
- Latency centered at midpoint, on the *other* side of the edge so
  the eye doesn't have to disambiguate it from the bandwidth labels.

Latency display also switched from RTT to RTT/2 (one-way
approximation) — the topology edge label and the tooltip both show
RTT/2 now, with the column header explicitly labeled "RTT/2" so the
semantic is clear.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes that fix the label-pile-up at the cluster centroid and
make labels actually adjacent to their arrows:

- Push labels AWAY from the viewport centroid (outer side of each
  edge), not toward it. Previously every edge in a 4-node diamond
  ended up with its labels piled near the centroid because
  `towardCenter` pointed there. For diagonals whose midpoint *is*
  the centroid we pick a stable side based on edge direction.

- Latency also halved in the tooltip (jitter/2 alongside RTT/2),
  for consistency with the topology edge label.

End result: each perimeter edge has its A→B bandwidth, latency,
and B→A bandwidth strung along the outer side, each adjacent to
its arrow head. Crossing diagonals' labels form a "+" pattern at
the center instead of overlapping each other.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant