feat(profilers): GPU + link profiling with directional measurements by AlexCheema · Pull Request #2010 · exo-explore/exo

AlexCheema · 2026-04-30T23:30:47Z

Summary

New active-probe profiler subsystem (src/exo/utils/profilers/) measuring GPU FP16 TFLOPS, GPU memory bandwidth, per-edge socket bandwidth (upload + download separately), socket RTT, per-edge RDMA bandwidth (rank0→rank1 and rank1→rank0 via mx.distributed.send/recv), and RDMA RTT.
State-driven reconciliation loop (15 s tick, replay-safe) with TTLs: GPU 1 h, socket 5 min, RDMA 6 h. GPU and RDMA probes skip while runners are active; all RDMA work serialised through a process-global lock. Subprocess isolation for jaccl init keeps the worker process clean.
Dashboard adds: a HuggingFace-style "GPU poor → GPU rich" bar (weighted blend of TFLOPS / memory bandwidth / memory against an 8× H100 anchor; FLOPS heaviest, memory least; per-dimension cap so a single insanely high axis still pulls the thumb right), per-node TFLOPS/bandwidth labels, a directional edge label (↑X ↓Y · Z RTT), and a hover tooltip with a per-direction breakdown including inferred connection type (Wi-Fi / Ethernet / Thunderbolt 4 / Thunderbolt 5 / RDMA — generation parsed from system_profiler's linkSpeed).
GPU-bench variance fix: long warm-up to lock the Apple GPU into peak performance state, plus best-of-N timing. Two healthy M3 Ultras now agree to ~2% (was ~15% before).

Screenshots

Topology view (no hover)

Two-node M3 Ultra cluster over Thunderbolt 5 + Tailscale. The GPU rich bar shows 47.0 TFLOPS / 1.33 TB/s / 1.00 TB; the thumb sits in the middle yellow zone — driven by the maxed-out memory dimension while compute and bandwidth are well below an 8× H100 anchor. Per-node label under each Mac shows GPU profile (e.g. 23.6 TFLOPS · 661 GB/s). Edge label reads ↑74.77 Gbps ↓74.79 Gbps · 18 µs RTT.

Hovering an edge — directional breakdown tooltip

Each row is one measured profile (one direction × one transport). Connection type is inferred per row.

Implementation notes

Why MLX mx.sum instead of GEMV for memory bandwidth. GEMV pays for an inner reduction plus a vector broadcast and caps at ~58 % of peak DRAM bandwidth on Apple Silicon. A pure streaming mx.sum over a 2 GB FP16 buffer (well past M3 Ultra's ~96 MB SLC) hits ~80 % — the closer-to-the-ceiling number that's useful for placement decisions.
RDMA probe is subprocess-isolated. mx.distributed.init(backend="jaccl") is process-global; running it inside the worker would clash with active inference. Probe lives in rdma_probe_main.py, spawned via /profile/rdma_probe HTTP rendezvous.
One-way bandwidth, RTT-only latency. Socket uploads use server-side timing (POST /profile/upload returns its own recv_duration_ms) so the small response's RTT doesn't pollute the result. RDMA uses mx.distributed.send/recv per direction. Latency stays as RTT — true one-way needs sub-µs clock sync that we don't have, and synthetic half-RTT split would be misleading.
State-driven reconciliation, not event-reactive. EXO replays the event log on master changes; reactive probe triggers would re-fire historical events. The reconciler reads current state every tick and probes anything missing or past TTL.

Test plan

uv run basedpyright && uv run ruff check && nix fmt && uv run pytest — all green (416 pass, 16 new tests in src/exo/utils/profilers/tests/).
Two-node end-to-end on M3 Ultra cluster (mike + james over TB5 + Tailscale): GPU profile, socket bandwidth (upload + download), socket RTT, RDMA bandwidth (upload + download), RDMA RTT all populated; values stable across hourly probes.
Connection-type inference: TB5 RDMA / Wi-Fi / Ethernet / Tailscale all classified correctly (Tailscale shows as "Unknown" by design — we don't fake CGNAT classification).
Inference safety: starting a generation while a GPU/RDMA probe interval elapses skips the probe cleanly (no contention, no errors).
Reviewer eyeball: confirm dashboard rendering on a multi-node cluster including hover behaviour.

🤖 Generated with Claude Code

Adds an active-probe profiler subsystem alongside the existing info gatherer. Each node measures: - GPU FP16 TFLOPS (square FP16 matmul) and memory bandwidth (large `mx.sum` streaming read). Long warm-up + best-of-N timing keeps two M3 Ultras agreeing to within ~2%. - Per-edge socket bandwidth (upload + download separately, server times the receive on uploads) and RTT, via new `/profile/echo`, `/profile/upload`, and `/profile/download` endpoints. - Per-edge RDMA bandwidth (rank0->rank1 and rank1->rank0 via `mx.distributed.send`/`recv`) and RTT (tiny-payload `all_sum` ping-pong), in a child process so jaccl init doesn't poison the worker process. Scheduling is a state-driven reconciliation loop (15s tick) so the event log can be replayed without re-firing probes. TTLs: GPU 1h, socket 5m, RDMA 6h. GPU and RDMA probes skip while runners are active; all RDMA work serialised through a process-global lock. Dashboard: - New GpuRichBar component at the top of the topology view: gradient thumb scored as a weighted blend of TFLOPS / bandwidth / memory against an 8x H100 anchor (TFLOPS heaviest, memory least), with a per-dimension cap so a single insanely high axis still pulls the thumb right. - Per-node TFLOPS / memory bandwidth labels under each device. - Edge label shows max upload / max download / min RTT across all profiles for the pair. - Hovering an edge shows a breakdown table: direction, inferred connection type (Wi-Fi / Ethernet / Thunderbolt N / RDMA), upload / download / RTT per profile. - Connection-type inference reads `nodeNetwork.interfaceType` for socket edges and parses `nodeThunderbolt.linkSpeed` (e.g. "Up to 80 Gb/s") to distinguish TB4 from TB5 on RDMA edges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The lock was a defensive guess that simultaneous jaccl/RDMA probes would fight for the Thunderbolt RDMA hardware. They don't — concurrent processes get independent QPs. Verified end-to-end on a 2x M3 Ultra cluster: both nodes start probes simultaneously at discovery and both complete cleanly with sensible numbers. Keep the `state.runners` gate — that one is still real (don't compete with active inference traffic). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The white circle reads as a slider thumb that the user can grab and drag, but the bar is a passive readout. Apple HIG distinguishes controls (slider with prominent thumb) from indicators (gauge, with a tick or filled portion). For a "where do you fall on this scale" display, the indicator pattern is the right one. Replace the circular thumb with: - A thin 2px vertical tick line through the track at the value, with a dark halo so it stays legible across the red->green gradient. - The gradient past the marker is dimmed so the eye lands on the position instead of perceiving a static spectrum with a tick on it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The edge metrics label was "↑X ↓Y · Z µs RTT" — the "RTT" suffix was redundant once the unit was already there and made the label crowded. Keep the "RTT" column header in the hover tooltip where the explicit semantic still helps readers parse the table. Also picks up minor nix-fmt comment-spacing tweaks across the profiler modules. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`mx.random.uniform(shape=..., dtype=float16)` internally generates fp32 then casts to fp16, which doubles the peak Metal allocation — our 2 GiB fp16 buffer briefly needs 4 GiB of heap. CI's macOS runner has max_buffer_size = 3.5 GiB and rejected the alloc: RuntimeError: [metal::malloc] Attempting to allocate 4294967296 bytes which is greater than the maximum allowed buffer size of 3758096384 bytes. DRAM bandwidth is independent of the values being streamed, so we just allocate `mx.zeros` directly. No fp32 temp, peak heap = exactly the 2 GiB we want. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The dashboard re-derives the connection type on every reactive update. The source data — `nodeNetwork[peer].interfaces` — is re-parsed on the backend every 10 s from `networksetup` and occasionally drops an entry for one tick before the next refresh puts it back. Without this fix the user sees the label flicker between "Ethernet" and "Unknown" several times a minute. Solution: cache the last *concrete* (non-"Unknown") classification per (sinkNodeId, sinkIp). When a fresh lookup returns "Unknown" we ignore it in favour of the cached answer. Concrete answers always update the cache, so a real network change propagates immediately. Cache is bounded by O(N²) for N nodes (one entry per directed edge × IP), so no leak. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously the apply layer's link-profile dedup keyed on transport alone, so a peer reachable on multiple IPs (LAN + Tailscale + link-local + ...) would have all of its socket profiles collapsed into one slot — and the reconciler probing each IP in turn would overwrite the previous, making the displayed bandwidth and classification bounce ("Ethernet 1.1 Gbps" → "Unknown 400 Mbps" → "Ethernet 1.1 Gbps" → ...). Switch the dedup to the natural identity per transport: - socket: (transport, sink_ip) — one row per IP - rdma: (transport, source_iface, sink_iface) Each connection now gets its own stable row. The dashboard's edge label still shows max-up / max-down / min-RTT across all profiles, so the summary is the best path; the hover tooltip shows the full breakdown per connection. Verified live on the 4-node M3 Ultra cluster: james -> s14 has 4 distinct socket rows (link-local, LAN, Tailscale, and a slow path) plus the RDMA row, all stable across 5+ minutes of probes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Added `latency_jitter_ms` to both SocketLinkProfile and RDMALinkProfile. Defined as the mean of |Δ| between consecutive RTT samples — same convention iperf3 reports, captures short-term variance better than stddev. - Socket: bumped LATENCY_SAMPLES from 5 → 10 so the mean-of-deltas is meaningful (4 deltas was thin). - RDMA: 50 samples already, just compute the deltas alongside the median. - State + apply: plumb `latency_jitter_ms` through. NodeSocketLinkProfile defaults to 0.0; NodeRdmaLinkProfile to None (matches the rest of its optional fields). Dashboard: new "Jitter" column in the hover tooltip. Edge label left alone — keeping it short. Tooltip styling fixes pulled in along the way: - `position: fixed` so it can escape `overflow: hidden` on the topology container — multi-row tooltips were getting clipped at the box edge. - `white-space: nowrap` on table cells; bandwidth numbers were wrapping onto two lines and overlapping. Slightly wider per-cell padding for breathing room. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…TT/2 GpuRichBar simplified to "ClusterStats": dropped the gradient bar + scoring and kept just the three aggregate tiles (FP16 compute, memory bandwidth, total memory). Centered, inline. Edge labels reworked: instead of one combined "↑X ↓Y · Z RTT" label per pair (which collided on tight layouts and forced ↑↓ glyphs to disambiguate direction), now we render up to three labels per edge: - Per-direction bandwidth — placed next to its arrow head, on the matching side of the midpoint. The arrow direction implies which way the number applies, so no glyphs needed. - Latency centered at midpoint, on the *other* side of the edge so the eye doesn't have to disambiguate it from the bandwidth labels. Latency display also switched from RTT to RTT/2 (one-way approximation) — the topology edge label and the tooltip both show RTT/2 now, with the column header explicitly labeled "RTT/2" so the semantic is clear. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two changes that fix the label-pile-up at the cluster centroid and make labels actually adjacent to their arrows: - Push labels AWAY from the viewport centroid (outer side of each edge), not toward it. Previously every edge in a 4-node diamond ended up with its labels piled near the centroid because `towardCenter` pointed there. For diagonals whose midpoint *is* the centroid we pick a stable side based on edge direction. - Latency also halved in the tooltip (jitter/2 alongside RTT/2), for consistency with the topology edge label. End result: each perimeter edge has its A→B bandwidth, latency, and B→A bandwidth strung along the outer side, each adjacent to its arrow head. Crossing diagonals' labels form a "+" pattern at the center instead of overlapping each other. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AlexCheema and others added 6 commits May 1, 2026 01:21

temp: add screenshots for PR description

7452a94

chore: remove temporary PR screenshot files

871414e

AlexCheema force-pushed the alexcheema/profilers-dashboard branch from c0644e3 to e5760dd Compare May 1, 2026 00:23

AlexCheema and others added 7 commits May 1, 2026 01:30

fix(profilers): use local RDMA interfaces in probe matrix

d1f4b24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(profilers): GPU + link profiling with directional measurements#2010

feat(profilers): GPU + link profiling with directional measurements#2010
AlexCheema wants to merge 13 commits into
mainfrom
alexcheema/profilers-dashboard

AlexCheema commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AlexCheema commented Apr 30, 2026

Summary

Screenshots

Topology view (no hover)

Hovering an edge — directional breakdown tooltip

Implementation notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant