feat(profilers): GPU + link profiling with directional measurements#2010
Open
AlexCheema wants to merge 13 commits into
Open
feat(profilers): GPU + link profiling with directional measurements#2010AlexCheema wants to merge 13 commits into
AlexCheema wants to merge 13 commits into
Conversation
Adds an active-probe profiler subsystem alongside the existing info gatherer. Each node measures: - GPU FP16 TFLOPS (square FP16 matmul) and memory bandwidth (large `mx.sum` streaming read). Long warm-up + best-of-N timing keeps two M3 Ultras agreeing to within ~2%. - Per-edge socket bandwidth (upload + download separately, server times the receive on uploads) and RTT, via new `/profile/echo`, `/profile/upload`, and `/profile/download` endpoints. - Per-edge RDMA bandwidth (rank0->rank1 and rank1->rank0 via `mx.distributed.send`/`recv`) and RTT (tiny-payload `all_sum` ping-pong), in a child process so jaccl init doesn't poison the worker process. Scheduling is a state-driven reconciliation loop (15s tick) so the event log can be replayed without re-firing probes. TTLs: GPU 1h, socket 5m, RDMA 6h. GPU and RDMA probes skip while runners are active; all RDMA work serialised through a process-global lock. Dashboard: - New GpuRichBar component at the top of the topology view: gradient thumb scored as a weighted blend of TFLOPS / bandwidth / memory against an 8x H100 anchor (TFLOPS heaviest, memory least), with a per-dimension cap so a single insanely high axis still pulls the thumb right. - Per-node TFLOPS / memory bandwidth labels under each device. - Edge label shows max upload / max download / min RTT across all profiles for the pair. - Hovering an edge shows a breakdown table: direction, inferred connection type (Wi-Fi / Ethernet / Thunderbolt N / RDMA), upload / download / RTT per profile. - Connection-type inference reads `nodeNetwork.interfaceType` for socket edges and parses `nodeThunderbolt.linkSpeed` (e.g. "Up to 80 Gb/s") to distinguish TB4 from TB5 on RDMA edges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The lock was a defensive guess that simultaneous jaccl/RDMA probes would fight for the Thunderbolt RDMA hardware. They don't — concurrent processes get independent QPs. Verified end-to-end on a 2x M3 Ultra cluster: both nodes start probes simultaneously at discovery and both complete cleanly with sensible numbers. Keep the `state.runners` gate — that one is still real (don't compete with active inference traffic). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The white circle reads as a slider thumb that the user can grab and drag, but the bar is a passive readout. Apple HIG distinguishes controls (slider with prominent thumb) from indicators (gauge, with a tick or filled portion). For a "where do you fall on this scale" display, the indicator pattern is the right one. Replace the circular thumb with: - A thin 2px vertical tick line through the track at the value, with a dark halo so it stays legible across the red->green gradient. - The gradient past the marker is dimmed so the eye lands on the position instead of perceiving a static spectrum with a tick on it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The edge metrics label was "↑X ↓Y · Z µs RTT" — the "RTT" suffix was redundant once the unit was already there and made the label crowded. Keep the "RTT" column header in the hover tooltip where the explicit semantic still helps readers parse the table. Also picks up minor nix-fmt comment-spacing tweaks across the profiler modules. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c0644e3 to
e5760dd
Compare
`mx.random.uniform(shape=..., dtype=float16)` internally generates
fp32 then casts to fp16, which doubles the peak Metal allocation —
our 2 GiB fp16 buffer briefly needs 4 GiB of heap. CI's macOS
runner has max_buffer_size = 3.5 GiB and rejected the alloc:
RuntimeError: [metal::malloc] Attempting to allocate 4294967296
bytes which is greater than the maximum allowed buffer size of
3758096384 bytes.
DRAM bandwidth is independent of the values being streamed, so we
just allocate `mx.zeros` directly. No fp32 temp, peak heap = exactly
the 2 GiB we want.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dashboard re-derives the connection type on every reactive update. The source data — `nodeNetwork[peer].interfaces` — is re-parsed on the backend every 10 s from `networksetup` and occasionally drops an entry for one tick before the next refresh puts it back. Without this fix the user sees the label flicker between "Ethernet" and "Unknown" several times a minute. Solution: cache the last *concrete* (non-"Unknown") classification per (sinkNodeId, sinkIp). When a fresh lookup returns "Unknown" we ignore it in favour of the cached answer. Concrete answers always update the cache, so a real network change propagates immediately. Cache is bounded by O(N²) for N nodes (one entry per directed edge × IP), so no leak. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the apply layer's link-profile dedup keyed on transport
alone, so a peer reachable on multiple IPs (LAN + Tailscale +
link-local + ...) would have all of its socket profiles collapsed
into one slot — and the reconciler probing each IP in turn would
overwrite the previous, making the displayed bandwidth and
classification bounce ("Ethernet 1.1 Gbps" → "Unknown 400 Mbps" →
"Ethernet 1.1 Gbps" → ...).
Switch the dedup to the natural identity per transport:
- socket: (transport, sink_ip) — one row per IP
- rdma: (transport, source_iface, sink_iface)
Each connection now gets its own stable row. The dashboard's edge
label still shows max-up / max-down / min-RTT across all profiles,
so the summary is the best path; the hover tooltip shows the full
breakdown per connection.
Verified live on the 4-node M3 Ultra cluster: james -> s14 has 4
distinct socket rows (link-local, LAN, Tailscale, and a slow path)
plus the RDMA row, all stable across 5+ minutes of probes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Added `latency_jitter_ms` to both SocketLinkProfile and RDMALinkProfile. Defined as the mean of |Δ| between consecutive RTT samples — same convention iperf3 reports, captures short-term variance better than stddev. - Socket: bumped LATENCY_SAMPLES from 5 → 10 so the mean-of-deltas is meaningful (4 deltas was thin). - RDMA: 50 samples already, just compute the deltas alongside the median. - State + apply: plumb `latency_jitter_ms` through. NodeSocketLinkProfile defaults to 0.0; NodeRdmaLinkProfile to None (matches the rest of its optional fields). Dashboard: new "Jitter" column in the hover tooltip. Edge label left alone — keeping it short. Tooltip styling fixes pulled in along the way: - `position: fixed` so it can escape `overflow: hidden` on the topology container — multi-row tooltips were getting clipped at the box edge. - `white-space: nowrap` on table cells; bandwidth numbers were wrapping onto two lines and overlapping. Slightly wider per-cell padding for breathing room. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…TT/2 GpuRichBar simplified to "ClusterStats": dropped the gradient bar + scoring and kept just the three aggregate tiles (FP16 compute, memory bandwidth, total memory). Centered, inline. Edge labels reworked: instead of one combined "↑X ↓Y · Z RTT" label per pair (which collided on tight layouts and forced ↑↓ glyphs to disambiguate direction), now we render up to three labels per edge: - Per-direction bandwidth — placed next to its arrow head, on the matching side of the midpoint. The arrow direction implies which way the number applies, so no glyphs needed. - Latency centered at midpoint, on the *other* side of the edge so the eye doesn't have to disambiguate it from the bandwidth labels. Latency display also switched from RTT to RTT/2 (one-way approximation) — the topology edge label and the tooltip both show RTT/2 now, with the column header explicitly labeled "RTT/2" so the semantic is clear. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes that fix the label-pile-up at the cluster centroid and make labels actually adjacent to their arrows: - Push labels AWAY from the viewport centroid (outer side of each edge), not toward it. Previously every edge in a 4-node diamond ended up with its labels piled near the centroid because `towardCenter` pointed there. For diagonals whose midpoint *is* the centroid we pick a stable side based on edge direction. - Latency also halved in the tooltip (jitter/2 alongside RTT/2), for consistency with the topology edge label. End result: each perimeter edge has its A→B bandwidth, latency, and B→A bandwidth strung along the outer side, each adjacent to its arrow head. Crossing diagonals' labels form a "+" pattern at the center instead of overlapping each other. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
src/exo/utils/profilers/) measuring GPU FP16 TFLOPS, GPU memory bandwidth, per-edge socket bandwidth (upload + download separately), socket RTT, per-edge RDMA bandwidth (rank0→rank1 and rank1→rank0 viamx.distributed.send/recv), and RDMA RTT.system_profiler's linkSpeed).Screenshots
Topology view (no hover)
Two-node M3 Ultra cluster over Thunderbolt 5 + Tailscale. The GPU rich bar shows 47.0 TFLOPS / 1.33 TB/s / 1.00 TB; the thumb sits in the middle yellow zone — driven by the maxed-out memory dimension while compute and bandwidth are well below an 8× H100 anchor. Per-node label under each Mac shows GPU profile (e.g.
23.6 TFLOPS · 661 GB/s). Edge label reads↑74.77 Gbps ↓74.79 Gbps · 18 µs RTT.Hovering an edge — directional breakdown tooltip
Each row is one measured profile (one direction × one transport). Connection type is inferred per row.
Implementation notes
mx.suminstead of GEMV for memory bandwidth. GEMV pays for an inner reduction plus a vector broadcast and caps at ~58 % of peak DRAM bandwidth on Apple Silicon. A pure streamingmx.sumover a 2 GB FP16 buffer (well past M3 Ultra's ~96 MB SLC) hits ~80 % — the closer-to-the-ceiling number that's useful for placement decisions.mx.distributed.init(backend="jaccl")is process-global; running it inside the worker would clash with active inference. Probe lives inrdma_probe_main.py, spawned via/profile/rdma_probeHTTP rendezvous.POST /profile/uploadreturns its ownrecv_duration_ms) so the small response's RTT doesn't pollute the result. RDMA usesmx.distributed.send/recvper direction. Latency stays as RTT — true one-way needs sub-µs clock sync that we don't have, and synthetic half-RTT split would be misleading.Test plan
uv run basedpyright && uv run ruff check && nix fmt && uv run pytest— all green (416 pass, 16 new tests insrc/exo/utils/profilers/tests/).🤖 Generated with Claude Code