feat: peer-to-peer model distribution#1992
Open
adurham wants to merge 2 commits into
Open
Conversation
added 2 commits
April 26, 2026 16:57
When more than one node in a cluster needs the same model, this lets nodes that already hold the weights serve them to peers over the local network instead of every node fetching independently from HuggingFace. On a Thunderbolt-meshed cluster a 200 GB model is pulled from the internet exactly once and then fans out at link-local speed. Components ---------- * New ``src/exo/download/file_server.py`` exposes a small aiohttp server on ``EXO_FILE_SERVER_PORT`` (default 52416) that streams files out of any directory listed in ``EXO_MODELS_DIRS`` / ``EXO_MODELS_READ_ONLY_DIRS``. Supports HTTP Range for resumable downloads and rejects path-traversal via ``Path.is_relative_to``. * New ``src/exo/download/peer_discovery.py`` walks the global ``download_status`` for any peer in ``DownloadCompleted`` and picks the best routable IPv4 address from that peer's ``NodeNetworkInfo``, ranking ``thunderbolt > maybe_ethernet > ethernet`` and skipping loopback / link-local / IPv6. * ``StartDownload`` / ``DownloadModel`` now carry an optional ``repo_url``. When set, the shard downloader skips HF and uses curl (``-C -`` for resume, ``-f`` for fail-fast) to pull from the peer. Per-file parallelism is bumped to 16 concurrent files for the peer path since the bottleneck is local NIC throughput, not HF rate limits. * New env var ``EXO_FILE_SERVER_BIND_HOST`` (default ``0.0.0.0``). Operators on untrusted networks can set it to ``127.0.0.1`` to disable peer serving on a node, or to a specific interface IP to narrow exposure. Tests ----- * ``tests/test_peer_discovery.py`` (16 cases): IP priority ordering, IPv6 / fe80 / 127 / 169.254 skip, no-completed-peers handling, multi-peer fallthrough, peer-is-self skip. * ``tests/test_file_server.py`` (11 cases): Range / partial / 416, 404 paths, path-traversal rejection, multi-directory resolution, read-only mirror serving. * ``tests/test_p2p_download.py`` (5 cases, skipped if curl absent): end-to-end round-trip via a real file_server on an ephemeral port, progress callback shape, resume from a partial file, curl error on 404 and unreachable host. Docs ---- * ``docs/p2p-model-distribution.md`` covers the discovery/serve flow, the env vars, the curl runtime dependency, and the security stance (no auth on the listener — assumes a private cluster; recommended firewall posture documented).
Cancellation of an active download was being modelled as a transition back to ``DownloadPending`` — the same state used for "we haven't started yet". This conflated _paused with bytes on disk_ with _never ran_, and meant the worker plan loop happily auto-restarted a download the user had explicitly paused. Operators kept losing partial-download context across pauses. * Adds ``DownloadPaused`` to the download state union, carrying the ``downloaded`` / ``total`` already accumulated on disk. * Coordinator's existing pause path now writes ``DownloadPaused`` rather than ``DownloadPending``. * Renames the user-facing operation cancel → pause across the API surface, the dashboard store, and the downloads page UI. The internal ``CancelDownload`` IPC command name is kept so the master/worker wiring and the legacy ``/download/cancel`` HTTP endpoint stay backwards-compatible. * Adds ``POST /download/pause`` alongside the existing ``/download/cancel`` for clients that want the new naming. Both dispatch to the same handler under the hood. * ``placement.py`` treats ``DownloadPaused`` identically to ``DownloadPending`` for cycle-fitness scoring — partial progress still counts toward the cycle's affinity for that model. * ``test_cancel_download.py`` updated end-to-end to assert the new ``DownloadPaused`` event shape.
adurham
pushed a commit
to adurham/exo
that referenced
this pull request
Apr 26, 2026
Re-authored the original P2P chain from "not started" to a 2-commit upstream PR after a security sweep surfaced the lateral path-traversal bug and a malformed-Range DoS. PR includes hash verification via X-File-SHA256 sidecars and a configurable concurrency cap.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When more than one node in a cluster needs the same model, this lets nodes that already hold the weights serve them to peers over the local network instead of every node fetching independently from HuggingFace. On a Thunderbolt-meshed cluster, a 200 GB model is pulled from the internet exactly once and then fans out at link-local speed.
The PR is two commits, both reviewable independently:
feat: peer-to-peer model distribution— the file server, peer discovery, curl-based fetch with hash verification, and the concurrency cap.refactor: rename download cancel→pause and surface DownloadPaused state— splits the existing "cancel" download state into a realDownloadPausedso the worker plan loop doesn't auto-restart paused downloads. AddsPOST /download/pausealongside/download/cancel(the latter is kept for backwards compat).How it works
src/exo/download/file_server.pyexposes a small aiohttp server onEXO_FILE_SERVER_PORT(default52416) that streams files from any directory listed inEXO_MODELS_DIRS/EXO_MODELS_READ_ONLY_DIRS. Supports HTTP Range for resumable downloads.src/exo/download/peer_discovery.pywalks the globaldownload_statusfor any peer inDownloadCompletedand picks the best routable IPv4 address from that peer'sNodeNetworkInfo, rankingthunderbolt > maybe_ethernet > ethernet > everything elseand skipping IPv6, loopback, link-local.StartDownload/DownloadModelcarry an optionalrepo_url. When set, the shard downloader skips HF and uses curl (-C -for resume,-ffor fail-fast,-D <file>to capture response headers) to pull from the peer. Per-file parallelism is bumped to 16 concurrent files for the peer path since the bottleneck is local NIC throughput, not HF rate limits.Security stance
Documented in detail in
docs/p2p-model-distribution.md. Headlines:<model_dir>/<normalized>/subdirectory —..-traversal that would escape into a sibling model under the same root is rejected with 404 (an earlieris_relative_to(model_dir)-only check missed this). Tested at the HTTP-protocol level with raw sockets so client-side URL normalization can't hide a regression.Rangeheaders (bytes=abc-, multi-range, suffix-form, etc.) are silently ignored rather than crashing the handler. Pre-fix this was a one-line DoS —int("abc")raisedValueErrorand aiohttp returned 500.<file>.sha256sidecar (written by the existing HF integrity-check path after it verifies HF's etag), the file server emitsX-File-SHA256. The receiver captures the header in the same curl round-trip via-D, hashes the downloaded bytes, refuses to rename.partial → finalon mismatch, and the outer retry loop re-fetches. On success the receiver writes its own sidecar so it can in turn pass verified hashes to other peers. Catches transmission corruption and disk-side decay; trust boundary remains "every node in the cluster is trusted."EXO_FILE_SERVER_MAX_CONCURRENCY(default 64). Excess in-flight serves get503 Retry-After: 1and the receiver's existing retry loop handles it (curl re-runs and-C -resumes — no bytes wasted).EXO_FILE_SERVER_BIND_HOST(default0.0.0.0). Set to127.0.0.1to disable P2P serving on a node, or to a specific interface IP to narrow exposure.Not found, notFile not found: <user-controlled-input>.Tests
61 new tests, all passing alongside the existing 360. Notable:
test_peer_discovery.py— 16 cases covering interface priority, routability filtering, multi-peer fallthrough.test_file_server.py— 32 cases including 4 raw-socket protocol-level tests that exercise path traversal and malformed Range without aiohttp client-side URL normalization.test_p2p_download.py— 13 end-to-end cases through a real ephemeral file_server, including hash mismatch handling, sidecar absence (backward compat), and_parse_x_file_sha256redirect-chain handling.Test plan
docs/p2p-model-distribution.mdaccurately describes the security stancedownload_file_with_retryretry semanticsRelated
Fork-side issue tracking from when this was scoped:
docs/upstream-prs.md(private) — this is the "P2P model distribution" entry.