Skip to content

feat: peer-to-peer model distribution#1992

Open
adurham wants to merge 2 commits into
exo-explore:mainfrom
adurham:p2p-upstream-pr
Open

feat: peer-to-peer model distribution#1992
adurham wants to merge 2 commits into
exo-explore:mainfrom
adurham:p2p-upstream-pr

Conversation

@adurham
Copy link
Copy Markdown
Contributor

@adurham adurham commented Apr 26, 2026

Summary

When more than one node in a cluster needs the same model, this lets nodes that already hold the weights serve them to peers over the local network instead of every node fetching independently from HuggingFace. On a Thunderbolt-meshed cluster, a 200 GB model is pulled from the internet exactly once and then fans out at link-local speed.

The PR is two commits, both reviewable independently:

  1. feat: peer-to-peer model distribution — the file server, peer discovery, curl-based fetch with hash verification, and the concurrency cap.
  2. refactor: rename download cancel→pause and surface DownloadPaused state — splits the existing "cancel" download state into a real DownloadPaused so the worker plan loop doesn't auto-restart paused downloads. Adds POST /download/pause alongside /download/cancel (the latter is kept for backwards compat).

How it works

  • New src/exo/download/file_server.py exposes a small aiohttp server on EXO_FILE_SERVER_PORT (default 52416) that streams files from any directory listed in EXO_MODELS_DIRS / EXO_MODELS_READ_ONLY_DIRS. Supports HTTP Range for resumable downloads.
  • New src/exo/download/peer_discovery.py walks the global download_status for any peer in DownloadCompleted and picks the best routable IPv4 address from that peer's NodeNetworkInfo, ranking thunderbolt > maybe_ethernet > ethernet > everything else and skipping IPv6, loopback, link-local.
  • StartDownload / DownloadModel carry an optional repo_url. When set, the shard downloader skips HF and uses curl (-C - for resume, -f for fail-fast, -D <file> to capture response headers) to pull from the peer. Per-file parallelism is bumped to 16 concurrent files for the peer path since the bottleneck is local NIC throughput, not HF rate limits.

Security stance

Documented in detail in docs/p2p-model-distribution.md. Headlines:

  • Path-traversal defense. Resolved file paths are pinned to the specific <model_dir>/<normalized>/ subdirectory — ..-traversal that would escape into a sibling model under the same root is rejected with 404 (an earlier is_relative_to(model_dir)-only check missed this). Tested at the HTTP-protocol level with raw sockets so client-side URL normalization can't hide a regression.
  • Range header robustness. Malformed Range headers (bytes=abc-, multi-range, suffix-form, etc.) are silently ignored rather than crashing the handler. Pre-fix this was a one-line DoS — int("abc") raised ValueError and aiohttp returned 500.
  • Hash verification. When the source has a <file>.sha256 sidecar (written by the existing HF integrity-check path after it verifies HF's etag), the file server emits X-File-SHA256. The receiver captures the header in the same curl round-trip via -D, hashes the downloaded bytes, refuses to rename .partial → final on mismatch, and the outer retry loop re-fetches. On success the receiver writes its own sidecar so it can in turn pass verified hashes to other peers. Catches transmission corruption and disk-side decay; trust boundary remains "every node in the cluster is trusted."
  • Concurrency cap. EXO_FILE_SERVER_MAX_CONCURRENCY (default 64). Excess in-flight serves get 503 Retry-After: 1 and the receiver's existing retry loop handles it (curl re-runs and -C - resumes — no bytes wasted).
  • Configurable bind host. EXO_FILE_SERVER_BIND_HOST (default 0.0.0.0). Set to 127.0.0.1 to disable P2P serving on a node, or to a specific interface IP to narrow exposure.
  • Error responses do not echo request paths. A 404 says Not found, not File not found: <user-controlled-input>.
  • No reverse-proxy assumption. Inter-cluster traffic is the use case; rate-limiting at the reverse-proxy layer doesn't fit the topology.

Tests

61 new tests, all passing alongside the existing 360. Notable:

  • test_peer_discovery.py — 16 cases covering interface priority, routability filtering, multi-peer fallthrough.
  • test_file_server.py — 32 cases including 4 raw-socket protocol-level tests that exercise path traversal and malformed Range without aiohttp client-side URL normalization.
  • test_p2p_download.py — 13 end-to-end cases through a real ephemeral file_server, including hash mismatch handling, sidecar absence (backward compat), and _parse_x_file_sha256 redirect-chain handling.

Test plan

  • CI green (basedpyright, ruff, pytest)
  • Reviewer sanity-check: docs/p2p-model-distribution.md accurately describes the security stance
  • Reviewer sanity-check: hash-verification flow makes sense given the existing download_file_with_retry retry semantics
  • Optional manual test: 2-node cluster, place an instance of a model neither node has, confirm node A pulls from HF and node B pulls from node A

Related

Fork-side issue tracking from when this was scoped: docs/upstream-prs.md (private) — this is the "P2P model distribution" entry.

Adam Durham added 2 commits April 26, 2026 16:57
When more than one node in a cluster needs the same model, this lets
nodes that already hold the weights serve them to peers over the local
network instead of every node fetching independently from HuggingFace.
On a Thunderbolt-meshed cluster a 200 GB model is pulled from the
internet exactly once and then fans out at link-local speed.

Components
----------
* New ``src/exo/download/file_server.py`` exposes a small aiohttp
  server on ``EXO_FILE_SERVER_PORT`` (default 52416) that streams files
  out of any directory listed in ``EXO_MODELS_DIRS`` /
  ``EXO_MODELS_READ_ONLY_DIRS``. Supports HTTP Range for resumable
  downloads and rejects path-traversal via ``Path.is_relative_to``.

* New ``src/exo/download/peer_discovery.py`` walks the global
  ``download_status`` for any peer in ``DownloadCompleted`` and picks
  the best routable IPv4 address from that peer's ``NodeNetworkInfo``,
  ranking ``thunderbolt > maybe_ethernet > ethernet`` and skipping
  loopback / link-local / IPv6.

* ``StartDownload`` / ``DownloadModel`` now carry an optional
  ``repo_url``. When set, the shard downloader skips HF and uses curl
  (``-C -`` for resume, ``-f`` for fail-fast) to pull from the peer.
  Per-file parallelism is bumped to 16 concurrent files for the peer
  path since the bottleneck is local NIC throughput, not HF rate
  limits.

* New env var ``EXO_FILE_SERVER_BIND_HOST`` (default ``0.0.0.0``).
  Operators on untrusted networks can set it to ``127.0.0.1`` to
  disable peer serving on a node, or to a specific interface IP to
  narrow exposure.

Tests
-----
* ``tests/test_peer_discovery.py`` (16 cases): IP priority ordering,
  IPv6 / fe80 / 127 / 169.254 skip, no-completed-peers handling,
  multi-peer fallthrough, peer-is-self skip.

* ``tests/test_file_server.py`` (11 cases): Range / partial / 416,
  404 paths, path-traversal rejection, multi-directory resolution,
  read-only mirror serving.

* ``tests/test_p2p_download.py`` (5 cases, skipped if curl absent):
  end-to-end round-trip via a real file_server on an ephemeral port,
  progress callback shape, resume from a partial file, curl error on
  404 and unreachable host.

Docs
----
* ``docs/p2p-model-distribution.md`` covers the discovery/serve flow,
  the env vars, the curl runtime dependency, and the security stance
  (no auth on the listener — assumes a private cluster; recommended
  firewall posture documented).
Cancellation of an active download was being modelled as a transition
back to ``DownloadPending`` — the same state used for "we haven't
started yet". This conflated _paused with bytes on disk_ with _never
ran_, and meant the worker plan loop happily auto-restarted a download
the user had explicitly paused. Operators kept losing partial-download
context across pauses.

* Adds ``DownloadPaused`` to the download state union, carrying the
  ``downloaded`` / ``total`` already accumulated on disk.
* Coordinator's existing pause path now writes ``DownloadPaused``
  rather than ``DownloadPending``.
* Renames the user-facing operation cancel → pause across the API
  surface, the dashboard store, and the downloads page UI. The
  internal ``CancelDownload`` IPC command name is kept so the
  master/worker wiring and the legacy ``/download/cancel`` HTTP
  endpoint stay backwards-compatible.
* Adds ``POST /download/pause`` alongside the existing
  ``/download/cancel`` for clients that want the new naming. Both
  dispatch to the same handler under the hood.
* ``placement.py`` treats ``DownloadPaused`` identically to
  ``DownloadPending`` for cycle-fitness scoring — partial progress
  still counts toward the cycle's affinity for that model.
* ``test_cancel_download.py`` updated end-to-end to assert the new
  ``DownloadPaused`` event shape.
adurham pushed a commit to adurham/exo that referenced this pull request Apr 26, 2026
Re-authored the original P2P chain from "not started" to a 2-commit
upstream PR after a security sweep surfaced the lateral path-traversal
bug and a malformed-Range DoS. PR includes hash verification via
X-File-SHA256 sidecars and a configurable concurrency cap.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant