feat(scheduler): TP-vs-MP pick_strategy cost function#19
Merged
marcos-mendez merged 1 commit intoMay 6, 2026
Merged
Conversation
Adds `spanker_scheduler::decision` with a bandwidth-bound cost model that picks tensor-parallel vs model-parallel for a given matmul tile on N sails, using the rev-A bandwidth constants from PR #17. ## Cost model (mock-only, bandwidth-bound) For a tile (m, n, k) with `bytes_per_element` per scalar on `n_sails`: - TP: weights sharded → `weight_bytes / (n * local_ddr_bw)`; activations AllReduce'd → `activation_bytes / intercard_bw`. - MP: weights local on one sail → `weight_bytes / local_ddr_bw`; activations sharded forward → `activation_bytes / (n * intercard_bw)`. Pick the smaller. Cost is bandwidth-bound only — compute and latency deferred until silicon characterisation lands. The asymmetric pipeline-MP cost (per-card-boundary forwarding) is intentionally NOT modelled in this PR; it requires a more sophisticated cost surface and is filed as a follow-up. ## Public surface - `Strategy` (`#[non_exhaustive]`) — `TensorParallel | ModelParallel` - `TileShape { m, n, k, bytes_per_element }` - `pick_strategy(tile, n_sails, local_ddr_bw, intercard_bw) -> Strategy` - `TileShape::weight_bytes()`, `activation_bytes()` — saturating Re-exported from `spanker_scheduler` root. ## Capacity-planning datapoint (TinyLlama-1.1B Q4_0 decode) For the FFN up-projection tile `m=1, n=5632, k=2048, bpe=2` on 4 sails with rev-A constants (LOCAL_DDR=2.0 GB/s, INTERCARD=500 MB/s), the decision is **ModelParallel**. The per-token activation (~11 KB) is the scarce-bandwidth axis we want to shard, NOT the per-token weight read (~4 KB in the m=1 decode regime). Prefill (m >> 1) would flip back to TP — that scenario lands when the runtime exposes prefill batch shapes. ## Tests (8 unit + 1 doctest) - `pick_strategy_returns_tp_when_activation_dominates` - `pick_strategy_returns_mp_when_weights_dominate` - `pick_strategy_n_sails_1_returns_tp` (degenerate) - `pick_strategy_n_sails_0_treated_as_1` - `pick_strategy_with_bandwidth_overrides` (flips via custom BW) - `pick_strategy_for_tinyllama_decode_step` (capacity-planning) - `pick_strategy_saturates_on_overflow_inputs` (panic-free) - `pick_strategy_with_zero_bandwidth_picks_tp_default` - `tile_shape_byte_helpers` - module-level doctest showing default-constants usage - `compile_fail` doctest on `Strategy` proving `#[non_exhaustive]` Cargo gates: build, test (24+9+5=38 pass), clippy -D warnings, fmt. Authored by Agent 3 (Software Stack — Spanker). Signed-off-by: Marcos <m@pop.coop>
Member
Author
Review by Agent R — APPROVECI 3/3 SUCCESS. Local: 56 unit/int tests + 5 doctests + clippy/fmt clean. 466+/0- across 2 files. Capacity-planning datapoint captured: TinyLlama-1.1B Q4_0 decode m=1, n=5632, k=2048, n_sails=4, rev-A constants → ModelParallel. Reasoning: m=1 decode regime has weight_bytes ~4KB but activation_bytes ~11KB; activation is the scarce-bandwidth axis to shard. Follow-up Spanker #20 filed for asymmetric pipeline-MP cost model. API surface kept minimal (single function with explicit BW params; default usage shown in module-level doctest). Merging via two-step. Forgejo sync follows. Authored by Agent R (Reviewer). |
4 tasks
marcos-mendez
added a commit
that referenced
this pull request
May 6, 2026
…loses #21) (#22) Add `HOST_LINK_BW_BYTES_PER_SEC = 100_000_000` (100 MB/s) to the bandwidth model, capturing the rev-A GbE host link as the third tier of the bandwidth hierarchy: Local DDR (per card) ~2.0 GB/s LOCAL_DDR_BW Inter-card (per direction) ~500 MB/s INTERCARD_BW Host link (GbE) ~100 MB/s HOST_LINK_BW (NEW) Source-of-truth: Stays `docs/upstream-contributions/2026-05-06-liteeth-ecp5-sgmii.md` (Stays PR #34, merged 2026-05-06). Community measurements on Versa-ECP5 and ECPIX-5 land at 800-940 Mbps UDP iperf3, i.e. 80-94 % of GbE line rate. The 100 MB/s number is the realistic post-IP/UDP/Ethernet-header steady-state ceiling. The host link is 5x slower than inter-card and 20x slower than local DDR — it is the dominant cost when collective ops must reach the host (model load, gradient checkpoint to host RAM, dataset streaming, prompt-embedding upload). ## Scope Minimal — per the issue spec's "if pick_strategy already handles this" branch: - `pick_strategy` is the per-token TP/MP decision and most decode tokens stay on-card; host-link cost is small per-token and only matters at session boundaries. - No callers exist today for a session-level cost-budget API, so introducing `bytes_per_second_per_token_estimate` would be speculative generality (YAGNI). Defer until the runtime needs it. - This PR keeps the public surface to a constant + module-level doctest update + tests. ## Tests 3 new unit tests in `bandwidth.rs`: - `host_link_bw_constant_matches_recon_doc` — pins value to 100_000_000 (guards against silent "round up to 125 MB/s line rate" drift). - `host_link_bw_is_slowest_hop` — pins the three-tier ordering HOST_LINK < INTERCARD < LOCAL_DDR. - `host_link_bw_is_inside_observed_range` — pins 80-125 MB/s envelope (community recon range, with line-rate ceiling). Plus the existing `constants_are_positive` test extended to cover the new constant. Module-level doctest in `bandwidth.rs` updated to demonstrate all three constants. Crate-root doctest in `lib.rs` updated to assert the three-tier ordering. ## Cargo gates - `cargo build -p spanker-scheduler`: green - `cargo test -p spanker-scheduler`: 27 unit + 9 integration + 6 doctests, all green (delta: +3 unit tests vs PR #19 baseline) - `cargo clippy -p spanker-scheduler --all-targets -- -D warnings`: green - `cargo fmt -p spanker-scheduler -- --check`: clean Refs: - #21 (this issue) - popsolutions/Stays#34 (LiteEth ECP5 SGMII recon, source-of-truth) - #17 (PR that landed initial 2-tier model) - #19 (PR that landed pick_strategy) Authored by Agent 3 (Software Stack — Spanker). Signed-off-by: Marcos <m@pop.coop> Co-authored-by: Marcos <m@pop.coop>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
spanker_scheduler::decisionmodule with a bandwidth-bound cost function that picks TP vs MP for a matmul tile on N sails using PR feat(scheduler): bandwidth model constants — realistic ECP5 + 8b/10b ceilings (closes #14) #17's rev-A bandwidth constants.Strategy { TensorParallel, ModelParallel }(#[non_exhaustive]),TileShape { m, n, k, bytes_per_element },pick_strategy(tile, n_sails, local_ddr_bw, intercard_bw) -> Strategy. Re-exported fromspanker_schedulerroot.m=1, n=5632, k=2048, bpe=2) on 4 sails with rev-A constants picks ModelParallel — the per-token activation (~11 KB) is the scarce-bandwidth axis we want to shard.Cost model
For a tile (m, n, k) with
bytes_per_elementper scalar onn_sails:weight_bytes / (n_sails * local_ddr_bw); activations AllReduce'd over intercard →activation_bytes / intercard_bw.weight_bytes / local_ddr_bw; activations sharded forward →activation_bytes / (n_sails * intercard_bw).Pick the smaller. Bandwidth-bound only — compute and latency deferred until silicon characterisation. The asymmetric pipeline-MP per-card-boundary forwarding cost is intentionally NOT modelled in this PR; follow-up below.
Test plan
cargo build -p spanker-schedulercleancargo test -p spanker-scheduler— 24 unit + 9 integration + 5 doctests = 38 passcargo clippy -p spanker-scheduler --all-targets -- -D warningscleancargo fmt -p spanker-scheduler --checkcleann=0,n=1), activation-dominated → TP, weight-dominated → MP, bandwidth overrides flip the answer, TinyLlama decode capacity-planning datapoint, overflow saturation, zero-bandwidth guard, byte-helper smoke#[non_exhaustive]regression doctest pinning the semver-evolution guard onStrategyFollow-up
(n_sails - 1) * activation_bytes / intercard_bw) so MP is correctly penalised for deep pipelines on bandwidth-starved intercard links.m >> 1) regime — confirm the cost model flips back to TP for prefill tiles when the runtime exposes prefill batch shapes.Authored by Agent 3 (Software Stack — Spanker).