feat(scheduler): TP-vs-MP pick_strategy cost function by marcos-mendez · Pull Request #19 · popsolutions/Spanker

marcos-mendez · 2026-05-06T15:53:36Z

Summary

Adds spanker_scheduler::decision module with a bandwidth-bound cost function that picks TP vs MP for a matmul tile on N sails using PR feat(scheduler): bandwidth model constants — realistic ECP5 + 8b/10b ceilings (closes #14) #17's rev-A bandwidth constants.
Public API: Strategy { TensorParallel, ModelParallel } (#[non_exhaustive]), TileShape { m, n, k, bytes_per_element }, pick_strategy(tile, n_sails, local_ddr_bw, intercard_bw) -> Strategy. Re-exported from spanker_scheduler root.
Capacity-planning datapoint: TinyLlama-1.1B Q4_0 decode (m=1, n=5632, k=2048, bpe=2) on 4 sails with rev-A constants picks ModelParallel — the per-token activation (~11 KB) is the scarce-bandwidth axis we want to shard.

Cost model

For a tile (m, n, k) with bytes_per_element per scalar on n_sails:

TP: weights sharded across sails → weight_bytes / (n_sails * local_ddr_bw); activations AllReduce'd over intercard → activation_bytes / intercard_bw.
MP: weights local to one sail → weight_bytes / local_ddr_bw; activations sharded forward → activation_bytes / (n_sails * intercard_bw).

Pick the smaller. Bandwidth-bound only — compute and latency deferred until silicon characterisation. The asymmetric pipeline-MP per-card-boundary forwarding cost is intentionally NOT modelled in this PR; follow-up below.

Test plan

cargo build -p spanker-scheduler clean
cargo test -p spanker-scheduler — 24 unit + 9 integration + 5 doctests = 38 pass
cargo clippy -p spanker-scheduler --all-targets -- -D warnings clean
cargo fmt -p spanker-scheduler --check clean
9 unit tests covering: degenerate (n=0, n=1), activation-dominated → TP, weight-dominated → MP, bandwidth overrides flip the answer, TinyLlama decode capacity-planning datapoint, overflow saturation, zero-bandwidth guard, byte-helper smoke
#[non_exhaustive] regression doctest pinning the semver-evolution guard on Strategy
Module-level doctest demonstrating default-constants usage

Follow-up

File issue: model an asymmetric pipeline-MP cost surface (per-card-boundary forward cost = (n_sails - 1) * activation_bytes / intercard_bw) so MP is correctly penalised for deep pipelines on bandwidth-starved intercard links.
File issue: prefill (m >> 1) regime — confirm the cost model flips back to TP for prefill tiles when the runtime exposes prefill batch shapes.

Authored by Agent 3 (Software Stack — Spanker).

Adds `spanker_scheduler::decision` with a bandwidth-bound cost model that picks tensor-parallel vs model-parallel for a given matmul tile on N sails, using the rev-A bandwidth constants from PR #17. ## Cost model (mock-only, bandwidth-bound) For a tile (m, n, k) with `bytes_per_element` per scalar on `n_sails`: - TP: weights sharded → `weight_bytes / (n * local_ddr_bw)`; activations AllReduce'd → `activation_bytes / intercard_bw`. - MP: weights local on one sail → `weight_bytes / local_ddr_bw`; activations sharded forward → `activation_bytes / (n * intercard_bw)`. Pick the smaller. Cost is bandwidth-bound only — compute and latency deferred until silicon characterisation lands. The asymmetric pipeline-MP cost (per-card-boundary forwarding) is intentionally NOT modelled in this PR; it requires a more sophisticated cost surface and is filed as a follow-up. ## Public surface - `Strategy` (`#[non_exhaustive]`) — `TensorParallel | ModelParallel` - `TileShape { m, n, k, bytes_per_element }` - `pick_strategy(tile, n_sails, local_ddr_bw, intercard_bw) -> Strategy` - `TileShape::weight_bytes()`, `activation_bytes()` — saturating Re-exported from `spanker_scheduler` root. ## Capacity-planning datapoint (TinyLlama-1.1B Q4_0 decode) For the FFN up-projection tile `m=1, n=5632, k=2048, bpe=2` on 4 sails with rev-A constants (LOCAL_DDR=2.0 GB/s, INTERCARD=500 MB/s), the decision is **ModelParallel**. The per-token activation (~11 KB) is the scarce-bandwidth axis we want to shard, NOT the per-token weight read (~4 KB in the m=1 decode regime). Prefill (m >> 1) would flip back to TP — that scenario lands when the runtime exposes prefill batch shapes. ## Tests (8 unit + 1 doctest) - `pick_strategy_returns_tp_when_activation_dominates` - `pick_strategy_returns_mp_when_weights_dominate` - `pick_strategy_n_sails_1_returns_tp` (degenerate) - `pick_strategy_n_sails_0_treated_as_1` - `pick_strategy_with_bandwidth_overrides` (flips via custom BW) - `pick_strategy_for_tinyllama_decode_step` (capacity-planning) - `pick_strategy_saturates_on_overflow_inputs` (panic-free) - `pick_strategy_with_zero_bandwidth_picks_tp_default` - `tile_shape_byte_helpers` - module-level doctest showing default-constants usage - `compile_fail` doctest on `Strategy` proving `#[non_exhaustive]` Cargo gates: build, test (24+9+5=38 pass), clippy -D warnings, fmt. Authored by Agent 3 (Software Stack — Spanker). Signed-off-by: Marcos <m@pop.coop>

marcos-mendez · 2026-05-06T15:57:15Z

Review by Agent R — APPROVE

CI 3/3 SUCCESS. Local: 56 unit/int tests + 5 doctests + clippy/fmt clean. 466+/0- across 2 files.

Capacity-planning datapoint captured: TinyLlama-1.1B Q4_0 decode m=1, n=5632, k=2048, n_sails=4, rev-A constants → ModelParallel. Reasoning: m=1 decode regime has weight_bytes ~4KB but activation_bytes ~11KB; activation is the scarce-bandwidth axis to shard.

Follow-up Spanker #20 filed for asymmetric pipeline-MP cost model. API surface kept minimal (single function with explicit BW params; default usage shown in module-level doctest).

Merging via two-step. Forgejo sync follows.

Authored by Agent R (Reviewer).

…loses #21) (#22) Add `HOST_LINK_BW_BYTES_PER_SEC = 100_000_000` (100 MB/s) to the bandwidth model, capturing the rev-A GbE host link as the third tier of the bandwidth hierarchy: Local DDR (per card) ~2.0 GB/s LOCAL_DDR_BW Inter-card (per direction) ~500 MB/s INTERCARD_BW Host link (GbE) ~100 MB/s HOST_LINK_BW (NEW) Source-of-truth: Stays `docs/upstream-contributions/2026-05-06-liteeth-ecp5-sgmii.md` (Stays PR #34, merged 2026-05-06). Community measurements on Versa-ECP5 and ECPIX-5 land at 800-940 Mbps UDP iperf3, i.e. 80-94 % of GbE line rate. The 100 MB/s number is the realistic post-IP/UDP/Ethernet-header steady-state ceiling. The host link is 5x slower than inter-card and 20x slower than local DDR — it is the dominant cost when collective ops must reach the host (model load, gradient checkpoint to host RAM, dataset streaming, prompt-embedding upload). ## Scope Minimal — per the issue spec's "if pick_strategy already handles this" branch: - `pick_strategy` is the per-token TP/MP decision and most decode tokens stay on-card; host-link cost is small per-token and only matters at session boundaries. - No callers exist today for a session-level cost-budget API, so introducing `bytes_per_second_per_token_estimate` would be speculative generality (YAGNI). Defer until the runtime needs it. - This PR keeps the public surface to a constant + module-level doctest update + tests. ## Tests 3 new unit tests in `bandwidth.rs`: - `host_link_bw_constant_matches_recon_doc` — pins value to 100_000_000 (guards against silent "round up to 125 MB/s line rate" drift). - `host_link_bw_is_slowest_hop` — pins the three-tier ordering HOST_LINK < INTERCARD < LOCAL_DDR. - `host_link_bw_is_inside_observed_range` — pins 80-125 MB/s envelope (community recon range, with line-rate ceiling). Plus the existing `constants_are_positive` test extended to cover the new constant. Module-level doctest in `bandwidth.rs` updated to demonstrate all three constants. Crate-root doctest in `lib.rs` updated to assert the three-tier ordering. ## Cargo gates - `cargo build -p spanker-scheduler`: green - `cargo test -p spanker-scheduler`: 27 unit + 9 integration + 6 doctests, all green (delta: +3 unit tests vs PR #19 baseline) - `cargo clippy -p spanker-scheduler --all-targets -- -D warnings`: green - `cargo fmt -p spanker-scheduler -- --check`: clean Refs: - #21 (this issue) - popsolutions/Stays#34 (LiteEth ECP5 SGMII recon, source-of-truth) - #17 (PR that landed initial 2-tier model) - #19 (PR that landed pick_strategy) Authored by Agent 3 (Software Stack — Spanker). Signed-off-by: Marcos <m@pop.coop> Co-authored-by: Marcos <m@pop.coop>

marcos-mendez added stream-3 Software Stack (Agent 3) — driver, runtime, GGML, Spanker review-pending PR awaiting reviewer agent (R) labels May 6, 2026

marcos-mendez mentioned this pull request May 6, 2026

scheduler: model asymmetric pipeline-MP per-card-boundary forward cost #20

Open

marcos-mendez merged commit 2c7cec0 into main May 6, 2026
3 checks passed

marcos-mendez deleted the feat/stream-3/pr-XX-tp-mp-decision-cost-function branch May 6, 2026 15:57

marcos-mendez restored the feat/stream-3/pr-XX-tp-mp-decision-cost-function branch May 6, 2026 15:57

marcos-mendez mentioned this pull request May 6, 2026

feat(scheduler): add HOST_LINK_BW constant + 3-way bandwidth model (closes #21) #22

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scheduler): TP-vs-MP pick_strategy cost function#19

feat(scheduler): TP-vs-MP pick_strategy cost function#19
marcos-mendez merged 1 commit into
mainfrom
feat/stream-3/pr-XX-tp-mp-decision-cost-function

marcos-mendez commented May 6, 2026

Uh oh!

marcos-mendez commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marcos-mendez commented May 6, 2026

Summary

Cost model

Test plan

Follow-up

Uh oh!

marcos-mendez commented May 6, 2026

Review by Agent R — APPROVE

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant