feat(scheduler): bandwidth model constants — realistic ECP5 + 8b/10b ceilings (closes #14)#17
Conversation
…ceilings (closes #14) Adds `spanker_scheduler::bandwidth` with two `pub const u64` values that the TP-vs-MP partitioning logic must use instead of the stale ADR-001 "DDR3-1600 = 12.8 GB/s" theoretical number: - `LOCAL_DDR_BW_BYTES_PER_SEC = 2_000_000_000` (2.0 GB/s — midpoint of the 1.5–2.4 GB/s realistic ceiling observed across the three production references that drive `ECP5DDRPHY` on the open ECP5 toolchain: OrangeCrab, Trellis Board, Versa-ECP5). Source-of-truth: Stays `docs/upstream-contributions/2026-05-06-litedram-ecp5.md` §Bandwidth budget (merged 2026-05-06). - `INTERCARD_BW_BYTES_PER_SEC = 500_000_000` (500 MB/s per direction — 4 lanes × 1.25 Gbps × 8b/10b coding). Source-of-truth: InnerJib7EA `docs/hw/intercard-connector-pinout.md` §9. When ADR-014 lands and selects 64b/66b coding the effective rate becomes ~600 MB/s; bump in the same PR that lands ADR-014. Re-exported from `spanker_scheduler` so consumers can write `use spanker_scheduler::{LOCAL_DDR_BW_BYTES_PER_SEC, INTERCARD_BW_BYTES_PER_SEC};` directly. Five unit tests (in `bandwidth::tests`) plus one crate-level doc-test pin the contract: - `local_ddr_bw_is_not_the_stale_12_8_gb_per_sec_value` — regression guard against accidental revert to the theoretical-peak number; asserts `LOCAL_DDR_BW < 4 GB/s` (envelope below both 6.4 and 12.8 GB/s but with rev-B headroom). - `local_ddr_bw_is_inside_observed_range` — pins to the 1.5–2.4 GB/s band reported by Stays recon. - `intercard_bw_matches_innerjib7ea_pinout_section_9` — pins INTERCARD to the doc-stated 500 MB/s. - `local_ddr_dominates_intercard_by_at_least_4x` — `LOCAL_DDR_BW >= 4 × INTERCARD_BW` (rev-A lands at exactly 4× with the spec values; the assertion uses `>=` so the rev-A baseline passes; if a future bump breaks the invariant the TP/MP heuristics need re-deriving). - `constants_are_positive` — rules out a zero that would silently become a divide-by-zero in throughput-divided cost estimates. No TP/MP decision logic in the scheduler today (PRs #6/#10/#13 are scaffolding), so this PR is intentionally constants-only — the load-bearing artefact for downstream cost-model PRs to consume. Cross-stream: - MAST issue #32 amends ADR-001's bandwidth claim; constant values here track that amendment. - InnerJib7EA PR #11 §9 is the source-of-truth for the 500 MB/s intercard ceiling. - Stays PR #26 (merged 2026-05-06) is the source-of-truth for the 1.5–2.4 GB/s DDR ceiling. No stale `12.8` or `12_800_000_000` constant existed in any Rust source under `src/` prior to this PR — the value was only a doc-claim in MAST ADR-001. Confirmed via repo-wide grep. Out of scope: TP/MP decision logic, ADR-001 amendment (MAST #32), bench-harness validation. All filed as follow-ups. Refs: #14, MAST #32, Stays #26, InnerJib7EA #11. Authored by Agent 3 (Software Stack — Spanker). Signed-off-by: Marcos (Agent 3) <popsolutions.co@gmail.com>
Review by Agent RVerdict: APPROVE Pre-review gates (
|
Adds `spanker_scheduler::decision` with a bandwidth-bound cost model that picks tensor-parallel vs model-parallel for a given matmul tile on N sails, using the rev-A bandwidth constants from PR #17. ## Cost model (mock-only, bandwidth-bound) For a tile (m, n, k) with `bytes_per_element` per scalar on `n_sails`: - TP: weights sharded → `weight_bytes / (n * local_ddr_bw)`; activations AllReduce'd → `activation_bytes / intercard_bw`. - MP: weights local on one sail → `weight_bytes / local_ddr_bw`; activations sharded forward → `activation_bytes / (n * intercard_bw)`. Pick the smaller. Cost is bandwidth-bound only — compute and latency deferred until silicon characterisation lands. The asymmetric pipeline-MP cost (per-card-boundary forwarding) is intentionally NOT modelled in this PR; it requires a more sophisticated cost surface and is filed as a follow-up. ## Public surface - `Strategy` (`#[non_exhaustive]`) — `TensorParallel | ModelParallel` - `TileShape { m, n, k, bytes_per_element }` - `pick_strategy(tile, n_sails, local_ddr_bw, intercard_bw) -> Strategy` - `TileShape::weight_bytes()`, `activation_bytes()` — saturating Re-exported from `spanker_scheduler` root. ## Capacity-planning datapoint (TinyLlama-1.1B Q4_0 decode) For the FFN up-projection tile `m=1, n=5632, k=2048, bpe=2` on 4 sails with rev-A constants (LOCAL_DDR=2.0 GB/s, INTERCARD=500 MB/s), the decision is **ModelParallel**. The per-token activation (~11 KB) is the scarce-bandwidth axis we want to shard, NOT the per-token weight read (~4 KB in the m=1 decode regime). Prefill (m >> 1) would flip back to TP — that scenario lands when the runtime exposes prefill batch shapes. ## Tests (8 unit + 1 doctest) - `pick_strategy_returns_tp_when_activation_dominates` - `pick_strategy_returns_mp_when_weights_dominate` - `pick_strategy_n_sails_1_returns_tp` (degenerate) - `pick_strategy_n_sails_0_treated_as_1` - `pick_strategy_with_bandwidth_overrides` (flips via custom BW) - `pick_strategy_for_tinyllama_decode_step` (capacity-planning) - `pick_strategy_saturates_on_overflow_inputs` (panic-free) - `pick_strategy_with_zero_bandwidth_picks_tp_default` - `tile_shape_byte_helpers` - module-level doctest showing default-constants usage - `compile_fail` doctest on `Strategy` proving `#[non_exhaustive]` Cargo gates: build, test (24+9+5=38 pass), clippy -D warnings, fmt. Authored by Agent 3 (Software Stack — Spanker). Signed-off-by: Marcos <m@pop.coop> Co-authored-by: Marcos <m@pop.coop>
…loses #21) (#22) Add `HOST_LINK_BW_BYTES_PER_SEC = 100_000_000` (100 MB/s) to the bandwidth model, capturing the rev-A GbE host link as the third tier of the bandwidth hierarchy: Local DDR (per card) ~2.0 GB/s LOCAL_DDR_BW Inter-card (per direction) ~500 MB/s INTERCARD_BW Host link (GbE) ~100 MB/s HOST_LINK_BW (NEW) Source-of-truth: Stays `docs/upstream-contributions/2026-05-06-liteeth-ecp5-sgmii.md` (Stays PR #34, merged 2026-05-06). Community measurements on Versa-ECP5 and ECPIX-5 land at 800-940 Mbps UDP iperf3, i.e. 80-94 % of GbE line rate. The 100 MB/s number is the realistic post-IP/UDP/Ethernet-header steady-state ceiling. The host link is 5x slower than inter-card and 20x slower than local DDR — it is the dominant cost when collective ops must reach the host (model load, gradient checkpoint to host RAM, dataset streaming, prompt-embedding upload). ## Scope Minimal — per the issue spec's "if pick_strategy already handles this" branch: - `pick_strategy` is the per-token TP/MP decision and most decode tokens stay on-card; host-link cost is small per-token and only matters at session boundaries. - No callers exist today for a session-level cost-budget API, so introducing `bytes_per_second_per_token_estimate` would be speculative generality (YAGNI). Defer until the runtime needs it. - This PR keeps the public surface to a constant + module-level doctest update + tests. ## Tests 3 new unit tests in `bandwidth.rs`: - `host_link_bw_constant_matches_recon_doc` — pins value to 100_000_000 (guards against silent "round up to 125 MB/s line rate" drift). - `host_link_bw_is_slowest_hop` — pins the three-tier ordering HOST_LINK < INTERCARD < LOCAL_DDR. - `host_link_bw_is_inside_observed_range` — pins 80-125 MB/s envelope (community recon range, with line-rate ceiling). Plus the existing `constants_are_positive` test extended to cover the new constant. Module-level doctest in `bandwidth.rs` updated to demonstrate all three constants. Crate-root doctest in `lib.rs` updated to assert the three-tier ordering. ## Cargo gates - `cargo build -p spanker-scheduler`: green - `cargo test -p spanker-scheduler`: 27 unit + 9 integration + 6 doctests, all green (delta: +3 unit tests vs PR #19 baseline) - `cargo clippy -p spanker-scheduler --all-targets -- -D warnings`: green - `cargo fmt -p spanker-scheduler -- --check`: clean Refs: - #21 (this issue) - popsolutions/Stays#34 (LiteEth ECP5 SGMII recon, source-of-truth) - #17 (PR that landed initial 2-tier model) - #19 (PR that landed pick_strategy) Authored by Agent 3 (Software Stack — Spanker). Signed-off-by: Marcos <m@pop.coop> Co-authored-by: Marcos <m@pop.coop>
Summary
Codifies the corrected rev-A bandwidth ceilings in the Spanker
scheduler so its capacity-planning logic uses real numbers, not the
stale ADR-001 "DDR3-1600 = 12.8 GB/s" theoretical claim that Stays
PR #26 (merged 2026-05-06) invalidated for the open ECP5 toolchain.
Adds a new module
src/scheduler/src/bandwidth.rswith twopub const u64values, re-exported at the crate root:LOCAL_DDR_BWis the midpoint of the 1.5–2.4 GB/s realistic rangeobserved across the three production references that drive
ECP5DDRPHYon the open toolchain (OrangeCrab, Trellis Board,Versa-ECP5).
INTERCARD_BWis the 4 lanes × 1.25 Gbps × 8b/10b ≈500 MB/s number from InnerJib7EA pinout §9 (current 8b/10b baseline;
becomes ~600 MB/s when ADR-014 selects 64b/66b — bump in lockstep
with the ADR PR).
Source-of-truth references
LOCAL_DDR_BW: Staysdocs/upstream-contributions/2026-05-06-litedram-ecp5.md§Bandwidth budget (Stays PR #26, merged 2026-05-06).
INTERCARD_BW: InnerJib7EAdocs/hw/intercard-connector-pinout.md§9 (InnerJib7EA PR chore(scheduler): doc-only sentinel tests + closure-based all_reduce dispatch + comment ordering #11 corrected to 500 MB/s).
amendment (Stream 1 in flight).
Did I find / replace any stale
12.8or12_800_000_000?No. Repo-wide grep for
12.8,12_800,LOCAL_DDR_BW, andBW_BYTES_PER_SECacross*.rs,*.toml, and*.md(excludingtarget/,.git/,.claude/worktrees/,external/) returned zerohits in source. The stale value was a doc-claim in MAST ADR-001 only;
no Rust constant existed yet to replace. This PR is the load-bearing
constant the scheduler's future TP-vs-MP cost model will consume.
Why constants-only (no TP/MP decision logic)?
Per the issue triage and
feedback_solo_mode_and_pr_workflow.mdscoping discipline: PRs #6 / #10 / #13 are scheduler scaffolding;
there is no live TP/MP decision logic to retrofit yet. The constants
are the load-bearing artefact for the next PR (downstream cost
model). Building the cost model on top of these constants is filed
as a separate scoped follow-up. Bench-harness validation against
measured throughput on first hardware bring-up is also a separate PR.
Tests
15 unit tests pass (5 new in
bandwidth::tests+ 10 pre-existing),9 integration tests pass, 3 doc-tests pass:
local_ddr_bw_is_not_the_stale_12_8_gb_per_sec_value— regressionguard against accidental revert; asserts
LOCAL_DDR_BW < 4 GB/senvelope (rejects both 6.4 and 12.8 GB/s; allows honest rev-B bump).
local_ddr_bw_is_inside_observed_range— pins to the 1.5–2.4 GB/sband reported by Stays recon.
intercard_bw_matches_innerjib7ea_pinout_section_9— pinsINTERCARD_BWto the doc-stated 500 MB/s.local_ddr_dominates_intercard_by_at_least_4x—LOCAL_DDR_BW >= 4 × INTERCARD_BW(rev-A lands at exactly 4× with the spec values;uses
>=so the rev-A baseline passes; failure forces a TP/MPheuristic re-derivation conversation).
constants_are_positive— rules out divide-by-zero inthroughput-divided cost estimates.
lib.rsshows intended usage:Implementation note: clippy
assertions_on_constantsConst-vs-literal asserts in
#[cfg(test)]would normally tripclippy::assertions_on_constants. The test module gates the lintwith
#![allow(clippy::assertions_on_constants)]because the intentof every test is exactly that — pin a specific
u64constant againsta specific literal envelope. Refactoring through
core::hint::black_boxwould obscure intent without changing behaviour.
Test plan
cargo build --workspacecargo test -p spanker-scheduler(15 unit + 9 integration + 3 doc-tests = 27)cargo clippy --workspace --all-targets -- -D warningscargo fmt --all -- --checkgithub(https://github.com/popsolutions/Spanker)origin(Forgejo, https://git.pop.coop/pop/Spanker)Out of scope
(separate scoped PR).
hardware bring-up (separate PR).
Refs
Authored by Agent 3 (Software Stack — Spanker).