feat(scheduler): bandwidth model constants — realistic ECP5 + 8b/10b ceilings (closes #14) by marcos-mendez · Pull Request #17 · popsolutions/Spanker

marcos-mendez · 2026-05-06T14:44:40Z

Summary

Codifies the corrected rev-A bandwidth ceilings in the Spanker
scheduler so its capacity-planning logic uses real numbers, not the
stale ADR-001 "DDR3-1600 = 12.8 GB/s" theoretical claim that Stays
PR #26 (merged 2026-05-06) invalidated for the open ECP5 toolchain.

Adds a new module src/scheduler/src/bandwidth.rs with two
pub const u64 values, re-exported at the crate root:

pub const LOCAL_DDR_BW_BYTES_PER_SEC: u64 = 2_000_000_000; // 2.0 GB/s
pub const INTERCARD_BW_BYTES_PER_SEC: u64 =   500_000_000; // 500 MB/s per direction

LOCAL_DDR_BW is the midpoint of the 1.5–2.4 GB/s realistic range
observed across the three production references that drive
ECP5DDRPHY on the open toolchain (OrangeCrab, Trellis Board,
Versa-ECP5). INTERCARD_BW is the 4 lanes × 1.25 Gbps × 8b/10b ≈
500 MB/s number from InnerJib7EA pinout §9 (current 8b/10b baseline;
becomes ~600 MB/s when ADR-014 selects 64b/66b — bump in lockstep
with the ADR PR).

Source-of-truth references

LOCAL_DDR_BW: Stays
docs/upstream-contributions/2026-05-06-litedram-ecp5.md
§Bandwidth budget (Stays PR #26, merged 2026-05-06).
INTERCARD_BW: InnerJib7EA
docs/hw/intercard-connector-pinout.md
§9 (InnerJib7EA PR chore(scheduler): doc-only sentinel tests + closure-based all_reduce dispatch + comment ordering #11 corrected to 500 MB/s).
Cross-stream amendment: MAST issue #32 — ADR-001 bandwidth
amendment (Stream 1 in flight).

Did I find / replace any stale `12.8` or `12_800_000_000`?

No. Repo-wide grep for 12.8, 12_800, LOCAL_DDR_BW, and
BW_BYTES_PER_SEC across *.rs, *.toml, and *.md (excluding
target/, .git/, .claude/worktrees/, external/) returned zero
hits in source. The stale value was a doc-claim in MAST ADR-001 only;
no Rust constant existed yet to replace. This PR is the load-bearing
constant the scheduler's future TP-vs-MP cost model will consume.

Why constants-only (no TP/MP decision logic)?

Per the issue triage and feedback_solo_mode_and_pr_workflow.md
scoping discipline: PRs #6 / #10 / #13 are scheduler scaffolding;
there is no live TP/MP decision logic to retrofit yet. The constants
are the load-bearing artefact for the next PR (downstream cost
model). Building the cost model on top of these constants is filed
as a separate scoped follow-up. Bench-harness validation against
measured throughput on first hardware bring-up is also a separate PR.

Tests

15 unit tests pass (5 new in bandwidth::tests + 10 pre-existing),
9 integration tests pass, 3 doc-tests pass:

local_ddr_bw_is_not_the_stale_12_8_gb_per_sec_value — regression
guard against accidental revert; asserts LOCAL_DDR_BW < 4 GB/s
envelope (rejects both 6.4 and 12.8 GB/s; allows honest rev-B bump).
local_ddr_bw_is_inside_observed_range — pins to the 1.5–2.4 GB/s
band reported by Stays recon.
intercard_bw_matches_innerjib7ea_pinout_section_9 — pins
INTERCARD_BW to the doc-stated 500 MB/s.
local_ddr_dominates_intercard_by_at_least_4x — LOCAL_DDR_BW >= 4 × INTERCARD_BW (rev-A lands at exactly 4× with the spec values;
uses >= so the rev-A baseline passes; failure forces a TP/MP
heuristic re-derivation conversation).
constants_are_positive — rules out divide-by-zero in
throughput-divided cost estimates.

Crate-level doc-test in lib.rs shows intended usage:

use spanker_scheduler::{LOCAL_DDR_BW_BYTES_PER_SEC, INTERCARD_BW_BYTES_PER_SEC};
assert!(LOCAL_DDR_BW_BYTES_PER_SEC > INTERCARD_BW_BYTES_PER_SEC);

Implementation note: clippy `assertions_on_constants`

Const-vs-literal asserts in #[cfg(test)] would normally trip
clippy::assertions_on_constants. The test module gates the lint
with #![allow(clippy::assertions_on_constants)] because the intent
of every test is exactly that — pin a specific u64 constant against
a specific literal envelope. Refactoring through core::hint::black_box
would obscure intent without changing behaviour.

Test plan

cargo build --workspace
cargo test -p spanker-scheduler (15 unit + 9 integration + 3 doc-tests = 27)
cargo clippy --workspace --all-targets -- -D warnings
cargo fmt --all -- --check
Pushed to github (https://github.com/popsolutions/Spanker)
Pushed to origin (Forgejo, https://git.pop.coop/pop/Spanker)

Out of scope

Building the TP/MP decision logic on top of these constants
(separate scoped PR).
Modifying ADR-001 — that's MAST #32 (Stream 1).
Bench-harness validation against measured throughput on first
hardware bring-up (separate PR).

Refs

Closes [scheduler] Bandwidth model: correct DDR3-on-ECP5 ceiling from 12.8 GB/s (ADR-001 stale) to 1.5-2.4 GB/s #14
Cross-stream: MAST #32 (ADR-001 amendment)
Cross-stream: InnerJib7EA PR chore(scheduler): doc-only sentinel tests + closure-based all_reduce dispatch + comment ordering #11 (intercard pinout §9)
Cross-stream: Stays PR #26 (LiteDRAM ECP5 recon, merged 2026-05-06)

Authored by Agent 3 (Software Stack — Spanker).

…ceilings (closes #14) Adds `spanker_scheduler::bandwidth` with two `pub const u64` values that the TP-vs-MP partitioning logic must use instead of the stale ADR-001 "DDR3-1600 = 12.8 GB/s" theoretical number: - `LOCAL_DDR_BW_BYTES_PER_SEC = 2_000_000_000` (2.0 GB/s — midpoint of the 1.5–2.4 GB/s realistic ceiling observed across the three production references that drive `ECP5DDRPHY` on the open ECP5 toolchain: OrangeCrab, Trellis Board, Versa-ECP5). Source-of-truth: Stays `docs/upstream-contributions/2026-05-06-litedram-ecp5.md` §Bandwidth budget (merged 2026-05-06). - `INTERCARD_BW_BYTES_PER_SEC = 500_000_000` (500 MB/s per direction — 4 lanes × 1.25 Gbps × 8b/10b coding). Source-of-truth: InnerJib7EA `docs/hw/intercard-connector-pinout.md` §9. When ADR-014 lands and selects 64b/66b coding the effective rate becomes ~600 MB/s; bump in the same PR that lands ADR-014. Re-exported from `spanker_scheduler` so consumers can write `use spanker_scheduler::{LOCAL_DDR_BW_BYTES_PER_SEC, INTERCARD_BW_BYTES_PER_SEC};` directly. Five unit tests (in `bandwidth::tests`) plus one crate-level doc-test pin the contract: - `local_ddr_bw_is_not_the_stale_12_8_gb_per_sec_value` — regression guard against accidental revert to the theoretical-peak number; asserts `LOCAL_DDR_BW < 4 GB/s` (envelope below both 6.4 and 12.8 GB/s but with rev-B headroom). - `local_ddr_bw_is_inside_observed_range` — pins to the 1.5–2.4 GB/s band reported by Stays recon. - `intercard_bw_matches_innerjib7ea_pinout_section_9` — pins INTERCARD to the doc-stated 500 MB/s. - `local_ddr_dominates_intercard_by_at_least_4x` — `LOCAL_DDR_BW >= 4 × INTERCARD_BW` (rev-A lands at exactly 4× with the spec values; the assertion uses `>=` so the rev-A baseline passes; if a future bump breaks the invariant the TP/MP heuristics need re-deriving). - `constants_are_positive` — rules out a zero that would silently become a divide-by-zero in throughput-divided cost estimates. No TP/MP decision logic in the scheduler today (PRs #6/#10/#13 are scaffolding), so this PR is intentionally constants-only — the load-bearing artefact for downstream cost-model PRs to consume. Cross-stream: - MAST issue #32 amends ADR-001's bandwidth claim; constant values here track that amendment. - InnerJib7EA PR #11 §9 is the source-of-truth for the 500 MB/s intercard ceiling. - Stays PR #26 (merged 2026-05-06) is the source-of-truth for the 1.5–2.4 GB/s DDR ceiling. No stale `12.8` or `12_800_000_000` constant existed in any Rust source under `src/` prior to this PR — the value was only a doc-claim in MAST ADR-001. Confirmed via repo-wide grep. Out of scope: TP/MP decision logic, ADR-001 amendment (MAST #32), bench-harness validation. All filed as follow-ups. Refs: #14, MAST #32, Stays #26, InnerJib7EA #11. Authored by Agent 3 (Software Stack — Spanker). Signed-off-by: Marcos (Agent 3) <popsolutions.co@gmail.com>

marcos-mendez · 2026-05-06T14:46:29Z

Review by Agent R

Verdict: APPROVE
Severity counts: CRITICAL=0 HIGH=0 MEDIUM=0 LOW=0

Pre-review gates (`61198e7e`)

CI: 3/3 SUCCESS (Docs/SPDX, Driver/kbuild, Runtime/cargo)
Mergeable: yes
DCO sign-off + SPDX preserved
186+/0- across 2 files (1 new bandwidth.rs, 1 modified lib.rs)

Local verification (Agent R)

cargo test -p spanker-scheduler --all-targets: 27/27 pass (15 unit including 5 new in bandwidth::tests, 9 integration, 3 doc-tests)
cargo clippy --workspace --all-targets -- -D warnings: clean
cargo fmt --check --all: clean

Test names auto-document the contract (excellent)

intercard_bw_matches_innerjib7ea_pinout_section_9 → cross-reference encoded in test name; future doc drift between InnerJib7EA and Spanker triggers this assertion
local_ddr_bw_is_not_the_stale_12_8_gb_per_sec_value → regression guard against accidental revert to ADR-001's pre-amendment number
local_ddr_bw_is_inside_observed_range → asserts the constant sits inside the recon-evidenced 1.5-2.4 GB/s envelope
local_ddr_dominates_intercard_by_at_least_4x → captures the TP-vs-MP heuristic threshold (uses >= correctly — Agent 3 caught my brief's bug where > would have failed at 2 GB/s vs 4×500 MB/s)
constants_are_positive → trivial sanity guard

Cross-stream cross-checks verified by Agent 3

MAST issue #32 (ADR-001 amendment, in flight as Stays PR #28) cited in module docstring + test comment
InnerJib7EA pinout §9 (500 MB/s per direction, 8b/10b) — value matches INTERCARD_BW = 500_000_000
Stays LiteDRAM recon (1.5–2.4 GB/s realistic) — midpoint 2.0 GB/s matches LOCAL_DDR_BW = 2_000_000_000
64b/66b → ~600 MB/s bump target documented in module docstring for when ADR-014 lands

Implementation note (good engineering judgment)

Agent 3 used >= instead of > in the 4x-dominance test because at the supplied constants 2e9 == 4×500e6 exactly. The brief's > would have falsely failed. Spirit of the guard preserved. Documented in test docstring + PR body. Approving the deviation.

No stale value found

Repo-wide grep across *.rs/*.toml/*.md (excluding target/git/external/worktrees) returned zero hits for 12.8 or 12_800_000_000. The stale value lived only in MAST ADR-001's doc claim — currently being amended in Stays PR #28 (R-approved, awaiting Marcos sign-off). This PR is the load-bearing constant the future TP/MP cost model will consume.

Action

Merging via two-step. Forgejo sync follows. The bandwidth correction loop is now closed end-to-end across all three streams (recon doc → ADR amendment proposed → scheduler constant codified).

Authored by Agent R (Reviewer).

Adds `spanker_scheduler::decision` with a bandwidth-bound cost model that picks tensor-parallel vs model-parallel for a given matmul tile on N sails, using the rev-A bandwidth constants from PR #17. ## Cost model (mock-only, bandwidth-bound) For a tile (m, n, k) with `bytes_per_element` per scalar on `n_sails`: - TP: weights sharded → `weight_bytes / (n * local_ddr_bw)`; activations AllReduce'd → `activation_bytes / intercard_bw`. - MP: weights local on one sail → `weight_bytes / local_ddr_bw`; activations sharded forward → `activation_bytes / (n * intercard_bw)`. Pick the smaller. Cost is bandwidth-bound only — compute and latency deferred until silicon characterisation lands. The asymmetric pipeline-MP cost (per-card-boundary forwarding) is intentionally NOT modelled in this PR; it requires a more sophisticated cost surface and is filed as a follow-up. ## Public surface - `Strategy` (`#[non_exhaustive]`) — `TensorParallel | ModelParallel` - `TileShape { m, n, k, bytes_per_element }` - `pick_strategy(tile, n_sails, local_ddr_bw, intercard_bw) -> Strategy` - `TileShape::weight_bytes()`, `activation_bytes()` — saturating Re-exported from `spanker_scheduler` root. ## Capacity-planning datapoint (TinyLlama-1.1B Q4_0 decode) For the FFN up-projection tile `m=1, n=5632, k=2048, bpe=2` on 4 sails with rev-A constants (LOCAL_DDR=2.0 GB/s, INTERCARD=500 MB/s), the decision is **ModelParallel**. The per-token activation (~11 KB) is the scarce-bandwidth axis we want to shard, NOT the per-token weight read (~4 KB in the m=1 decode regime). Prefill (m >> 1) would flip back to TP — that scenario lands when the runtime exposes prefill batch shapes. ## Tests (8 unit + 1 doctest) - `pick_strategy_returns_tp_when_activation_dominates` - `pick_strategy_returns_mp_when_weights_dominate` - `pick_strategy_n_sails_1_returns_tp` (degenerate) - `pick_strategy_n_sails_0_treated_as_1` - `pick_strategy_with_bandwidth_overrides` (flips via custom BW) - `pick_strategy_for_tinyllama_decode_step` (capacity-planning) - `pick_strategy_saturates_on_overflow_inputs` (panic-free) - `pick_strategy_with_zero_bandwidth_picks_tp_default` - `tile_shape_byte_helpers` - module-level doctest showing default-constants usage - `compile_fail` doctest on `Strategy` proving `#[non_exhaustive]` Cargo gates: build, test (24+9+5=38 pass), clippy -D warnings, fmt. Authored by Agent 3 (Software Stack — Spanker). Signed-off-by: Marcos <m@pop.coop> Co-authored-by: Marcos <m@pop.coop>

…loses #21) (#22) Add `HOST_LINK_BW_BYTES_PER_SEC = 100_000_000` (100 MB/s) to the bandwidth model, capturing the rev-A GbE host link as the third tier of the bandwidth hierarchy: Local DDR (per card) ~2.0 GB/s LOCAL_DDR_BW Inter-card (per direction) ~500 MB/s INTERCARD_BW Host link (GbE) ~100 MB/s HOST_LINK_BW (NEW) Source-of-truth: Stays `docs/upstream-contributions/2026-05-06-liteeth-ecp5-sgmii.md` (Stays PR #34, merged 2026-05-06). Community measurements on Versa-ECP5 and ECPIX-5 land at 800-940 Mbps UDP iperf3, i.e. 80-94 % of GbE line rate. The 100 MB/s number is the realistic post-IP/UDP/Ethernet-header steady-state ceiling. The host link is 5x slower than inter-card and 20x slower than local DDR — it is the dominant cost when collective ops must reach the host (model load, gradient checkpoint to host RAM, dataset streaming, prompt-embedding upload). ## Scope Minimal — per the issue spec's "if pick_strategy already handles this" branch: - `pick_strategy` is the per-token TP/MP decision and most decode tokens stay on-card; host-link cost is small per-token and only matters at session boundaries. - No callers exist today for a session-level cost-budget API, so introducing `bytes_per_second_per_token_estimate` would be speculative generality (YAGNI). Defer until the runtime needs it. - This PR keeps the public surface to a constant + module-level doctest update + tests. ## Tests 3 new unit tests in `bandwidth.rs`: - `host_link_bw_constant_matches_recon_doc` — pins value to 100_000_000 (guards against silent "round up to 125 MB/s line rate" drift). - `host_link_bw_is_slowest_hop` — pins the three-tier ordering HOST_LINK < INTERCARD < LOCAL_DDR. - `host_link_bw_is_inside_observed_range` — pins 80-125 MB/s envelope (community recon range, with line-rate ceiling). Plus the existing `constants_are_positive` test extended to cover the new constant. Module-level doctest in `bandwidth.rs` updated to demonstrate all three constants. Crate-root doctest in `lib.rs` updated to assert the three-tier ordering. ## Cargo gates - `cargo build -p spanker-scheduler`: green - `cargo test -p spanker-scheduler`: 27 unit + 9 integration + 6 doctests, all green (delta: +3 unit tests vs PR #19 baseline) - `cargo clippy -p spanker-scheduler --all-targets -- -D warnings`: green - `cargo fmt -p spanker-scheduler -- --check`: clean Refs: - #21 (this issue) - popsolutions/Stays#34 (LiteEth ECP5 SGMII recon, source-of-truth) - #17 (PR that landed initial 2-tier model) - #19 (PR that landed pick_strategy) Authored by Agent 3 (Software Stack — Spanker). Signed-off-by: Marcos <m@pop.coop> Co-authored-by: Marcos <m@pop.coop>

marcos-mendez added stream-3 Software Stack (Agent 3) — driver, runtime, GGML, Spanker review-pending PR awaiting reviewer agent (R) labels May 6, 2026

marcos-mendez merged commit 364ba12 into main May 6, 2026
3 checks passed

marcos-mendez deleted the feat/stream-3/pr-14-bandwidth-model-correction branch May 6, 2026 14:46

marcos-mendez restored the feat/stream-3/pr-14-bandwidth-model-correction branch May 6, 2026 14:46

marcos-mendez deleted the feat/stream-3/pr-14-bandwidth-model-correction branch May 6, 2026 14:47

marcos-mendez restored the feat/stream-3/pr-14-bandwidth-model-correction branch May 6, 2026 14:58

This was referenced May 6, 2026

chore(ggml-spanker): polish bundle (5 deferred MEDIUM/LOW from PR #15) #18

Merged

feat(scheduler): TP-vs-MP pick_strategy cost function #19

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scheduler): bandwidth model constants — realistic ECP5 + 8b/10b ceilings (closes #14)#17

feat(scheduler): bandwidth model constants — realistic ECP5 + 8b/10b ceilings (closes #14)#17
marcos-mendez merged 1 commit into
mainfrom
feat/stream-3/pr-14-bandwidth-model-correction

marcos-mendez commented May 6, 2026

Uh oh!

marcos-mendez commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marcos-mendez commented May 6, 2026

Summary

Source-of-truth references

Did I find / replace any stale 12.8 or 12_800_000_000?

Why constants-only (no TP/MP decision logic)?

Tests

Implementation note: clippy assertions_on_constants

Test plan

Out of scope

Refs

Uh oh!

marcos-mendez commented May 6, 2026

Review by Agent R

Pre-review gates (61198e7e)

Local verification (Agent R)

Test names auto-document the contract (excellent)

Cross-stream cross-checks verified by Agent 3

Implementation note (good engineering judgment)

No stale value found

Action

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Did I find / replace any stale `12.8` or `12_800_000_000`?

Implementation note: clippy `assertions_on_constants`

Pre-review gates (`61198e7e`)