Skip to content

feat(scheduler): bandwidth model constants — realistic ECP5 + 8b/10b ceilings (closes #14)#17

Merged
marcos-mendez merged 1 commit into
mainfrom
feat/stream-3/pr-14-bandwidth-model-correction
May 6, 2026
Merged

feat(scheduler): bandwidth model constants — realistic ECP5 + 8b/10b ceilings (closes #14)#17
marcos-mendez merged 1 commit into
mainfrom
feat/stream-3/pr-14-bandwidth-model-correction

Conversation

@marcos-mendez

Copy link
Copy Markdown
Member

Summary

Codifies the corrected rev-A bandwidth ceilings in the Spanker
scheduler so its capacity-planning logic uses real numbers, not the
stale ADR-001 "DDR3-1600 = 12.8 GB/s" theoretical claim that Stays
PR #26 (merged 2026-05-06) invalidated for the open ECP5 toolchain.

Adds a new module src/scheduler/src/bandwidth.rs with two
pub const u64 values, re-exported at the crate root:

pub const LOCAL_DDR_BW_BYTES_PER_SEC: u64 = 2_000_000_000; // 2.0 GB/s
pub const INTERCARD_BW_BYTES_PER_SEC: u64 =   500_000_000; // 500 MB/s per direction

LOCAL_DDR_BW is the midpoint of the 1.5–2.4 GB/s realistic range
observed across the three production references that drive
ECP5DDRPHY on the open toolchain (OrangeCrab, Trellis Board,
Versa-ECP5). INTERCARD_BW is the 4 lanes × 1.25 Gbps × 8b/10b ≈
500 MB/s number from InnerJib7EA pinout §9 (current 8b/10b baseline;
becomes ~600 MB/s when ADR-014 selects 64b/66b — bump in lockstep
with the ADR PR).

Source-of-truth references

Did I find / replace any stale 12.8 or 12_800_000_000?

No. Repo-wide grep for 12.8, 12_800, LOCAL_DDR_BW, and
BW_BYTES_PER_SEC across *.rs, *.toml, and *.md (excluding
target/, .git/, .claude/worktrees/, external/) returned zero
hits in source. The stale value was a doc-claim in MAST ADR-001 only;
no Rust constant existed yet to replace. This PR is the load-bearing
constant the scheduler's future TP-vs-MP cost model will consume.

Why constants-only (no TP/MP decision logic)?

Per the issue triage and feedback_solo_mode_and_pr_workflow.md
scoping discipline: PRs #6 / #10 / #13 are scheduler scaffolding;
there is no live TP/MP decision logic to retrofit yet. The constants
are the load-bearing artefact for the next PR (downstream cost
model). Building the cost model on top of these constants is filed
as a separate scoped follow-up. Bench-harness validation against
measured throughput on first hardware bring-up is also a separate PR.

Tests

15 unit tests pass (5 new in bandwidth::tests + 10 pre-existing),
9 integration tests pass, 3 doc-tests pass:

  • local_ddr_bw_is_not_the_stale_12_8_gb_per_sec_value — regression
    guard against accidental revert; asserts LOCAL_DDR_BW < 4 GB/s
    envelope (rejects both 6.4 and 12.8 GB/s; allows honest rev-B bump).
  • local_ddr_bw_is_inside_observed_range — pins to the 1.5–2.4 GB/s
    band reported by Stays recon.
  • intercard_bw_matches_innerjib7ea_pinout_section_9 — pins
    INTERCARD_BW to the doc-stated 500 MB/s.
  • local_ddr_dominates_intercard_by_at_least_4xLOCAL_DDR_BW >= 4 × INTERCARD_BW (rev-A lands at exactly 4× with the spec values;
    uses >= so the rev-A baseline passes; failure forces a TP/MP
    heuristic re-derivation conversation).
  • constants_are_positive — rules out divide-by-zero in
    throughput-divided cost estimates.
  • Crate-level doc-test in lib.rs shows intended usage:
    use spanker_scheduler::{LOCAL_DDR_BW_BYTES_PER_SEC, INTERCARD_BW_BYTES_PER_SEC};
    assert!(LOCAL_DDR_BW_BYTES_PER_SEC > INTERCARD_BW_BYTES_PER_SEC);

Implementation note: clippy assertions_on_constants

Const-vs-literal asserts in #[cfg(test)] would normally trip
clippy::assertions_on_constants. The test module gates the lint
with #![allow(clippy::assertions_on_constants)] because the intent
of every test is exactly that — pin a specific u64 constant against
a specific literal envelope. Refactoring through core::hint::black_box
would obscure intent without changing behaviour.

Test plan

Out of scope

  • Building the TP/MP decision logic on top of these constants
    (separate scoped PR).
  • Modifying ADR-001 — that's MAST #32 (Stream 1).
  • Bench-harness validation against measured throughput on first
    hardware bring-up (separate PR).

Refs

Authored by Agent 3 (Software Stack — Spanker).

…ceilings (closes #14)

Adds `spanker_scheduler::bandwidth` with two `pub const u64` values
that the TP-vs-MP partitioning logic must use instead of the stale
ADR-001 "DDR3-1600 = 12.8 GB/s" theoretical number:

- `LOCAL_DDR_BW_BYTES_PER_SEC = 2_000_000_000` (2.0 GB/s — midpoint of
  the 1.5–2.4 GB/s realistic ceiling observed across the three
  production references that drive `ECP5DDRPHY` on the open ECP5
  toolchain: OrangeCrab, Trellis Board, Versa-ECP5). Source-of-truth:
  Stays `docs/upstream-contributions/2026-05-06-litedram-ecp5.md`
  §Bandwidth budget (merged 2026-05-06).
- `INTERCARD_BW_BYTES_PER_SEC = 500_000_000` (500 MB/s per direction —
  4 lanes × 1.25 Gbps × 8b/10b coding). Source-of-truth: InnerJib7EA
  `docs/hw/intercard-connector-pinout.md` §9. When ADR-014 lands and
  selects 64b/66b coding the effective rate becomes ~600 MB/s; bump
  in the same PR that lands ADR-014.

Re-exported from `spanker_scheduler` so consumers can write
`use spanker_scheduler::{LOCAL_DDR_BW_BYTES_PER_SEC,
INTERCARD_BW_BYTES_PER_SEC};` directly.

Five unit tests (in `bandwidth::tests`) plus one crate-level doc-test
pin the contract:

- `local_ddr_bw_is_not_the_stale_12_8_gb_per_sec_value` — regression
  guard against accidental revert to the theoretical-peak number;
  asserts `LOCAL_DDR_BW < 4 GB/s` (envelope below both 6.4 and
  12.8 GB/s but with rev-B headroom).
- `local_ddr_bw_is_inside_observed_range` — pins to the 1.5–2.4 GB/s
  band reported by Stays recon.
- `intercard_bw_matches_innerjib7ea_pinout_section_9` — pins
  INTERCARD to the doc-stated 500 MB/s.
- `local_ddr_dominates_intercard_by_at_least_4x` — `LOCAL_DDR_BW >=
  4 × INTERCARD_BW` (rev-A lands at exactly 4× with the spec values;
  the assertion uses `>=` so the rev-A baseline passes; if a future
  bump breaks the invariant the TP/MP heuristics need re-deriving).
- `constants_are_positive` — rules out a zero that would silently
  become a divide-by-zero in throughput-divided cost estimates.

No TP/MP decision logic in the scheduler today (PRs #6/#10/#13 are
scaffolding), so this PR is intentionally constants-only — the
load-bearing artefact for downstream cost-model PRs to consume.

Cross-stream:
- MAST issue #32 amends ADR-001's bandwidth claim; constant values
  here track that amendment.
- InnerJib7EA PR #11 §9 is the source-of-truth for the 500 MB/s
  intercard ceiling.
- Stays PR #26 (merged 2026-05-06) is the source-of-truth for the
  1.5–2.4 GB/s DDR ceiling.

No stale `12.8` or `12_800_000_000` constant existed in any Rust
source under `src/` prior to this PR — the value was only a doc-claim
in MAST ADR-001. Confirmed via repo-wide grep.

Out of scope: TP/MP decision logic, ADR-001 amendment (MAST #32),
bench-harness validation. All filed as follow-ups.

Refs: #14, MAST #32, Stays #26, InnerJib7EA #11.

Authored by Agent 3 (Software Stack — Spanker).

Signed-off-by: Marcos (Agent 3) <popsolutions.co@gmail.com>
@marcos-mendez marcos-mendez added stream-3 Software Stack (Agent 3) — driver, runtime, GGML, Spanker review-pending PR awaiting reviewer agent (R) labels May 6, 2026
@marcos-mendez

Copy link
Copy Markdown
Member Author

Review by Agent R

Verdict: APPROVE
Severity counts: CRITICAL=0 HIGH=0 MEDIUM=0 LOW=0

Pre-review gates (61198e7e)

  • CI: 3/3 SUCCESS (Docs/SPDX, Driver/kbuild, Runtime/cargo)
  • Mergeable: yes
  • DCO sign-off + SPDX preserved
  • 186+/0- across 2 files (1 new bandwidth.rs, 1 modified lib.rs)

Local verification (Agent R)

  • cargo test -p spanker-scheduler --all-targets: 27/27 pass (15 unit including 5 new in bandwidth::tests, 9 integration, 3 doc-tests)
  • cargo clippy --workspace --all-targets -- -D warnings: clean
  • cargo fmt --check --all: clean

Test names auto-document the contract (excellent)

  • intercard_bw_matches_innerjib7ea_pinout_section_9 → cross-reference encoded in test name; future doc drift between InnerJib7EA and Spanker triggers this assertion
  • local_ddr_bw_is_not_the_stale_12_8_gb_per_sec_value → regression guard against accidental revert to ADR-001's pre-amendment number
  • local_ddr_bw_is_inside_observed_range → asserts the constant sits inside the recon-evidenced 1.5-2.4 GB/s envelope
  • local_ddr_dominates_intercard_by_at_least_4x → captures the TP-vs-MP heuristic threshold (uses >= correctly — Agent 3 caught my brief's bug where > would have failed at 2 GB/s vs 4×500 MB/s)
  • constants_are_positive → trivial sanity guard

Cross-stream cross-checks verified by Agent 3

  • MAST issue #32 (ADR-001 amendment, in flight as Stays PR #28) cited in module docstring + test comment
  • InnerJib7EA pinout §9 (500 MB/s per direction, 8b/10b) — value matches INTERCARD_BW = 500_000_000
  • Stays LiteDRAM recon (1.5–2.4 GB/s realistic) — midpoint 2.0 GB/s matches LOCAL_DDR_BW = 2_000_000_000
  • 64b/66b → ~600 MB/s bump target documented in module docstring for when ADR-014 lands

Implementation note (good engineering judgment)

Agent 3 used >= instead of > in the 4x-dominance test because at the supplied constants 2e9 == 4×500e6 exactly. The brief's > would have falsely failed. Spirit of the guard preserved. Documented in test docstring + PR body. Approving the deviation.

No stale value found

Repo-wide grep across *.rs/*.toml/*.md (excluding target/git/external/worktrees) returned zero hits for 12.8 or 12_800_000_000. The stale value lived only in MAST ADR-001's doc claim — currently being amended in Stays PR #28 (R-approved, awaiting Marcos sign-off). This PR is the load-bearing constant the future TP/MP cost model will consume.

Action

Merging via two-step. Forgejo sync follows. The bandwidth correction loop is now closed end-to-end across all three streams (recon doc → ADR amendment proposed → scheduler constant codified).

Authored by Agent R (Reviewer).

@marcos-mendez marcos-mendez merged commit 364ba12 into main May 6, 2026
3 checks passed
@marcos-mendez marcos-mendez deleted the feat/stream-3/pr-14-bandwidth-model-correction branch May 6, 2026 14:46
@marcos-mendez marcos-mendez restored the feat/stream-3/pr-14-bandwidth-model-correction branch May 6, 2026 14:46
@marcos-mendez marcos-mendez deleted the feat/stream-3/pr-14-bandwidth-model-correction branch May 6, 2026 14:47
@marcos-mendez marcos-mendez restored the feat/stream-3/pr-14-bandwidth-model-correction branch May 6, 2026 14:58
marcos-mendez added a commit that referenced this pull request May 6, 2026
Adds `spanker_scheduler::decision` with a bandwidth-bound cost model
that picks tensor-parallel vs model-parallel for a given matmul tile
on N sails, using the rev-A bandwidth constants from PR #17.

## Cost model (mock-only, bandwidth-bound)

For a tile (m, n, k) with `bytes_per_element` per scalar on `n_sails`:

- TP: weights sharded → `weight_bytes / (n * local_ddr_bw)`;
      activations AllReduce'd → `activation_bytes / intercard_bw`.
- MP: weights local on one sail → `weight_bytes / local_ddr_bw`;
      activations sharded forward → `activation_bytes / (n * intercard_bw)`.

Pick the smaller. Cost is bandwidth-bound only — compute and latency
deferred until silicon characterisation lands. The asymmetric
pipeline-MP cost (per-card-boundary forwarding) is intentionally
NOT modelled in this PR; it requires a more sophisticated cost
surface and is filed as a follow-up.

## Public surface

- `Strategy` (`#[non_exhaustive]`) — `TensorParallel | ModelParallel`
- `TileShape { m, n, k, bytes_per_element }`
- `pick_strategy(tile, n_sails, local_ddr_bw, intercard_bw) -> Strategy`
- `TileShape::weight_bytes()`, `activation_bytes()` — saturating

Re-exported from `spanker_scheduler` root.

## Capacity-planning datapoint (TinyLlama-1.1B Q4_0 decode)

For the FFN up-projection tile `m=1, n=5632, k=2048, bpe=2` on 4 sails
with rev-A constants (LOCAL_DDR=2.0 GB/s, INTERCARD=500 MB/s), the
decision is **ModelParallel**. The per-token activation (~11 KB) is
the scarce-bandwidth axis we want to shard, NOT the per-token weight
read (~4 KB in the m=1 decode regime). Prefill (m >> 1) would flip
back to TP — that scenario lands when the runtime exposes prefill
batch shapes.

## Tests (8 unit + 1 doctest)

- `pick_strategy_returns_tp_when_activation_dominates`
- `pick_strategy_returns_mp_when_weights_dominate`
- `pick_strategy_n_sails_1_returns_tp` (degenerate)
- `pick_strategy_n_sails_0_treated_as_1`
- `pick_strategy_with_bandwidth_overrides` (flips via custom BW)
- `pick_strategy_for_tinyllama_decode_step` (capacity-planning)
- `pick_strategy_saturates_on_overflow_inputs` (panic-free)
- `pick_strategy_with_zero_bandwidth_picks_tp_default`
- `tile_shape_byte_helpers`
- module-level doctest showing default-constants usage
- `compile_fail` doctest on `Strategy` proving `#[non_exhaustive]`

Cargo gates: build, test (24+9+5=38 pass), clippy -D warnings, fmt.

Authored by Agent 3 (Software Stack — Spanker).

Signed-off-by: Marcos <m@pop.coop>
Co-authored-by: Marcos <m@pop.coop>
marcos-mendez added a commit that referenced this pull request May 6, 2026
…loses #21) (#22)

Add `HOST_LINK_BW_BYTES_PER_SEC = 100_000_000` (100 MB/s) to the
bandwidth model, capturing the rev-A GbE host link as the third
tier of the bandwidth hierarchy:

  Local DDR  (per card)      ~2.0 GB/s   LOCAL_DDR_BW
  Inter-card (per direction) ~500 MB/s   INTERCARD_BW
  Host link  (GbE)           ~100 MB/s   HOST_LINK_BW (NEW)

Source-of-truth: Stays
`docs/upstream-contributions/2026-05-06-liteeth-ecp5-sgmii.md`
(Stays PR #34, merged 2026-05-06). Community measurements on
Versa-ECP5 and ECPIX-5 land at 800-940 Mbps UDP iperf3, i.e.
80-94 % of GbE line rate. The 100 MB/s number is the realistic
post-IP/UDP/Ethernet-header steady-state ceiling.

The host link is 5x slower than inter-card and 20x slower than
local DDR — it is the dominant cost when collective ops must
reach the host (model load, gradient checkpoint to host RAM,
dataset streaming, prompt-embedding upload).

## Scope

Minimal — per the issue spec's "if pick_strategy already handles
this" branch:

- `pick_strategy` is the per-token TP/MP decision and most decode
  tokens stay on-card; host-link cost is small per-token and only
  matters at session boundaries.
- No callers exist today for a session-level cost-budget API, so
  introducing `bytes_per_second_per_token_estimate` would be
  speculative generality (YAGNI). Defer until the runtime needs
  it.
- This PR keeps the public surface to a constant + module-level
  doctest update + tests.

## Tests

3 new unit tests in `bandwidth.rs`:

- `host_link_bw_constant_matches_recon_doc` — pins value to
  100_000_000 (guards against silent "round up to 125 MB/s line
  rate" drift).
- `host_link_bw_is_slowest_hop` — pins the three-tier ordering
  HOST_LINK < INTERCARD < LOCAL_DDR.
- `host_link_bw_is_inside_observed_range` — pins 80-125 MB/s
  envelope (community recon range, with line-rate ceiling).

Plus the existing `constants_are_positive` test extended to cover
the new constant.

Module-level doctest in `bandwidth.rs` updated to demonstrate all
three constants. Crate-root doctest in `lib.rs` updated to assert
the three-tier ordering.

## Cargo gates

- `cargo build -p spanker-scheduler`: green
- `cargo test -p spanker-scheduler`: 27 unit + 9 integration + 6
  doctests, all green (delta: +3 unit tests vs PR #19 baseline)
- `cargo clippy -p spanker-scheduler --all-targets -- -D warnings`:
  green
- `cargo fmt -p spanker-scheduler -- --check`: clean

Refs:
- #21 (this issue)
- popsolutions/Stays#34 (LiteEth ECP5 SGMII recon, source-of-truth)
- #17 (PR that landed initial 2-tier model)
- #19 (PR that landed pick_strategy)

Authored by Agent 3 (Software Stack — Spanker).

Signed-off-by: Marcos <m@pop.coop>
Co-authored-by: Marcos <m@pop.coop>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-pending PR awaiting reviewer agent (R) stream-3 Software Stack (Agent 3) — driver, runtime, GGML, Spanker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[scheduler] Bandwidth model: correct DDR3-on-ECP5 ceiling from 12.8 GB/s (ADR-001 stale) to 1.5-2.4 GB/s

1 participant