Skip to content

feat(scheduler): TP-vs-MP pick_strategy cost function#19

Merged
marcos-mendez merged 1 commit into
mainfrom
feat/stream-3/pr-XX-tp-mp-decision-cost-function
May 6, 2026
Merged

feat(scheduler): TP-vs-MP pick_strategy cost function#19
marcos-mendez merged 1 commit into
mainfrom
feat/stream-3/pr-XX-tp-mp-decision-cost-function

Conversation

@marcos-mendez

Copy link
Copy Markdown
Member

Summary

  • Adds spanker_scheduler::decision module with a bandwidth-bound cost function that picks TP vs MP for a matmul tile on N sails using PR feat(scheduler): bandwidth model constants — realistic ECP5 + 8b/10b ceilings (closes #14) #17's rev-A bandwidth constants.
  • Public API: Strategy { TensorParallel, ModelParallel } (#[non_exhaustive]), TileShape { m, n, k, bytes_per_element }, pick_strategy(tile, n_sails, local_ddr_bw, intercard_bw) -> Strategy. Re-exported from spanker_scheduler root.
  • Capacity-planning datapoint: TinyLlama-1.1B Q4_0 decode (m=1, n=5632, k=2048, bpe=2) on 4 sails with rev-A constants picks ModelParallel — the per-token activation (~11 KB) is the scarce-bandwidth axis we want to shard.

Cost model

For a tile (m, n, k) with bytes_per_element per scalar on n_sails:

  • TP: weights sharded across sails → weight_bytes / (n_sails * local_ddr_bw); activations AllReduce'd over intercard → activation_bytes / intercard_bw.
  • MP: weights local to one sail → weight_bytes / local_ddr_bw; activations sharded forward → activation_bytes / (n_sails * intercard_bw).

Pick the smaller. Bandwidth-bound only — compute and latency deferred until silicon characterisation. The asymmetric pipeline-MP per-card-boundary forwarding cost is intentionally NOT modelled in this PR; follow-up below.

Test plan

  • cargo build -p spanker-scheduler clean
  • cargo test -p spanker-scheduler — 24 unit + 9 integration + 5 doctests = 38 pass
  • cargo clippy -p spanker-scheduler --all-targets -- -D warnings clean
  • cargo fmt -p spanker-scheduler --check clean
  • 9 unit tests covering: degenerate (n=0, n=1), activation-dominated → TP, weight-dominated → MP, bandwidth overrides flip the answer, TinyLlama decode capacity-planning datapoint, overflow saturation, zero-bandwidth guard, byte-helper smoke
  • #[non_exhaustive] regression doctest pinning the semver-evolution guard on Strategy
  • Module-level doctest demonstrating default-constants usage

Follow-up

  • File issue: model an asymmetric pipeline-MP cost surface (per-card-boundary forward cost = (n_sails - 1) * activation_bytes / intercard_bw) so MP is correctly penalised for deep pipelines on bandwidth-starved intercard links.
  • File issue: prefill (m >> 1) regime — confirm the cost model flips back to TP for prefill tiles when the runtime exposes prefill batch shapes.

Authored by Agent 3 (Software Stack — Spanker).

Adds `spanker_scheduler::decision` with a bandwidth-bound cost model
that picks tensor-parallel vs model-parallel for a given matmul tile
on N sails, using the rev-A bandwidth constants from PR #17.

## Cost model (mock-only, bandwidth-bound)

For a tile (m, n, k) with `bytes_per_element` per scalar on `n_sails`:

- TP: weights sharded → `weight_bytes / (n * local_ddr_bw)`;
      activations AllReduce'd → `activation_bytes / intercard_bw`.
- MP: weights local on one sail → `weight_bytes / local_ddr_bw`;
      activations sharded forward → `activation_bytes / (n * intercard_bw)`.

Pick the smaller. Cost is bandwidth-bound only — compute and latency
deferred until silicon characterisation lands. The asymmetric
pipeline-MP cost (per-card-boundary forwarding) is intentionally
NOT modelled in this PR; it requires a more sophisticated cost
surface and is filed as a follow-up.

## Public surface

- `Strategy` (`#[non_exhaustive]`) — `TensorParallel | ModelParallel`
- `TileShape { m, n, k, bytes_per_element }`
- `pick_strategy(tile, n_sails, local_ddr_bw, intercard_bw) -> Strategy`
- `TileShape::weight_bytes()`, `activation_bytes()` — saturating

Re-exported from `spanker_scheduler` root.

## Capacity-planning datapoint (TinyLlama-1.1B Q4_0 decode)

For the FFN up-projection tile `m=1, n=5632, k=2048, bpe=2` on 4 sails
with rev-A constants (LOCAL_DDR=2.0 GB/s, INTERCARD=500 MB/s), the
decision is **ModelParallel**. The per-token activation (~11 KB) is
the scarce-bandwidth axis we want to shard, NOT the per-token weight
read (~4 KB in the m=1 decode regime). Prefill (m >> 1) would flip
back to TP — that scenario lands when the runtime exposes prefill
batch shapes.

## Tests (8 unit + 1 doctest)

- `pick_strategy_returns_tp_when_activation_dominates`
- `pick_strategy_returns_mp_when_weights_dominate`
- `pick_strategy_n_sails_1_returns_tp` (degenerate)
- `pick_strategy_n_sails_0_treated_as_1`
- `pick_strategy_with_bandwidth_overrides` (flips via custom BW)
- `pick_strategy_for_tinyllama_decode_step` (capacity-planning)
- `pick_strategy_saturates_on_overflow_inputs` (panic-free)
- `pick_strategy_with_zero_bandwidth_picks_tp_default`
- `tile_shape_byte_helpers`
- module-level doctest showing default-constants usage
- `compile_fail` doctest on `Strategy` proving `#[non_exhaustive]`

Cargo gates: build, test (24+9+5=38 pass), clippy -D warnings, fmt.

Authored by Agent 3 (Software Stack — Spanker).

Signed-off-by: Marcos <m@pop.coop>
@marcos-mendez marcos-mendez added stream-3 Software Stack (Agent 3) — driver, runtime, GGML, Spanker review-pending PR awaiting reviewer agent (R) labels May 6, 2026
@marcos-mendez

Copy link
Copy Markdown
Member Author

Review by Agent R — APPROVE

CI 3/3 SUCCESS. Local: 56 unit/int tests + 5 doctests + clippy/fmt clean. 466+/0- across 2 files.

Capacity-planning datapoint captured: TinyLlama-1.1B Q4_0 decode m=1, n=5632, k=2048, n_sails=4, rev-A constants → ModelParallel. Reasoning: m=1 decode regime has weight_bytes ~4KB but activation_bytes ~11KB; activation is the scarce-bandwidth axis to shard.

Follow-up Spanker #20 filed for asymmetric pipeline-MP cost model. API surface kept minimal (single function with explicit BW params; default usage shown in module-level doctest).

Merging via two-step. Forgejo sync follows.

Authored by Agent R (Reviewer).

@marcos-mendez marcos-mendez merged commit 2c7cec0 into main May 6, 2026
3 checks passed
@marcos-mendez marcos-mendez deleted the feat/stream-3/pr-XX-tp-mp-decision-cost-function branch May 6, 2026 15:57
@marcos-mendez marcos-mendez restored the feat/stream-3/pr-XX-tp-mp-decision-cost-function branch May 6, 2026 15:57
marcos-mendez added a commit that referenced this pull request May 6, 2026
…loses #21) (#22)

Add `HOST_LINK_BW_BYTES_PER_SEC = 100_000_000` (100 MB/s) to the
bandwidth model, capturing the rev-A GbE host link as the third
tier of the bandwidth hierarchy:

  Local DDR  (per card)      ~2.0 GB/s   LOCAL_DDR_BW
  Inter-card (per direction) ~500 MB/s   INTERCARD_BW
  Host link  (GbE)           ~100 MB/s   HOST_LINK_BW (NEW)

Source-of-truth: Stays
`docs/upstream-contributions/2026-05-06-liteeth-ecp5-sgmii.md`
(Stays PR #34, merged 2026-05-06). Community measurements on
Versa-ECP5 and ECPIX-5 land at 800-940 Mbps UDP iperf3, i.e.
80-94 % of GbE line rate. The 100 MB/s number is the realistic
post-IP/UDP/Ethernet-header steady-state ceiling.

The host link is 5x slower than inter-card and 20x slower than
local DDR — it is the dominant cost when collective ops must
reach the host (model load, gradient checkpoint to host RAM,
dataset streaming, prompt-embedding upload).

## Scope

Minimal — per the issue spec's "if pick_strategy already handles
this" branch:

- `pick_strategy` is the per-token TP/MP decision and most decode
  tokens stay on-card; host-link cost is small per-token and only
  matters at session boundaries.
- No callers exist today for a session-level cost-budget API, so
  introducing `bytes_per_second_per_token_estimate` would be
  speculative generality (YAGNI). Defer until the runtime needs
  it.
- This PR keeps the public surface to a constant + module-level
  doctest update + tests.

## Tests

3 new unit tests in `bandwidth.rs`:

- `host_link_bw_constant_matches_recon_doc` — pins value to
  100_000_000 (guards against silent "round up to 125 MB/s line
  rate" drift).
- `host_link_bw_is_slowest_hop` — pins the three-tier ordering
  HOST_LINK < INTERCARD < LOCAL_DDR.
- `host_link_bw_is_inside_observed_range` — pins 80-125 MB/s
  envelope (community recon range, with line-rate ceiling).

Plus the existing `constants_are_positive` test extended to cover
the new constant.

Module-level doctest in `bandwidth.rs` updated to demonstrate all
three constants. Crate-root doctest in `lib.rs` updated to assert
the three-tier ordering.

## Cargo gates

- `cargo build -p spanker-scheduler`: green
- `cargo test -p spanker-scheduler`: 27 unit + 9 integration + 6
  doctests, all green (delta: +3 unit tests vs PR #19 baseline)
- `cargo clippy -p spanker-scheduler --all-targets -- -D warnings`:
  green
- `cargo fmt -p spanker-scheduler -- --check`: clean

Refs:
- #21 (this issue)
- popsolutions/Stays#34 (LiteEth ECP5 SGMII recon, source-of-truth)
- #17 (PR that landed initial 2-tier model)
- #19 (PR that landed pick_strategy)

Authored by Agent 3 (Software Stack — Spanker).

Signed-off-by: Marcos <m@pop.coop>
Co-authored-by: Marcos <m@pop.coop>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-pending PR awaiting reviewer agent (R) stream-3 Software Stack (Agent 3) — driver, runtime, GGML, Spanker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant