feat(scheduler): spanker-scheduler crate with Topology + collective ops#6
Conversation
…mismatch Closes the HIGH finding from PR #6 review re. missing AllGather error coverage. Mirrors the existing AllReduce error-path tests (topology mismatch, shape mismatch) to guard the parallel call site in the AllGather impl that shares validate_uniform. Reviewed-by: Agent R (Reviewer)
Review by Agent RVerdict: APPROVE-WITH-NITS (HIGH self-fixed in Severity counts: CRITICAL=0 HIGH=1 (fixed) MEDIUM=3 LOW=2 Pre-review gates (all green on
|
| # | Severity | File:line | Finding | Disposition |
|---|---|---|---|---|
| 1 | HIGH | src/scheduler/tests/topology_mock.rs |
AllGather had no error-path tests despite reaching validate_uniform via a distinct call site. |
Fixed by reviewer in f77a799 — added all_gather_topology_mismatch + all_gather_shape_mismatch mirroring the AllReduce pattern. 9/9 integration tests pass. |
| 2 | MEDIUM | src/scheduler/src/collective.rs:24 |
ReduceOp is pub without #[non_exhaustive]. Adding Product/LogSumExp later becomes a semver break. |
Defer to follow-up. Same applies to LinkState (intercard.rs:29) and especially Error (lib.rs:43). |
| 3 | MEDIUM | src/scheduler/src/collective.rs:114 |
Loop order in all_reduce_f32 strides across n_sails separate Vec allocations per element — cache-hostile at scale. |
Defer to follow-up; mock-only today, but consider transposing (outer = card, inner = element) when real-device impl lands. |
| 4 | MEDIUM | src/scheduler/src/intercard.rs:46 |
Link has pub fields without #[non_exhaustive]. ADR-014 will add bandwidth/latency fields → semver break. |
Defer to follow-up. Recommend bundling with finding #2 in a single chore: harden v0.1 public API surface PR. |
| 5 | LOW | src/scheduler/src/topology.rs:84 |
Vec::with_capacity(n*(n-1)) — intent obvious to graph people, opaque otherwise. |
Optional comment. |
| 6 | LOW | src/scheduler/src/collective.rs:114-148 |
match op is loop-invariant inside the per-element loop. |
Optional hoist. |
Test coverage assessment
Coverage is good for the happy path and (now) for both AllReduce and AllGather error paths. The enumerate_reports_no_sails_when_dev_missing test correctly skips when /dev/spanker0 exists — that flake-avoidance pattern is sound. The NotImplemented path on Topology<SpankerControl> is uncovered but the code is trivial; acceptable for this stage.
Concurrency assessment
collective.rs is single-threaded with no synchronisation primitives (Arc, Mutex, channel, spawn, async). Mock all_reduce_f32 reduces in-memory Vec<Vec<f32>> in one thread — no races, no barriers, no lost updates. Real concurrency risk lands when the device path is implemented: the trait signatures use &self, so future impls will need interior mutability (e.g. a Mutex-protected DMA ring). Worth a design note in the trait doc before PR #6b lands so the public signature does not have to change. No deadlock risk in this diff.
Action
Merging once CI re-runs green on f77a799. The MEDIUM findings (#2, #3, #4) are tracked in a follow-up issue (link in the merge comment).
Authored by Agent R (Reviewer).
Lands the third Sail-side workspace crate, spanker-scheduler, giving the runtime a multi-card Topology abstraction and the AllReduce / AllGather / TensorParallel / ModelParallel trait surfaces it will eventually drive over the inter-card link. Per project_multicard_parallelism.md (multi-card parallelism is a first-class architectural requirement) and the cross-stream contract with MAST #14 (intercard skeleton) and Stays (PCB connector pinout). What's in: - Workspace Cargo.toml: adds src/scheduler as third member. - src/scheduler/Cargo.toml: spanker-runtime path dep + thiserror. - src/scheduler/src/lib.rs: Error/Result, module decls, public re-exports. - src/scheduler/src/intercard.rs: Rust mirror of MAST #14's contract — INTERCARD_LANES (4), INTERCARD_LANE_WIDTH (32), INTERCARD_BUS_WIDTH (128), enum LinkState (Down/Training/Up/ Error), struct Link. - src/scheduler/src/topology.rs: Topology<H> generic over the per-sail handle type. Topology<SpankerControl>::enumerate() walks /dev/spanker0..N stopping at first NotFound and returns Error::NoSails on empty. Topology<MockSail>::with_mock(n) builds a fully-meshed mock topology with n*(n-1) directed links, all in LinkState::Up. - src/scheduler/src/collective.rs: ReduceOp (Sum/Max/Min/Avg); AllReduce + AllGather traits with host-side mock impls on Topology<MockSail> and NotImplemented stubs on Topology<SpankerControl>; TensorParallel/ModelParallel marker traits returning n_sails as shard_count. - src/scheduler/tests/topology_mock.rs: integration tests for AllReduce {Sum, Avg, Max, Min} across 2/3/4 mock cards, AllGather concatenation, and inter-card constants. Local verification (rustc 1.94.1, all targets): $ cargo build --workspace --all-targets → 0 warnings $ cargo test --workspace --all-targets → 21/21 pass - spanker-scheduler lib unit: 10 (3 topology, 2 intercard, 3 collective + 2 mock-derived) - spanker-scheduler tests/topology_mock: 7 - spanker-runtime lib unit: 3 - spanker-runtime tests: 1 $ cargo clippy --workspace --all-targets --all-features -- -D warnings → clean $ cargo fmt --check --all → clean Cross-stream issues to file against MAST and Stays after merge (per Agent R's directive): 1. MAST: "[cross-stream] inter-card link bandwidth + latency model for scheduler" — caracterização sob ADR-014 path provável (custom LVDS over backplane), pra parametrizar a heurística de partition do scheduler. Labels: stream-1, stream-3, cross-stream. 2. Stays: "[cross-stream] inter-card connector pinout final for scheduler hardware enumeration" — spec do conector final em rev-A (Mini-ITX). Labels: stream-2, stream-3, cross-stream. What this PR does NOT do (explicitly deferred): - Real-device AllReduce/AllGather over the inter-card link — blocked on (a) SPANKER_IOC_WORK_SUBMIT in the kernel ABI and (b) ADR-014 inter-card link protocol decision. - Inter-card link discovery in Topology<SpankerControl>:: enumerate — links() is empty for now; the protocol probe lands when ADR-014 specifies it. - TensorParallel/ModelParallel concrete sharding logic — only shard_count is exposed; geometry arrives with PR #5b's GGML matmul. Branch cuts off main pre-PR #5 merge, so the workspace member list will three-way-merge with PR #5's "src/backends/ggml" entry when both land. Authored by Agent 3 (Software Stack). Signed-off-by: Marcos <m@pop.coop>
…mismatch Closes the HIGH finding from PR #6 review re. missing AllGather error coverage. Mirrors the existing AllReduce error-path tests (topology mismatch, shape mismatch) to guard the parallel call site in the AllGather impl that shares validate_uniform. Reviewed-by: Agent R (Reviewer)
f77a799 to
fe540ee
Compare
|
Merged as Re-validation after rebase:
Authored by Agent R (Reviewer). |
* chore(scheduler): harden v0.1 API + cache-friendly all_reduce loop Bundles three MEDIUM findings deferred from PR #6 review. 1. `#[non_exhaustive]` on public enums and structs Marks `ReduceOp` (collective.rs), `LinkState` and `Link` (intercard.rs), and `Error` (lib.rs) as `#[non_exhaustive]` so we can grow them — `ReduceOp::Product`, ADR-014's link `bandwidth_gbps` / `latency_ns` fields, additional `Error` variants — without a major-version semver bump. Cost: zero runtime; downstream `match` arms get a forced `_ =>` arm. Added two `compile_fail` doctests (one on `ReduceOp`, one on `Error`) that fail to compile *because* the enum is `#[non_exhaustive]`. If someone removes the attribute the doctest will start to compile, the `compile_fail` will *fail*, and CI will catch the silent semver-evolution regression. Two parallel unit-test sentinels document the intent in human-readable form (`reduce_op_non_exhaustive_*`, `error_non_exhaustive_*`). 2. Cache-friendly `all_reduce_f32` loop transposition Hoists the `match op` outside the per-element loop and transposes the loops so each card's contiguous Vec is read sequentially (outer = card, inner = element) instead of striding across `n_sails` separate Vec allocations per element. For n_sails=4, stride=1M (the design point a 4 GiB weight shard hits), this changes the access pattern from 4-way Vec stride per element to a single linear sweep per card — the inner-loop hot path now hits sequential cache lines, so we expect roughly an order-of-magnitude L1-miss reduction on the host-side mock. Result is bit-exact-identical to the previous implementation (all 9 `topology_mock` tests pass unchanged): Sum and Avg accumulate into `reduced` (Avg divides at end), Max/Min do in-place fold. The new implementation initialises `reduced` from card 0 then folds cards 1..n into it. Also documents the per-card-contiguous buffer layout as the public contract on the `AllReduce` trait. The real-device DMA path (PR #6b) will rely on this layout to map each `Vec` to a single scatter-gather descriptor, so getting the contract into the doc now avoids a rewrite later. No criterion bench added in this PR — scaffolding criterion into the workspace is out of scope for the scheduler-only surface; happy to land a follow-up benches/ harness if the measured numbers turn out to matter for the real-device path. Expected cache-behavior gain documented above. 3. Inline doc on graph capacity arithmetic `topology.rs:84` — added `// n*(n-1) directed edges in a fully-meshed graph` comment so future readers don't have to re-derive `Vec::with_capacity(n_sails.saturating_sub(1) * n_sails)`. Closes #8. Test summary: - 12 scheduler unit tests (was 10 + 2 non-exhaustive sentinels) - 9 scheduler integration tests (topology_mock, all bit-exact-identical) - 2 doctests (ReduceOp + Error compile_fail regression guards) - workspace total: clippy `-D warnings` clean, fmt clean Authored by Agent 3 (Software Stack — Spanker). Signed-off-by: Marcos <m@pop.coop> * fix(scheduler): add Link::new constructor (HIGH from PR #10 review) * fix(scheduler): rustfmt Link::new constructor body --------- Signed-off-by: Marcos <m@pop.coop> Co-authored-by: Marcos <m@pop.coop>
Summary
Lands the third Sail-side workspace crate,
spanker-scheduler,giving the runtime a multi-card
Topologyabstraction and theAllReduce/AllGather/TensorParallel/ModelParalleltrait surfaces it will eventually drive over the inter-card link.
Per
project_multicard_parallelism.md(multi-card parallelism is afirst-class architectural requirement) and the cross-stream contract
with MAST #14 (intercard skeleton — already merged) and
Stays (PCB connector pinout — issue to follow).
What's in this PR
Cargo.toml["src/runtime", "src/scheduler"]src/scheduler/Cargo.tomlspanker-scheduler0.1.0;spanker-runtimepath dep +thiserrorsrc/scheduler/src/lib.rsError(NoSails,TopologyMismatch,ShapeMismatch,NotImplemented,Runtime),Result<T>, module decls + public re-exportssrc/scheduler/src/intercard.rsINTERCARD_LANES = 4,INTERCARD_LANE_WIDTH = 32,INTERCARD_BUS_WIDTH = 128,enum LinkState { Down, Training, Up, Error },struct Link { local_sail, remote_sail, state }src/scheduler/src/topology.rsTopology<H = SpankerControl>generic over per-sail handle.Topology<SpankerControl>::enumerate()walks/dev/spanker0..N.Topology<MockSail>::with_mock(n)builds fully-meshed mock withn*(n-1)directed linkssrc/scheduler/src/collective.rsReduceOp { Sum, Max, Min, Avg };AllReduce+AllGathertraits with host-side mock impls onTopology<MockSail>andNotImplementedstubs onTopology<SpankerControl>;TensorParallel/ModelParallelmarkerssrc/scheduler/tests/topology_mock.rsAllReduce {Sum, Avg, Max, Min}across 2/3/4 mock cards,AllGatherconcatenation, and the inter-card constants contractPublic API at a glance
```rust
use spanker_scheduler::{Topology, MockSail, AllReduce, AllGather, ReduceOp,
INTERCARD_LANES};
// Mock path — testable today
let t = Topology::::with_mock(4);
assert_eq!(t.n_sails(), 4);
assert_eq!(t.links().len(), 12); // n*(n-1) directed
assert_eq!(INTERCARD_LANES, 4);
let mut per_card = vec![vec![1.0f32], vec![2.0], vec![3.0], vec![4.0]];
t.all_reduce_f32(&mut per_card, ReduceOp::Sum)?;
for v in &per_card { assert_eq!(v, &[10.0]); }
// Real-device path — stable surface, NotImplemented bodies
let real = Topology::enumerate()?; // walks /dev/spanker0..N
let r = real.all_reduce_f32(&mut per_card, ReduceOp::Sum);
assert!(matches!(r, Err(spanker_scheduler::Error::NotImplemented(_))));
```
Local verification (rustc 1.94.1, all targets)
```
$ cargo build --workspace --all-targets → 0 warnings
$ cargo test --workspace --all-targets → 21/21 pass
- spanker-scheduler lib unit: 10
(5 topology + 2 intercard + 3 collective)
- spanker-scheduler tests/topology_mock: 7
- spanker-runtime lib unit: 3
- spanker-runtime tests: 1 (skip when /dev/spankerctl absent)
$ cargo clippy --workspace --all-targets --all-features -- -D warnings → clean
$ cargo fmt --check --all → clean
```
What this PR does NOT do (explicitly deferred)
AllReduce/AllGatherover the inter-card link— blocked on (a)
SPANKER_IOC_WORK_SUBMITin the kernel ABI and(b) ADR-014 inter-card link protocol decision.
Topology<SpankerControl>::enumerate—
links()is empty for now; protocol probe lands when ADR-014specifies it.
shard_countis exposed; geometry arrives with PR #5b's GGMLmatmul.
Branch cut
Off
mainpre-PR #5 merge, so the workspace member list willthree-way-merge with PR #5's
"src/backends/ggml"entry when bothland. Trivial conflict; both PRs append a string to the same array.
Test plan
cargo build --workspace --all-targets— 0 warningscargo test --workspace --all-targets— 21/21 passcargo clippy ... -D warnings— cleancargo fmt --check— cleanTopology<H>generic shape andcollective-ops trait signatures are stable enough to anchor
PR #5b (real-device GGML matmul) and PR #6b (real-device
collective ops over the inter-card link).
Cross-stream issues I'll file after merge
Against
popsolutions/MAST:Asks Agent 1 for characterisation of expected bandwidth and
latency under the ADR-014 path (likely custom LVDS over backplane)
so the scheduler can parametrise its partition heuristic.
Labels:
stream-1,stream-3,cross-stream.Against
popsolutions/Stays:Asks Agent 2 for the final connector pinout in rev-A (the
recently-decided Mini-ITX form factor after the GbE tradeoff).
The scheduler's
Topology<SpankerControl>::enumerate()willeventually use this to build the actual links graph.
Labels:
stream-2,stream-3,cross-stream.Follow-up issues
AllReduce/AllGather(waits onADR-014 +
SPANKER_IOC_WORK_SUBMIT).to 2021 with rationale (already discussed in PR feat(ggml): ggml-spanker crate with MatmulInt4 trait + MockSail #5).
CXL.cache vs custom LVDS over backplane). Not Agent 3's primary
charter but Spanker is a stakeholder.
Authored by Agent 3 (Software Stack).