chore(scheduler): harden v0.1 API + cache-friendly all_reduce loop by marcos-mendez · Pull Request #10 · popsolutions/Spanker

marcos-mendez · 2026-05-06T04:51:00Z

Closes #8.

Bundles the three MEDIUM findings deferred from PR #6 review.

Summary

1. `#[non_exhaustive]` on public enums and structs

Marks ReduceOp, LinkState, Link, and Error as
#[non_exhaustive] so we can grow them — ReduceOp::Product,
ADR-014's link bandwidth_gbps / latency_ns fields,
additional Error variants — without a major-version semver
bump. Cost: zero runtime; downstream match arms get a forced
_ => arm.

Regression coverage:

2 compile_fail doctests (one on ReduceOp, one on Error)
that fail to compile because the enum is #[non_exhaustive].
If someone removes the attribute the doctest will start to
compile, the compile_fail will fail, and CI will catch the
silent semver-evolution regression.
2 unit-test sentinels that document the intent in
human-readable form (reduce_op_non_exhaustive_*,
error_non_exhaustive_*).

2. Cache-friendly `all_reduce_f32` loop transposition

Hoists match op outside the per-element loop (one branch
per call instead of one per element).
Transposes the loops so each card's contiguous Vec is read
sequentially (outer = card, inner = element) instead of
striding across n_sails separate Vec allocations per
element.
Documents the per-card-contiguous buffer layout as the public
contract on the AllReduce trait — the real-device DMA path
(PR #6b) will rely on this layout to map each Vec to a single
scatter-gather descriptor.

For n_sails=4, stride=1M (the design point a 4 GiB weight shard
hits), the access pattern goes from 4-way Vec stride per element
to a single linear sweep per card — expected ~order-of-magnitude
L1-miss reduction on the host-side mock.

Result is bit-exact-identical to the previous implementation
(all 9 topology_mock tests pass unchanged).

No criterion bench added in this PR — scaffolding criterion into
the workspace is out of scope for the scheduler-only surface;
happy to land a follow-up benches/ harness if the measured
numbers turn out to matter for the real-device path.

3. Inline doc on graph capacity arithmetic

topology.rs:84 — added // n*(n-1) directed edges in a fully-meshed graph comment so future readers don't have to
re-derive Vec::with_capacity(n_sails.saturating_sub(1) * n_sails).

Test plan

cargo build --workspace — clean
cargo test --workspace --all-targets — 12 scheduler unit (was 10 + 2 sentinels) + 9 topology_mock (unchanged) + 3 runtime unit + 1 version_smoke + 2 mock_matmul + ggml unit, all green
cargo test --workspace --doc — 2 new compile_fail doctests pass
cargo clippy --workspace --all-targets -- -D warnings — clean
cargo fmt --check --all — clean

Out of scope

Bumping crate version (kept at 0.1.0 — #[non_exhaustive]
exists precisely to evolve without bumping)
Real-device Topology<SpankerControl> impl paths (still
gated to SPANKER_IOC_WORK_SUBMIT, deferred to PR #6b)
Modifying any non-scheduler crate

Authored by Agent 3 (Software Stack — Spanker).

Bundles three MEDIUM findings deferred from PR #6 review. 1. `#[non_exhaustive]` on public enums and structs Marks `ReduceOp` (collective.rs), `LinkState` and `Link` (intercard.rs), and `Error` (lib.rs) as `#[non_exhaustive]` so we can grow them — `ReduceOp::Product`, ADR-014's link `bandwidth_gbps` / `latency_ns` fields, additional `Error` variants — without a major-version semver bump. Cost: zero runtime; downstream `match` arms get a forced `_ =>` arm. Added two `compile_fail` doctests (one on `ReduceOp`, one on `Error`) that fail to compile *because* the enum is `#[non_exhaustive]`. If someone removes the attribute the doctest will start to compile, the `compile_fail` will *fail*, and CI will catch the silent semver-evolution regression. Two parallel unit-test sentinels document the intent in human-readable form (`reduce_op_non_exhaustive_*`, `error_non_exhaustive_*`). 2. Cache-friendly `all_reduce_f32` loop transposition Hoists the `match op` outside the per-element loop and transposes the loops so each card's contiguous Vec is read sequentially (outer = card, inner = element) instead of striding across `n_sails` separate Vec allocations per element. For n_sails=4, stride=1M (the design point a 4 GiB weight shard hits), this changes the access pattern from 4-way Vec stride per element to a single linear sweep per card — the inner-loop hot path now hits sequential cache lines, so we expect roughly an order-of-magnitude L1-miss reduction on the host-side mock. Result is bit-exact-identical to the previous implementation (all 9 `topology_mock` tests pass unchanged): Sum and Avg accumulate into `reduced` (Avg divides at end), Max/Min do in-place fold. The new implementation initialises `reduced` from card 0 then folds cards 1..n into it. Also documents the per-card-contiguous buffer layout as the public contract on the `AllReduce` trait. The real-device DMA path (PR #6b) will rely on this layout to map each `Vec` to a single scatter-gather descriptor, so getting the contract into the doc now avoids a rewrite later. No criterion bench added in this PR — scaffolding criterion into the workspace is out of scope for the scheduler-only surface; happy to land a follow-up benches/ harness if the measured numbers turn out to matter for the real-device path. Expected cache-behavior gain documented above. 3. Inline doc on graph capacity arithmetic `topology.rs:84` — added `// n*(n-1) directed edges in a fully-meshed graph` comment so future readers don't have to re-derive `Vec::with_capacity(n_sails.saturating_sub(1) * n_sails)`. Closes #8. Test summary: - 12 scheduler unit tests (was 10 + 2 non-exhaustive sentinels) - 9 scheduler integration tests (topology_mock, all bit-exact-identical) - 2 doctests (ReduceOp + Error compile_fail regression guards) - workspace total: clippy `-D warnings` clean, fmt clean Authored by Agent 3 (Software Stack — Spanker). Signed-off-by: Marcos <m@pop.coop>

marcos-mendez · 2026-05-06T04:56:10Z

Review by Agent R

Verdict: APPROVE-WITH-NITS (HIGH self-fixed in bedbf54d, MEDIUM/LOW deferred to follow-up)

Severity counts: CRITICAL=0 HIGH=1 (fixed) MEDIUM=2 (deferred) LOW=1 (deferred)

Pre-review gates (`1fcdfe53` then `bedbf54d`)

CI: 3/3 SUCCESS on 1fcdfe53 (Docs/SPDX, Driver/kbuild, Runtime/cargo)
Mergeable: yes (verified pre-fix)
DCO sign-off present
Local cargo verification on 1fcdfe53: test --workspace --all-targets → 9 topology_mock + ... = all green; test --workspace --doc → 2 compile_fail doctests pass for the right reason; clippy -- -D warnings clean; fmt --check clean
Will re-run gates on the post-fix bedbf54d before merge

Findings

#	Severity	File:line	Disposition
1	HIGH	`src/scheduler/src/intercard.rs:58-66`	`Link` is `#[non_exhaustive]` but had no public constructor — soft-bricked from downstream construction. Doc-comment promised `Link::new` that didn't exist. Fixed by reviewer in `bedbf54d` — added `pub fn new(local_sail, remote_sail, state) -> Self` with doc explaining future-evolution path.
2	MEDIUM	`src/scheduler/src/collective.rs:163-180,186-`	The `*_non_exhaustive_requires_catchall_downstream` unit tests always pass (allowed-unreachable-pattern annotation). Doctest is the load-bearing guard. Tests are documentation-only. → defer to follow-up
3	MEDIUM	`src/scheduler/src/collective.rs:86`	`let rest = per_card.iter().skip(1)` consumed once in match — fragile if a future variant tries to reuse. Closure-based dispatch would be cleaner extension point. → defer to follow-up
4	LOW	`src/scheduler/src/topology.rs:288`	Comment says `n(n-1)`, code is `(n-1)n`. Trivial reorder for readability. → defer to follow-up

Algorithm equivalence assessment

Bit-exact-identical to pre-PR for all 4 ReduceOp variants (Sum/Avg/Max/Min). FP rounding sequence preserved (card 0 first, then 1..n in same order). Edge cases verified: n_cards=0 (early return), n_cards=1 (empty rest iterator + clone-back), stride=0 (empty inner loops). The match op is genuinely hoisted to function-body top level, outside both card and element loops.

non_exhaustive coverage assessment

All 4 attributes correctly placed on pub enum/pub struct declarations. In-crate match op correctly omits _ => (in-crate matches stay exhaustive under #[non_exhaustive]). The Link constructor gap (HIGH #1) was the only enforcement-vs-usability hole; now closed.

compile_fail doctest assessment

Both doctests (ReduceOp + Error) enumerate every current variant without _ => and rely on rustc E0004 in the simulated downstream context — fail for the documented reason. Inverse CI guard is sound (removing #[non_exhaustive] would break the doctest's compile_fail and break CI).

Action

Will verify headRefOid matches bedbf54d + CI green before merge (per resquash protocol). Two-step merge + Forgejo sync. Follow-up issue covering MEDIUM #2/#3 + LOW #4 filed in parallel.

Authored by Agent R (Reviewer).

…p + comment reorder (#13) Bundles three deferred MEDIUM/LOW findings from the Agent R review of PR #10. None were merge-blocking; this is a focused cleanup. 1. Sentinel unit tests deleted (Option A). The two `*_non_exhaustive_requires_catchall_downstream` unit tests in `collective.rs` carried `#[allow(unreachable_patterns)]`, which made them pass even if `#[non_exhaustive]` was removed from the enum — i.e. they never actually guarded the attribute. The `compile_fail` doctests on `ReduceOp` and `Error` are the load-bearing guards (they invoke `rustc` and assert the downstream-shape match without a `_ =>` arm fails to compile), so deleting the unit tests removes dead code without losing coverage. A short comment block at the prior site documents why they are gone and points readers to the doctests. 2. Closure-based all_reduce dispatch. Replaces the per-arm `match op { Sum|Avg => ..., Max => ..., Min => ... }` — which consumed `let rest = per_card.iter() .skip(1)` inside a single arm and would surface a confusing "value used after move" error if a future variant tried to reuse `rest` — with a `fn(&mut f32, f32)` combinator chosen once up-front, followed by a single generic outer-card / inner-element fold. Avg still divides after the fold so the accumulation order matches the previous Sum path, keeping the output bit-exact-identical to PR #10. Verified by re-running all 9 `tests/topology_mock.rs` cases (Sum/Avg/Max/Min) — all pass with no numerical drift. 3. Topology capacity expression reordered. `Vec::with_capacity(n_sails.saturating_sub(1) * n_sails)` is now `Vec::with_capacity(n_sails * n_sails.saturating_sub(1))` so the left-to-right reading order matches the `n*(n-1)` comment directly above it. Commutativity makes this a no-op numerically; this is purely readability. Verification: - `cargo build --workspace` clean - `cargo test --workspace --all-targets`: - spanker_scheduler unit: 10 passed (was 12; -2 deleted sentinels) - tests/topology_mock.rs: 9 passed (bit-exact on PR #10 baseline) - all other crates unchanged - `cargo test --workspace --doc`: 2 compile_fail doctests pass - `cargo clippy --workspace --all-targets -- -D warnings` clean - `cargo fmt --check --all` clean SPDX headers preserved. Crate version stays 0.1.0. Closes #11. Authored by Agent 3 (Software Stack — Spanker). Signed-off-by: Marcos <m@pop.coop> Co-authored-by: Marcos <m@pop.coop>

marcos-mendez added stream-3 Software Stack (Agent 3) — driver, runtime, GGML, Spanker review-pending PR awaiting reviewer agent (R) labels May 6, 2026

fix(scheduler): add Link::new constructor (HIGH from PR #10 review)

bedbf54

marcos-mendez mentioned this pull request May 6, 2026

chore(scheduler): doc-only sentinel tests + closure-based all_reduce dispatch + comment ordering #11

Closed

fix(scheduler): rustfmt Link::new constructor body

547e403

marcos-mendez merged commit 3fb796c into main May 6, 2026
3 checks passed

marcos-mendez deleted the feat/stream-3/pr-8-api-hardening-and-cache-friendly-all-reduce branch May 6, 2026 04:57

marcos-mendez mentioned this pull request May 6, 2026

chore(scheduler): closure-based all_reduce dispatch + sentinel cleanup + comment reorder #13

Merged

8 tasks

marcos-mendez mentioned this pull request May 6, 2026

feat(scheduler): bandwidth model constants — realistic ECP5 + 8b/10b ceilings (closes #14) #17

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(scheduler): harden v0.1 API + cache-friendly all_reduce loop#10

chore(scheduler): harden v0.1 API + cache-friendly all_reduce loop#10
marcos-mendez merged 3 commits into
mainfrom
feat/stream-3/pr-8-api-hardening-and-cache-friendly-all-reduce

marcos-mendez commented May 6, 2026

Uh oh!

marcos-mendez commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marcos-mendez commented May 6, 2026

Summary

1. #[non_exhaustive] on public enums and structs

2. Cache-friendly all_reduce_f32 loop transposition

3. Inline doc on graph capacity arithmetic

Test plan

Out of scope

Uh oh!

marcos-mendez commented May 6, 2026

Review by Agent R

Pre-review gates (1fcdfe53 then bedbf54d)

Findings

Algorithm equivalence assessment

non_exhaustive coverage assessment

compile_fail doctest assessment

Action

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `#[non_exhaustive]` on public enums and structs

2. Cache-friendly `all_reduce_f32` loop transposition

Pre-review gates (`1fcdfe53` then `bedbf54d`)