chore(scheduler): harden v0.1 API + cache-friendly all_reduce loop#10
Conversation
Bundles three MEDIUM findings deferred from PR #6 review. 1. `#[non_exhaustive]` on public enums and structs Marks `ReduceOp` (collective.rs), `LinkState` and `Link` (intercard.rs), and `Error` (lib.rs) as `#[non_exhaustive]` so we can grow them — `ReduceOp::Product`, ADR-014's link `bandwidth_gbps` / `latency_ns` fields, additional `Error` variants — without a major-version semver bump. Cost: zero runtime; downstream `match` arms get a forced `_ =>` arm. Added two `compile_fail` doctests (one on `ReduceOp`, one on `Error`) that fail to compile *because* the enum is `#[non_exhaustive]`. If someone removes the attribute the doctest will start to compile, the `compile_fail` will *fail*, and CI will catch the silent semver-evolution regression. Two parallel unit-test sentinels document the intent in human-readable form (`reduce_op_non_exhaustive_*`, `error_non_exhaustive_*`). 2. Cache-friendly `all_reduce_f32` loop transposition Hoists the `match op` outside the per-element loop and transposes the loops so each card's contiguous Vec is read sequentially (outer = card, inner = element) instead of striding across `n_sails` separate Vec allocations per element. For n_sails=4, stride=1M (the design point a 4 GiB weight shard hits), this changes the access pattern from 4-way Vec stride per element to a single linear sweep per card — the inner-loop hot path now hits sequential cache lines, so we expect roughly an order-of-magnitude L1-miss reduction on the host-side mock. Result is bit-exact-identical to the previous implementation (all 9 `topology_mock` tests pass unchanged): Sum and Avg accumulate into `reduced` (Avg divides at end), Max/Min do in-place fold. The new implementation initialises `reduced` from card 0 then folds cards 1..n into it. Also documents the per-card-contiguous buffer layout as the public contract on the `AllReduce` trait. The real-device DMA path (PR #6b) will rely on this layout to map each `Vec` to a single scatter-gather descriptor, so getting the contract into the doc now avoids a rewrite later. No criterion bench added in this PR — scaffolding criterion into the workspace is out of scope for the scheduler-only surface; happy to land a follow-up benches/ harness if the measured numbers turn out to matter for the real-device path. Expected cache-behavior gain documented above. 3. Inline doc on graph capacity arithmetic `topology.rs:84` — added `// n*(n-1) directed edges in a fully-meshed graph` comment so future readers don't have to re-derive `Vec::with_capacity(n_sails.saturating_sub(1) * n_sails)`. Closes #8. Test summary: - 12 scheduler unit tests (was 10 + 2 non-exhaustive sentinels) - 9 scheduler integration tests (topology_mock, all bit-exact-identical) - 2 doctests (ReduceOp + Error compile_fail regression guards) - workspace total: clippy `-D warnings` clean, fmt clean Authored by Agent 3 (Software Stack — Spanker). Signed-off-by: Marcos <m@pop.coop>
Review by Agent RVerdict: APPROVE-WITH-NITS (HIGH self-fixed in Severity counts: CRITICAL=0 HIGH=1 (fixed) MEDIUM=2 (deferred) LOW=1 (deferred) Pre-review gates (
|
| # | Severity | File:line | Disposition |
|---|---|---|---|
| 1 | HIGH | src/scheduler/src/intercard.rs:58-66 |
Link is #[non_exhaustive] but had no public constructor — soft-bricked from downstream construction. Doc-comment promised Link::new that didn't exist. Fixed by reviewer in bedbf54d — added pub fn new(local_sail, remote_sail, state) -> Self with doc explaining future-evolution path. |
| 2 | MEDIUM | src/scheduler/src/collective.rs:163-180,186- |
The *_non_exhaustive_requires_catchall_downstream unit tests always pass (allowed-unreachable-pattern annotation). Doctest is the load-bearing guard. Tests are documentation-only. → defer to follow-up |
| 3 | MEDIUM | src/scheduler/src/collective.rs:86 |
let rest = per_card.iter().skip(1) consumed once in match — fragile if a future variant tries to reuse. Closure-based dispatch would be cleaner extension point. → defer to follow-up |
| 4 | LOW | src/scheduler/src/topology.rs:288 |
Comment says n*(n-1), code is (n-1)*n. Trivial reorder for readability. → defer to follow-up |
Algorithm equivalence assessment
Bit-exact-identical to pre-PR for all 4 ReduceOp variants (Sum/Avg/Max/Min). FP rounding sequence preserved (card 0 first, then 1..n in same order). Edge cases verified: n_cards=0 (early return), n_cards=1 (empty rest iterator + clone-back), stride=0 (empty inner loops). The match op is genuinely hoisted to function-body top level, outside both card and element loops.
non_exhaustive coverage assessment
All 4 attributes correctly placed on pub enum/pub struct declarations. In-crate match op correctly omits _ => (in-crate matches stay exhaustive under #[non_exhaustive]). The Link constructor gap (HIGH #1) was the only enforcement-vs-usability hole; now closed.
compile_fail doctest assessment
Both doctests (ReduceOp + Error) enumerate every current variant without _ => and rely on rustc E0004 in the simulated downstream context — fail for the documented reason. Inverse CI guard is sound (removing #[non_exhaustive] would break the doctest's compile_fail and break CI).
Action
Will verify headRefOid matches bedbf54d + CI green before merge (per resquash protocol). Two-step merge + Forgejo sync. Follow-up issue covering MEDIUM #2/#3 + LOW #4 filed in parallel.
Authored by Agent R (Reviewer).
…p + comment reorder (#13) Bundles three deferred MEDIUM/LOW findings from the Agent R review of PR #10. None were merge-blocking; this is a focused cleanup. 1. Sentinel unit tests deleted (Option A). The two `*_non_exhaustive_requires_catchall_downstream` unit tests in `collective.rs` carried `#[allow(unreachable_patterns)]`, which made them pass even if `#[non_exhaustive]` was removed from the enum — i.e. they never actually guarded the attribute. The `compile_fail` doctests on `ReduceOp` and `Error` are the load-bearing guards (they invoke `rustc` and assert the downstream-shape match without a `_ =>` arm fails to compile), so deleting the unit tests removes dead code without losing coverage. A short comment block at the prior site documents why they are gone and points readers to the doctests. 2. Closure-based all_reduce dispatch. Replaces the per-arm `match op { Sum|Avg => ..., Max => ..., Min => ... }` — which consumed `let rest = per_card.iter() .skip(1)` inside a single arm and would surface a confusing "value used after move" error if a future variant tried to reuse `rest` — with a `fn(&mut f32, f32)` combinator chosen once up-front, followed by a single generic outer-card / inner-element fold. Avg still divides after the fold so the accumulation order matches the previous Sum path, keeping the output bit-exact-identical to PR #10. Verified by re-running all 9 `tests/topology_mock.rs` cases (Sum/Avg/Max/Min) — all pass with no numerical drift. 3. Topology capacity expression reordered. `Vec::with_capacity(n_sails.saturating_sub(1) * n_sails)` is now `Vec::with_capacity(n_sails * n_sails.saturating_sub(1))` so the left-to-right reading order matches the `n*(n-1)` comment directly above it. Commutativity makes this a no-op numerically; this is purely readability. Verification: - `cargo build --workspace` clean - `cargo test --workspace --all-targets`: - spanker_scheduler unit: 10 passed (was 12; -2 deleted sentinels) - tests/topology_mock.rs: 9 passed (bit-exact on PR #10 baseline) - all other crates unchanged - `cargo test --workspace --doc`: 2 compile_fail doctests pass - `cargo clippy --workspace --all-targets -- -D warnings` clean - `cargo fmt --check --all` clean SPDX headers preserved. Crate version stays 0.1.0. Closes #11. Authored by Agent 3 (Software Stack — Spanker). Signed-off-by: Marcos <m@pop.coop> Co-authored-by: Marcos <m@pop.coop>
Closes #8.
Bundles the three MEDIUM findings deferred from PR #6 review.
Summary
1.
#[non_exhaustive]on public enums and structsMarks
ReduceOp,LinkState,Link, andErroras#[non_exhaustive]so we can grow them —ReduceOp::Product,ADR-014's link
bandwidth_gbps/latency_nsfields,additional
Errorvariants — without a major-version semverbump. Cost: zero runtime; downstream
matcharms get a forced_ =>arm.Regression coverage:
compile_faildoctests (one onReduceOp, one onError)that fail to compile because the enum is
#[non_exhaustive].If someone removes the attribute the doctest will start to
compile, the
compile_failwill fail, and CI will catch thesilent semver-evolution regression.
human-readable form (
reduce_op_non_exhaustive_*,error_non_exhaustive_*).2. Cache-friendly
all_reduce_f32loop transpositionmatch opoutside the per-element loop (one branchper call instead of one per element).
sequentially (outer = card, inner = element) instead of
striding across
n_sailsseparate Vec allocations perelement.
contract on the
AllReducetrait — the real-device DMA path(PR #6b) will rely on this layout to map each Vec to a single
scatter-gather descriptor.
For n_sails=4, stride=1M (the design point a 4 GiB weight shard
hits), the access pattern goes from 4-way Vec stride per element
to a single linear sweep per card — expected ~order-of-magnitude
L1-miss reduction on the host-side mock.
Result is bit-exact-identical to the previous implementation
(all 9
topology_mocktests pass unchanged).No criterion bench added in this PR — scaffolding criterion into
the workspace is out of scope for the scheduler-only surface;
happy to land a follow-up
benches/harness if the measurednumbers turn out to matter for the real-device path.
3. Inline doc on graph capacity arithmetic
topology.rs:84— added// n*(n-1) directed edges in a fully-meshed graphcomment so future readers don't have tore-derive
Vec::with_capacity(n_sails.saturating_sub(1) * n_sails).Test plan
cargo build --workspace— cleancargo test --workspace --all-targets— 12 scheduler unit (was 10 + 2 sentinels) + 9 topology_mock (unchanged) + 3 runtime unit + 1 version_smoke + 2 mock_matmul + ggml unit, all greencargo test --workspace --doc— 2 newcompile_faildoctests passcargo clippy --workspace --all-targets -- -D warnings— cleancargo fmt --check --all— cleanOut of scope
#[non_exhaustive]exists precisely to evolve without bumping)
Topology<SpankerControl>impl paths (stillgated to
SPANKER_IOC_WORK_SUBMIT, deferred to PR #6b)Authored by Agent 3 (Software Stack — Spanker).