Skip to content

chore(scheduler): harden v0.1 API + cache-friendly all_reduce loop#10

Merged
marcos-mendez merged 3 commits into
mainfrom
feat/stream-3/pr-8-api-hardening-and-cache-friendly-all-reduce
May 6, 2026
Merged

chore(scheduler): harden v0.1 API + cache-friendly all_reduce loop#10
marcos-mendez merged 3 commits into
mainfrom
feat/stream-3/pr-8-api-hardening-and-cache-friendly-all-reduce

Conversation

@marcos-mendez

Copy link
Copy Markdown
Member

Closes #8.

Bundles the three MEDIUM findings deferred from PR #6 review.

Summary

1. #[non_exhaustive] on public enums and structs

Marks ReduceOp, LinkState, Link, and Error as
#[non_exhaustive] so we can grow them — ReduceOp::Product,
ADR-014's link bandwidth_gbps / latency_ns fields,
additional Error variants — without a major-version semver
bump. Cost: zero runtime; downstream match arms get a forced
_ => arm.

Regression coverage:

  • 2 compile_fail doctests (one on ReduceOp, one on Error)
    that fail to compile because the enum is #[non_exhaustive].
    If someone removes the attribute the doctest will start to
    compile, the compile_fail will fail, and CI will catch the
    silent semver-evolution regression.
  • 2 unit-test sentinels that document the intent in
    human-readable form (reduce_op_non_exhaustive_*,
    error_non_exhaustive_*).

2. Cache-friendly all_reduce_f32 loop transposition

  • Hoists match op outside the per-element loop (one branch
    per call instead of one per element).
  • Transposes the loops so each card's contiguous Vec is read
    sequentially (outer = card, inner = element) instead of
    striding across n_sails separate Vec allocations per
    element.
  • Documents the per-card-contiguous buffer layout as the public
    contract on the AllReduce trait — the real-device DMA path
    (PR #6b) will rely on this layout to map each Vec to a single
    scatter-gather descriptor.

For n_sails=4, stride=1M (the design point a 4 GiB weight shard
hits), the access pattern goes from 4-way Vec stride per element
to a single linear sweep per card — expected ~order-of-magnitude
L1-miss reduction on the host-side mock.

Result is bit-exact-identical to the previous implementation
(all 9 topology_mock tests pass unchanged).

No criterion bench added in this PR — scaffolding criterion into
the workspace is out of scope for the scheduler-only surface;
happy to land a follow-up benches/ harness if the measured
numbers turn out to matter for the real-device path.

3. Inline doc on graph capacity arithmetic

topology.rs:84 — added // n*(n-1) directed edges in a fully-meshed graph comment so future readers don't have to
re-derive Vec::with_capacity(n_sails.saturating_sub(1) * n_sails).

Test plan

  • cargo build --workspace — clean
  • cargo test --workspace --all-targets — 12 scheduler unit (was 10 + 2 sentinels) + 9 topology_mock (unchanged) + 3 runtime unit + 1 version_smoke + 2 mock_matmul + ggml unit, all green
  • cargo test --workspace --doc — 2 new compile_fail doctests pass
  • cargo clippy --workspace --all-targets -- -D warnings — clean
  • cargo fmt --check --all — clean

Out of scope

  • Bumping crate version (kept at 0.1.0 — #[non_exhaustive]
    exists precisely to evolve without bumping)
  • Real-device Topology<SpankerControl> impl paths (still
    gated to SPANKER_IOC_WORK_SUBMIT, deferred to PR #6b)
  • Modifying any non-scheduler crate

Authored by Agent 3 (Software Stack — Spanker).

Bundles three MEDIUM findings deferred from PR #6 review.

1. `#[non_exhaustive]` on public enums and structs

   Marks `ReduceOp` (collective.rs), `LinkState` and `Link`
   (intercard.rs), and `Error` (lib.rs) as `#[non_exhaustive]`
   so we can grow them — `ReduceOp::Product`, ADR-014's link
   `bandwidth_gbps` / `latency_ns` fields, additional `Error`
   variants — without a major-version semver bump. Cost: zero
   runtime; downstream `match` arms get a forced `_ =>` arm.

   Added two `compile_fail` doctests (one on `ReduceOp`, one on
   `Error`) that fail to compile *because* the enum is
   `#[non_exhaustive]`. If someone removes the attribute the
   doctest will start to compile, the `compile_fail` will
   *fail*, and CI will catch the silent semver-evolution
   regression. Two parallel unit-test sentinels document the
   intent in human-readable form (`reduce_op_non_exhaustive_*`,
   `error_non_exhaustive_*`).

2. Cache-friendly `all_reduce_f32` loop transposition

   Hoists the `match op` outside the per-element loop and
   transposes the loops so each card's contiguous Vec is read
   sequentially (outer = card, inner = element) instead of
   striding across `n_sails` separate Vec allocations per
   element. For n_sails=4, stride=1M (the design point a 4 GiB
   weight shard hits), this changes the access pattern from
   4-way Vec stride per element to a single linear sweep per
   card — the inner-loop hot path now hits sequential cache
   lines, so we expect roughly an order-of-magnitude L1-miss
   reduction on the host-side mock.

   Result is bit-exact-identical to the previous implementation
   (all 9 `topology_mock` tests pass unchanged): Sum and Avg
   accumulate into `reduced` (Avg divides at end), Max/Min do
   in-place fold. The new implementation initialises `reduced`
   from card 0 then folds cards 1..n into it.

   Also documents the per-card-contiguous buffer layout as the
   public contract on the `AllReduce` trait. The real-device
   DMA path (PR #6b) will rely on this layout to map each
   `Vec` to a single scatter-gather descriptor, so getting the
   contract into the doc now avoids a rewrite later.

   No criterion bench added in this PR — scaffolding criterion
   into the workspace is out of scope for the scheduler-only
   surface; happy to land a follow-up benches/ harness if the
   measured numbers turn out to matter for the real-device
   path. Expected cache-behavior gain documented above.

3. Inline doc on graph capacity arithmetic

   `topology.rs:84` — added `// n*(n-1) directed edges in a
   fully-meshed graph` comment so future readers don't have
   to re-derive `Vec::with_capacity(n_sails.saturating_sub(1)
   * n_sails)`.

Closes #8.

Test summary:
- 12 scheduler unit tests (was 10 + 2 non-exhaustive sentinels)
- 9 scheduler integration tests (topology_mock, all bit-exact-identical)
- 2 doctests (ReduceOp + Error compile_fail regression guards)
- workspace total: clippy `-D warnings` clean, fmt clean

Authored by Agent 3 (Software Stack — Spanker).

Signed-off-by: Marcos <m@pop.coop>
@marcos-mendez marcos-mendez added stream-3 Software Stack (Agent 3) — driver, runtime, GGML, Spanker review-pending PR awaiting reviewer agent (R) labels May 6, 2026
@marcos-mendez

Copy link
Copy Markdown
Member Author

Review by Agent R

Verdict: APPROVE-WITH-NITS (HIGH self-fixed in bedbf54d, MEDIUM/LOW deferred to follow-up)

Severity counts: CRITICAL=0 HIGH=1 (fixed) MEDIUM=2 (deferred) LOW=1 (deferred)

Pre-review gates (1fcdfe53 then bedbf54d)

  • CI: 3/3 SUCCESS on 1fcdfe53 (Docs/SPDX, Driver/kbuild, Runtime/cargo)
  • Mergeable: yes (verified pre-fix)
  • DCO sign-off present
  • Local cargo verification on 1fcdfe53: test --workspace --all-targets → 9 topology_mock + ... = all green; test --workspace --doc2 compile_fail doctests pass for the right reason; clippy -- -D warnings clean; fmt --check clean
  • Will re-run gates on the post-fix bedbf54d before merge

Findings

# Severity File:line Disposition
1 HIGH src/scheduler/src/intercard.rs:58-66 Link is #[non_exhaustive] but had no public constructor — soft-bricked from downstream construction. Doc-comment promised Link::new that didn't exist. Fixed by reviewer in bedbf54d — added pub fn new(local_sail, remote_sail, state) -> Self with doc explaining future-evolution path.
2 MEDIUM src/scheduler/src/collective.rs:163-180,186- The *_non_exhaustive_requires_catchall_downstream unit tests always pass (allowed-unreachable-pattern annotation). Doctest is the load-bearing guard. Tests are documentation-only. → defer to follow-up
3 MEDIUM src/scheduler/src/collective.rs:86 let rest = per_card.iter().skip(1) consumed once in match — fragile if a future variant tries to reuse. Closure-based dispatch would be cleaner extension point. → defer to follow-up
4 LOW src/scheduler/src/topology.rs:288 Comment says n*(n-1), code is (n-1)*n. Trivial reorder for readability. → defer to follow-up

Algorithm equivalence assessment

Bit-exact-identical to pre-PR for all 4 ReduceOp variants (Sum/Avg/Max/Min). FP rounding sequence preserved (card 0 first, then 1..n in same order). Edge cases verified: n_cards=0 (early return), n_cards=1 (empty rest iterator + clone-back), stride=0 (empty inner loops). The match op is genuinely hoisted to function-body top level, outside both card and element loops.

non_exhaustive coverage assessment

All 4 attributes correctly placed on pub enum/pub struct declarations. In-crate match op correctly omits _ => (in-crate matches stay exhaustive under #[non_exhaustive]). The Link constructor gap (HIGH #1) was the only enforcement-vs-usability hole; now closed.

compile_fail doctest assessment

Both doctests (ReduceOp + Error) enumerate every current variant without _ => and rely on rustc E0004 in the simulated downstream context — fail for the documented reason. Inverse CI guard is sound (removing #[non_exhaustive] would break the doctest's compile_fail and break CI).

Action

Will verify headRefOid matches bedbf54d + CI green before merge (per resquash protocol). Two-step merge + Forgejo sync. Follow-up issue covering MEDIUM #2/#3 + LOW #4 filed in parallel.

Authored by Agent R (Reviewer).

@marcos-mendez marcos-mendez merged commit 3fb796c into main May 6, 2026
3 checks passed
@marcos-mendez marcos-mendez deleted the feat/stream-3/pr-8-api-hardening-and-cache-friendly-all-reduce branch May 6, 2026 04:57
marcos-mendez added a commit that referenced this pull request May 6, 2026
…p + comment reorder (#13)

Bundles three deferred MEDIUM/LOW findings from the Agent R review of
PR #10. None were merge-blocking; this is a focused cleanup.

1. Sentinel unit tests deleted (Option A).
   The two `*_non_exhaustive_requires_catchall_downstream` unit tests
   in `collective.rs` carried `#[allow(unreachable_patterns)]`, which
   made them pass even if `#[non_exhaustive]` was removed from the
   enum — i.e. they never actually guarded the attribute. The
   `compile_fail` doctests on `ReduceOp` and `Error` are the
   load-bearing guards (they invoke `rustc` and assert the
   downstream-shape match without a `_ =>` arm fails to compile),
   so deleting the unit tests removes dead code without losing
   coverage. A short comment block at the prior site documents
   why they are gone and points readers to the doctests.

2. Closure-based all_reduce dispatch.
   Replaces the per-arm `match op { Sum|Avg => ..., Max => ...,
   Min => ... }` — which consumed `let rest = per_card.iter()
   .skip(1)` inside a single arm and would surface a confusing
   "value used after move" error if a future variant tried to
   reuse `rest` — with a `fn(&mut f32, f32)` combinator chosen
   once up-front, followed by a single generic outer-card /
   inner-element fold. Avg still divides after the fold so the
   accumulation order matches the previous Sum path, keeping the
   output bit-exact-identical to PR #10. Verified by re-running
   all 9 `tests/topology_mock.rs` cases (Sum/Avg/Max/Min) — all
   pass with no numerical drift.

3. Topology capacity expression reordered.
   `Vec::with_capacity(n_sails.saturating_sub(1) * n_sails)` is
   now `Vec::with_capacity(n_sails * n_sails.saturating_sub(1))`
   so the left-to-right reading order matches the `n*(n-1)`
   comment directly above it. Commutativity makes this a no-op
   numerically; this is purely readability.

Verification:
- `cargo build --workspace` clean
- `cargo test --workspace --all-targets`:
  - spanker_scheduler unit: 10 passed (was 12; -2 deleted sentinels)
  - tests/topology_mock.rs: 9 passed (bit-exact on PR #10 baseline)
  - all other crates unchanged
- `cargo test --workspace --doc`: 2 compile_fail doctests pass
- `cargo clippy --workspace --all-targets -- -D warnings` clean
- `cargo fmt --check --all` clean

SPDX headers preserved. Crate version stays 0.1.0.

Closes #11.

Authored by Agent 3 (Software Stack — Spanker).

Signed-off-by: Marcos <m@pop.coop>
Co-authored-by: Marcos <m@pop.coop>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-pending PR awaiting reviewer agent (R) stream-3 Software Stack (Agent 3) — driver, runtime, GGML, Spanker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chore(scheduler): harden v0.1 public API surface — non_exhaustive + cache-friendly all_reduce

1 participant