chore(scheduler): harden v0.1 public API surface — non_exhaustive + cache-friendly all_reduce

## Context

Bundles three MEDIUM findings from the Agent R review of PR #6 (scheduler crate) that were deferred from merge. None blocked the merge but all three should land before any external consumer (kernel-userspace bridge, GGML backend) starts importing this crate, otherwise we eat semver pain.

## Tasks

### 1. `#[non_exhaustive]` on public enums and structs (MEDIUM-2 + MEDIUM-4)

Add `#[non_exhaustive]` to:
- `ReduceOp` (`src/scheduler/src/collective.rs:24`) — future variants like `Product`, `LogSumExp`, `BitwiseOr` are likely
- `LinkState` (`src/scheduler/src/intercard.rs:29`) — protocol still TBD per ADR-014
- `Link` struct (`src/scheduler/src/intercard.rs:46`) — bandwidth_gbps / latency_ns will be added when ADR-014 lands
- `Error` enum (`src/scheduler/src/lib.rs:43`) — library Error enums almost always benefit from this

Rationale: avoids breaking-change semver bumps when these types grow. Cost: zero runtime; downstream `match` arms get a forced `_ =>` catch-all.

### 2. Cache-friendly `all_reduce_f32` loop order (MEDIUM-3)

In `src/scheduler/src/collective.rs:114`, the `for i in 0..stride` outer loop strides across `n_sails` separate `Vec` allocations per element. For a 4 GiB weight shard (n_sails=4) this is cache-hostile.

Suggested transposition:
- Hoist the `match op` outside the per-element loop (compute closure or function pointer once per call)
- Outer loop = card; inner loop = element (so each card's contiguous Vec is read sequentially before moving to the next)
- Add a doc note on the `AllReduce` trait that callers provide per-card contiguous buffers — fixes the layout contract for the real-device impl too

This is mock-only today but the trait signature is the public contract, so getting the layout intent into the doc now avoids a rewrite in PR #6b.

### 3. (Bonus) Inline doc on graph capacity arithmetic (LOW-5)

`src/scheduler/src/topology.rs:84`: `Vec::with_capacity(n_sails.saturating_sub(1) * n_sails)` — add a one-line `// n*(n-1) directed edges in a fully-meshed graph` comment.

## Acceptance

- All three changes in one PR titled `chore(scheduler): harden v0.1 public API surface`
- New unit test that asserts a `match ReduceOp` without `_ =>` arm fails to compile (use `#[deny(non_exhaustive_omitted_patterns)]` or a doctest demonstrating semver-evolution)
- `cargo bench` (if added) shows the loop transposition improves throughput on n_sails=4, stride=1M; if no bench yet, document expected cache behavior in the commit message

## Refs

- Agent R review on PR #6: https://github.com/popsolutions/Spanker/pull/6#issuecomment-4384997856
- ADR-014 (inter-card link protocol) — open against MAST

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(scheduler): harden v0.1 public API surface — non_exhaustive + cache-friendly all_reduce #8

Context

Tasks

1. `#[non_exhaustive]` on public enums and structs (MEDIUM-2 + MEDIUM-4)

2. Cache-friendly `all_reduce_f32` loop order (MEDIUM-3)

3. (Bonus) Inline doc on graph capacity arithmetic (LOW-5)

Acceptance

Refs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

chore(scheduler): harden v0.1 public API surface — non_exhaustive + cache-friendly all_reduce #8

Description

Context

Tasks

1. #[non_exhaustive] on public enums and structs (MEDIUM-2 + MEDIUM-4)

2. Cache-friendly all_reduce_f32 loop order (MEDIUM-3)

3. (Bonus) Inline doc on graph capacity arithmetic (LOW-5)

Acceptance

Refs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `#[non_exhaustive]` on public enums and structs (MEDIUM-2 + MEDIUM-4)

2. Cache-friendly `all_reduce_f32` loop order (MEDIUM-3)