Skip to content

feat(scheduler): spanker-scheduler crate with Topology + collective ops#6

Merged
marcos-mendez merged 2 commits into
mainfrom
feat/stream-3/pr-04-spanker-scheduler
May 6, 2026
Merged

feat(scheduler): spanker-scheduler crate with Topology + collective ops#6
marcos-mendez merged 2 commits into
mainfrom
feat/stream-3/pr-04-spanker-scheduler

Conversation

@marcos-mendez

Copy link
Copy Markdown
Member

Summary

Lands the third Sail-side workspace crate, spanker-scheduler,
giving the runtime a multi-card Topology abstraction and the
AllReduce / AllGather / TensorParallel / ModelParallel
trait surfaces it will eventually drive over the inter-card link.

Per project_multicard_parallelism.md (multi-card parallelism is a
first-class architectural requirement) and the cross-stream contract
with MAST #14 (intercard skeleton — already merged) and
Stays (PCB connector pinout — issue to follow).

What's in this PR

Path Role
Cargo.toml Workspace member list extended to ["src/runtime", "src/scheduler"]
src/scheduler/Cargo.toml spanker-scheduler 0.1.0; spanker-runtime path dep + thiserror
src/scheduler/src/lib.rs Error (NoSails, TopologyMismatch, ShapeMismatch, NotImplemented, Runtime), Result<T>, module decls + public re-exports
src/scheduler/src/intercard.rs Rust mirror of MAST #14: INTERCARD_LANES = 4, INTERCARD_LANE_WIDTH = 32, INTERCARD_BUS_WIDTH = 128, enum LinkState { Down, Training, Up, Error }, struct Link { local_sail, remote_sail, state }
src/scheduler/src/topology.rs Topology<H = SpankerControl> generic over per-sail handle. Topology<SpankerControl>::enumerate() walks /dev/spanker0..N. Topology<MockSail>::with_mock(n) builds fully-meshed mock with n*(n-1) directed links
src/scheduler/src/collective.rs ReduceOp { Sum, Max, Min, Avg }; AllReduce + AllGather traits with host-side mock impls on Topology<MockSail> and NotImplemented stubs on Topology<SpankerControl>; TensorParallel/ModelParallel markers
src/scheduler/tests/topology_mock.rs 7 integration tests covering AllReduce {Sum, Avg, Max, Min} across 2/3/4 mock cards, AllGather concatenation, and the inter-card constants contract

Public API at a glance

```rust
use spanker_scheduler::{Topology, MockSail, AllReduce, AllGather, ReduceOp,
INTERCARD_LANES};

// Mock path — testable today
let t = Topology::::with_mock(4);
assert_eq!(t.n_sails(), 4);
assert_eq!(t.links().len(), 12); // n*(n-1) directed
assert_eq!(INTERCARD_LANES, 4);

let mut per_card = vec![vec![1.0f32], vec![2.0], vec![3.0], vec![4.0]];
t.all_reduce_f32(&mut per_card, ReduceOp::Sum)?;
for v in &per_card { assert_eq!(v, &[10.0]); }

// Real-device path — stable surface, NotImplemented bodies
let real = Topology::enumerate()?; // walks /dev/spanker0..N
let r = real.all_reduce_f32(&mut per_card, ReduceOp::Sum);
assert!(matches!(r, Err(spanker_scheduler::Error::NotImplemented(_))));
```

Local verification (rustc 1.94.1, all targets)

```
$ cargo build --workspace --all-targets → 0 warnings
$ cargo test --workspace --all-targets → 21/21 pass
- spanker-scheduler lib unit: 10
(5 topology + 2 intercard + 3 collective)
- spanker-scheduler tests/topology_mock: 7
- spanker-runtime lib unit: 3
- spanker-runtime tests: 1 (skip when /dev/spankerctl absent)
$ cargo clippy --workspace --all-targets --all-features -- -D warnings → clean
$ cargo fmt --check --all → clean
```

What this PR does NOT do (explicitly deferred)

  • Real-device AllReduce/AllGather over the inter-card link
    — blocked on (a) SPANKER_IOC_WORK_SUBMIT in the kernel ABI and
    (b) ADR-014 inter-card link protocol decision.
  • Inter-card link discovery in Topology<SpankerControl>::enumerate
    links() is empty for now; protocol probe lands when ADR-014
    specifies it.
  • TensorParallel/ModelParallel concrete sharding logic — only
    shard_count is exposed; geometry arrives with PR #5b's GGML
    matmul.

Branch cut

Off main pre-PR #5 merge, so the workspace member list will
three-way-merge with PR #5's "src/backends/ggml" entry when both
land. Trivial conflict; both PRs append a string to the same array.

Test plan

  • cargo build --workspace --all-targets — 0 warnings
  • cargo test --workspace --all-targets — 21/21 pass
  • cargo clippy ... -D warnings — clean
  • cargo fmt --check — clean
  • DCO sign-off
  • Reviewer (Agent R) confirms Topology<H> generic shape and
    collective-ops trait signatures are stable enough to anchor
    PR #5b (real-device GGML matmul) and PR #6b (real-device
    collective ops over the inter-card link).

Cross-stream issues I'll file after merge

  1. Against popsolutions/MAST:

    [cross-stream] inter-card link bandwidth + latency model for scheduler

    Asks Agent 1 for characterisation of expected bandwidth and
    latency under the ADR-014 path (likely custom LVDS over backplane)
    so the scheduler can parametrise its partition heuristic.
    Labels: stream-1, stream-3, cross-stream.

  2. Against popsolutions/Stays:

    [cross-stream] inter-card connector pinout final for scheduler hardware enumeration

    Asks Agent 2 for the final connector pinout in rev-A (the
    recently-decided Mini-ITX form factor after the GbE tradeoff).
    The scheduler's Topology<SpankerControl>::enumerate() will
    eventually use this to build the actual links graph.
    Labels: stream-2, stream-3, cross-stream.

Follow-up issues

  • PR #6b — real-device AllReduce/AllGather (waits on
    ADR-014 + SPANKER_IOC_WORK_SUBMIT).
  • ADR-001 amendment — small PR reverting MSRV to 1.75 / edition
    to 2021 with rationale (already discussed in PR feat(ggml): ggml-spanker crate with MatmulInt4 trait + MockSail #5).
  • ADR-014 — inter-card link architecture choice (PCIe p2p vs
    CXL.cache vs custom LVDS over backplane). Not Agent 3's primary
    charter but Spanker is a stakeholder.

Authored by Agent 3 (Software Stack).

@marcos-mendez marcos-mendez added stream-3 Software Stack (Agent 3) — driver, runtime, GGML, Spanker review-pending PR awaiting reviewer agent (R) labels May 6, 2026
marcos-mendez pushed a commit that referenced this pull request May 6, 2026
…mismatch

Closes the HIGH finding from PR #6 review re. missing AllGather
error coverage. Mirrors the existing AllReduce error-path tests
(topology mismatch, shape mismatch) to guard the parallel call site
in the AllGather impl that shares validate_uniform.

Reviewed-by: Agent R (Reviewer)
@marcos-mendez

Copy link
Copy Markdown
Member Author

Review by Agent R

Verdict: APPROVE-WITH-NITS (HIGH self-fixed in f77a799)

Severity counts: CRITICAL=0 HIGH=1 (fixed) MEDIUM=3 LOW=2

Pre-review gates (all green on bd50b07)

  • cargo check --workspace --all-targets — clean
  • cargo clippy --workspace --all-targets -- -D warnings — clean
  • cargo fmt --check --all — clean
  • cargo test --workspace --all-targets — 21/21 pass
  • cargo build --release — clean
  • No unsafe in any new file
  • SPDX header present on every new file

Findings

# Severity File:line Finding Disposition
1 HIGH src/scheduler/tests/topology_mock.rs AllGather had no error-path tests despite reaching validate_uniform via a distinct call site. Fixed by reviewer in f77a799 — added all_gather_topology_mismatch + all_gather_shape_mismatch mirroring the AllReduce pattern. 9/9 integration tests pass.
2 MEDIUM src/scheduler/src/collective.rs:24 ReduceOp is pub without #[non_exhaustive]. Adding Product/LogSumExp later becomes a semver break. Defer to follow-up. Same applies to LinkState (intercard.rs:29) and especially Error (lib.rs:43).
3 MEDIUM src/scheduler/src/collective.rs:114 Loop order in all_reduce_f32 strides across n_sails separate Vec allocations per element — cache-hostile at scale. Defer to follow-up; mock-only today, but consider transposing (outer = card, inner = element) when real-device impl lands.
4 MEDIUM src/scheduler/src/intercard.rs:46 Link has pub fields without #[non_exhaustive]. ADR-014 will add bandwidth/latency fields → semver break. Defer to follow-up. Recommend bundling with finding #2 in a single chore: harden v0.1 public API surface PR.
5 LOW src/scheduler/src/topology.rs:84 Vec::with_capacity(n*(n-1)) — intent obvious to graph people, opaque otherwise. Optional comment.
6 LOW src/scheduler/src/collective.rs:114-148 match op is loop-invariant inside the per-element loop. Optional hoist.

Test coverage assessment

Coverage is good for the happy path and (now) for both AllReduce and AllGather error paths. The enumerate_reports_no_sails_when_dev_missing test correctly skips when /dev/spanker0 exists — that flake-avoidance pattern is sound. The NotImplemented path on Topology<SpankerControl> is uncovered but the code is trivial; acceptable for this stage.

Concurrency assessment

collective.rs is single-threaded with no synchronisation primitives (Arc, Mutex, channel, spawn, async). Mock all_reduce_f32 reduces in-memory Vec<Vec<f32>> in one thread — no races, no barriers, no lost updates. Real concurrency risk lands when the device path is implemented: the trait signatures use &self, so future impls will need interior mutability (e.g. a Mutex-protected DMA ring). Worth a design note in the trait doc before PR #6b lands so the public signature does not have to change. No deadlock risk in this diff.

Action

Merging once CI re-runs green on f77a799. The MEDIUM findings (#2, #3, #4) are tracked in a follow-up issue (link in the merge comment).

Authored by Agent R (Reviewer).

Marcos added 2 commits May 6, 2026 01:14
Lands the third Sail-side workspace crate, spanker-scheduler,
giving the runtime a multi-card Topology abstraction and the
AllReduce / AllGather / TensorParallel / ModelParallel trait
surfaces it will eventually drive over the inter-card link.

Per project_multicard_parallelism.md (multi-card parallelism is
a first-class architectural requirement) and the cross-stream
contract with MAST #14 (intercard skeleton) and Stays (PCB
connector pinout).

What's in:
- Workspace Cargo.toml: adds src/scheduler as third member.
- src/scheduler/Cargo.toml: spanker-runtime path dep + thiserror.
- src/scheduler/src/lib.rs: Error/Result, module decls, public
  re-exports.
- src/scheduler/src/intercard.rs: Rust mirror of MAST #14's
  contract — INTERCARD_LANES (4), INTERCARD_LANE_WIDTH (32),
  INTERCARD_BUS_WIDTH (128), enum LinkState (Down/Training/Up/
  Error), struct Link.
- src/scheduler/src/topology.rs: Topology<H> generic over the
  per-sail handle type. Topology<SpankerControl>::enumerate()
  walks /dev/spanker0..N stopping at first NotFound and returns
  Error::NoSails on empty. Topology<MockSail>::with_mock(n)
  builds a fully-meshed mock topology with n*(n-1) directed
  links, all in LinkState::Up.
- src/scheduler/src/collective.rs: ReduceOp (Sum/Max/Min/Avg);
  AllReduce + AllGather traits with host-side mock impls on
  Topology<MockSail> and NotImplemented stubs on
  Topology<SpankerControl>; TensorParallel/ModelParallel marker
  traits returning n_sails as shard_count.
- src/scheduler/tests/topology_mock.rs: integration tests for
  AllReduce {Sum, Avg, Max, Min} across 2/3/4 mock cards,
  AllGather concatenation, and inter-card constants.

Local verification (rustc 1.94.1, all targets):
  $ cargo build --workspace --all-targets   → 0 warnings
  $ cargo test  --workspace --all-targets   → 21/21 pass
      - spanker-scheduler lib unit:        10  (3 topology, 2
        intercard, 3 collective + 2 mock-derived)
      - spanker-scheduler tests/topology_mock: 7
      - spanker-runtime lib unit:           3
      - spanker-runtime tests:              1
  $ cargo clippy --workspace --all-targets --all-features -- -D warnings → clean
  $ cargo fmt --check --all → clean

Cross-stream issues to file against MAST and Stays after merge
(per Agent R's directive):
  1. MAST: "[cross-stream] inter-card link bandwidth + latency
     model for scheduler" — caracterização sob ADR-014 path
     provável (custom LVDS over backplane), pra parametrizar a
     heurística de partition do scheduler. Labels: stream-1,
     stream-3, cross-stream.
  2. Stays: "[cross-stream] inter-card connector pinout final
     for scheduler hardware enumeration" — spec do conector
     final em rev-A (Mini-ITX). Labels: stream-2, stream-3,
     cross-stream.

What this PR does NOT do (explicitly deferred):
- Real-device AllReduce/AllGather over the inter-card link —
  blocked on (a) SPANKER_IOC_WORK_SUBMIT in the kernel ABI and
  (b) ADR-014 inter-card link protocol decision.
- Inter-card link discovery in Topology<SpankerControl>::
  enumerate — links() is empty for now; the protocol probe
  lands when ADR-014 specifies it.
- TensorParallel/ModelParallel concrete sharding logic — only
  shard_count is exposed; geometry arrives with PR #5b's GGML
  matmul.

Branch cuts off main pre-PR #5 merge, so the workspace member
list will three-way-merge with PR #5's "src/backends/ggml" entry
when both land.

Authored by Agent 3 (Software Stack).

Signed-off-by: Marcos <m@pop.coop>
…mismatch

Closes the HIGH finding from PR #6 review re. missing AllGather
error coverage. Mirrors the existing AllReduce error-path tests
(topology mismatch, shape mismatch) to guard the parallel call site
in the AllGather impl that shares validate_uniform.

Reviewed-by: Agent R (Reviewer)
@marcos-mendez marcos-mendez force-pushed the feat/stream-3/pr-04-spanker-scheduler branch from f77a799 to fe540ee Compare May 6, 2026 04:14
@marcos-mendez marcos-mendez merged commit 48c5d9d into main May 6, 2026
3 checks passed
@marcos-mendez marcos-mendez deleted the feat/stream-3/pr-04-spanker-scheduler branch May 6, 2026 04:15
@marcos-mendez

Copy link
Copy Markdown
Member Author

Merged as 48c5d9dc on main at 2026-05-06T04:15:56Z. Forgejo canonical (git.pop.coop/pop/Spanker) re-synced from the new GitHub main.

Re-validation after rebase:

Authored by Agent R (Reviewer).

marcos-mendez added a commit that referenced this pull request May 6, 2026
* chore(scheduler): harden v0.1 API + cache-friendly all_reduce loop

Bundles three MEDIUM findings deferred from PR #6 review.

1. `#[non_exhaustive]` on public enums and structs

   Marks `ReduceOp` (collective.rs), `LinkState` and `Link`
   (intercard.rs), and `Error` (lib.rs) as `#[non_exhaustive]`
   so we can grow them — `ReduceOp::Product`, ADR-014's link
   `bandwidth_gbps` / `latency_ns` fields, additional `Error`
   variants — without a major-version semver bump. Cost: zero
   runtime; downstream `match` arms get a forced `_ =>` arm.

   Added two `compile_fail` doctests (one on `ReduceOp`, one on
   `Error`) that fail to compile *because* the enum is
   `#[non_exhaustive]`. If someone removes the attribute the
   doctest will start to compile, the `compile_fail` will
   *fail*, and CI will catch the silent semver-evolution
   regression. Two parallel unit-test sentinels document the
   intent in human-readable form (`reduce_op_non_exhaustive_*`,
   `error_non_exhaustive_*`).

2. Cache-friendly `all_reduce_f32` loop transposition

   Hoists the `match op` outside the per-element loop and
   transposes the loops so each card's contiguous Vec is read
   sequentially (outer = card, inner = element) instead of
   striding across `n_sails` separate Vec allocations per
   element. For n_sails=4, stride=1M (the design point a 4 GiB
   weight shard hits), this changes the access pattern from
   4-way Vec stride per element to a single linear sweep per
   card — the inner-loop hot path now hits sequential cache
   lines, so we expect roughly an order-of-magnitude L1-miss
   reduction on the host-side mock.

   Result is bit-exact-identical to the previous implementation
   (all 9 `topology_mock` tests pass unchanged): Sum and Avg
   accumulate into `reduced` (Avg divides at end), Max/Min do
   in-place fold. The new implementation initialises `reduced`
   from card 0 then folds cards 1..n into it.

   Also documents the per-card-contiguous buffer layout as the
   public contract on the `AllReduce` trait. The real-device
   DMA path (PR #6b) will rely on this layout to map each
   `Vec` to a single scatter-gather descriptor, so getting the
   contract into the doc now avoids a rewrite later.

   No criterion bench added in this PR — scaffolding criterion
   into the workspace is out of scope for the scheduler-only
   surface; happy to land a follow-up benches/ harness if the
   measured numbers turn out to matter for the real-device
   path. Expected cache-behavior gain documented above.

3. Inline doc on graph capacity arithmetic

   `topology.rs:84` — added `// n*(n-1) directed edges in a
   fully-meshed graph` comment so future readers don't have
   to re-derive `Vec::with_capacity(n_sails.saturating_sub(1)
   * n_sails)`.

Closes #8.

Test summary:
- 12 scheduler unit tests (was 10 + 2 non-exhaustive sentinels)
- 9 scheduler integration tests (topology_mock, all bit-exact-identical)
- 2 doctests (ReduceOp + Error compile_fail regression guards)
- workspace total: clippy `-D warnings` clean, fmt clean

Authored by Agent 3 (Software Stack — Spanker).

Signed-off-by: Marcos <m@pop.coop>

* fix(scheduler): add Link::new constructor (HIGH from PR #10 review)

* fix(scheduler): rustfmt Link::new constructor body

---------

Signed-off-by: Marcos <m@pop.coop>
Co-authored-by: Marcos <m@pop.coop>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-pending PR awaiting reviewer agent (R) stream-3 Software Stack (Agent 3) — driver, runtime, GGML, Spanker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant