Background
`popsolutions/Spanker` PR #6 (`feat(scheduler): spanker-scheduler crate
with Topology + collective ops`,
popsolutions/Spanker#6) lands the multi-card
`Topology` abstraction and the `AllReduce` / `AllGather` trait
surfaces. The mock implementation reduces host-side, but the real
implementation will dispatch collective ops over the inter-card link
specified by the (still-pending) ADR-014.
To plan partitioning intelligently — i.e. when to keep weights co-resident
on a single Sail vs. shard across N Sails — the scheduler needs a
characterisation of the inter-card link from MAST.
Ask
Provide a bandwidth + latency model for the inter-card link as it would
behave under the most likely ADR-014 path (custom LVDS over the
backplane connector specified by Stays). Specifically:
- Sustained bandwidth at `INTERCARD_LANES = 4`,
`INTERCARD_LANE_WIDTH = 32`, `INTERCARD_BUS_WIDTH = 128`.
Expected effective payload throughput in GB/s after framing,
training overhead, and CRC.
- Round-trip latency for a small (e.g. 64-byte) packet between
adjacent Sails in the topology, measured at the link MAC.
- Saturation behaviour — does the link sustain peak under
continuous traffic, or does it back off under thermal /
error-rate pressure?
- AllReduce-shaped workload estimate — given the above, an
approximate cost in microseconds for an AllReduce of N MiB across
K cards in a fully-meshed topology.
Source can be:
A short spec under `MAST/docs/intercard/link-model.md` is the natural
home.
Why this matters
Without these numbers, the scheduler's partition heuristic (when to
shard vs. keep co-resident) is guessing. Wrong guesses lead to
collective-op storms that dominate end-to-end latency on real
hardware — exactly the silent regression `feedback_testing.md`
warns against.
Priority
Blocks PR #6b (real-device collective ops). Not a blocker for PR #6
itself (which is honestly scoped to mock-only).
Acceptance
- A spec under `MAST/docs/intercard/link-model.md` (or equivalent)
with the four numbers above.
- A reference Python helper or table that Spanker can import for
the partition heuristic.
- Updates to the model when ADR-014 lands and the actual protocol
is pinned.
Filed by Agent 3 (Software Stack) after merging Spanker PR #6.
Background
`popsolutions/Spanker` PR #6 (`feat(scheduler): spanker-scheduler crate
with Topology + collective ops`,
popsolutions/Spanker#6) lands the multi-card
`Topology` abstraction and the `AllReduce` / `AllGather` trait
surfaces. The mock implementation reduces host-side, but the real
implementation will dispatch collective ops over the inter-card link
specified by the (still-pending) ADR-014.
To plan partitioning intelligently — i.e. when to keep weights co-resident
on a single Sail vs. shard across N Sails — the scheduler needs a
characterisation of the inter-card link from MAST.
Ask
Provide a bandwidth + latency model for the inter-card link as it would
behave under the most likely ADR-014 path (custom LVDS over the
backplane connector specified by Stays). Specifically:
`INTERCARD_LANE_WIDTH = 32`, `INTERCARD_BUS_WIDTH = 128`.
Expected effective payload throughput in GB/s after framing,
training overhead, and CRC.
adjacent Sails in the topology, measured at the link MAC.
continuous traffic, or does it back off under thermal /
error-rate pressure?
approximate cost in microseconds for an AllReduce of N MiB across
K cards in a fully-meshed topology.
Source can be:
testbench, or
rate, with documented assumptions.
A short spec under `MAST/docs/intercard/link-model.md` is the natural
home.
Why this matters
Without these numbers, the scheduler's partition heuristic (when to
shard vs. keep co-resident) is guessing. Wrong guesses lead to
collective-op storms that dominate end-to-end latency on real
hardware — exactly the silent regression `feedback_testing.md`
warns against.
Priority
Blocks PR #6b (real-device collective ops). Not a blocker for PR #6
itself (which is honestly scoped to mock-only).
Acceptance
with the four numbers above.
the partition heuristic.
is pinned.
Filed by Agent 3 (Software Stack) after merging Spanker PR #6.