[cross-stream] inter-card link bandwidth + latency model for scheduler

## Background

\`popsolutions/Spanker\` PR #6 (\`feat(scheduler): spanker-scheduler crate
with Topology + collective ops\`,
https://github.com/popsolutions/Spanker/pull/6) lands the multi-card
\`Topology\` abstraction and the \`AllReduce\` / \`AllGather\` trait
surfaces. The mock implementation reduces host-side, but the real
implementation will dispatch collective ops over the inter-card link
specified by the (still-pending) ADR-014.

To plan partitioning intelligently — i.e. when to keep weights co-resident
on a single Sail vs. shard across N Sails — the scheduler needs a
characterisation of the inter-card link from MAST.

## Ask

Provide a bandwidth + latency model for the inter-card link as it would
behave under the most likely ADR-014 path (custom LVDS over the
backplane connector specified by Stays). Specifically:

1. **Sustained bandwidth** at \`INTERCARD_LANES = 4\`,
   \`INTERCARD_LANE_WIDTH = 32\`, \`INTERCARD_BUS_WIDTH = 128\`.
   Expected effective payload throughput in GB/s after framing,
   training overhead, and CRC.
2. **Round-trip latency** for a small (e.g. 64-byte) packet between
   adjacent Sails in the topology, measured at the link MAC.
3. **Saturation behaviour** — does the link sustain peak under
   continuous traffic, or does it back off under thermal /
   error-rate pressure?
4. **AllReduce-shaped workload estimate** — given the above, an
   approximate cost in microseconds for an AllReduce of N MiB across
   K cards in a fully-meshed topology.

Source can be:
- Calibration against the MAST #14 intercard skeleton on the cocotb
  testbench, or
- A back-of-envelope model derived from the link width × clock × line
  rate, with documented assumptions.

A short spec under \`MAST/docs/intercard/link-model.md\` is the natural
home.

## Why this matters

Without these numbers, the scheduler's partition heuristic (when to
shard vs. keep co-resident) is guessing. Wrong guesses lead to
collective-op storms that dominate end-to-end latency on real
hardware — exactly the silent regression \`feedback_testing.md\`
warns against.

## Priority

Blocks PR #6b (real-device collective ops). Not a blocker for PR #6
itself (which is honestly scoped to mock-only).

## Acceptance

- A spec under \`MAST/docs/intercard/link-model.md\` (or equivalent)
  with the four numbers above.
- A reference Python helper or table that Spanker can import for
  the partition heuristic.
- Updates to the model when ADR-014 lands and the actual protocol
  is pinned.

Filed by Agent 3 (Software Stack) after merging Spanker PR #6.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cross-stream] inter-card link bandwidth + latency model for scheduler #18

Background

Ask

Why this matters

Priority

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[cross-stream] inter-card link bandwidth + latency model for scheduler #18

Description

Background

Ask

Why this matters

Priority

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions