Skip to content

[cross-stream] inter-card link bandwidth + latency model for scheduler #18

@marcos-mendez

Description

@marcos-mendez

Background

`popsolutions/Spanker` PR #6 (`feat(scheduler): spanker-scheduler crate
with Topology + collective ops`,
popsolutions/Spanker#6) lands the multi-card
`Topology` abstraction and the `AllReduce` / `AllGather` trait
surfaces. The mock implementation reduces host-side, but the real
implementation will dispatch collective ops over the inter-card link
specified by the (still-pending) ADR-014.

To plan partitioning intelligently — i.e. when to keep weights co-resident
on a single Sail vs. shard across N Sails — the scheduler needs a
characterisation of the inter-card link from MAST.

Ask

Provide a bandwidth + latency model for the inter-card link as it would
behave under the most likely ADR-014 path (custom LVDS over the
backplane connector specified by Stays). Specifically:

  1. Sustained bandwidth at `INTERCARD_LANES = 4`,
    `INTERCARD_LANE_WIDTH = 32`, `INTERCARD_BUS_WIDTH = 128`.
    Expected effective payload throughput in GB/s after framing,
    training overhead, and CRC.
  2. Round-trip latency for a small (e.g. 64-byte) packet between
    adjacent Sails in the topology, measured at the link MAC.
  3. Saturation behaviour — does the link sustain peak under
    continuous traffic, or does it back off under thermal /
    error-rate pressure?
  4. AllReduce-shaped workload estimate — given the above, an
    approximate cost in microseconds for an AllReduce of N MiB across
    K cards in a fully-meshed topology.

Source can be:

A short spec under `MAST/docs/intercard/link-model.md` is the natural
home.

Why this matters

Without these numbers, the scheduler's partition heuristic (when to
shard vs. keep co-resident) is guessing. Wrong guesses lead to
collective-op storms that dominate end-to-end latency on real
hardware — exactly the silent regression `feedback_testing.md`
warns against.

Priority

Blocks PR #6b (real-device collective ops). Not a blocker for PR #6
itself (which is honestly scoped to mock-only).

Acceptance

  • A spec under `MAST/docs/intercard/link-model.md` (or equivalent)
    with the four numbers above.
  • A reference Python helper or table that Spanker can import for
    the partition heuristic.
  • Updates to the model when ADR-014 lands and the actual protocol
    is pinned.

Filed by Agent 3 (Software Stack) after merging Spanker PR #6.

Metadata

Metadata

Assignees

No one assigned

    Labels

    cross-streamTouches multiple streams — coordination neededstream-1RTL Architect (Agent 1) — SystemVerilog, cocotb, MAST primarystream-3Software Stack (Agent 3) — driver, runtime, GGML, Spanker

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions