Context
PR #19 ships pick_strategy with a symmetric bandwidth-bound cost model: both TP and MP charge activation_bytes / intercard_bw (sharded by n_sails for MP). This is good enough for the rev-A capacity-planning datapoint (TinyLlama Q4_0 decode → MP) but it understates MP's cost for deep pipeline-parallel placements where every card-boundary forward step pays an intercard hop.
Proposal
For pipeline-MP across p partitions of the model, each forward step pays (p - 1) * activation_bytes / intercard_bw for the chained forwards. The current pick_strategy only counts the sharded activation transfer, not the chain.
Add either:
- A
Strategy::PipelineParallel { partitions: u32 } variant with the per-boundary cost, or
- A more careful MP cost that scales with
n_sails - 1 when the tile spans a pipeline boundary.
Acceptance
- TinyLlama decode datapoint still picks MP (the activation chain is short on small models).
- A 70B-class model with 8-way pipeline at the same intercard bandwidth flips toward TP because the chain cost dominates.
- New test in
decision::tests pinning the boundary where the pipeline-cost penalty flips the answer.
Follow-up references
Context
PR #19 ships
pick_strategywith a symmetric bandwidth-bound cost model: both TP and MP chargeactivation_bytes / intercard_bw(sharded byn_sailsfor MP). This is good enough for the rev-A capacity-planning datapoint (TinyLlama Q4_0 decode → MP) but it understates MP's cost for deep pipeline-parallel placements where every card-boundary forward step pays an intercard hop.Proposal
For pipeline-MP across
ppartitions of the model, each forward step pays(p - 1) * activation_bytes / intercard_bwfor the chained forwards. The currentpick_strategyonly counts the sharded activation transfer, not the chain.Add either:
Strategy::PipelineParallel { partitions: u32 }variant with the per-boundary cost, orn_sails - 1when the tile spans a pipeline boundary.Acceptance
decision::testspinning the boundary where the pipeline-cost penalty flips the answer.Follow-up references