Fix NCCL cost model non-monotonicity in Ring correction factor by fmassa · Pull Request #425 · meta-pytorch/autoparallel

fmassa · 2026-04-19T09:32:06Z

The Ring LL128 bandwidth ramp correction, after depth-scaling exponentiation for 8+ node configurations, could produce adjacent correction ratios above 2x. This made a larger message appear cheaper than a smaller one — e.g., a 2 MB all-gather was priced at 444.79 us while a 1 MB all-gather cost 448.62 us on 8 nodes. Real nccl-tests benchmarks confirm strict monotonicity at these sizes.

The fix precomputes the depth-scaled correction table with cumulative clamping, ensuring each entry is at most 2x the previous. The table is cached per node count via @functools.cache.

Accuracy validation

Validated against nccl-tests on H100 NVSwitch hardware across 2–256 GPUs, all collective types. Average absolute error by message size threshold:

Single node (intra-node NVSwitch):

  ┌─────────────┬────────┬────────┬─────────┐
  │   Config    │ ≥1 MB  │ ≥64 MB │ ≥512 MB │
  ├─────────────┼────────┼────────┼─────────┤
  │ AG/RS 2 GPU │ 20%    │ 8-10%  │ 2-3%    │
  ├─────────────┼────────┼────────┼─────────┤
  │ AG/RS 4 GPU │ 11%    │ 5%     │ 1%      │
  ├─────────────┼────────┼────────┼─────────┤
  │ AG/RS 8 GPU │ 4-6%   │ 1-3%   │ <1%     │
  ├─────────────┼────────┼────────┼─────────┤
  │ AR 2-8 GPU  │ 10-20% │ 4-8%   │ 1-3%    │
  └─────────────┴────────┴────────┴─────────┘

Multi-node (inter-node):

  ┌──────────────────────────┬────────┬────────┬─────────┐
  │          Config          │ ≥1 MB  │ ≥64 MB │ ≥512 MB │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ AG/RS 2 nodes (16 GPU)   │ 16-20% │ 6-11%  │ 1-8%    │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ AG/RS 4 nodes (32 GPU)   │ 3-4%   │ 2-3%   │ 1-2%    │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ AG/RS 8 nodes (64 GPU)   │ 4%     │ 3%     │ 2%      │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ AG/RS 16 nodes (128 GPU) │ 5-7%   │ 4-5%   │ 2-3%    │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ AG/RS 32 nodes (256 GPU) │ 8%     │ 10-12% │ 8-14%   │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ AR 2-32 nodes            │ 2-11%  │ 1-9%   │ 1-9%    │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ A2A 2-32 nodes           │ 1-8%   │ 1-4%   │ 1-5%    │
  └──────────────────────────┴────────┴────────┴─────────┘

The model is within 5% for most large-message configurations (≥512 MB), which is the regime that matters for the ILP solver's sharding decisions. Higher errors at small sizes are from the latency-dominated regime where fixed overhead dominates.

Changes

nccl_cost_model.py: Added _depth_scaled_ring_correction(n_nodes) with @functools.cache, cumulative 2x clamping across all table entries. Simplified _nccl_algo_time to index the cached table.
test_nccl_cost_model.py: Added TestRingCorrectionMonotonicity with 9 regression tests — table 2x property verification and allgather/reduce-scatter monotonicity checks for 8/16/32 nodes across 16 KB–256 MB.

Authored with Claude.

Description: The Ring LL128 bandwidth ramp correction, after depth-scaling exponentiation for 8+ node configurations, could produce adjacent correction ratios above 2x. This made a larger message appear cheaper than a smaller one — e.g., a 2 MB all-gather was priced at 444.79 us while a 1 MB all-gather cost 448.62 us on 8 nodes. Real nccl-tests benchmarks confirm strict monotonicity at these sizes. The fix precomputes the depth-scaled correction table with cumulative clamping, ensuring each entry is at most 2x the previous. The table is cached per node count via @functools.cache. Accuracy validation Validated against nccl-tests on H100 NVSwitch hardware across 2–256 GPUs, all collective types. Average absolute error by message size threshold: Single node (intra-node NVSwitch): ┌─────────────┬────────┬────────┬─────────┐ │ Config │ ≥1 MB │ ≥64 MB │ ≥512 MB │ ├─────────────┼────────┼────────┼─────────┤ │ AG/RS 2 GPU │ 20% │ 8-10% │ 2-3% │ ├─────────────┼────────┼────────┼─────────┤ │ AG/RS 4 GPU │ 11% │ 5% │ 1% │ ├─────────────┼────────┼────────┼─────────┤ │ AG/RS 8 GPU │ 4-6% │ 1-3% │ <1% │ ├─────────────┼────────┼────────┼─────────┤ │ AR 2-8 GPU │ 10-20% │ 4-8% │ 1-3% │ └─────────────┴────────┴────────┴─────────┘ Multi-node (inter-node): ┌──────────────────────────┬────────┬────────┬─────────┐ │ Config │ ≥1 MB │ ≥64 MB │ ≥512 MB │ ├──────────────────────────┼────────┼────────┼─────────┤ │ AG/RS 2 nodes (16 GPU) │ 16-20% │ 6-11% │ 1-8% │ ├──────────────────────────┼────────┼────────┼─────────┤ │ AG/RS 4 nodes (32 GPU) │ 3-4% │ 2-3% │ 1-2% │ ├──────────────────────────┼────────┼────────┼─────────┤ │ AG/RS 8 nodes (64 GPU) │ 4% │ 3% │ 2% │ ├──────────────────────────┼────────┼────────┼─────────┤ │ AG/RS 16 nodes (128 GPU) │ 5-7% │ 4-5% │ 2-3% │ ├──────────────────────────┼────────┼────────┼─────────┤ │ AG/RS 32 nodes (256 GPU) │ 8% │ 10-12% │ 8-14% │ ├──────────────────────────┼────────┼────────┼─────────┤ │ AR 2-32 nodes │ 2-11% │ 1-9% │ 1-9% │ ├──────────────────────────┼────────┼────────┼─────────┤ │ A2A 2-32 nodes │ 1-8% │ 1-4% │ 1-5% │ └──────────────────────────┴────────┴────────┴─────────┘ The model is within 5% for most large-message configurations (≥512 MB), which is the regime that matters for the ILP solver's sharding decisions. Higher errors at small sizes are from the latency-dominated regime where fixed overhead dominates. Changes - nccl_cost_model.py: Added _depth_scaled_ring_correction(n_nodes) with @functools.cache, cumulative 2x clamping across all table entries. Simplified _nccl_algo_time to index the cached table. - test_nccl_cost_model.py: Added TestRingCorrectionMonotonicity with 9 regression tests — table 2x property verification and allgather/reduce-scatter monotonicity checks for 8/16/32 nodes across 16 KB–256 MB. Authored with Claude.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 19, 2026

fmassa merged commit 3da86e4 into main Apr 19, 2026
9 of 11 checks passed

fmassa deleted the fmassa/improve_nccl_cost_model branch April 19, 2026 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix NCCL cost model non-monotonicity in Ring correction factor#425

Fix NCCL cost model non-monotonicity in Ring correction factor#425
fmassa merged 1 commit intomainfrom
fmassa/improve_nccl_cost_model

fmassa commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fmassa commented Apr 19, 2026

Accuracy validation

Single node (intra-node NVSwitch):

Multi-node (inter-node):

Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant