Skip to content

Fix NCCL cost model non-monotonicity in Ring correction factor#425

Merged
fmassa merged 1 commit intomainfrom
fmassa/improve_nccl_cost_model
Apr 19, 2026
Merged

Fix NCCL cost model non-monotonicity in Ring correction factor#425
fmassa merged 1 commit intomainfrom
fmassa/improve_nccl_cost_model

Conversation

@fmassa
Copy link
Copy Markdown
Contributor

@fmassa fmassa commented Apr 19, 2026

The Ring LL128 bandwidth ramp correction, after depth-scaling exponentiation for 8+ node configurations, could produce adjacent correction ratios above 2x. This made a larger message appear cheaper than a smaller one — e.g., a 2 MB all-gather was priced at 444.79 us while a 1 MB all-gather cost 448.62 us on 8 nodes. Real nccl-tests benchmarks confirm strict monotonicity at these sizes.

The fix precomputes the depth-scaled correction table with cumulative clamping, ensuring each entry is at most 2x the previous. The table is cached per node count via @functools.cache.

Accuracy validation

Validated against nccl-tests on H100 NVSwitch hardware across 2–256 GPUs, all collective types. Average absolute error by message size threshold:

Single node (intra-node NVSwitch):

  ┌─────────────┬────────┬────────┬─────────┐
  │   Config    │ ≥1 MB  │ ≥64 MB │ ≥512 MB │
  ├─────────────┼────────┼────────┼─────────┤
  │ AG/RS 2 GPU │ 20%    │ 8-10%  │ 2-3%    │
  ├─────────────┼────────┼────────┼─────────┤
  │ AG/RS 4 GPU │ 11%    │ 5%     │ 1%      │
  ├─────────────┼────────┼────────┼─────────┤
  │ AG/RS 8 GPU │ 4-6%   │ 1-3%   │ <1%     │
  ├─────────────┼────────┼────────┼─────────┤
  │ AR 2-8 GPU  │ 10-20% │ 4-8%   │ 1-3%    │
  └─────────────┴────────┴────────┴─────────┘

Multi-node (inter-node):

  ┌──────────────────────────┬────────┬────────┬─────────┐
  │          Config          │ ≥1 MB  │ ≥64 MB │ ≥512 MB │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ AG/RS 2 nodes (16 GPU)   │ 16-20% │ 6-11%  │ 1-8%    │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ AG/RS 4 nodes (32 GPU)   │ 3-4%   │ 2-3%   │ 1-2%    │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ AG/RS 8 nodes (64 GPU)   │ 4%     │ 3%     │ 2%      │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ AG/RS 16 nodes (128 GPU) │ 5-7%   │ 4-5%   │ 2-3%    │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ AG/RS 32 nodes (256 GPU) │ 8%     │ 10-12% │ 8-14%   │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ AR 2-32 nodes            │ 2-11%  │ 1-9%   │ 1-9%    │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ A2A 2-32 nodes           │ 1-8%   │ 1-4%   │ 1-5%    │
  └──────────────────────────┴────────┴────────┴─────────┘

The model is within 5% for most large-message configurations (≥512 MB), which is the regime that matters for the ILP solver's sharding decisions. Higher errors at small sizes are from the latency-dominated regime where fixed overhead dominates.

Changes

  • nccl_cost_model.py: Added _depth_scaled_ring_correction(n_nodes) with @functools.cache, cumulative 2x clamping across all table entries. Simplified _nccl_algo_time to index the cached table.
  • test_nccl_cost_model.py: Added TestRingCorrectionMonotonicity with 9 regression tests — table 2x property verification and allgather/reduce-scatter monotonicity checks for 8/16/32 nodes across 16 KB–256 MB.

Authored with Claude.

  Description:

  The Ring LL128 bandwidth ramp correction, after depth-scaling exponentiation for 8+ node configurations, could produce adjacent correction ratios above 2x. This made a larger message appear cheaper than a smaller one — e.g., a 2 MB
  all-gather was priced at 444.79 us while a 1 MB all-gather cost 448.62 us on 8 nodes. Real nccl-tests benchmarks confirm strict monotonicity at these sizes.

  The fix precomputes the depth-scaled correction table with cumulative clamping, ensuring each entry is at most 2x the previous. The table is cached per node count via @functools.cache.

  Accuracy validation

  Validated against nccl-tests on H100 NVSwitch hardware across 2–256 GPUs, all collective types. Average absolute error by message size threshold:

  Single node (intra-node NVSwitch):

  ┌─────────────┬────────┬────────┬─────────┐
  │   Config    │ ≥1 MB  │ ≥64 MB │ ≥512 MB │
  ├─────────────┼────────┼────────┼─────────┤
  │ AG/RS 2 GPU │ 20%    │ 8-10%  │ 2-3%    │
  ├─────────────┼────────┼────────┼─────────┤
  │ AG/RS 4 GPU │ 11%    │ 5%     │ 1%      │
  ├─────────────┼────────┼────────┼─────────┤
  │ AG/RS 8 GPU │ 4-6%   │ 1-3%   │ <1%     │
  ├─────────────┼────────┼────────┼─────────┤
  │ AR 2-8 GPU  │ 10-20% │ 4-8%   │ 1-3%    │
  └─────────────┴────────┴────────┴─────────┘

  Multi-node (inter-node):

  ┌──────────────────────────┬────────┬────────┬─────────┐
  │          Config          │ ≥1 MB  │ ≥64 MB │ ≥512 MB │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ AG/RS 2 nodes (16 GPU)   │ 16-20% │ 6-11%  │ 1-8%    │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ AG/RS 4 nodes (32 GPU)   │ 3-4%   │ 2-3%   │ 1-2%    │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ AG/RS 8 nodes (64 GPU)   │ 4%     │ 3%     │ 2%      │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ AG/RS 16 nodes (128 GPU) │ 5-7%   │ 4-5%   │ 2-3%    │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ AG/RS 32 nodes (256 GPU) │ 8%     │ 10-12% │ 8-14%   │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ AR 2-32 nodes            │ 2-11%  │ 1-9%   │ 1-9%    │
  ├──────────────────────────┼────────┼────────┼─────────┤
  │ A2A 2-32 nodes           │ 1-8%   │ 1-4%   │ 1-5%    │
  └──────────────────────────┴────────┴────────┴─────────┘

  The model is within 5% for most large-message configurations (≥512 MB), which is the regime that matters for the ILP solver's sharding decisions. Higher errors at small sizes are from the latency-dominated regime where fixed overhead
  dominates.

  Changes

  - nccl_cost_model.py: Added _depth_scaled_ring_correction(n_nodes) with @functools.cache, cumulative 2x clamping across all table entries. Simplified _nccl_algo_time to index the cached table.
  - test_nccl_cost_model.py: Added TestRingCorrectionMonotonicity with 9 regression tests — table 2x property verification and allgather/reduce-scatter monotonicity checks for 8/16/32 nodes across 16 KB–256 MB.

  Authored with Claude.
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 19, 2026
@fmassa fmassa merged commit 3da86e4 into main Apr 19, 2026
9 of 11 checks passed
@fmassa fmassa deleted the fmassa/improve_nccl_cost_model branch April 19, 2026 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant