Fix NCCL cost model non-monotonicity in Ring correction factor#425
Merged
Fix NCCL cost model non-monotonicity in Ring correction factor#425
Conversation
Description: The Ring LL128 bandwidth ramp correction, after depth-scaling exponentiation for 8+ node configurations, could produce adjacent correction ratios above 2x. This made a larger message appear cheaper than a smaller one — e.g., a 2 MB all-gather was priced at 444.79 us while a 1 MB all-gather cost 448.62 us on 8 nodes. Real nccl-tests benchmarks confirm strict monotonicity at these sizes. The fix precomputes the depth-scaled correction table with cumulative clamping, ensuring each entry is at most 2x the previous. The table is cached per node count via @functools.cache. Accuracy validation Validated against nccl-tests on H100 NVSwitch hardware across 2–256 GPUs, all collective types. Average absolute error by message size threshold: Single node (intra-node NVSwitch): ┌─────────────┬────────┬────────┬─────────┐ │ Config │ ≥1 MB │ ≥64 MB │ ≥512 MB │ ├─────────────┼────────┼────────┼─────────┤ │ AG/RS 2 GPU │ 20% │ 8-10% │ 2-3% │ ├─────────────┼────────┼────────┼─────────┤ │ AG/RS 4 GPU │ 11% │ 5% │ 1% │ ├─────────────┼────────┼────────┼─────────┤ │ AG/RS 8 GPU │ 4-6% │ 1-3% │ <1% │ ├─────────────┼────────┼────────┼─────────┤ │ AR 2-8 GPU │ 10-20% │ 4-8% │ 1-3% │ └─────────────┴────────┴────────┴─────────┘ Multi-node (inter-node): ┌──────────────────────────┬────────┬────────┬─────────┐ │ Config │ ≥1 MB │ ≥64 MB │ ≥512 MB │ ├──────────────────────────┼────────┼────────┼─────────┤ │ AG/RS 2 nodes (16 GPU) │ 16-20% │ 6-11% │ 1-8% │ ├──────────────────────────┼────────┼────────┼─────────┤ │ AG/RS 4 nodes (32 GPU) │ 3-4% │ 2-3% │ 1-2% │ ├──────────────────────────┼────────┼────────┼─────────┤ │ AG/RS 8 nodes (64 GPU) │ 4% │ 3% │ 2% │ ├──────────────────────────┼────────┼────────┼─────────┤ │ AG/RS 16 nodes (128 GPU) │ 5-7% │ 4-5% │ 2-3% │ ├──────────────────────────┼────────┼────────┼─────────┤ │ AG/RS 32 nodes (256 GPU) │ 8% │ 10-12% │ 8-14% │ ├──────────────────────────┼────────┼────────┼─────────┤ │ AR 2-32 nodes │ 2-11% │ 1-9% │ 1-9% │ ├──────────────────────────┼────────┼────────┼─────────┤ │ A2A 2-32 nodes │ 1-8% │ 1-4% │ 1-5% │ └──────────────────────────┴────────┴────────┴─────────┘ The model is within 5% for most large-message configurations (≥512 MB), which is the regime that matters for the ILP solver's sharding decisions. Higher errors at small sizes are from the latency-dominated regime where fixed overhead dominates. Changes - nccl_cost_model.py: Added _depth_scaled_ring_correction(n_nodes) with @functools.cache, cumulative 2x clamping across all table entries. Simplified _nccl_algo_time to index the cached table. - test_nccl_cost_model.py: Added TestRingCorrectionMonotonicity with 9 regression tests — table 2x property verification and allgather/reduce-scatter monotonicity checks for 8/16/32 nodes across 16 KB–256 MB. Authored with Claude.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Ring LL128 bandwidth ramp correction, after depth-scaling exponentiation for 8+ node configurations, could produce adjacent correction ratios above 2x. This made a larger message appear cheaper than a smaller one — e.g., a 2 MB all-gather was priced at 444.79 us while a 1 MB all-gather cost 448.62 us on 8 nodes. Real nccl-tests benchmarks confirm strict monotonicity at these sizes.
The fix precomputes the depth-scaled correction table with cumulative clamping, ensuring each entry is at most 2x the previous. The table is cached per node count via @functools.cache.
Accuracy validation
Validated against nccl-tests on H100 NVSwitch hardware across 2–256 GPUs, all collective types. Average absolute error by message size threshold:
Single node (intra-node NVSwitch):
Multi-node (inter-node):
The model is within 5% for most large-message configurations (≥512 MB), which is the regime that matters for the ILP solver's sharding decisions. Higher errors at small sizes are from the latency-dominated regime where fixed overhead dominates.
Changes
nccl_cost_model.py: Added_depth_scaled_ring_correction(n_nodes)with@functools.cache, cumulative 2x clamping across all table entries. Simplified_nccl_algo_timeto index the cached table.test_nccl_cost_model.py: AddedTestRingCorrectionMonotonicitywith 9 regression tests — table 2x property verification and allgather/reduce-scatter monotonicity checks for 8/16/32 nodes across 16 KB–256 MB.Authored with Claude.