[pull] main from pytorch:main#912
Merged
pull[bot] merged 22 commits intoMPACT-ORG:mainfrom Feb 27, 2026
Merged
Conversation
The `global_opaque_types` unordered_set in schema_type_parser.cpp was accessed without any synchronization. RCCL initialization spawns background C++ threads that trigger JIT schema parsing (calling isRegisteredOpaqueType) which reads `global_opaque_types`, while the main python thread concurrently writes to the set via registerOpaqueType at module import time.
This read+write data race on std::unordered_set is undefined behavior and can corrupt the hash table, causing a thread to spin indefinitely in hash collision resolution. Because the hang occurs inside a RCCL collective on a C++ thread, Python-level per-test timeouts cannot fire, so the entire distributed test shard runs until the 270-minute job-level timeout is hit.
Fix: protect all access to global_opaque_types with a mutex. The mutex itself uses a function-static to ensure thread-safe initialization (C++11 guarantee). This also covers any future callers of registerOpaqueType that run at import time concurrently with collective initialization.
Python thread (main):
import torch.distributed.tensor
-> _register_distributed_opaque_types()
-> torch._C._register_opaque_type("torch._C._distributed_c10d.ProcessGroup")
-> registerOpaqueType()
-> unordered_set::insert() (on `global_opaque_types`)
RCCL C++ Thread:
init_process_group()
-> isRegisteredOpaqueType()
-> unordered_set::find() (on `global_opaque_types`)
Pull Request resolved: #175694
Approved by: https://github.com/jeffdaily
Fixes #168208 ### Summary Disabling the low-precision (fp16) check in `test_tield_kernel_fusion` to stop intermittent CI failures, especially on ROCm. The failures come from the low-precision (fp16) comparison , not the main float32 check. Benchmark fusion can choose different fusion strategies between runs, and those small differences show up more often when comparing fp16 results. ### Fix Set `check_lowp=False`to skip the fp16 comparison. This follows the pattern used in other inductor tests that disable low-precision checks due to fusion precision issues (e.g., `test_cumsum_zero_dim`, `test_cumprod_zero_dim`, `test_split_cumsum_low_prec`). ### Testing `PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_benchmark_fusion.py BenchmarkFusionGpuTest.test_tield_kernel_fusion_cuda` - Test passed on MI300. Pull Request resolved: #175840 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: #174347 Approved by: https://github.com/pytorchbot
We cannot run all the test suites in `default` config using just 2 shards, it times out. However, due to insufficient Navi31 CI capacity, we cannot increase the shards per run either. Hence, reducing the test set to core unit tests unconditionally. Testplan: [rocm-navi31](https://github.com/pytorch/pytorch/actions/runs/22411229116/job/64893098527?pr=175770) passed Pull Request resolved: #175770 Approved by: https://github.com/jeffdaily
This reverts commit c5ca4f7. Reverted #175769 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#175769 (comment)))
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. Pull Request resolved: #174912 Approved by: https://github.com/v0i0
This PR modifies all of the binary workflows to use the proper working directory .../pytorch/pytorch rather than .../pytorch/pytorch/pytorch. Pull Request resolved: #174290 Approved by: https://github.com/atalman Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Summary: Extract the wrapper function signature generation code into reusable _get_cpp_param_type and _write_wrapper_signature methods on DeferredTritonCallWrapper, which will be reused by the lazy kernel compilation. Pull Request resolved: #175414 Approved by: https://github.com/PaulZhang12
Summary: Replace ''' with """ in cpp_wrapper_src, so that later lazy-kernel compilation can emit Triton kernel source code which contains '''. Pull Request resolved: #175415 Approved by: https://github.com/PaulZhang12 ghstack dependencies: #175414
…5415)" This reverts commit 1f66f34. Reverted #175415 on behalf of https://github.com/desertfire due to lint error ([comment](#175415 (comment)))
#175414)" This reverts commit 14b6f21. Reverted #175414 on behalf of https://github.com/desertfire due to lint error ([comment](#175414 (comment)))
… break (#174435) FIXES #167009 When FSDP2 is applied to child modules (like `nn.Linear` layers) and then to the root module, torch.compile fails to trace the sharded `nn.Linear` layers. The linear operations were running eagerly and not being captured in the compiled graphs. I tested enabling `config.wrap_top_frame`, but this does not fix this issue because it only adds an extra frame, but this graph break can happen in lower frames. Perf benefits are dependent on the use case. If the child module is a simple op (eg `nn.Linear`), then no improvement is expected. However if it is more substantial, such as a `nn.TransformerEncoder`, compiling that region may be more beneficial. Pull Request resolved: #174435 Approved by: https://github.com/anijain2305, https://github.com/Lucaskabela
Fwd fix for #174936 Though it uncovered an interesting with atomic-ops and non-contiguous outputs (this is why float16 types are not tested right now) Pull Request resolved: #175991 Approved by: https://github.com/huydhn, https://github.com/atalman
**Summary:**
This PR optimizes the `radixSelect` kernel on ROCm by reducing synchronization overhead when aggregating radix counts across warps. The previous implementation used 3 block-level `__syncthreads()` calls plus atomic operations on 4 radix buckets (contended by all warps). The new implementation uses 2 `__syncthreads()` calls with no atomic contention, reducing synchronization overhead and improving performance.
**Background:**
The `radixSelect` algorithm finds the k-th element by iteratively uncovering its bit pattern through multiple passes over the data. Each pass determines 2 bits of the top-k value's bitmap (up to 16 passes for float32). Each iteration involves:
1. Counting input elements that match the already uncovered pattern
2. Grouping them by radix bucket (4 buckets per iteration)
3. Aggregating counts across all warps
4. Broadcasting the aggregated counts back to all threads
**Previous Implementation:**
The original sequence for each iteration was:
```cpp
initialize smem[RadixSize] to 0
__syncthreads() // Sync 1
count within warp
if (lane_id == 0) {
atomicAdd(&smem[i], counts[i]) // Atomic contention on 4 buckets
}
__syncthreads() // Sync 2
read back total counts from smem
__syncthreads() // Sync 3
```
This involved **3 synchronizations** and **atomic contention** on 4 buckets from all warps.
**Changes:**
* **Warp-level reduction without atomics:**
- Each warp's lane 0 writes its counts to a dedicated location in shared memory
- Warp 0's lanes perform parallel reduction: each lane reduces one radix bin across all warps
- This eliminates atomic contention while maintaining correctness
* **Double-buffering for concurrent iterations:**
- Observation: Due to block-level synchronization, at most two consecutive iterations can be in-flight simultaneously
- When threads are in "section 2" (post-sync) of iteration `i`, other threads can only reach "section 1" (pre-sync) of iteration `i+1` and wait there
- We use `buffer_index` (0 or 1) to alternate between two shared memory segments, allowing safe concurrent execution
- This enables removing the first and last `__syncthreads()` calls, reducing from 3 to 2 synchronizations per iteration (2 = 3 - 2 + 1, where the +1 is required for the new warp-level aggregation step that replaces atomics)
**Performance:**
Measured on AMD MI350 (gfx950) using single-block TopK operator, where RadixSelect accounts for ~80% of total latency for typical workloads.
- **Smaller datatypes (bfloat16, float16):** 4-5% improvement on smaller inputs, ~1% on larger inputs
- **float32:** Similar improvements, slightly less pronounced
- **Average improvement:** ~2% (weighted by larger input latencies)
**Testing:**
- Verified correctness across multiple data types (float32, float16, bfloat16) and input shapes
- Tested with various K values to ensure correct behavior across all iteration counts
- Performance benchmarks included below
<img width="2307" height="1537" alt="topk_latency_comparison" src="https://github.com/user-attachments/assets/1d0b8428-055a-4fa9-b3b0-31427021adf9" />
**Testing code:**
- benchmark code: [code](https://github.com/user-attachments/files/24484540/benchmark.py)
Pull Request resolved: #174837
Approved by: https://github.com/jeffdaily
…5684)" This reverts commit 2931704. Reverted #175684 on behalf of https://github.com/jeanschmidt due to Breaking internal signals, see D94404441 ([comment](#175684 (comment)))
…de HOP (#175329)" This reverts commit 8a65c3e. Reverted #175329 on behalf of https://github.com/jeanschmidt due to Breaks internal signals, see D94376303 ([comment](#175329 (comment)))
Summary: Consolidation script runs slowly for remote mounts. Identical model on local disk takes ~7s but ~2h on remote mount. With these changes script now runs in 4s locally and 35s with mount. Problem lied with the use of mmap; each read of a tensor shard results in page fault after which the page must be read over network resulting in large amounts of reads over network. 1. Explicit reads instead of mmap: replaced with f.seek and f.read 2. File handle caching: lifting the open call outside 3. Metadata caching: compute _get_dcp_custom_metadata once per shard and reuse result Test Plan: Diffed consolidated output files from 3b and 70b models against the original benchmark script and no changes in output. Timing results: <img width="754" height="306" alt="image" src="https://github.com/user-attachments/assets/4613583e-18f7-40a3-b9c7-bbedf9aa2155" /> Pull Request resolved: #175762 Approved by: https://github.com/santhanamhari, https://github.com/ankitageorge
Adding the space makes the warning clearer IMO Pull Request resolved: #175998 Approved by: https://github.com/soulitzer
…5676)" This reverts commit fa1ac62. Reverted #175676 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #175555 - see [D94699653](https://www.internalfb.com/diff/D94699653) ([comment](#175676 (comment)))
This reverts commit 2958025. Reverted #175765 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #175555 - see D94699609 ([comment](#175765 (comment)))
#174751)" This reverts commit 1b9046a. Reverted #174751 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #175555 - see D94699526 ([comment](#174933 (comment)))
This reverts commit ba59c42. Reverted #174933 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #175555 - see D94699526 ([comment](#174933 (comment)))
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )