[pull] main from pytorch:main by pull[bot] · Pull Request #912 · MPACT-ORG/pytorch

pull · 2026-02-27T17:25:03Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

The `global_opaque_types` unordered_set in schema_type_parser.cpp was accessed without any synchronization. RCCL initialization spawns background C++ threads that trigger JIT schema parsing (calling isRegisteredOpaqueType) which reads `global_opaque_types`, while the main python thread concurrently writes to the set via registerOpaqueType at module import time. This read+write data race on std::unordered_set is undefined behavior and can corrupt the hash table, causing a thread to spin indefinitely in hash collision resolution. Because the hang occurs inside a RCCL collective on a C++ thread, Python-level per-test timeouts cannot fire, so the entire distributed test shard runs until the 270-minute job-level timeout is hit. Fix: protect all access to global_opaque_types with a mutex. The mutex itself uses a function-static to ensure thread-safe initialization (C++11 guarantee). This also covers any future callers of registerOpaqueType that run at import time concurrently with collective initialization. Python thread (main): import torch.distributed.tensor -> _register_distributed_opaque_types() -> torch._C._register_opaque_type("torch._C._distributed_c10d.ProcessGroup") -> registerOpaqueType() -> unordered_set::insert() (on `global_opaque_types`) RCCL C++ Thread: init_process_group() -> isRegisteredOpaqueType() -> unordered_set::find() (on `global_opaque_types`) Pull Request resolved: #175694 Approved by: https://github.com/jeffdaily

Fixes #168208 ### Summary Disabling the low-precision (fp16) check in `test_tield_kernel_fusion` to stop intermittent CI failures, especially on ROCm. The failures come from the low-precision (fp16) comparison , not the main float32 check. Benchmark fusion can choose different fusion strategies between runs, and those small differences show up more often when comparing fp16 results. ### Fix Set `check_lowp=False`to skip the fp16 comparison. This follows the pattern used in other inductor tests that disable low-precision checks due to fusion precision issues (e.g., `test_cumsum_zero_dim`, `test_cumprod_zero_dim`, `test_split_cumsum_low_prec`). ### Testing `PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_benchmark_fusion.py BenchmarkFusionGpuTest.test_tield_kernel_fusion_cuda` - Test passed on MI300. Pull Request resolved: #175840 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: #174347 Approved by: https://github.com/pytorchbot

We cannot run all the test suites in `default` config using just 2 shards, it times out. However, due to insufficient Navi31 CI capacity, we cannot increase the shards per run either. Hence, reducing the test set to core unit tests unconditionally. Testplan: [rocm-navi31](https://github.com/pytorch/pytorch/actions/runs/22411229116/job/64893098527?pr=175770) passed Pull Request resolved: #175770 Approved by: https://github.com/jeffdaily

This reverts commit c5ca4f7. Reverted #175769 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#175769 (comment)))

Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. Pull Request resolved: #174912 Approved by: https://github.com/v0i0

This PR modifies all of the binary workflows to use the proper working directory .../pytorch/pytorch rather than .../pytorch/pytorch/pytorch. Pull Request resolved: #174290 Approved by: https://github.com/atalman Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

Summary: Extract the wrapper function signature generation code into reusable _get_cpp_param_type and _write_wrapper_signature methods on DeferredTritonCallWrapper, which will be reused by the lazy kernel compilation. Pull Request resolved: #175414 Approved by: https://github.com/PaulZhang12

Summary: Replace ''' with """ in cpp_wrapper_src, so that later lazy-kernel compilation can emit Triton kernel source code which contains '''. Pull Request resolved: #175415 Approved by: https://github.com/PaulZhang12 ghstack dependencies: #175414

…5415)" This reverts commit 1f66f34. Reverted #175415 on behalf of https://github.com/desertfire due to lint error ([comment](#175415 (comment)))

#175414)" This reverts commit 14b6f21. Reverted #175414 on behalf of https://github.com/desertfire due to lint error ([comment](#175414 (comment)))

… break (#174435) FIXES #167009 When FSDP2 is applied to child modules (like `nn.Linear` layers) and then to the root module, torch.compile fails to trace the sharded `nn.Linear` layers. The linear operations were running eagerly and not being captured in the compiled graphs. I tested enabling `config.wrap_top_frame`, but this does not fix this issue because it only adds an extra frame, but this graph break can happen in lower frames. Perf benefits are dependent on the use case. If the child module is a simple op (eg `nn.Linear`), then no improvement is expected. However if it is more substantial, such as a `nn.TransformerEncoder`, compiling that region may be more beneficial. Pull Request resolved: #174435 Approved by: https://github.com/anijain2305, https://github.com/Lucaskabela

Fwd fix for #174936 Though it uncovered an interesting with atomic-ops and non-contiguous outputs (this is why float16 types are not tested right now) Pull Request resolved: #175991 Approved by: https://github.com/huydhn, https://github.com/atalman

**Summary:** This PR optimizes the `radixSelect` kernel on ROCm by reducing synchronization overhead when aggregating radix counts across warps. The previous implementation used 3 block-level `__syncthreads()` calls plus atomic operations on 4 radix buckets (contended by all warps). The new implementation uses 2 `__syncthreads()` calls with no atomic contention, reducing synchronization overhead and improving performance. **Background:** The `radixSelect` algorithm finds the k-th element by iteratively uncovering its bit pattern through multiple passes over the data. Each pass determines 2 bits of the top-k value's bitmap (up to 16 passes for float32). Each iteration involves: 1. Counting input elements that match the already uncovered pattern 2. Grouping them by radix bucket (4 buckets per iteration) 3. Aggregating counts across all warps 4. Broadcasting the aggregated counts back to all threads **Previous Implementation:** The original sequence for each iteration was: ```cpp initialize smem[RadixSize] to 0 __syncthreads() // Sync 1 count within warp if (lane_id == 0) { atomicAdd(&smem[i], counts[i]) // Atomic contention on 4 buckets } __syncthreads() // Sync 2 read back total counts from smem __syncthreads() // Sync 3 ``` This involved **3 synchronizations** and **atomic contention** on 4 buckets from all warps. **Changes:** * **Warp-level reduction without atomics:** - Each warp's lane 0 writes its counts to a dedicated location in shared memory - Warp 0's lanes perform parallel reduction: each lane reduces one radix bin across all warps - This eliminates atomic contention while maintaining correctness * **Double-buffering for concurrent iterations:** - Observation: Due to block-level synchronization, at most two consecutive iterations can be in-flight simultaneously - When threads are in "section 2" (post-sync) of iteration `i`, other threads can only reach "section 1" (pre-sync) of iteration `i+1` and wait there - We use `buffer_index` (0 or 1) to alternate between two shared memory segments, allowing safe concurrent execution - This enables removing the first and last `__syncthreads()` calls, reducing from 3 to 2 synchronizations per iteration (2 = 3 - 2 + 1, where the +1 is required for the new warp-level aggregation step that replaces atomics) **Performance:** Measured on AMD MI350 (gfx950) using single-block TopK operator, where RadixSelect accounts for ~80% of total latency for typical workloads. - **Smaller datatypes (bfloat16, float16):** 4-5% improvement on smaller inputs, ~1% on larger inputs - **float32:** Similar improvements, slightly less pronounced - **Average improvement:** ~2% (weighted by larger input latencies) **Testing:** - Verified correctness across multiple data types (float32, float16, bfloat16) and input shapes - Tested with various K values to ensure correct behavior across all iteration counts - Performance benchmarks included below <img width="2307" height="1537" alt="topk_latency_comparison" src="https://github.com/user-attachments/assets/1d0b8428-055a-4fa9-b3b0-31427021adf9" /> **Testing code:** - benchmark code: [code](https://github.com/user-attachments/files/24484540/benchmark.py) Pull Request resolved: #174837 Approved by: https://github.com/jeffdaily

…5684)" This reverts commit 2931704. Reverted #175684 on behalf of https://github.com/jeanschmidt due to Breaking internal signals, see D94404441 ([comment](#175684 (comment)))

…de HOP (#175329)" This reverts commit 8a65c3e. Reverted #175329 on behalf of https://github.com/jeanschmidt due to Breaks internal signals, see D94376303 ([comment](#175329 (comment)))

Summary: Consolidation script runs slowly for remote mounts. Identical model on local disk takes ~7s but ~2h on remote mount. With these changes script now runs in 4s locally and 35s with mount. Problem lied with the use of mmap; each read of a tensor shard results in page fault after which the page must be read over network resulting in large amounts of reads over network. 1. Explicit reads instead of mmap: replaced with f.seek and f.read 2. File handle caching: lifting the open call outside 3. Metadata caching: compute _get_dcp_custom_metadata once per shard and reuse result Test Plan: Diffed consolidated output files from 3b and 70b models against the original benchmark script and no changes in output. Timing results: <img width="754" height="306" alt="image" src="https://github.com/user-attachments/assets/4613583e-18f7-40a3-b9c7-bbedf9aa2155" /> Pull Request resolved: #175762 Approved by: https://github.com/santhanamhari, https://github.com/ankitageorge

Adding the space makes the warning clearer IMO Pull Request resolved: #175998 Approved by: https://github.com/soulitzer

…5676)" This reverts commit fa1ac62. Reverted #175676 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #175555 - see [D94699653](https://www.internalfb.com/diff/D94699653) ([comment](#175676 (comment)))

This reverts commit 2958025. Reverted #175765 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #175555 - see D94699609 ([comment](#175765 (comment)))

#174751)" This reverts commit 1b9046a. Reverted #174751 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #175555 - see D94699526 ([comment](#174933 (comment)))

This reverts commit ba59c42. Reverted #174933 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #175555 - see D94699526 ([comment](#174933 (comment)))

pragupta and others added 11 commits February 27, 2026 05:40

Revert "[inductor][refactor] Update cpp_wrapper_src quote format (#17…

71fed65

…5415)" This reverts commit 1f66f34. Reverted #175415 on behalf of https://github.com/desertfire due to lint error ([comment](#175415 (comment)))

Revert "[inductor][refactor] Update DeferredTritonCallWrapper.generate (

7e0feca

#175414)" This reverts commit 14b6f21. Reverted #175414 on behalf of https://github.com/desertfire due to lint error ([comment](#175414 (comment)))

pull bot locked and limited conversation to collaborators Feb 27, 2026

pull bot added the ⤵️ pull label Feb 27, 2026

azahed98 and others added 11 commits February 27, 2026 18:46

Revert "[DTensor] Split linear_pointwise_ops into category lists (#17…

80631eb

…5684)" This reverts commit 2931704. Reverted #175684 on behalf of https://github.com/jeanschmidt due to Breaking internal signals, see D94404441 ([comment](#175684 (comment)))

Revert "Fix TorchDispatchMode compatibility with inductor_compiled_co…

7eeab8a

…de HOP (#175329)" This reverts commit 8a65c3e. Reverted #175329 on behalf of https://github.com/jeanschmidt due to Breaks internal signals, see D94376303 ([comment](#175329 (comment)))

Add space in c++ standard warning (#175998)

879847d

Adding the space makes the warning clearer IMO Pull Request resolved: #175998 Approved by: https://github.com/soulitzer

Revert "Handle FakeScriptObject in save_graph_repro (#175765)"

3edb184

This reverts commit 2958025. Reverted #175765 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #175555 - see D94699609 ([comment](#175765 (comment)))

Revert "[inductor] Fix reciprocal to use float32 for division_rounding (

2151cd9

#174751)" This reverts commit 1b9046a. Reverted #174751 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #175555 - see D94699526 ([comment](#174933 (comment)))

Revert "[inductor] Use CUDA toolkit libdevice for Triton (#174933)"

fd11530

This reverts commit ba59c42. Reverted #174933 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #175555 - see D94699526 ([comment](#174933 (comment)))

pull bot merged commit fd11530 into MPACT-ORG:main Feb 27, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from pytorch:main#912

[pull] main from pytorch:main#912
pull[bot] merged 22 commits intoMPACT-ORG:mainfrom
pytorch:main

pull bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

pull bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

pull bot commented Feb 27, 2026 •

edited

Loading