[Revert] Revert PR #79240 and #79170#79249
Conversation
|
name seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
| try: | ||
| from paddle.distributed.communication import deep_ep | ||
| except ImportError: | ||
| deep_ep = None |
CI报告基于以下代码生成(30分钟更新一次): 1 Required任务 : 42/46 通过
2 失败详情🟡 Slice / Slice test — 不稳定问题(置信度: 中)分析器: 通用分析(fallback) 失败用例:
关键日志:
修复建议:
关联变更: ⚪ Linux-CPU / Build and test — 未知(置信度: 低)分析器: 通用分析(fallback) 失败用例:
关键日志:
修复建议:
关联变更: 已检查本 PR 4 个 CMake 变更文件,暂未建立与 Linux-CPU Test 失败的直接关联。 |
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (29.62%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #79249 +/- ##
==========================================
Coverage ? 29.62%
==========================================
Files ? 1
Lines ? 27
Branches ? 0
==========================================
Hits ? 8
Misses ? 19
Partials ? 0 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
6023938 to
fe3c6ca
Compare
107188b to
1a6b59e
Compare
1a6b59e to
2c009d1
Compare
2c009d1 to
a145a29
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-05 20:17:41
📋 Review 摘要
PR 概述:回退 PR #79240 和 #79170,恢复 deep_ep / FP8 / NVSHMEM 相关 CMake 构建配置原状,修复 deep_ep import 错误。
变更范围:cmake/third_party.cmake、paddle/fluid/CMakeLists.txt、paddle/fluid/distributed/collective/CMakeLists.txt、paddle/fluid/pybind/CMakeLists.txt
影响面 Tag:[Environment Adaptation] [Distributed Strategy]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | paddle/fluid/pybind/CMakeLists.txt:153 |
使用 CUDA_ARCH_BIN 而非 COMPILED_CUDA_ARCHS,与 collective/CMakeLists.txt 不一致,非 Manual 模式或 sm_100+ 场景下 deep_ep 编译后 pybind 不链接,导致 import 失败 |
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | dualpipev.py 中 get_event_from_calc_stream 等函数为死代码 |
|
| F2 | CUDA_ARCH_BIN 在非 Manual 模式为空导致 WITH_NVSHMEM 被错误关闭 |
|
| F3 | 缺少 if(NOT WITH_XPU) 保护,XPU 构建下 NVSHMEM 可用性受 GPU/ARCH 约束影响 |
📝 PR 规范检查
历史缓存的建议针对之前版本(DualPipeV overlap 修复),已不适用于当前 Revert PR,给出新建议如下。
当前标题 [Revert] Revert PR #79240 and #79170 中 [Revert] 不在 Paddle 官方 Tag 枚举(PR Category / PR Types)中,建议替换为官方 Tag。
标题建议(可直接复制):
[Environment Adaptation] Revert PR #79240 and #79170
PR 描述建议(点击展开,可直接复制)
### PR Category
<!-- One of [ User Experience | Execute Infrastructure | Operator Mechanism | CINN | Custom Device | Performance Optimization | Distributed Strategy | Parameter Server | Communication Library | Auto Parallel | Inference | Environment Adaptation ] -->
Environment Adaptation
### PR Types
<!-- One of [ New features | Bug fixes | Improvements | Performance | BC Breaking | Deprecations | Docs | Devs | Not User Facing | Security | Others ] -->
Others
### Description
<!-- Describe what you've done -->
Revert PR #79240 和 #79170,修复 deep_ep import 错误。
回退内容:
- cmake/third_party.cmake:恢复 WITH_NVSHMEM / flashattn 原有判断逻辑
- paddle/fluid/CMakeLists.txt:恢复 FP8 模块编译条件
- paddle/fluid/distributed/collective/CMakeLists.txt:恢复 DeepEP 编译
- paddle/fluid/pybind/CMakeLists.txt:恢复 deep_ep pybind 绑定
### 是否引起精度变化
<!-- one of the following [ 是 | 否 ]-->
否总体评价
本次 Revert 操作目的明确(修复 deep_ep import 错误)。需关注 pybind/CMakeLists.txt 中 CUDA_ARCH_BIN 与 collective/CMakeLists.txt 使用的 COMPILED_CUDA_ARCHS 不一致,在非 Manual 模式或 sm_100+ 架构下存在 deep_ep 编译但 pybind 未链接的风险;历史 F2/F3 问题随 revert 重新引入,建议在后续 PR 中修复。
| # set(PYBIND_DEPS ${PYBIND_DEPS} nvshmem) | ||
| # endif() | ||
| #endif() | ||
| string(FIND "${CUDA_ARCH_BIN}" "90" ARCH_BIN_CONTAINS_90) |
There was a problem hiding this comment.
🟡 建议 CUDA_ARCH_BIN 与 collective/CMakeLists.txt 中使用的 COMPILED_CUDA_ARCHS 不一致。
paddle/fluid/distributed/collective/CMakeLists.txt 用 COMPILED_CUDA_ARCHS + 正则 9[0-9]|1[0-9][0-9] 判断是否编译 deep_ep(覆盖 sm90~sm100+),而此处用 CUDA_ARCH_BIN + FIND "90"。当:
- 非 Manual 模式构建时
CUDA_ARCH_BIN为空 → deep_ep 已被编译,但此处 pybind 不链接,运行时import deep_ep因找不到符号而报错; - 目标架构为 sm_100(Blackwell)时,
CUDA_ARCH_BIN="100"不含"90"→ collective 编译了 deep_ep,但 pybind 不链接,同样导致 import 失败。
建议改为与 collective/CMakeLists.txt 保持一致:
string(REGEX MATCH "9[0-9]|1[0-9][0-9]" ARCH_BIN_CONTAINS_GE_90 "${COMPILED_CUDA_ARCHS}")
if(WITH_GPU AND ARCH_BIN_CONTAINS_GE_90)
PR Category
Environment Adaptation
PR Types
Others
Description
revert PR #79240 和 #79170
是否引起精度变化
否