Skip to content

[Revert] Revert PR #79240 and #79170#79249

Merged
risemeup1 merged 1 commit into
PaddlePaddle:developfrom
risemeup1:fix_deep_ep_import_bug
Jun 8, 2026
Merged

[Revert] Revert PR #79240 and #79170#79249
risemeup1 merged 1 commit into
PaddlePaddle:developfrom
risemeup1:fix_deep_ep_import_bug

Conversation

@risemeup1

@risemeup1 risemeup1 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

PR Category

Environment Adaptation

PR Types

Others

Description

revert PR #79240#79170

是否引起精度变化

@CLAassistant

CLAassistant commented Jun 4, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ risemeup1
❌ name


name seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

PaddlePaddle-bot

This comment was marked as outdated.

Comment on lines 28 to 31
try:
from paddle.distributed.communication import deep_ep
except ImportError:
deep_ep = None

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块是不是可以清理掉了?

PaddlePaddle-bot

This comment was marked as outdated.

SigureMo
SigureMo previously approved these changes Jun 4, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot

PaddlePaddle-bot commented Jun 4, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-05 22:22:28

CI报告基于以下代码生成(30分钟更新一次):
PR commit: a145a29 | Merge base: 29ae6bb (branch: develop)


1 Required任务 : 42/46 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
79(0) 79 73 3 2 0 1
任务 错误类型 置信度 日志
Slice / Slice test 不稳定问题 Job
Linux-CPU / Build and test 未知 Job

2 失败详情

🟡 Slice / Slice test — 不稳定问题(置信度: 中)

分析器: 通用分析(fallback)

失败用例:

用例 错误摘要
Getitem - forward - Slice - Slice with Step - float16 - paddle Slice benchmark 对比基线性能下降约 15.5%,触发 slice测试失败

关键日志:

slice测试失败, 存在性能下降case, 失败case性能变化: {'Getitem - forward - Slice - Slice with Step - float16 - paddle': -0.1554505282353383}
File "/paddle/PaddleTest/framework/slice_benchmark/run.py", line 164, in ci_test
  raise Exception("slice测试失败")
Exception: slice测试失败
  • 根因摘要: Slice 性能基准单例回退
    日志显示 Paddle Slice benchmark 中单个 float16 slice-with-step forward case 相比基线回退约 15.5%,因此 PaddleTest/framework/slice_benchmark/run.pyci_test() 中抛出异常。已读取 PR 变更文件并搜索相关上下文,本 PR 只修改 NVSHMEMFlashAttentionFP8DeepEP 相关 CMake 构建开关,未触及 Slice benchmark、Slice Python API 或 Slice kernel 实现;当前证据更偏向性能基准波动/不稳定,而非本 PR 直接引入的功能回归。

修复建议:

  1. 性能基准疑似不稳定,请先 rerun Slice / Slice test 验证。
  2. 若 rerun 后持续复现,再针对 Getitem - forward - Slice - Slice with Step - float16 - paddle 做本地复测/性能剖析;当前 PR 暂无明确 Slice 相关代码修复点。

关联变更: cmake/third_party.cmake, paddle/fluid/CMakeLists.txt, paddle/fluid/distributed/collective/CMakeLists.txt, paddle/fluid/pybind/CMakeLists.txt(均为构建开关/DeepEP/FP8/FlashAttention/NVSHMEM 相关,未发现与 Slice benchmark 的直接文件级关联)

⚪ Linux-CPU / Build and test — 未知(置信度: 低)

分析器: 通用分析(fallback)

失败用例:

用例 错误摘要
Test step Build step 成功后 Test step 失败,退出码为 8;当前未获取到具体失败用例/错误正文

关键日志:

failed_steps: ["Test"]
[FAILURE]: Process completed with exit code 8.
gh run view --log-failed: run 27013811308 is still in progress; logs will be available when it is complete
  • 根因摘要: Test 日志暂不可用
    MCP 深度日志未能提取到该 Job 的 error_snippet/step_log,直接拉取 GitHub failed log 时也提示所属 workflow run 仍在进行中,因此目前只能确认是 Linux-CPU 的 Test step 失败,无法定位到具体单测或源码位置。结合 PR diff,暂未发现本 PR 的 CMake 构建开关改动与 CPU Test 失败之间的直接关联。

修复建议:

  1. 等待 workflow run 完成后重新触发本分析或 rerun 该 Job,以获取完整 Test step 日志。
  2. 置信度较低,当前不建议直接修改代码;需以后续完整日志中的具体失败用例为准。

关联变更: 已检查本 PR 4 个 CMake 变更文件,暂未建立与 Linux-CPU Test 失败的直接关联。

@codecov-commenter

codecov-commenter commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 29.62963% with 19 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@29ae6bb). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...addle/distributed/fleet/meta_parallel/dualpipev.py 29.62% 19 Missing ⚠️

❌ Your patch status has failed because the patch coverage (29.62%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop   #79249   +/-   ##
==========================================
  Coverage           ?   29.62%           
==========================================
  Files              ?        1           
  Lines              ?       27           
  Branches           ?        0           
==========================================
  Hits               ?        8           
  Misses             ?       19           
  Partials           ?        0           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

@risemeup1 risemeup1 force-pushed the fix_deep_ep_import_bug branch from 6023938 to fe3c6ca Compare June 5, 2026 07:04
@risemeup1 risemeup1 changed the title Fix deep ep import bug Remove deep_ep dependency from DualPipeV and inline stream event utilities Jun 5, 2026
@risemeup1 risemeup1 force-pushed the fix_deep_ep_import_bug branch from 107188b to 1a6b59e Compare June 5, 2026 07:21
PaddlePaddle-bot

This comment was marked as outdated.

SigureMo
SigureMo previously approved these changes Jun 5, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@risemeup1 risemeup1 force-pushed the fix_deep_ep_import_bug branch from 1a6b59e to 2c009d1 Compare June 5, 2026 11:56
@risemeup1 risemeup1 changed the title Remove deep_ep dependency from DualPipeV and inline stream event utilities [Revert] revert PR #79240 和 #79170 Jun 5, 2026
@SigureMo SigureMo changed the title [Revert] revert PR #79240 和 #79170 [Revert] Revert PR #79240 and #79170 Jun 5, 2026
@risemeup1 risemeup1 force-pushed the fix_deep_ep_import_bug branch from 2c009d1 to a145a29 Compare June 5, 2026 12:05
PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-05 20:17:41

📋 Review 摘要

PR 概述:回退 PR #79240#79170,恢复 deep_ep / FP8 / NVSHMEM 相关 CMake 构建配置原状,修复 deep_ep import 错误。
变更范围cmake/third_party.cmakepaddle/fluid/CMakeLists.txtpaddle/fluid/distributed/collective/CMakeLists.txtpaddle/fluid/pybind/CMakeLists.txt
影响面 Tag[Environment Adaptation] [Distributed Strategy]

问题

级别 文件 概述
🟡 建议 paddle/fluid/pybind/CMakeLists.txt:153 使用 CUDA_ARCH_BIN 而非 COMPILED_CUDA_ARCHS,与 collective/CMakeLists.txt 不一致,非 Manual 模式或 sm_100+ 场景下 deep_ep 编译后 pybind 不链接,导致 import 失败

历史 Findings 修复情况

Finding 问题 状态
F1 dualpipev.pyget_event_from_calc_stream 等函数为死代码 ⚠️ 仍存在(本 PR 未涉及该文件)
F2 CUDA_ARCH_BIN 在非 Manual 模式为空导致 WITH_NVSHMEM 被错误关闭 ⚠️ 仍存在(revert 恢复了相同问题代码)
F3 缺少 if(NOT WITH_XPU) 保护,XPU 构建下 NVSHMEM 可用性受 GPU/ARCH 约束影响 ⚠️ 仍存在

📝 PR 规范检查

历史缓存的建议针对之前版本(DualPipeV overlap 修复),已不适用于当前 Revert PR,给出新建议如下。

当前标题 [Revert] Revert PR #79240 and #79170[Revert] 不在 Paddle 官方 Tag 枚举(PR Category / PR Types)中,建议替换为官方 Tag。

标题建议(可直接复制):

  • [Environment Adaptation] Revert PR #79240 and #79170
PR 描述建议(点击展开,可直接复制)
### PR Category
<!-- One of [ User Experience | Execute Infrastructure | Operator Mechanism | CINN | Custom Device | Performance Optimization | Distributed Strategy | Parameter Server | Communication Library | Auto Parallel | Inference | Environment Adaptation ] -->
Environment Adaptation

### PR Types
<!-- One of [ New features | Bug fixes | Improvements | Performance | BC Breaking | Deprecations | Docs | Devs | Not User Facing | Security | Others ] -->
Others

### Description
<!-- Describe what you've done -->
Revert PR #79240#79170,修复 deep_ep import 错误。

回退内容:
- cmake/third_party.cmake:恢复 WITH_NVSHMEM / flashattn 原有判断逻辑
- paddle/fluid/CMakeLists.txt:恢复 FP8 模块编译条件
- paddle/fluid/distributed/collective/CMakeLists.txt:恢复 DeepEP 编译
- paddle/fluid/pybind/CMakeLists.txt:恢复 deep_ep pybind 绑定

### 是否引起精度变化
<!-- one of the following [ 是 | 否 ]-->

总体评价

本次 Revert 操作目的明确(修复 deep_ep import 错误)。需关注 pybind/CMakeLists.txtCUDA_ARCH_BINcollective/CMakeLists.txt 使用的 COMPILED_CUDA_ARCHS 不一致,在非 Manual 模式或 sm_100+ 架构下存在 deep_ep 编译但 pybind 未链接的风险;历史 F2/F3 问题随 revert 重新引入,建议在后续 PR 中修复。

# set(PYBIND_DEPS ${PYBIND_DEPS} nvshmem)
# endif()
#endif()
string(FIND "${CUDA_ARCH_BIN}" "90" ARCH_BIN_CONTAINS_90)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 CUDA_ARCH_BINcollective/CMakeLists.txt 中使用的 COMPILED_CUDA_ARCHS 不一致。

paddle/fluid/distributed/collective/CMakeLists.txtCOMPILED_CUDA_ARCHS + 正则 9[0-9]|1[0-9][0-9] 判断是否编译 deep_ep(覆盖 sm90~sm100+),而此处用 CUDA_ARCH_BIN + FIND "90"。当:

  1. 非 Manual 模式构建时 CUDA_ARCH_BIN 为空 → deep_ep 已被编译,但此处 pybind 不链接,运行时 import deep_ep 因找不到符号而报错;
  2. 目标架构为 sm_100(Blackwell)时,CUDA_ARCH_BIN="100" 不含 "90" → collective 编译了 deep_ep,但 pybind 不链接,同样导致 import 失败。

建议改为与 collective/CMakeLists.txt 保持一致:

string(REGEX MATCH "9[0-9]|1[0-9][0-9]" ARCH_BIN_CONTAINS_GE_90 "${COMPILED_CUDA_ARCHS}")
if(WITH_GPU AND ARCH_BIN_CONTAINS_GE_90)

@risemeup1 risemeup1 merged commit 15f9805 into PaddlePaddle:develop Jun 8, 2026
139 of 144 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants