Enhance Deterministic training and AccProber for Qwen 3.5 by jayhenry · Pull Request #1850 · InternLM/xtuner

jayhenry · 2026-05-27T13:20:42Z

背景

Qwen3.5 35B MoE / VL 在 torch.compile、FSDP、FLA/Triton kernel 路径下做确定性复现时，可能受到 Inductor rblock 动态选择、Triton autotune/cache、以及 prober 在 compile 区域内记录方式的影响，导致跨运行梯度不一致或难以定位。

这次改动的目标是把已验证有效的确定性开关统一到训练入口，同时让 AccProber 能在 compile 路径下稳定记录 Qwen3.5 关键模块的中间张量摘要。

结论

当前验证结论是：Qwen3.5 纯文本训练和多模态训练，无论是否开启 torch.compile ，都可以做到确定性。

本 PR 将复现过程中确认有效的确定性设置收敛到产品入口，并补充 AccProber 与复现脚本，方便后续持续验证和定位回归。

主要改动

统一 set_deterministic() 入口：
- 设置 CUBLAS_WORKSPACE_CONFIG
- 关闭 TORCHINDUCTOR_DYNAMIC_SCALE_RBLOCK，同步关闭 torch._inductor.config.dynamic_scale_rblock
  - torch.use_deterministic_algorithms(True) 只会让 reduction 的初始候选收敛成一个 config；
  - dynamic rblock 仍可能在 precompile 后追加一个缩小 R*_BLOCK 的 launcher，并由 runtime benchmark 在多个 launcher 中选择。不同 rank/run 一旦选到不同 reduction 分块，浮点累加顺序就会变化，最终可能得到 bitwise 不同的梯度。
- SFT Trainer 和 RL TrainingWorker 共用该入口，确保在 torch.compile 前生效
在 XTUNER_DETERMINISTIC=true 时固定 FLA/Triton autotune 行为：
- 在导入 FLA 前 patch triton.autotune
- 固定使用第一个 autotune config
- 关闭 autotune cache result，减少不同 cache / autotune 选择带来的不确定性
增强 AccProber 的 torch.compile 兼容性：
- 改为记录轻量 tensor stats：tensor_sum / shape / dtype / first_10
- 在 compiled forward 内只缓存 tensor 信息，在 eager 边界统一 flush
- 降低 graph break 和显存占用
扩展 Qwen3.5 关键路径的 prober 覆盖：
- _Linear
- MoEMLP shared experts
- GatedDeltaNet
- causal_conv1d
- chunk_gated_delta_rule
- FusedRMSNormGated
- 保留 MHA / MLA / MoE / loss 等已有记录路径
新增 Qwen3.5 确定性复现脚本：
- text-only compile 梯度确定性记录与 bitwise hash 对比
- VL 路径在不同 TRITON_CACHE_DIR 下的梯度确定性复现与对比
新增 AccProber 集成测试：
- 覆盖非 compile / compile 下的 forward record
- 覆盖 GatedDeltaNet 内部 tensor
- 覆盖 MoE shared experts
- 覆盖 MHA fullgraph 下的 prober dump

测试与验证

pytest -v -s tests/profiler/test_prober.py
Qwen3.5 text-only compile 训练：两次运行的 per-rank grad shard bitwise hash 一致
Qwen3.5-VL 训练：在不同 TRITON_CACHE_DIR 下两次运行的 per-rank grad shard bitwise hash 一致

因此当前结论是：Qwen3.5 纯文本和多模态训练时均可以保证确定性。

影响范围

默认路径不改变行为。新增的 Triton autotune 固定、Inductor rblock 关闭等逻辑只在 XTUNER_DETERMINISTIC=true 时生效；AccProber 也只在显式 setup/profile step 后记录。

… capture Updated the test for MultiHeadAttention to ensure that the forward method is compiled with fullgraph=True and that the prober captures q_norm and k_norm tensors correctly. Removed unnecessary warm-up steps to prevent dead code elimination and clarified the test's mechanism and expectations regarding graph breaks and tensor recording.

… and first 10 elements, enhancing compatibility with torch.compile.

…runs. Assert autotuner before mutating configs to satisfy static analysis. Co-authored-by: Cursor <cursoragent@cursor.com>

jayhenry and others added 13 commits May 27, 2026 08:09

adjust to new attn outputs format

9316c98

Adjust Qwen35 GDN and Add test

6eddf9d

[Enhance] Add integration tests for AccProber with torch.compile support

bce9399

Updated the record_tensor method to store tensor sum, shape, dtype,…

ac33f67

… and first 10 elements, enhancing compatibility with torch.compile.

Record internal tensors for GatedDeltaNet and shared_experts in MoEMLP

2360c4a

reduce AccProber gpu mem

a35cd1e

Pin FLA Triton autotuning when XTUNER_DETERMINISTIC for reproducible …

95a02fc

…runs. Assert autotuner before mutating configs to satisfy static analysis. Co-authored-by: Cursor <cursoragent@cursor.com>

fix vl determ of FLA autotune patch v2 version

36d1234

[WIP] add min reproduce test, but sometimes not reproduce

edf2ce0

Adapt determinism prober tests to current main

61e6cb2

Complete merge-main determinism follow-ups

84b6909

clean files

9d7eeed

jayhenry requested a review from HAOCHENYE May 27, 2026 13:42

jayhenry added 3 commits May 27, 2026 13:51

add more comments

f5b4e03

DeterministicDDPTestCase use new set_deterministic func

7a2db99

move autotun patch to xtuner.v1 init

26416f8

CyCle1024 reviewed May 28, 2026

View reviewed changes

Comment thread xtuner/_testing/utils.py Outdated

only call set_deterministic in test

13d8a00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance Deterministic training and AccProber for Qwen 3.5 #1850

Enhance Deterministic training and AccProber for Qwen 3.5 #1850
jayhenry wants to merge 17 commits into
InternLM:mainfrom
jayhenry:qwen35-determ

jayhenry commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jayhenry commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

背景

结论

主要改动

测试与验证

影响范围

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jayhenry commented May 27, 2026 •

edited

Loading