Enhance Deterministic training and AccProber for Qwen 3.5 #1850
Open
jayhenry wants to merge 17 commits into
Open
Enhance Deterministic training and AccProber for Qwen 3.5 #1850jayhenry wants to merge 17 commits into
jayhenry wants to merge 17 commits into
Conversation
… capture Updated the test for MultiHeadAttention to ensure that the forward method is compiled with fullgraph=True and that the prober captures q_norm and k_norm tensors correctly. Removed unnecessary warm-up steps to prevent dead code elimination and clarified the test's mechanism and expectations regarding graph breaks and tensor recording.
… and first 10 elements, enhancing compatibility with torch.compile.
…runs. Assert autotuner before mutating configs to satisfy static analysis. Co-authored-by: Cursor <cursoragent@cursor.com>
CyCle1024
reviewed
May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
背景
Qwen3.5 35B MoE / VL 在
torch.compile、FSDP、FLA/Triton kernel 路径下做确定性复现时,可能受到 Inductor rblock 动态选择、Triton autotune/cache、以及 prober 在 compile 区域内记录方式的影响,导致跨运行梯度不一致或难以定位。这次改动的目标是把已验证有效的确定性开关统一到训练入口,同时让 AccProber 能在 compile 路径下稳定记录 Qwen3.5 关键模块的中间张量摘要。
结论
当前验证结论是:Qwen3.5 纯文本训练和多模态训练,无论是否开启
torch.compile,都可以做到确定性。本 PR 将复现过程中确认有效的确定性设置收敛到产品入口,并补充 AccProber 与复现脚本,方便后续持续验证和定位回归。
主要改动
统一
set_deterministic()入口:CUBLAS_WORKSPACE_CONFIGTORCHINDUCTOR_DYNAMIC_SCALE_RBLOCK,同步关闭torch._inductor.config.dynamic_scale_rblocktorch.compile前生效在
XTUNER_DETERMINISTIC=true时固定 FLA/Triton autotune 行为:triton.autotune增强 AccProber 的
torch.compile兼容性:tensor_sum / shape / dtype / first_10扩展 Qwen3.5 关键路径的 prober 覆盖:
_LinearMoEMLPshared expertsGatedDeltaNetcausal_conv1dchunk_gated_delta_ruleFusedRMSNormGated新增 Qwen3.5 确定性复现脚本:
TRITON_CACHE_DIR下的梯度确定性复现与对比新增 AccProber 集成测试:
测试与验证
pytest -v -s tests/profiler/test_prober.pyTRITON_CACHE_DIR下两次运行的 per-rank grad shard bitwise hash 一致因此当前结论是:Qwen3.5 纯文本和多模态训练时均可以保证确定性。
影响范围
默认路径不改变行为。新增的 Triton autotune 固定、Inductor rblock 关闭等逻辑只在
XTUNER_DETERMINISTIC=true时生效;AccProber 也只在显式 setup/profile step 后记录。