Distributed HCCL harness and examples for three runtimes #307
Distributed HCCL harness and examples for three runtimes #307PKUZHOU wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust framework for distributed kernel execution on Ascend NPUs, leveraging HCCL for inter-device communication. It provides a flexible Python-based orchestration layer to manage the lifecycle of distributed tasks across various simpler runtimes, from compilation to verification. The core C++ worker is designed to be generic, abstracting away the complexities of device-specific setup and focusing on executing kernels efficiently in a multi-card environment. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
本次PR为 simpler 项目引入了一套完整的分布式执行框架,这是一个结构清晰且意义重大的补充。它包含一个通用的C++ worker、基于Python的编排脚本以及针对三种不同运行时的示例。整体设计清晰,并遵循了对现有代码库“零侵入”的原则。
我的审查发现了一些可以改进的地方,主要涉及正确性、可维护性和效率。关键点包括:
- Python运行器中默认的数据类型处理可能导致静默的数据损坏。
- C++ worker中存在不必要的内存拷贝。
- C++ worker中使用了魔数,并且对HCCL内部数据结构的处理方式比较脆弱。
- 示例内核中硬编码了rank数量限制,可能导致计算结果不正确。
- 示例目录之间存在大量的代码重复(例如
golden.py和treduce_kernel.cpp在三个示例中完全相同)。建议将这些通用文件提取到共享位置。
解决这些问题将有助于提升这个新分布式框架的健壮性和可维护性。总的来说,这是一项出色的贡献。
examples/a2a3/host_build_graph/treduce_distributed/kernels/aiv/treduce_kernel.cpp
Outdated
Show resolved
Hide resolved
examples/a2a3/tensormap_and_ringbuffer/treduce_distributed/kernels/kernel_config.py
Outdated
Show resolved
Hide resolved
examples/a2a3/tensormap_and_ringbuffer/treduce_distributed/kernels/kernel_config.py
Outdated
Show resolved
Hide resolved
- validate distributed buffer metadata and simplify output verification - support explicit device selection in run_example.py and ci.sh for CI - shrink treduce examples to 4 ranks, remove stale config, and guard invalid rank/root values - rename the per-rank helper to distributed_worker.py and document buffer layout Made-with: Cursor
- add backend-agnostic `comm_*` host APIs plus a2a3/a5 hardware and sim implementations so distributed runs share one communication abstraction - add Python bindings, distributed runner orchestration, and per-rank worker support to drive multi-rank examples through `run_example.py` - add distributed treduce examples for all three runtimes and fold in the PR hw-native-sys#307 review fixes for CI-friendly rank counts, explicit device selection, and stronger validation Made-with: Cursor
#303
分布式扩展实现总结
架构总览
重构遵循 Simpler 已有的三层架构(Platform C → Python Bindings → Runner/Examples),通过一个后端无关的
comm_*API 将分布式通信能力集成到现有框架中,替代了原先独立的distributed/目录(932 行单体 C++ worker + 独立 runner)。graph TD subgraph runner["Runner 层"] RE["run_example.py<br/>检测 DISTRIBUTED_CONFIG"] DCR["DistributedCodeRunner<br/>编译 · 数据准备 · 启动 · 校验"] end subgraph worker["Worker 层 (N 个进程)"] DW["_distributed_worker.py<br/>per-rank 进程"] end subgraph python["Python Bindings"] BD["bindings.py<br/>comm_* ctypes wrappers"] end subgraph platform["Platform 层 (libhost_runtime.so)"] CH["comm.h — 5 个泛型 C 函数"] HCCL["comm_hccl.cpp<br/>真机 HCCL+RDMA"] SIM["comm_sim.cpp<br/>仿真 POSIX shm"] end RE -->|"distributed"| DCR DCR -->|"spawns N"| DW DW --> BD BD --> CH CH --> HCCL CH --> SIM一、Platform 层(C/C++)
1.1 后端无关 API —
comm.h定义 5 个核心函数,所有后端实现同一套接口:
comm_initCommHandle comm_init(int rank, int nranks, const char* rootinfo_path)comm_alloc_windowsint comm_alloc_windows(CommHandle h, size_t win_size, uint64_t* device_ctx_out)CommDeviceContext设备指针comm_get_local_window_baseint comm_get_local_window_base(CommHandle h, uint64_t* base_out)comm_barrierint comm_barrier(CommHandle h)comm_destroyint comm_destroy(CommHandle h)文件位置:
src/a2a3/platform/include/host/comm.hsrc/a5/platform/include/host/comm.h1.2 设备侧上下文 —
CommDeviceContextKernel 通过
windowsIn[pe]计算远端地址来访问其他 rank 的数据。此结构体是 Host 与 Device kernel 之间的 ABI 契约。文件位置:
src/a2a3/platform/include/common/comm_context.hsrc/a5/platform/include/common/comm_context.h1.3 两个后端实现
两个后端导出相同的
comm_*符号,通过链接时多态选择——无运行时开销,无#ifdef。HCCL 后端(真机)
src/{a2a3,a5}/platform/onboard/host/comm_hccl.cpp(476 行)HcclCommInitRootInfo→HcclAllocComResourceByTiling→ MESH/RING 上下文解析,提取windowsIn地址hccl、hccl_fwkSimulation 后端(仿真)
src/{a2a3,a5}/platform/sim/host/comm_sim.cpp(199 行)shm_open+mmap创建跨进程 POSIX 共享内存,所有 rank 进程映射同一物理内存区域__atomicbuiltins,在 mmap 共享内存上安全工作rt(POSIX shm)1.4 CMakeLists.txt 修改
src/a2a3/platform/onboard/host/CMakeLists.txtcomm_hccl.cpp到源文件,链接hcclhccl_fwksrc/a2a3/platform/sim/host/CMakeLists.txtcomm_sim.cpp到源文件,链接rtsrc/a5/platform/onboard/host/CMakeLists.txtsrc/a5/platform/sim/host/CMakeLists.txt二、Python Bindings 层
python/bindings.py新增 5 个 ctypes wrapper 函数:对应的模块级 Python wrapper:
三、Runner 层
3.1
_distributed_worker.py(254 行)— Per-Rank Worker替代原 C++
distributed_worker。每个 rank 作为独立 Python 进程运行,执行流程:CLI 参数通过
DistributedCodeRunner自动生成,包括--win-buffer、--dev-buffer、--arg、--kernel-bin等。3.2
distributed_code_runner.py(435 行)— 编排器DistributedCodeRunner类封装完整测试流程:compile()RuntimeBuilder+KernelCompiler并行构建 Host/AICPU/AICore,编译 orchestration 和 kernelprepare_data()golden.generate_distributed_inputs()为每个 rank 生成.bin文件run()_distributed_worker.py子进程verify()golden.compute_golden()按容差校验run_all()3.3
run_example.py(374 行)— 统一入口自动检测
kernel_config.py中的DISTRIBUTED_CONFIG:DISTRIBUTED_CONFIG→ 导入DistributedCodeRunner,路由到分布式流程CodeRunner流程用户无需关心是单卡还是多卡,同一个命令即可:
新增
--nranks参数可覆盖默认 rank 数量。四、示例实现:Distributed TREDUCE
三个 runtime 变体(
host_build_graph、aicpu_build_graph、tensormap_and_ringbuffer)各有一份treduce_distributed示例。4.1 目录结构
4.2 DISTRIBUTED_CONFIG 定义
字段说明:
placement: "window"— Buffer 分配在 RDMA 窗口区域(所有 rank 可互相访问)placement: "device"— Buffer 通过device_malloc分配(rank 本地)args— 传给编排函数的参数列表,特殊 token:nranks、root、deviceCtx(指向CommDeviceContext的设备指针)4.3 算法
input[i] = i + rank × 100(256 个 float32)TREDUCE指令从所有 rank 的 RDMA 窗口读取数据并求和output[i] = 8i + 100 × 8 × 7/2 = 8i + 28004.4 Kernel 核心逻辑
4.5 运行命令
4.6 测试通过输出
五、文件清单
Platform C/C++ 文件
src/a2a3/platform/include/host/comm.hsrc/a2a3/platform/include/common/comm_context.hCommDeviceContext结构体src/a2a3/platform/onboard/host/comm_hccl.cppsrc/a2a3/platform/sim/host/comm_sim.cppsrc/a5/platform/...(同上 4 个文件)Python 文件
python/bindings.pyexamples/scripts/_distributed_worker.pyexamples/scripts/distributed_code_runner.pyexamples/scripts/run_example.py示例文件(3 runtime × 4 文件)
host_build_graphaicpu_build_graphtensormap_and_ringbuffer已删除
distributed/六、扩展性
新增通信后端(如 MPI)仅需:
comm_mpi.cpp,实现comm.h的 5 个函数CMakeLists.txt中添加源文件和链接库