Skip to content

Distributed HCCL harness and examples for three runtimes #307

Open
PKUZHOU wants to merge 1 commit intohw-native-sys:mainfrom
PKUZHOU:main
Open

Distributed HCCL harness and examples for three runtimes #307
PKUZHOU wants to merge 1 commit intohw-native-sys:mainfrom
PKUZHOU:main

Conversation

@PKUZHOU
Copy link

@PKUZHOU PKUZHOU commented Mar 17, 2026

#303

分布式扩展实现总结

架构总览

重构遵循 Simpler 已有的三层架构(Platform C → Python Bindings → Runner/Examples),通过一个后端无关的 comm_* API 将分布式通信能力集成到现有框架中,替代了原先独立的 distributed/ 目录(932 行单体 C++ worker + 独立 runner)。

graph TD
    subgraph runner["Runner 层"]
        RE["run_example.py<br/>检测 DISTRIBUTED_CONFIG"]
        DCR["DistributedCodeRunner<br/>编译 · 数据准备 · 启动 · 校验"]
    end

    subgraph worker["Worker 层 (N 个进程)"]
        DW["_distributed_worker.py<br/>per-rank 进程"]
    end

    subgraph python["Python Bindings"]
        BD["bindings.py<br/>comm_* ctypes wrappers"]
    end

    subgraph platform["Platform 层 (libhost_runtime.so)"]
        CH["comm.h — 5 个泛型 C 函数"]
        HCCL["comm_hccl.cpp<br/>真机 HCCL+RDMA"]
        SIM["comm_sim.cpp<br/>仿真 POSIX shm"]
    end

    RE -->|"distributed"| DCR
    DCR -->|"spawns N"| DW
    DW --> BD
    BD --> CH
    CH --> HCCL
    CH --> SIM
Loading

一、Platform 层(C/C++)

1.1 后端无关 API — comm.h

定义 5 个核心函数,所有后端实现同一套接口:

函数 签名 作用
comm_init CommHandle comm_init(int rank, int nranks, const char* rootinfo_path) 初始化通信,返回句柄
comm_alloc_windows int comm_alloc_windows(CommHandle h, size_t win_size, uint64_t* device_ctx_out) 分配 RDMA 窗口,输出 CommDeviceContext 设备指针
comm_get_local_window_base int comm_get_local_window_base(CommHandle h, uint64_t* base_out) 获取本 rank 窗口基址
comm_barrier int comm_barrier(CommHandle h) 全 rank 同步屏障
comm_destroy int comm_destroy(CommHandle h) 释放通信资源

文件位置:

  • src/a2a3/platform/include/host/comm.h
  • src/a5/platform/include/host/comm.h

1.2 设备侧上下文 — CommDeviceContext

static constexpr uint32_t COMM_MAX_RANK_NUM = 64;

struct CommDeviceContext {
    uint64_t workSpace;
    uint64_t workSpaceSize;
    uint32_t rankId;
    uint32_t rankNum;
    uint64_t winSize;
    uint64_t windowsIn[COMM_MAX_RANK_NUM];
    uint64_t windowsOut[COMM_MAX_RANK_NUM];
};

Kernel 通过 windowsIn[pe] 计算远端地址来访问其他 rank 的数据。此结构体是 Host 与 Device kernel 之间的 ABI 契约。

文件位置:

  • src/a2a3/platform/include/common/comm_context.h
  • src/a5/platform/include/common/comm_context.h

1.3 两个后端实现

两个后端导出相同的 comm_* 符号,通过链接时多态选择——无运行时开销,无 #ifdef

HCCL 后端(真机)

  • 文件:src/{a2a3,a5}/platform/onboard/host/comm_hccl.cpp(476 行)
  • 核心机制:ACL 初始化 → RootInfo 文件交换 → HcclCommInitRootInfoHcclAllocComResourceByTiling → MESH/RING 上下文解析,提取 windowsIn 地址
  • 链接依赖:hcclhccl_fwk

Simulation 后端(仿真)

  • 文件:src/{a2a3,a5}/platform/sim/host/comm_sim.cpp(199 行)
  • 核心机制:shm_open + mmap 创建跨进程 POSIX 共享内存,所有 rank 进程映射同一物理内存区域
  • 共享内存布局:
+-----------------------------+
| SharedHeader (4096 bytes)   |  ← 同步原语 (atomic barrier)
+-----------------------------+
| Rank 0 window region        |  ← per_rank_win_size bytes
+-----------------------------+
| Rank 1 window region        |
+-----------------------------+
| ...                         |
+-----------------------------+
| Rank N-1 window region      |
+-----------------------------+
  • Barrier 同步使用 GCC __atomic builtins,在 mmap 共享内存上安全工作
  • 链接依赖:rt(POSIX shm)

1.4 CMakeLists.txt 修改

文件 变更
src/a2a3/platform/onboard/host/CMakeLists.txt 添加 comm_hccl.cpp 到源文件,链接 hccl hccl_fwk
src/a2a3/platform/sim/host/CMakeLists.txt 添加 comm_sim.cpp 到源文件,链接 rt
src/a5/platform/onboard/host/CMakeLists.txt 同 a2a3 onboard
src/a5/platform/sim/host/CMakeLists.txt 同 a2a3 sim

二、Python Bindings 层

python/bindings.py 新增 5 个 ctypes wrapper 函数:

# ctypes 签名
lib.comm_init.argtypes = [c_int, c_int, c_char_p]
lib.comm_init.restype = c_void_p

lib.comm_alloc_windows.argtypes = [c_void_p, c_size_t, POINTER(c_uint64)]
lib.comm_alloc_windows.restype = c_int

lib.comm_get_local_window_base.argtypes = [c_void_p, POINTER(c_uint64)]
lib.comm_get_local_window_base.restype = c_int

lib.comm_barrier.argtypes = [c_void_p]
lib.comm_barrier.restype = c_int

lib.comm_destroy.argtypes = [c_void_p]
lib.comm_destroy.restype = c_int

对应的模块级 Python wrapper:

def comm_init(rank: int, nranks: int, rootinfo_path: str) -> int: ...
def comm_alloc_windows(handle: int, win_size: int) -> int: ...
def comm_get_local_window_base(handle: int) -> int: ...
def comm_barrier(handle: int) -> None: ...
def comm_destroy(handle: int) -> None: ...

三、Runner 层

3.1 _distributed_worker.py(254 行)— Per-Rank Worker

替代原 C++ distributed_worker。每个 rank 作为独立 Python 进程运行,执行流程:

1. bind_host_binary() → set_device()
2. comm_init → comm_alloc_windows → comm_get_local_window_base
3. 分配 buffer
   ├── window 类型: local_base + offset(RDMA 窗口内)
   └── device 类型: device_malloc(rank 本地)
4. 从 .bin 文件加载输入 → copy_to_device
5. comm_barrier(确保所有 rank 数据就绪)
6. Runtime.initialize → launch_runtime → Runtime.finalize
7. comm_barrier → copy_from_device → 保存输出到 .bin
8. comm_destroy

CLI 参数通过 DistributedCodeRunner 自动生成,包括 --win-buffer--dev-buffer--arg--kernel-bin 等。

3.2 distributed_code_runner.py(435 行)— 编排器

DistributedCodeRunner 类封装完整测试流程:

方法 作用
compile() 复用 RuntimeBuilder + KernelCompiler 并行构建 Host/AICPU/AICore,编译 orchestration 和 kernel
prepare_data() 调用 golden.generate_distributed_inputs() 为每个 rank 生成 .bin 文件
run() 清理旧的共享内存段,为每个 rank spawn _distributed_worker.py 子进程
verify() 从 root rank 读取输出,调用 golden.compute_golden() 按容差校验
run_all() 串联以上四步,返回布尔结果

3.3 run_example.py(374 行)— 统一入口

自动检测 kernel_config.py 中的 DISTRIBUTED_CONFIG

  • DISTRIBUTED_CONFIG → 导入 DistributedCodeRunner,路由到分布式流程
  • → 走标准 CodeRunner 流程

用户无需关心是单卡还是多卡,同一个命令即可:

python examples/scripts/run_example.py \
  -k examples/a2a3/host_build_graph/treduce_distributed/kernels \
  -g examples/a2a3/host_build_graph/treduce_distributed/golden.py \
  -p a2a3sim

新增 --nranks 参数可覆盖默认 rank 数量。


四、示例实现:Distributed TREDUCE

三个 runtime 变体(host_build_graphaicpu_build_graphtensormap_and_ringbuffer)各有一份 treduce_distributed 示例。

4.1 目录结构

examples/a2a3/<runtime>/treduce_distributed/
├── golden.py                          # 生成 per-rank 输入 + golden 校验
└── kernels/
    ├── kernel_config.py               # DISTRIBUTED_CONFIG 定义
    ├── aiv/
    │   └── treduce_kernel.cpp         # AIV TREDUCE kernel
    └── orchestration/
        └── treduce_orch.cpp           # 编排函数(各 runtime 不同)

4.2 DISTRIBUTED_CONFIG 定义

DISTRIBUTED_CONFIG = {
    "nranks": 8,
    "root": 0,
    "comm_include_dirs": ["tests/npu/a2a3/comm/st/testcase"],
    "win_sync_prefix": 256,        # 窗口头部预留字节数
    "buffers": [
        {"name": "input",  "dtype": "float32", "count": 256, "placement": "window"},
        {"name": "output", "dtype": "float32", "count": 256, "placement": "device"},
    ],
    "inputs": ["input"],           # 从 .bin 文件加载的 buffer
    "outputs": ["output"],         # 执行后保存的 buffer
    "args": ["input", "output", "nranks", "root", "deviceCtx"],
}

字段说明:

  • placement: "window" — Buffer 分配在 RDMA 窗口区域(所有 rank 可互相访问)
  • placement: "device" — Buffer 通过 device_malloc 分配(rank 本地)
  • args — 传给编排函数的参数列表,特殊 token:nranksrootdeviceCtx(指向 CommDeviceContext 的设备指针)

4.3 算法

  • 8 个 rank,每个 rank 贡献 input[i] = i + rank × 100(256 个 float32)
  • Root rank(rank 0)通过 PTO TREDUCE 指令从所有 rank 的 RDMA 窗口读取数据并求和
  • 期望输出:output[i] = 8i + 100 × 8 × 7/2 = 8i + 2800

4.4 Kernel 核心逻辑

// 地址转换:计算 pe 号 rank 的远端指针
template <typename T>
__gm__ T *CommRemotePtr(__gm__ CommDeviceContext *ctx, __gm__ T *localPtr, int pe) {
    uint64_t localBase = ctx->windowsIn[ctx->rankId];
    uint64_t offset = (uint64_t)localPtr - localBase;
    return (__gm__ T *)(ctx->windowsIn[pe] + offset);
}

// Root rank 执行 reduce
if (my_rank == root) {
    for (int i = 0; i < actual_nranks; ++i) {
        __gm__ float* remoteInput = CommRemotePtr(commCtx, input, i);
        tensors[i] = Global(remoteInput, shape, stride);
    }
    pto::comm::TREDUCE(pg, outputG, accTile, recvTile, ReduceOp::Sum);
}

4.5 运行命令

# 仿真(无硬件要求)
python examples/scripts/run_example.py \
  -k examples/a2a3/host_build_graph/treduce_distributed/kernels \
  -g examples/a2a3/host_build_graph/treduce_distributed/golden.py \
  -p a2a3sim

# 真机
python examples/scripts/run_example.py \
  -k examples/a2a3/host_build_graph/treduce_distributed/kernels \
  -g examples/a2a3/host_build_graph/treduce_distributed/golden.py \
  -p a2a3

4.6 测试通过输出

[INFO] Detected DISTRIBUTED_CONFIG — using distributed runner
[INFO] === Phase 1: Building runtime ===
[INFO] Compiling AICore, AICPU, Host in parallel...
[INFO] === Phase 2: Compiling orchestration ===
[INFO] === Phase 3: Compiling kernels ===
[INFO] === Phase 4: Saving artifacts ===
[INFO] === Launching 8 workers ===
[INFO] Rank 0: OK
[INFO] Rank 1: OK
...
[INFO] Rank 7: OK
[INFO] VERIFY PASSED: output — 256 elements correct
[INFO]   Sample: [2800.0, 2808.0, 2816.0, 2824.0, 2832.0]
============================================================
TEST PASSED
============================================================

五、文件清单

Platform C/C++ 文件

文件 行数 说明
src/a2a3/platform/include/host/comm.h 93 后端无关 comm API 声明
src/a2a3/platform/include/common/comm_context.h 31 CommDeviceContext 结构体
src/a2a3/platform/onboard/host/comm_hccl.cpp 476 HCCL 后端实现
src/a2a3/platform/sim/host/comm_sim.cpp 199 仿真后端实现(POSIX shm)
src/a5/platform/...(同上 4 个文件) a5 平台镜像

Python 文件

文件 行数 说明
python/bindings.py 760 comm_* ctypes wrapper(含原有绑定)
examples/scripts/_distributed_worker.py 254 Per-rank Python worker
examples/scripts/distributed_code_runner.py 435 分布式测试编排器
examples/scripts/run_example.py 374 统一入口(含 DISTRIBUTED_CONFIG 检测)

示例文件(3 runtime × 4 文件)

Runtime kernel_config.py treduce_kernel.cpp treduce_orch.cpp golden.py
host_build_graph 48 行 88 行 56 行 42 行
aicpu_build_graph 49 行 88 行 38 行 42 行
tensormap_and_ringbuffer 47 行 88 行 53 行 42 行

已删除

目录 说明
distributed/ 原独立分布式目录(~1500 行),功能已集成到三层架构中

六、扩展性

新增通信后端(如 MPI)仅需:

  1. 新建一个 comm_mpi.cpp,实现 comm.h 的 5 个函数
  2. 在对应的 CMakeLists.txt 中添加源文件和链接库
  3. 无需修改 Python 层、Runner 层或 kernel 代码

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust framework for distributed kernel execution on Ascend NPUs, leveraging HCCL for inter-device communication. It provides a flexible Python-based orchestration layer to manage the lifecycle of distributed tasks across various simpler runtimes, from compilation to verification. The core C++ worker is designed to be generic, abstracting away the complexities of device-specific setup and focusing on executing kernels efficiently in a multi-card environment.

Highlights

  • Distributed Execution Framework: Introduced a new distributed extension for simpler, enabling multi-card kernel execution on Ascend NPU environments using HCCL for RDMA communication.
  • Multi-Runtime Support: Implemented support for three simpler runtimes: host_build_graph, aicpu_build_graph, and tensormap_and_ringbuffer, allowing flexibility in orchestration and scheduling.
  • Generic Worker Process: Developed a generic C++ distributed_worker executable that handles HCCL initialization, resource allocation, and delegates kernel execution to simpler's runtime via dynamic loading, making it case-agnostic.
  • Python Orchestration Layer: Added a Python DistributedRunner class and CLI entry point (run_distributed_example.py) to orchestrate the compilation, data preparation, worker launching, and result verification for distributed examples.
  • Distributed TREDUCE Examples: Provided comprehensive examples for distributed TREDUCE (collective reduce sum) across 8 cards, validated for all three supported runtimes, demonstrating the framework's capabilities.
  • Zero Intrusion Design: Ensured that the new distributed functionality is implemented without modifying existing core simpler source code, maintaining backward compatibility and modularity.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • distributed/CMakeLists.txt
    • Added CMake configuration for building the distributed_worker executable, including finding necessary Ascend CANN SDK libraries like ACL and HCCL.
  • distributed/README.md
    • Added comprehensive documentation in Chinese, detailing the architecture, supported runtimes, relationship to existing simpler code, directory structure, quick start guide, CLI parameters, and instructions for adding new distributed kernels.
  • distributed/include/hccl_context.h
    • Added the HcclDeviceContext structure definition, which describes the device-side communication context for HCCL, including RDMA window addresses.
  • distributed/python/distributed_runner.py
    • Added the DistributedRunner Python class, responsible for compiling artifacts, preparing per-rank input data, building the C++ worker, launching distributed processes, and verifying results against golden references.
  • distributed/python/run_distributed_example.py
    • Added a Python CLI entry point for running distributed simpler examples, providing options for kernel directories, golden files, platform, number of ranks, and various execution control flags.
  • distributed/src/distributed_worker.cpp
    • Added the distributed_worker C++ executable, which handles per-device processes, including ACL/HCCL initialization, RootInfo exchange, HCCL resource allocation, dynamic loading of simpler's runtime, and execution of kernels based on CLI arguments.
  • examples/a2a3/aicpu_build_graph/treduce_distributed/golden.py
    • Added a Python script defining the golden reference for distributed TREDUCE, including functions to generate per-rank inputs and compute the expected golden output for verification.
  • examples/a2a3/aicpu_build_graph/treduce_distributed/kernels/aiv/treduce_kernel.cpp
    • Added an AIV kernel implementation for distributed TREDUCE, utilizing PTO communication instructions and HcclDeviceContext for remote memory access.
  • examples/a2a3/aicpu_build_graph/treduce_distributed/kernels/kernel_config.py
    • Added kernel configuration for the aicpu_build_graph runtime, specifying orchestration source, kernel details, runtime parameters, and distributed configuration including buffer layouts and arguments.
  • examples/a2a3/aicpu_build_graph/treduce_distributed/kernels/orchestration/treduce_orch.cpp
    • Added a C++ orchestration plugin for the aicpu_build_graph runtime, which reads arguments and builds a single AIV task using the aicpu_build_api.
  • examples/a2a3/host_build_graph/treduce_distributed/golden.py
    • Added a Python script defining the golden reference for distributed TREDUCE, including functions to generate per-rank inputs and compute the expected golden output for verification.
  • examples/a2a3/host_build_graph/treduce_distributed/kernels/aiv/treduce_kernel.cpp
    • Added an AIV kernel implementation for distributed TREDUCE, utilizing PTO communication instructions and HcclDeviceContext for remote memory access.
  • examples/a2a3/host_build_graph/treduce_distributed/kernels/kernel_config.py
    • Added kernel configuration for the host_build_graph runtime, specifying orchestration source, kernel details, runtime parameters, and distributed configuration including buffer layouts and arguments.
  • examples/a2a3/host_build_graph/treduce_distributed/kernels/orchestration/treduce_orch.cpp
    • Added a C++ orchestration function for the host_build_graph runtime, which processes device pointers and scalars to create a single AIV task.
  • examples/a2a3/tensormap_and_ringbuffer/treduce_distributed/golden.py
    • Added a Python script defining the golden reference for distributed TREDUCE, including functions to generate per-rank inputs and compute the expected golden output for verification.
  • examples/a2a3/tensormap_and_ringbuffer/treduce_distributed/kernels/aiv/treduce_kernel.cpp
    • Added an AIV kernel implementation for distributed TREDUCE, utilizing PTO communication instructions and HcclDeviceContext for remote memory access.
  • examples/a2a3/tensormap_and_ringbuffer/treduce_distributed/kernels/kernel_config.py
    • Added kernel configuration for the tensormap_and_ringbuffer runtime, specifying orchestration source, kernel details, runtime parameters, and distributed configuration including buffer layouts and arguments.
  • examples/a2a3/tensormap_and_ringbuffer/treduce_distributed/kernels/orchestration/treduce_orch.cpp
    • Added a C++ orchestration function for the tensormap_and_ringbuffer runtime, which uses the PTO2 API to submit an AIV task with scalar parameters.
Activity
  • No specific human activity (comments, reviews, progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

本次PR为 simpler 项目引入了一套完整的分布式执行框架,这是一个结构清晰且意义重大的补充。它包含一个通用的C++ worker、基于Python的编排脚本以及针对三种不同运行时的示例。整体设计清晰,并遵循了对现有代码库“零侵入”的原则。

我的审查发现了一些可以改进的地方,主要涉及正确性、可维护性和效率。关键点包括:

  • Python运行器中默认的数据类型处理可能导致静默的数据损坏。
  • C++ worker中存在不必要的内存拷贝。
  • C++ worker中使用了魔数,并且对HCCL内部数据结构的处理方式比较脆弱。
  • 示例内核中硬编码了rank数量限制,可能导致计算结果不正确。
  • 示例目录之间存在大量的代码重复(例如 golden.pytreduce_kernel.cpp 在三个示例中完全相同)。建议将这些通用文件提取到共享位置。

解决这些问题将有助于提升这个新分布式框架的健壮性和可维护性。总的来说,这是一项出色的贡献。

PKUZHOU pushed a commit to PKUZHOU/simpler that referenced this pull request Mar 18, 2026
- validate distributed buffer metadata and simplify output verification
- support explicit device selection in run_example.py and ci.sh for CI
- shrink treduce examples to 4 ranks, remove stale config, and guard invalid rank/root values
- rename the per-rank helper to distributed_worker.py and document buffer layout

Made-with: Cursor
- add backend-agnostic `comm_*` host APIs plus a2a3/a5 hardware and sim implementations so distributed runs share one communication abstraction
- add Python bindings, distributed runner orchestration, and per-rank worker support to drive multi-rank examples through `run_example.py`
- add distributed treduce examples for all three runtimes and fold in the PR hw-native-sys#307 review fixes for CI-friendly rank counts, explicit device selection, and stronger validation

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants