Skip to content

feat(ascend): op-simple group — Add, Mul, Cast, Cat, Matmul, Gemm, Linear#65

Merged
voltjia merged 7 commits intomasterfrom
feat/ascend-op-simple
Apr 22, 2026
Merged

feat(ascend): op-simple group — Add, Mul, Cast, Cat, Matmul, Gemm, Linear#65
voltjia merged 7 commits intomasterfrom
feat/ascend-op-simple

Conversation

@zhangyue207
Copy link
Copy Markdown
Collaborator

@zhangyue207 zhangyue207 commented Apr 18, 2026

Summary

Seven foundational Ascend operators — Add, Mul, Cast, Cat, Matmul, Gemm,
Linear — implemented via the ACLNN API set.

This is part 2 of 4 in the Ascend operator split (part 1 =
feat/ascend-framework-pr). Each category PR ships its operators as an
atomic unit: src/base/<op>.h declaration + src/ascend/<op>/*.h Ascend
impl + src/cpu/<op>/<op>.h CPU reference + tests/test_<op>.py.

Depends on: feat/ascend-framework-pr must merge first (shared
framework headers, generator fixes, CI fixes, and test infra).

Operators

op impl src/base/<op>.h
Add aclnnAdd MODIFY (exists on master)
Mul aclnnMul NEW
Cast aclnnCast NEW
Cat aclnnCat NEW
Matmul aclnnMatmul NEW (replaces master's mat_mul.h — class renamed MatMulMatmul)
Gemm aclnnMm MODIFY (exists on master) — also carries the cached-executor / workspace-pool rework used by all ACLNN operators
Linear aclnnMatmul + optional bias NEW

Kernels are header-only under src/ascend/<op>/kernel.h; the build picks
them up automatically through the Ascend glob in src/CMakeLists.txt.

CPU reference implementations

src/cpu/{cast,cat,linear,mul}/ added as reference implementations for
the new ops. add, gemm, and matmul already had CPU references on
master (mat_mul.h → matmul.h rename handled in this PR).

Removed

src/base/mat_mul.h — the old MatMul class had no implementation on
any backend. Replaced by the new Matmul class in src/base/matmul.h.

Verification

  • python3 .ci/run.py --local --gpu-id <N> (Ascend 910B + CANN 8.5.1):
    3435 passed / 1746 skipped / 0 failed
  • Tests for ops not shipped in this PR (CausalSoftmax / RmsNorm / etc.)
    skip cleanly via the framework PR's skip_op_without_platform_impl
    autouse fixture.

Test plan

  • python3 .ci/run.py --local
  • clang-format passes locally
  • CUDA / Metax / Cambricon / Moore / Iluvatar regressions (CI-verified)

@zhangyue207
Copy link
Copy Markdown
Collaborator Author

merge test:

[gw0] [ 99%] PASSED tests/test_swiglu.py::test_swiglu[npu-dtype2-0.01-0.005-1-shape3-input_strides3-gate_strides3-out_strides3] 
tests/test_swiglu.py::test_swiglu[npu-dtype2-0.01-0.005-1-shape4-None-None-None] 
[gw0] [ 99%] PASSED tests/test_swiglu.py::test_swiglu[npu-dtype2-0.01-0.005-1-shape4-None-None-None] 
tests/test_swiglu.py::test_swiglu[npu-dtype2-0.01-0.005-1-shape5-input_strides5-gate_strides5-out_strides5] 
[gw0] [ 99%] PASSED tests/test_swiglu.py::test_swiglu[npu-dtype2-0.01-0.005-1-shape5-input_strides5-gate_strides5-out_strides5] 
tests/test_swiglu.py::test_swiglu[npu-dtype2-0.01-0.005-1-shape6-None-None-None] 
[gw0] [ 99%] PASSED tests/test_swiglu.py::test_swiglu[npu-dtype2-0.01-0.005-1-shape6-None-None-None] 
tests/test_swiglu.py::test_swiglu[npu-dtype2-0.01-0.005-1-shape7-input_strides7-gate_strides7-out_strides7] 
[gw0] [100%] PASSED tests/test_swiglu.py::test_swiglu[npu-dtype2-0.01-0.005-1-shape7-input_strides7-gate_strides7-out_strides7] 

----------- generated xml file: /workspace/results/test-results.xml ------------
===================== 3767 passed, 1664 skipped in 45.95s ======================
========== Summary ==========
[warn] job ascend_npu: container exited with 137 (likely docker teardown SIGKILL after clean pytest); junit XML reports no failures — treating as success
EXIT=0

Copy link
Copy Markdown
Collaborator

@Ziminli Ziminli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

整体有一些格式和 custom_kernel 这个文件夹的设计问题,在 #64 里同个问题的详细反馈。

Comment thread scripts/generate_wrappers.py Outdated
Comment thread src/ascend/cat/kernel.h Outdated
Comment thread src/ascend/custom/add_rms_norm/op_host/add_rms_norm.cpp
Comment thread src/ascend/custom/rms_norm/op_host/rms_norm.cpp
Comment thread src/ascend/custom/cmake/config_ascend.cmake
Comment thread src/ascend/custom/add_rms_norm/op_host/add_rms_norm.cpp Outdated
@zhangyue207 zhangyue207 force-pushed the feat/ascend-op-simple branch 13 times, most recently from 78a0628 to 7eeec7a Compare April 21, 2026 06:17
…near

Seven foundational Ascend operators:

| op | impl |
|---|---|
| Add | aclnnAdd |
| Mul | aclnnMul |
| Cast | aclnnCast |
| Cat | aclnnCat |
| Matmul | aclnnMatmul |
| Gemm | aclnnMm (also carries the cached-executor / workspace-pool rework) |
| Linear | aclnnMatmul + optional bias |

Also ships:
- `src/base/<op>.h` for the 5 new ops (cast/cat/linear/matmul/mul);
  `add.h` and `gemm.h` existed on master and are updated in-place
- `src/cpu/<op>/<op>.h` reference impls for cast/cat/linear/mul (add/gemm/matmul
  had CPU refs on master already)
- `tests/test_<op>.py` for each operator (add and gemm have MODIFY diffs;
  others are new)
@zhangyue207 zhangyue207 force-pushed the feat/ascend-op-simple branch from 7eeec7a to 7649042 Compare April 21, 2026 06:40
zhangyue added 4 commits April 21, 2026 16:15
…caches

- `add/kernel.h`: swap destroy() → release() on in_cache_/oth_cache_/out_cache_
  and drop aclDestroyAclOpExecutor (both are referenced by the Repeatable
  executor; destroying them causes double-free at shutdown per the pattern
  documented in common.h and commit 64c367c).
- `cat/kernel.h`: release all in_caches_[i] in the destructor; without it,
  ~AclTensorCache() on vector teardown double-frees descriptors held by
  tensor_list_ / executor_.
- Also group the alpha_* storage members with blank lines to match file
  convention.
…entation_indices`

Replaces hardcoded `(0, 1)` / `(0, 1, 2)` tuples in test_add, test_gemm,
test_rms_norm, test_swiglu with a union over the locally-available devices'
active implementation indices.

New helper `tests.utils.all_active_implementation_indices(op_cls)` only
iterates `get_available_devices()` to avoid `DispatchFunc::std::abort` on
device types outside the build's `ActiveDevices` set.

Effect on Ascend CI: skipped-test count drops from 3246 to 1686 — impl=1
(`cuBLASLt`) no longer parametrized when no CUDA device is visible, and
RmsNorm/Swiglu's custom-kernel slot drops out of the matrix on op-simple
where the framework layer hasn't merged the AscendC impl yet.
Replaces the per-test `@pytest.mark.parametrize("implementation_index", ...)`
+ runtime `if impl not in active_indices: skip` pattern with a single hook in
`conftest.pytest_generate_tests` that emits only the (device, impl) pairs
actually active on each device.

Rationale: kernel dispatch is per-device, so cross-device union (previous
`all_active_implementation_indices` helper) polluted the matrix with impls
that the selected device can't run — runtime-skipped noise.  Joint generation
keeps the matrix to its semantic cell: "this device has this impl, so run it".

- `tests/conftest.py`: when both `device` and `implementation_index` are in
  fixturenames, emit pairs via `op_cls.active_implementation_indices(dev)`;
  fall back to a skipped placeholder (`id="skip"`) when no device has an
  active impl, avoiding `[NOTSET-...]` test IDs.
- `tests/{test_add,test_gemm,test_rms_norm,test_swiglu}.py`: drop the hardcoded
  `implementation_index` parametrize decorator and the runtime `active_indices`
  guard — conftest now handles both.
- `tests/utils.py`: remove the `all_active_implementation_indices` helper
  (superseded by per-device generation in conftest).

Same test outcome on Ascend CI (1935 passed / 1686 skipped) but the remaining
skips are now either semantically mandatory (uint dtypes unsupported by
`torch_npu`, Gemm impl=2 SFINAE-only workaround, op missing ascend impl on
op-simple pending PR #66) rather than mechanism artifacts.
…undant fixture

Post-review cleanup of the joint-parametrize refactor (1dd288f):

- Extract `_op_class_from_module` as a shared helper; `skip_op_without_platform_impl` fixture now calls it instead of re-deriving the snake→pascal class name inline.
- Short-circuit the fixture when `implementation_index` is already in callspec — `pytest_generate_tests` has already pruned empty-impl pairs, so per-case `active_implementation_indices` calls are wasted work.
- Drop `try/except ImportError` inside the helper — collection has already imported `infini.ops` via test modules; masking a real import failure only turns it into a cryptic NOTSET fixture.
- Drop the `devices[0] if devices else "cpu"` fallback — `get_available_devices()` always includes `"cpu"`, making the `else` arm unreachable.
@zhangyue207
Copy link
Copy Markdown
Collaborator Author

ascend:

[gw0] [ 99%] SKIPPED tests/test_swiglu.py::test_swiglu[skip-dtype1-0.001-0.001-shape5-input_strides5-gate_strides5-out_strides5] 
tests/test_swiglu.py::test_swiglu[skip-dtype1-0.001-0.001-shape6-None-None-None] 
[gw0] [ 99%] SKIPPED tests/test_swiglu.py::test_swiglu[skip-dtype1-0.001-0.001-shape6-None-None-None] 
tests/test_swiglu.py::test_swiglu[skip-dtype1-0.001-0.001-shape7-input_strides7-gate_strides7-out_strides7] 
[gw0] [ 99%] SKIPPED tests/test_swiglu.py::test_swiglu[skip-dtype1-0.001-0.001-shape7-input_strides7-gate_strides7-out_strides7] 
tests/test_swiglu.py::test_swiglu[skip-dtype2-0.01-0.005-shape0-None-None-None] 
[gw0] [ 99%] SKIPPED tests/test_swiglu.py::test_swiglu[skip-dtype2-0.01-0.005-shape0-None-None-None] 
tests/test_swiglu.py::test_swiglu[skip-dtype2-0.01-0.005-shape1-input_strides1-gate_strides1-out_strides1] 
[gw0] [ 99%] SKIPPED tests/test_swiglu.py::test_swiglu[skip-dtype2-0.01-0.005-shape1-input_strides1-gate_strides1-out_strides1] 
tests/test_swiglu.py::test_swiglu[skip-dtype2-0.01-0.005-shape2-None-None-None] 
[gw0] [ 99%] SKIPPED tests/test_swiglu.py::test_swiglu[skip-dtype2-0.01-0.005-shape2-None-None-None] 
tests/test_swiglu.py::test_swiglu[skip-dtype2-0.01-0.005-shape3-input_strides3-gate_strides3-out_strides3] 
[gw0] [ 99%] SKIPPED tests/test_swiglu.py::test_swiglu[skip-dtype2-0.01-0.005-shape3-input_strides3-gate_strides3-out_strides3] 
tests/test_swiglu.py::test_swiglu[skip-dtype2-0.01-0.005-shape4-None-None-None] 
[gw0] [ 99%] SKIPPED tests/test_swiglu.py::test_swiglu[skip-dtype2-0.01-0.005-shape4-None-None-None] 
tests/test_swiglu.py::test_swiglu[skip-dtype2-0.01-0.005-shape5-input_strides5-gate_strides5-out_strides5] 
[gw0] [ 99%] SKIPPED tests/test_swiglu.py::test_swiglu[skip-dtype2-0.01-0.005-shape5-input_strides5-gate_strides5-out_strides5] 
tests/test_swiglu.py::test_swiglu[skip-dtype2-0.01-0.005-shape6-None-None-None] 
[gw0] [ 99%] SKIPPED tests/test_swiglu.py::test_swiglu[skip-dtype2-0.01-0.005-shape6-None-None-None] 
tests/test_swiglu.py::test_swiglu[skip-dtype2-0.01-0.005-shape7-input_strides7-gate_strides7-out_strides7] 
[gw0] [100%] SKIPPED tests/test_swiglu.py::test_swiglu[skip-dtype2-0.01-0.005-shape7-input_strides7-gate_strides7-out_strides7] 

----------- generated xml file: /workspace/results/test-results.xml ------------
===================== 1935 passed, 1686 skipped in 32.37s ======================
========== Summary ==========
[warn] job ascend_npu: container exited with 137 (likely docker teardown SIGKILL after clean pytest); junit XML reports no failures — treating as success

nvidia:

=========================== short test summary info ============================
FAILED tests/test_cast.py::test_cast[cuda-input_dtype0-out_dtype0-0.001-0.001-shape0-None-None]
FAILED tests/test_linear.py::test_linear[cuda-dtype1-0.01-0.01-False-False-True-a_shape2-b_shape2-out_shape2]
FAILED tests/test_linear.py::test_linear[cuda-dtype0-0.01-0.05-False-False-False-a_shape0-b_shape0-out_shape0]
FAILED tests/test_linear.py::test_linear[cuda-dtype0-0.01-0.05-True-False-False-a_shape1-b_shape1-out_shape1]
FAILED tests/test_matmul.py::test_matmul[cpu-dtype1-0.01-0.01-False-False-a_shape3-b_shape3-c_shape3]
FAILED tests/test_matmul.py::test_matmul[cuda-dtype2-0.01-0.01-False-True-a_shape3-b_shape3-c_shape3]
FAILED tests/test_mul.py::test_mul[cuda-dtype0-1e-07-1e-07-shape0-None-None-None]
FAILED tests/test_mul.py::test_mul[cuda-dtype5-0-0-shape0-None-None-None]
FAILED tests/test_cast.py::test_cast[cuda-input_dtype0-out_dtype0-0.001-0.001-shape1-input_strides1-out_strides1]
FAILED tests/test_cast.py::test_cast[cuda-input_dtype5-out_dtype5-0.01-0.005-shape0-None-None]
FAILED tests/test_cat.py::test_cat[cuda-dtype0-1e-07-1e-07-shapes0-0-out_shape0]
FAILED tests/test_linear.py::test_linear[cuda-dtype1-0.01-0.01-False-False-True-a_shape3-b_shape3-out_shape3]
FAILED tests/test_linear.py::test_linear[cuda-dtype0-0.01-0.05-False-False-True-a_shape0-b_shape0-out_shape0]
FAILED tests/test_linear.py::test_linear[cuda-dtype1-0.01-0.01-False-False-True-a_shape0-b_shape0-out_shape0]
FAILED tests/test_matmul.py::test_matmul[cpu-dtype1-0.01-0.01-False-False-a_shape0-b_shape0-c_shape0]
FAILED tests/test_linear.py::test_linear[cuda-dtype0-0.01-0.05-True-False-False-a_shape4-b_shape4-out_shape4]
FAILED tests/test_matmul.py::test_matmul[cuda-dtype2-0.01-0.01-False-True-a_shape2-b_shape2-c_shape2]
FAILED tests/test_mul.py::test_mul[cuda-dtype0-1e-07-1e-07-shape1-input_strides1-other_strides1-out_strides1]
FAILED tests/test_mul.py::test_mul[cuda-dtype2-0.01-0.005-shape8-input_strides8-other_strides8-out_strides8]
FAILED tests/test_mul.py::test_mul[cuda-dtype5-0-0-shape11-input_strides11-other_strides11-out_strides11]
FAILED tests/test_mul.py::test_mul[cuda-dtype8-0-0-shape10-None-None-None]
FAILED tests/test_cast.py::test_cast[cuda-input_dtype3-out_dtype3-0.01-0.005-shape2-None-None]
FAILED tests/test_cat.py::test_cat[cuda-dtype2-0.01-0.005-shapes1-1-out_shape1]
FAILED tests/test_linear.py::test_linear[cuda-dtype1-0.01-0.01-False-True-False-a_shape3-b_shape3-out_shape3]
FAILED tests/test_linear.py::test_linear[cuda-dtype2-0.01-0.01-True-False-False-a_shape2-b_shape2-out_shape2]
FAILED tests/test_linear.py::test_linear[cuda-dtype0-0.01-0.05-True-False-False-a_shape0-b_shape0-out_shape0]
FAILED tests/test_matmul.py::test_matmul[cpu-dtype0-0.01-0.01-True-True-a_shape2-b_shape2-c_shape2]
FAILED tests/test_linear.py::test_linear[cuda-dtype0-0.01-0.05-True-True-True-a_shape0-b_shape0-out_shape0]
FAILED tests/test_matmul.py::test_matmul[cuda-dtype2-0.01-0.01-False-False-a_shape1-b_shape1-c_shape1]
FAILED tests/test_mul.py::test_mul[cuda-dtype1-0.001-0.001-shape10-None-None-None]
FAILED tests/test_mul.py::test_mul[cuda-dtype0-1e-07-1e-07-shape2-input_strides2-None-None]
FAILED tests/test_mul.py::test_mul[cuda-dtype7-0-0-shape3-None-None-None]
FAILED tests/test_cast.py::test_cast[cuda-input_dtype2-out_dtype2-0.01-0.005-shape0-None-None]
========== 33 failed, 6225 passed, 2053 skipped in 117.96s (0:01:57) ===========
Stage 'test' failed with exit code 1
========== Summary ==========
job nvidia_gpu failed (exit code 1)

iluvatar

=========================== short test summary info ============================
FAILED tests/test_cast.py::test_cast[cuda-input_dtype0-out_dtype0-0.001-0.001-shape0-None-None]
FAILED tests/test_linear.py::test_linear[cuda-dtype0-0.01-0.05-False-False-False-a_shape0-b_shape0-out_shape0]
FAILED tests/test_linear.py::test_linear[cuda-dtype2-0.01-0.01-True-False-False-a_shape1-b_shape1-out_shape1]
FAILED tests/test_linear.py::test_linear[cuda-dtype1-0.01-0.01-True-True-False-a_shape3-b_shape3-out_shape3]
FAILED tests/test_linear.py::test_linear[cuda-dtype1-0.01-0.01-False-False-False-a_shape1-b_shape1-out_shape1]
FAILED tests/test_matmul.py::test_matmul[cpu-dtype1-0.01-0.01-True-False-a_shape1-b_shape1-c_shape1]
FAILED tests/test_matmul.py::test_matmul[cuda-dtype1-0.01-0.01-True-True-a_shape3-b_shape3-c_shape3]
FAILED tests/test_mul.py::test_mul[cuda-dtype0-1e-07-1e-07-shape0-None-None-None]
FAILED tests/test_mul.py::test_mul[cuda-dtype2-0.01-0.005-shape11-input_strides11-other_strides11-out_strides11]
FAILED tests/test_mul.py::test_mul[cuda-dtype5-0-0-shape10-None-None-None] - ...
FAILED tests/test_cast.py::test_cast[cuda-input_dtype0-out_dtype0-0.001-0.001-shape1-input_strides1-out_strides1]
FAILED tests/test_cat.py::test_cat[cuda-dtype0-1e-07-1e-07-shapes4-0-out_shape4]
FAILED tests/test_linear.py::test_linear[cuda-dtype0-0.01-0.05-False-False-False-a_shape2-b_shape2-out_shape2]
FAILED tests/test_linear.py::test_linear[cuda-dtype0-0.01-0.05-True-True-False-a_shape2-b_shape2-out_shape2]
FAILED tests/test_matmul.py::test_matmul[cpu-dtype1-0.01-0.01-False-True-a_shape2-b_shape2-c_shape2]
FAILED tests/test_linear.py::test_linear[cuda-dtype2-0.01-0.01-True-False-False-a_shape0-b_shape0-out_shape0]
FAILED tests/test_linear.py::test_linear[cuda-dtype0-0.01-0.05-False-False-False-a_shape1-b_shape1-out_shape1]
FAILED tests/test_linear.py::test_linear[cuda-dtype1-0.01-0.01-True-False-True-a_shape3-b_shape3-out_shape3]
FAILED tests/test_matmul.py::test_matmul[cuda-dtype2-0.01-0.01-True-True-a_shape0-b_shape0-c_shape0]
FAILED tests/test_mul.py::test_mul[cuda-dtype0-1e-07-1e-07-shape1-input_strides1-other_strides1-out_strides1]
FAILED tests/test_mul.py::test_mul[cuda-dtype3-0-0-shape0-None-None-None] - w...
FAILED tests/test_mul.py::test_mul[cuda-dtype6-0-0-shape2-input_strides2-None-None]
FAILED tests/test_matmul.py::test_matmul[cpu-dtype2-0.01-0.01-False-True-a_shape2-b_shape2-c_shape2]
FAILED tests/test_cast.py::test_cast[cuda-input_dtype0-out_dtype0-0.001-0.001-shape2-None-None]
FAILED tests/test_causal_softmax.py::test_causal_softmax[cuda-dtype0-1e-05-1e-05-shape0-None-None]
FAILED tests/test_linear.py::test_linear[cuda-dtype0-0.01-0.05-True-True-True-a_shape2-b_shape2-out_shape2]
FAILED tests/test_linear.py::test_linear[cuda-dtype2-0.01-0.01-False-True-True-a_shape0-b_shape0-out_shape0]
FAILED tests/test_linear.py::test_linear[cuda-dtype1-0.01-0.01-True-True-False-a_shape0-b_shape0-out_shape0]
FAILED tests/test_mul.py::test_mul[cuda-dtype0-1e-07-1e-07-shape2-input_strides2-None-None]
FAILED tests/test_mul.py::test_mul[cuda-dtype0-1e-07-1e-07-shape5-input_strides5-other_strides5-None]
FAILED tests/test_mul.py::test_mul[cuda-dtype6-0-0-shape9-input_strides9-other_strides9-out_strides9]
FAILED tests/test_matmul.py::test_matmul[cuda-dtype0-0.01-0.01-False-True-a_shape3-b_shape3-c_shape3]
FAILED tests/test_linear.py::test_linear[cuda-dtype2-0.01-0.01-True-False-False-a_shape3-b_shape3-out_shape3]
FAILED tests/test_linear.py::test_linear[cuda-dtype0-0.01-0.05-False-False-False-a_shape3-b_shape3-out_shape3]
FAILED tests/test_cat.py::test_cat[cuda-dtype0-1e-07-1e-07-shapes0-0-out_shape0]
=========== 35 failed, 5766 passed, 1072 skipped in 73.83s (0:01:13) ===========
Stage 'test' failed with exit code 1
========== Summary ==========
job iluvatar_gpu failed (exit code 1)

moore

zhangyue@mccx:~/InfiniOps$ python3 .ci/run.py --local --gpu-id 0 --stage build
platform: moore
==> running job: moore_gpu
========== Setup ==========
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Processing /tmp/src
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Requirement already satisfied: pytest in /usr/local/lib/python3.10/dist-packages (from InfiniOps==0.1.0) (7.2.2)
Requirement already satisfied: pytest-cov in /usr/local/lib/python3.10/dist-packages (from InfiniOps==0.1.0) (7.1.0)
Requirement already satisfied: pytest-xdist in /usr/local/lib/python3.10/dist-packages (from InfiniOps==0.1.0) (3.8.0)
Requirement already satisfied: ruff in /usr/local/lib/python3.10/dist-packages (from InfiniOps==0.1.0) (0.15.7)
Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (from InfiniOps==0.1.0) (2.5.0)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from InfiniOps==0.1.0) (6.0.2)
Requirement already satisfied: attrs>=19.2.0 in /usr/local/lib/python3.10/dist-packages (from pytest->InfiniOps==0.1.0) (25.3.0)
Requirement already satisfied: iniconfig in /usr/local/lib/python3.10/dist-packages (from pytest->InfiniOps==0.1.0) (2.1.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from pytest->InfiniOps==0.1.0) (24.2)
Requirement already satisfied: pluggy<2.0,>=0.12 in /usr/local/lib/python3.10/dist-packages (from pytest->InfiniOps==0.1.0) (1.6.0)
Requirement already satisfied: exceptiongroup>=1.0.0rc8 in /usr/local/lib/python3.10/dist-packages (from pytest->InfiniOps==0.1.0) (1.3.0)
Requirement already satisfied: tomli>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from pytest->InfiniOps==0.1.0) (2.2.1)
Requirement already satisfied: typing-extensions>=4.6.0 in /usr/local/lib/python3.10/dist-packages (from exceptiongroup>=1.0.0rc8->pytest->InfiniOps==0.1.0) (4.15.0)
Requirement already satisfied: coverage>=7.10.6 in /usr/local/lib/python3.10/dist-packages (from coverage[toml]>=7.10.6->pytest-cov->InfiniOps==0.1.0) (7.13.5)
Requirement already satisfied: execnet>=2.1 in /usr/local/lib/python3.10/dist-packages (from pytest-xdist->InfiniOps==0.1.0) (2.1.2)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch->InfiniOps==0.1.0) (3.19.1)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch->InfiniOps==0.1.0) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch->InfiniOps==0.1.0) (3.1.6)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch->InfiniOps==0.1.0) (2025.9.0)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch->InfiniOps==0.1.0) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch->InfiniOps==0.1.0) (1.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch->InfiniOps==0.1.0) (3.0.2)
Building wheels for collected packages: InfiniOps
  Building wheel for InfiniOps (pyproject.toml): started
  Building wheel for InfiniOps (pyproject.toml): still running...
  Building wheel for InfiniOps (pyproject.toml): still running...
  Building wheel for InfiniOps (pyproject.toml): finished with status 'done'
  Created wheel for InfiniOps: filename=infiniops-0.1.0-cp310-cp310-linux_x86_64.whl size=618509 sha256=1a741ff896a5cfaf9e735b38db0130812ed36277c0a71caf64bc402af6805ef0
  Stored in directory: /tmp/pip-ephem-wheel-cache-zqc59rna/wheels/ac/4c/a5/78fe3376fbe0f633e8ad47ec3e677a6762cbf147a5e0195bab
Successfully built InfiniOps
Installing collected packages: InfiniOps
Successfully installed InfiniOps-0.1.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.

[notice] A new release of pip is available: 25.2 -> 26.0.1
[notice] To update, run: python -m pip install --upgrade pip
========== Stage: build ==========
========== Summary ==========

cambricon

[zhangyue@localhost InfiniOps]$ python .ci/run.py --local --stage build
platform: cambricon
==> running job: cambricon_gpu
========== Setup ==========
Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Processing /tmp/src
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Requirement already satisfied: ruff in /usr/local/python3.10/lib/python3.10/site-packages (from InfiniOps==0.1.0) (0.15.7)
Requirement already satisfied: torch in /usr/local/python3.10/lib/python3.10/site-packages (from InfiniOps==0.1.0) (2.1.0)
Requirement already satisfied: pytest in /usr/local/python3.10/lib/python3.10/site-packages (from InfiniOps==0.1.0) (9.0.2)
Requirement already satisfied: pytest-cov in /usr/local/python3.10/lib/python3.10/site-packages (from InfiniOps==0.1.0) (7.1.0)
Requirement already satisfied: pytest-xdist in /usr/local/python3.10/lib/python3.10/site-packages (from InfiniOps==0.1.0) (3.8.0)
Requirement already satisfied: pyyaml in /usr/local/python3.10/lib/python3.10/site-packages (from InfiniOps==0.1.0) (5.3.1)
Requirement already satisfied: pygments>=2.7.2 in /usr/local/python3.10/lib/python3.10/site-packages (from pytest->InfiniOps==0.1.0) (2.19.2)
Requirement already satisfied: pluggy<2,>=1.5 in /usr/local/python3.10/lib/python3.10/site-packages (from pytest->InfiniOps==0.1.0) (1.6.0)
Requirement already satisfied: exceptiongroup>=1 in /usr/local/python3.10/lib/python3.10/site-packages (from pytest->InfiniOps==0.1.0) (1.3.0)
Requirement already satisfied: tomli>=1 in /usr/local/python3.10/lib/python3.10/site-packages (from pytest->InfiniOps==0.1.0) (2.4.0)
Requirement already satisfied: packaging>=22 in /usr/local/python3.10/lib/python3.10/site-packages (from pytest->InfiniOps==0.1.0) (25.0)
Requirement already satisfied: iniconfig>=1.0.1 in /usr/local/python3.10/lib/python3.10/site-packages (from pytest->InfiniOps==0.1.0) (2.3.0)
Requirement already satisfied: coverage[toml]>=7.10.6 in /usr/local/python3.10/lib/python3.10/site-packages (from pytest-cov->InfiniOps==0.1.0) (7.13.5)
Requirement already satisfied: execnet>=2.1 in /usr/local/python3.10/lib/python3.10/site-packages (from pytest-xdist->InfiniOps==0.1.0) (2.1.2)
Requirement already satisfied: sympy in /usr/local/python3.10/lib/python3.10/site-packages (from torch->InfiniOps==0.1.0) (1.14.0)
Requirement already satisfied: fsspec in /usr/local/python3.10/lib/python3.10/site-packages (from torch->InfiniOps==0.1.0) (2025.5.1)
Requirement already satisfied: typing-extensions in /usr/local/python3.10/lib/python3.10/site-packages (from torch->InfiniOps==0.1.0) (4.14.0)
Requirement already satisfied: networkx in /usr/local/python3.10/lib/python3.10/site-packages (from torch->InfiniOps==0.1.0) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/python3.10/lib/python3.10/site-packages (from torch->InfiniOps==0.1.0) (3.1.6)
Requirement already satisfied: filelock in /usr/local/python3.10/lib/python3.10/site-packages (from torch->InfiniOps==0.1.0) (3.18.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/python3.10/lib/python3.10/site-packages (from jinja2->torch->InfiniOps==0.1.0) (3.0.2)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/python3.10/lib/python3.10/site-packages (from sympy->torch->InfiniOps==0.1.0) (1.3.0)
Building wheels for collected packages: InfiniOps
  Building wheel for InfiniOps (pyproject.toml): started
  Building wheel for InfiniOps (pyproject.toml): still running...
  Building wheel for InfiniOps (pyproject.toml): finished with status 'done'
  Created wheel for InfiniOps: filename=infiniops-0.1.0-cp310-cp310-linux_aarch64.whl size=276075 sha256=d63c6618589beb7b4740344b69b11262eae6475b75487b379dedf871aac90d42
  Stored in directory: /tmp/pip-ephem-wheel-cache-v2w22n8k/wheels/ac/4c/a5/78fe3376fbe0f633e8ad47ec3e677a6762cbf147a5e0195bab
Successfully built InfiniOps
Installing collected packages: InfiniOps
Successfully installed InfiniOps-0.1.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: There was an error checking the latest version of pip.
========== Stage: build ==========
========== Summary ==========

metax

zhangyue@test:~/InfiniOps$ python3 .ci/run.py  --local --stage build
platform: metax
==> running job: metax_gpu
========== Setup ==========
Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Processing /tmp/src
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Requirement already satisfied: pytest in /opt/conda/lib/python3.10/site-packages (from InfiniOps==0.1.0) (8.4.1)
Requirement already satisfied: pytest-cov in /opt/conda/lib/python3.10/site-packages (from InfiniOps==0.1.0) (7.1.0)
Requirement already satisfied: pytest-xdist in /opt/conda/lib/python3.10/site-packages (from InfiniOps==0.1.0) (3.8.0)
Requirement already satisfied: ruff in /opt/conda/lib/python3.10/site-packages (from InfiniOps==0.1.0) (0.15.7)
Requirement already satisfied: torch in /opt/conda/lib/python3.10/site-packages (from InfiniOps==0.1.0) (2.4.0+metax3.2.1.3)
Requirement already satisfied: pyyaml in /opt/conda/lib/python3.10/site-packages (from InfiniOps==0.1.0) (6.0.3)
Requirement already satisfied: exceptiongroup>=1 in /opt/conda/lib/python3.10/site-packages (from pytest->InfiniOps==0.1.0) (1.3.0)
Requirement already satisfied: iniconfig>=1 in /opt/conda/lib/python3.10/site-packages (from pytest->InfiniOps==0.1.0) (2.1.0)
Requirement already satisfied: packaging>=20 in /opt/conda/lib/python3.10/site-packages (from pytest->InfiniOps==0.1.0) (25.0)
Requirement already satisfied: pluggy<2,>=1.5 in /opt/conda/lib/python3.10/site-packages (from pytest->InfiniOps==0.1.0) (1.6.0)
Requirement already satisfied: pygments>=2.7.2 in /opt/conda/lib/python3.10/site-packages (from pytest->InfiniOps==0.1.0) (2.19.2)
Requirement already satisfied: tomli>=1 in /opt/conda/lib/python3.10/site-packages (from pytest->InfiniOps==0.1.0) (2.3.0)
Requirement already satisfied: coverage>=7.10.6 in /opt/conda/lib/python3.10/site-packages (from coverage[toml]>=7.10.6->pytest-cov->InfiniOps==0.1.0) (7.11.0)
Requirement already satisfied: execnet>=2.1 in /opt/conda/lib/python3.10/site-packages (from pytest-xdist->InfiniOps==0.1.0) (2.1.2)
Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from torch->InfiniOps==0.1.0) (3.20.0)
Requirement already satisfied: typing-extensions>=4.8.0 in /opt/conda/lib/python3.10/site-packages (from torch->InfiniOps==0.1.0) (4.15.0)
Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from torch->InfiniOps==0.1.0) (1.14.0)
Requirement already satisfied: networkx in /opt/conda/lib/python3.10/site-packages (from torch->InfiniOps==0.1.0) (3.4.2)
Requirement already satisfied: jinja2 in /opt/conda/lib/python3.10/site-packages (from torch->InfiniOps==0.1.0) (3.1.6)
Requirement already satisfied: fsspec in /opt/conda/lib/python3.10/site-packages (from torch->InfiniOps==0.1.0) (2025.5.1)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2->torch->InfiniOps==0.1.0) (3.0.2)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/lib/python3.10/site-packages (from sympy->torch->InfiniOps==0.1.0) (1.3.0)
Building wheels for collected packages: InfiniOps
  Building wheel for InfiniOps (pyproject.toml): started
  Building wheel for InfiniOps (pyproject.toml): still running...
  Building wheel for InfiniOps (pyproject.toml): still running...
  Building wheel for InfiniOps (pyproject.toml): finished with status 'done'
  Created wheel for InfiniOps: filename=infiniops-0.1.0-cp310-cp310-linux_x86_64.whl size=791843 sha256=96c5b77525dbc526324ae52ba9fff744bbde6ba0a7b10612d702af04fe8fe630
  Stored in directory: /tmp/pip-ephem-wheel-cache-is6c9cut/wheels/ac/4c/a5/78fe3376fbe0f633e8ad47ec3e677a6762cbf147a5e0195bab
Successfully built InfiniOps
Installing collected packages: InfiniOps
Successfully installed InfiniOps-0.1.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
========== Stage: build ==========
========== Summary ==========

@voltjia voltjia requested a review from Ziminli April 22, 2026 01:41
Comment thread src/cpu/cast/cast.h Outdated
Comment thread src/cpu/linear/linear.h Outdated
Comment thread src/cpu/linear/linear.h Outdated
zhangyue added 2 commits April 22, 2026 13:54
…ables in Linear

Per PR #65 review:

- `src/cpu/cast/cast.h`: replace nested `DispatchFunc(in_dtype, ...)` inside
  `DispatchFunc(out_dtype, ...)` with a single multi-dispatch call
  `DispatchFunc<kCpu, AllTypes, AllTypes>({in, out}, [](in_tag, out_tag) {...})`
  per the multi-dispatch idiom documented in `CONTRIBUTING.md`.
- `src/cpu/linear/linear.h`: rename PascalCase locals to snake_case:
  `A/B/Out/Bias` → `a_ptr/b_ptr/out_ptr/bias_ptr`,
  `A_batch/B_batch/Out_batch` → `a_batch/b_batch/out_batch`,
  `M/N/K` → `m/n/k` (matching master's `src/cpu/gemm/gemm.h` which already
  uses lowercase dim names `m_/n_/k_`).
- `if (bias_ptr && bias)` → `if (bias_ptr)` (line 75). `bias_ptr` is
  `nullptr` iff `!bias` by construction at line 38, so `&& bias` is dead.
- Remove `// Determine `m`, `n`, `k` from shapes and transpose flags.` —
  the three lines below literally do exactly that; self-describing now that
  names are snake_case.
@voltjia voltjia merged commit 13cf84a into master Apr 22, 2026
4 checks passed
@voltjia voltjia deleted the feat/ascend-op-simple branch April 22, 2026 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants