feat(ascend): op-simple group — Add, Mul, Cast, Cat, Matmul, Gemm, Linear#65
Merged
feat(ascend): op-simple group — Add, Mul, Cast, Cat, Matmul, Gemm, Linear#65
Conversation
Collaborator
Author
|
merge test: |
Ziminli
requested changes
Apr 20, 2026
78a0628 to
7eeec7a
Compare
…near Seven foundational Ascend operators: | op | impl | |---|---| | Add | aclnnAdd | | Mul | aclnnMul | | Cast | aclnnCast | | Cat | aclnnCat | | Matmul | aclnnMatmul | | Gemm | aclnnMm (also carries the cached-executor / workspace-pool rework) | | Linear | aclnnMatmul + optional bias | Also ships: - `src/base/<op>.h` for the 5 new ops (cast/cat/linear/matmul/mul); `add.h` and `gemm.h` existed on master and are updated in-place - `src/cpu/<op>/<op>.h` reference impls for cast/cat/linear/mul (add/gemm/matmul had CPU refs on master already) - `tests/test_<op>.py` for each operator (add and gemm have MODIFY diffs; others are new)
7eeec7a to
7649042
Compare
Ziminli
approved these changes
Apr 21, 2026
added 4 commits
April 21, 2026 16:15
…caches - `add/kernel.h`: swap destroy() → release() on in_cache_/oth_cache_/out_cache_ and drop aclDestroyAclOpExecutor (both are referenced by the Repeatable executor; destroying them causes double-free at shutdown per the pattern documented in common.h and commit 64c367c). - `cat/kernel.h`: release all in_caches_[i] in the destructor; without it, ~AclTensorCache() on vector teardown double-frees descriptors held by tensor_list_ / executor_. - Also group the alpha_* storage members with blank lines to match file convention.
…entation_indices` Replaces hardcoded `(0, 1)` / `(0, 1, 2)` tuples in test_add, test_gemm, test_rms_norm, test_swiglu with a union over the locally-available devices' active implementation indices. New helper `tests.utils.all_active_implementation_indices(op_cls)` only iterates `get_available_devices()` to avoid `DispatchFunc::std::abort` on device types outside the build's `ActiveDevices` set. Effect on Ascend CI: skipped-test count drops from 3246 to 1686 — impl=1 (`cuBLASLt`) no longer parametrized when no CUDA device is visible, and RmsNorm/Swiglu's custom-kernel slot drops out of the matrix on op-simple where the framework layer hasn't merged the AscendC impl yet.
Replaces the per-test `@pytest.mark.parametrize("implementation_index", ...)`
+ runtime `if impl not in active_indices: skip` pattern with a single hook in
`conftest.pytest_generate_tests` that emits only the (device, impl) pairs
actually active on each device.
Rationale: kernel dispatch is per-device, so cross-device union (previous
`all_active_implementation_indices` helper) polluted the matrix with impls
that the selected device can't run — runtime-skipped noise. Joint generation
keeps the matrix to its semantic cell: "this device has this impl, so run it".
- `tests/conftest.py`: when both `device` and `implementation_index` are in
fixturenames, emit pairs via `op_cls.active_implementation_indices(dev)`;
fall back to a skipped placeholder (`id="skip"`) when no device has an
active impl, avoiding `[NOTSET-...]` test IDs.
- `tests/{test_add,test_gemm,test_rms_norm,test_swiglu}.py`: drop the hardcoded
`implementation_index` parametrize decorator and the runtime `active_indices`
guard — conftest now handles both.
- `tests/utils.py`: remove the `all_active_implementation_indices` helper
(superseded by per-device generation in conftest).
Same test outcome on Ascend CI (1935 passed / 1686 skipped) but the remaining
skips are now either semantically mandatory (uint dtypes unsupported by
`torch_npu`, Gemm impl=2 SFINAE-only workaround, op missing ascend impl on
op-simple pending PR #66) rather than mechanism artifacts.
…undant fixture Post-review cleanup of the joint-parametrize refactor (1dd288f): - Extract `_op_class_from_module` as a shared helper; `skip_op_without_platform_impl` fixture now calls it instead of re-deriving the snake→pascal class name inline. - Short-circuit the fixture when `implementation_index` is already in callspec — `pytest_generate_tests` has already pruned empty-impl pairs, so per-case `active_implementation_indices` calls are wasted work. - Drop `try/except ImportError` inside the helper — collection has already imported `infini.ops` via test modules; masking a real import failure only turns it into a cryptic NOTSET fixture. - Drop the `devices[0] if devices else "cpu"` fallback — `get_available_devices()` always includes `"cpu"`, making the `else` arm unreachable.
Collaborator
Author
|
ascend: nvidia: iluvatar moore cambricon metax |
Ziminli
requested changes
Apr 22, 2026
added 2 commits
April 22, 2026 13:54
…ables in Linear Per PR #65 review: - `src/cpu/cast/cast.h`: replace nested `DispatchFunc(in_dtype, ...)` inside `DispatchFunc(out_dtype, ...)` with a single multi-dispatch call `DispatchFunc<kCpu, AllTypes, AllTypes>({in, out}, [](in_tag, out_tag) {...})` per the multi-dispatch idiom documented in `CONTRIBUTING.md`. - `src/cpu/linear/linear.h`: rename PascalCase locals to snake_case: `A/B/Out/Bias` → `a_ptr/b_ptr/out_ptr/bias_ptr`, `A_batch/B_batch/Out_batch` → `a_batch/b_batch/out_batch`, `M/N/K` → `m/n/k` (matching master's `src/cpu/gemm/gemm.h` which already uses lowercase dim names `m_/n_/k_`).
- `if (bias_ptr && bias)` → `if (bias_ptr)` (line 75). `bias_ptr` is `nullptr` iff `!bias` by construction at line 38, so `&& bias` is dead. - Remove `// Determine `m`, `n`, `k` from shapes and transpose flags.` — the three lines below literally do exactly that; self-describing now that names are snake_case.
Ziminli
approved these changes
Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Seven foundational Ascend operators — Add, Mul, Cast, Cat, Matmul, Gemm,
Linear — implemented via the ACLNN API set.
This is part 2 of 4 in the Ascend operator split (part 1 =
feat/ascend-framework-pr). Each category PR ships its operators as anatomic unit:
src/base/<op>.hdeclaration +src/ascend/<op>/*.hAscendimpl +
src/cpu/<op>/<op>.hCPU reference +tests/test_<op>.py.Depends on:
feat/ascend-framework-prmust merge first (sharedframework headers, generator fixes, CI fixes, and test infra).
Operators
src/base/<op>.haclnnAddaclnnMulaclnnCastaclnnCataclnnMatmulmat_mul.h— class renamedMatMul→Matmul)aclnnMmaclnnMatmul+ optional biasKernels are header-only under
src/ascend/<op>/kernel.h; the build picksthem up automatically through the Ascend glob in
src/CMakeLists.txt.CPU reference implementations
src/cpu/{cast,cat,linear,mul}/added as reference implementations forthe new ops.
add,gemm, andmatmulalready had CPU references onmaster (mat_mul.h → matmul.h rename handled in this PR).
Removed
src/base/mat_mul.h— the oldMatMulclass had no implementation onany backend. Replaced by the new
Matmulclass insrc/base/matmul.h.Verification
python3 .ci/run.py --local --gpu-id <N>(Ascend 910B + CANN 8.5.1):3435 passed / 1746 skipped / 0 failed
skip cleanly via the framework PR's
skip_op_without_platform_implautouse fixture.
Test plan
python3 .ci/run.py --localclang-formatpasses locally