[JAX] Support for batched einsum and grouped GEMM without D2H memcpy #2604

jberchtold-nvidia · 2026-01-15T19:51:56Z

Description

Depends on #2502

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

- Add FP8 scale_inv pointer handling in nvte_grouped_gemm for proper FP8 GEMM - Fix random padding in tests to ensure 16-byte alignment for all dtypes - Reorder GroupedGemmSetupWorkspace members for natural alignment - Remove debug prints Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

- Remove unused alignment parameter from GroupedGemmSetupWorkspace::from_buffers - Simplify select_grouped_operand by removing dead code branches - Add GroupedOperandSelection.tensor field to avoid passing tensor separately - Extract set_fp8_scale_pointers and init_matrix_layouts helpers - Add safety check for FP8 on Hopper column-wise fallback - Support NULL C tensor when beta=0 (uses D as placeholder) - Remove unused get_scale_inv() from test - Add use_null_c test parameter and test case - Fix documentation: alpha/beta are single element tensors only Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

- Change alpha/beta from single values to per-matrix arrays - Validate alpha/beta have exactly num_tensors elements - Update kernel to index alpha_ptr[idx] and beta_ptr[idx] - Move alpha/beta validation to validate_grouped_gemm_inputs - Update tests to use per-matrix alpha/beta arrays - Update documentation Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

…coded shape

…hed gemm speed

This reverts commit bc6cf66.

… single-stream for multi tensor quantize)

…tion buffer

for more information, see https://pre-commit.ci

greptile-apps · 2026-01-15T19:55:24Z

Greptile Summary

This PR adds support for batched einsum operations and grouped GEMM without device-to-host memory copies, enabling efficient Mixture-of-Experts (MoE) implementations with per-expert FP8 quantization in JAX.

Key Changes

Grouped GEMM Implementation: New nvte_grouped_gemm C++ API using cuBLAS 13.1+ for batched matrix operations, with GPU-side setup kernel to avoid D2H memcpy overhead
Einsum Support: JAX einsum implementation using vmap+dense pattern for MoE workloads, supporting per-expert quantization with QuantizerSet arrays
Quantization Updates: Refactored grouped quantization to use batched primitives, changed scale initialization from jnp.empty to jnp.ones to prevent uninitialized memory issues
Architecture Requirements: Grouped GEMM requires Blackwell (SM100+) and cuBLAS 13.1+, with proper version checks in place

Testing

Comprehensive test coverage includes:

C++ unit tests for grouped GEMM with various shape configurations (uniform, varying first/last dims)
JAX tests for einsum with MoE patterns, gradients, and multiple FP8 recipes
Validation of forward/backward passes with numerical accuracy checks

Temporary Workarounds

quantization.cpp:405-406: Memset to zero uninitialized buffer portions when over-allocated (noted as temporary fix needing investigation)

Confidence Score: 4/5

This PR is safe to merge with minor style improvements recommended
The implementation is well-tested with comprehensive unit tests for both C++ and JAX layers. Architecture checks ensure SM100+ requirement for grouped GEMM. Main concerns are temporary workarounds (memset for over-allocated buffers) that should be monitored, and some code style improvements. The core logic appears sound with proper validation and error handling.
Pay close attention to transformer_engine/jax/csrc/extensions/quantization.cpp for the temporary memset workaround

Important Files Changed

Filename	Overview
transformer_engine/jax/csrc/extensions/quantization.cpp	Optimized memset for quantization buffer to only zero uninitialized portions, preventing unnecessary overhead
transformer_engine/common/gemm/cublaslt_grouped_gemm.cu	New grouped GEMM implementation using cuBLAS 13.1+ with proper SM100+ architecture checks, comprehensive error handling, and GPU-side setup kernel
transformer_engine/jax/einsum.py	New einsum implementation using vmap+dense for MoE with per-expert quantization, validates NN layout and single batch dimension
transformer_engine/jax/cpp_extensions/quantization.py	Changed scale initialization from empty to ones, refactored batcher to use general batcher implementation
transformer_engine/jax/cpp_extensions/gemm.py	Added grouped GEMM support with new primitives, workspace management, and batched quantization integration
tests/jax/test_einsum.py	Comprehensive test suite for einsum with MoE operations, gradients, and multiple FP8 recipes

Sequence Diagram

sequenceDiagram
    participant User as JAX User Code
    participant Einsum as einsum()
    participant Dense as dense()
    participant GEMM as GemmPrimitive
    participant Quant as GroupedQuantize
    participant CUDA as cuBLAS/CUDA

    User->>Einsum: einsum("EBCM,EMH->EBCH", x, w, quantizer_sets)
    Einsum->>Einsum: Parse equation & validate NN layout
    Einsum->>Einsum: Stack quantizer_sets into pytree
    
    Einsum->>Dense: vmap(dense_with_quantizer) over batch dim E
    
    loop For each expert (vmapped)
        Dense->>Quant: grouped_quantize(x, quantizer_i)
        Quant->>CUDA: GroupedQuantizeFFI (batched)
        CUDA-->>Quant: quantized tensors + scales
        
        Dense->>Quant: grouped_quantize(w, quantizer_i)
        Quant->>CUDA: GroupedQuantizeFFI (batched)
        CUDA-->>Quant: quantized tensors + scales
        
        Dense->>GEMM: gemm(x_q, w_q, scales)
        GEMM->>CUDA: nvte_grouped_gemm (if batched)
        Note over CUDA: GPU-side setup kernel<br/>No D2H memcpy
        CUDA->>CUDA: cublasLtMatmul (grouped)
        CUDA-->>GEMM: output
        GEMM-->>Dense: result
    end
    
    Dense-->>Einsum: vmapped outputs
    Einsum-->>User: final result

greptile-apps

_{24 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-15T19:55:23Z

transformer_engine/jax/csrc/extensions/quantization.cpp

+  cudaMemsetAsync(outputs->untyped_data() + used_output_size, 0,
+                  outputs->size_bytes() - used_output_size, stream);


style: potential pointer arithmetic issue with untyped data

the pointer arithmetic outputs->untyped_data() + used_output_size treats the pointer as char* (byte-addressed), which should be correct. verify that used_output_size is calculated in bytes, not elements.

Suggested change

cudaMemsetAsync(outputs->untyped_data() + used_output_size, 0,

outputs->size_bytes() - used_output_size, stream);

size_t used_output_size = (sum_group_sizes*non_group_m) * n * output_dtype_bytes;

char* output_base = static_cast<char*>(outputs->untyped_data());

cudaMemsetAsync(output_base + used_output_size, 0, outputs->size_bytes() - used_output_size, stream);

phu0ngng and others added 30 commits December 3, 2025 13:07

init einsum

8de5bb5

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1f02cf4

for more information, see https://pre-commit.ci

code drop

bf3ebc2

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

76293d4

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

785df34

for more information, see https://pre-commit.ci

fix

1329b37

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

47c58be

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

3b2fcdf

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

101766b

for more information, see https://pre-commit.ci

Fix alpha/beta numel - use SimpleTensor::numel()

1167f75

Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Merge branch 'main' into einsum

a5ee92f

Einsum WIP 1

00eb186

Test

38defb8

Refactor: move grouped GEMM to separate file and cleanup API

e4a80a3

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Merge branch 'main' into grouped_gemm

db1e177

fix

047a9f9

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

c490e06

for more information, see https://pre-commit.ci

batching working correctly for quant and gemm but slow

e397845

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

fix

59145cc

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Require Blackwell (SM100) and cuBLAS 13.1+ for grouped GEMM

77b422a

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

fix

9c8158e

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

fix

b1e0893

Merge remote-tracking branch 'github-upstream/main' into einsum

f70f376

move einsum logic into TE

fb2067b

einsum unit tests

30716a6

fwd bwd einsum test

349c315

unit tests passed with grouped gemm in bf16

57ab3b0

jberchtold-nvidia added 13 commits December 23, 2025 11:26

grouped quantization working for single gpu

ab98852

Merge remote-tracking branch 'pawel/grouped_gemm' into einsum

1184796

wip

f1fc31c

with many hacks grouped gemm with new api works for a particular hard…

c8cf763

…coded shape

progress

21e7002

more tests pass

1ae08dd

einsum tests pass

fe39e39

more progress, works in maxtext single-gpu and is closer to bf16 batc…

5e47d57

…hed gemm speed

attempt at passing thru stateful args for DS

bc6cf66

Revert "attempt at passing thru stateful args for DS"

bcbe864

This reverts commit bc6cf66.

batch gemm specialization for CS amax calc

b40353f

multi-GPU grouped quantize working now in shard_map (with hack to use…

ee71c96

… single-stream for multi tensor quantize)

reduce size of zero'ing memset to only uninitialized part of quantiza…

9856862

…tion buffer

jberchtold-nvidia marked this pull request as draft January 15, 2026 19:52

[pre-commit.ci] auto fixes from pre-commit.com hooks

f58ba23

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Jan 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[JAX] Support for batched einsum and grouped GEMM without D2H memcpy #2604

[JAX] Support for batched einsum and grouped GEMM without D2H memcpy #2604

Uh oh!

jberchtold-nvidia commented Jan 15, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Jan 15, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		cudaMemsetAsync(outputs->untyped_data() + used_output_size, 0,
		outputs->size_bytes() - used_output_size, stream);

-  cudaMemsetAsync(outputs->untyped_data() + used_output_size, 0,
-                  outputs->size_bytes() - used_output_size, stream);
+  size_t used_output_size = (sum_group_sizes*non_group_m) * n * output_dtype_bytes;
+  char* output_base = static_cast<char*>(outputs->untyped_data());
+  cudaMemsetAsync(output_base + used_output_size, 0, outputs->size_bytes() - used_output_size, stream);

[JAX] Support for batched einsum and grouped GEMM without D2H memcpy #2604

Are you sure you want to change the base?

[JAX] Support for batched einsum and grouped GEMM without D2H memcpy #2604

Uh oh!

Conversation

jberchtold-nvidia commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Jan 15, 2026

Greptile Summary

Key Changes

Testing

Temporary Workarounds

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jberchtold-nvidia commented Jan 15, 2026 •

edited

Loading