[TPU][Pallas]Fix example/cross_entropy.py on Pallas TPU by yarongmu-google · Pull Request #2019 · pytorch/helion

yarongmu-google · 2026-04-14T22:02:25Z

The kernels currently has 2 common issues that need support:

Long types are not supported in Pallas/Mosaic (XLA does support it but Helion doesn't go through XLA).
Directly indexing into vectors on HBM.
Add CI workflow #2 is the bigger fix here. More about it:

The issue was that evaluating hl.load(logits_flat, [flat_indices]) maps to random reads from HBM, which TensorCores do not support. By changing the Helion code to apply the label == v_chunk_index boolean mask directly across the logits_rows which are sequentially streaming into VMEM, we eliminated the 1D sparse gather entirely.

The updated cross_entropy.py is now verified mathematically correct, fully functional, and autotunes on the TPU v7s on smaller shapes.

After this PR:

=================================================================
Benchmark Results
=================================================================
Implementation       Time (ms)    Speedup        
-----------------------------------------------------------------
helion               0.3826       1.10x          
torch                0.4208       1.00x (ref)    
=================================================================

Note: this PR depends on pytorch/pytorch#180252

norx1991 · 2026-04-15T00:00:56Z

FYI, there is #1950 for the long type part.

yarongmu-google · 2026-04-15T00:04:51Z

FYI, there is #1950 for the long type part.

Thanks. Any idea why those type mapping classes were not updated? Were those actially not needed?

norx1991 · 2026-04-15T02:02:46Z

FYI, there is #1950 for the long type part.

Thanks. Any idea why those type mapping classes were not updated? Were those actially not needed?

The idea is that if a long type is not really needed on TPU, the user can use the newly added LONG_INT_TYPE. If they are really needed, then casting will not help. Therefore, the type mapping does not need to be updated. We would reject 64 bit data type directly.

…errors and fix zero division in block size calculation

…py.py to avoid unaligned HBM gather This optimizes the cross_entropy kernel to be hardware agnostic. By calculating the target logits via a boolean mask over the streaming dense block, it stays entirely within TensorCore/VMEM boundaries on TPU and perfectly coalesced on GPU, eliminating the unaligned 1D HBM gather which Pallas TC kernels do not natively support without SC DMA staging.

AmesingFlank · 2026-04-16T00:26:48Z

I'd recommend creating a new example instead of modifying the existing one. Ideally, we would make the compiler smart enough so that even for the original example, the compiler could generate masked indexings which works on TPU, so there's value in keeping that example around

yarongmu-google · 2026-04-16T00:38:30Z

FYI, there is #1950 for the long type part.

Thanks. Any idea why those type mapping classes were not updated? Were those actially not needed?

The idea is that if a long type is not really needed on TPU, the user can use the newly added LONG_INT_TYPE. If they are really needed, then casting will not help. Therefore, the type mapping does not need to be updated. We would reject 64 bit data type directly.

gotcha. Reverted.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 14, 2026

yarongmu-google mentioned this pull request Apr 14, 2026

[TPU][Pallas]Fix example/cross_entropy.py on Pallas TPU #2002

Closed

yarongmu-google marked this pull request as draft April 14, 2026 22:30

yarongmu-google mentioned this pull request Apr 14, 2026

[TPU][Pallas] TPUs do not support 64-bit ints or floats. pytorch/pytorch#180252

Closed

yarongmu-google marked this pull request as ready for review April 15, 2026 18:10

yarongmu-google added 4 commits April 15, 2026 17:16

fix(pallas): add mapping for 64-bit dtypes to 32-bit to avoid Pallas …

5d3519d

…errors and fix zero division in block size calculation

style: apply ruff and pyrefly auto-formatting

a834e5d

Lint.

3a8c269

yarongmu-google force-pushed the fix-pallas-dtype-mapping-clean branch from 0374f54 to 0c1a4b3 Compare April 16, 2026 00:21

revert styling changes in higher_order_ops.py and template_buffer.py

29ec3ba

yarongmu-google force-pushed the fix-pallas-dtype-mapping-clean branch from b4c4ad8 to 29ec3ba Compare April 16, 2026 00:49

chore: remove fp8 dtype mappings to keep PR focused

ba1ef8d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPU][Pallas]Fix example/cross_entropy.py on Pallas TPU#2019

[TPU][Pallas]Fix example/cross_entropy.py on Pallas TPU#2019
yarongmu-google wants to merge 6 commits intopytorch:mainfrom
yarongmu-google:fix-pallas-dtype-mapping-clean

yarongmu-google commented Apr 14, 2026

Uh oh!

norx1991 commented Apr 15, 2026

Uh oh!

yarongmu-google commented Apr 15, 2026

Uh oh!

norx1991 commented Apr 15, 2026 •

edited

Loading

Uh oh!

AmesingFlank commented Apr 16, 2026

Uh oh!

yarongmu-google commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yarongmu-google commented Apr 14, 2026

Uh oh!

norx1991 commented Apr 15, 2026

Uh oh!

yarongmu-google commented Apr 15, 2026

Uh oh!

norx1991 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmesingFlank commented Apr 16, 2026

Uh oh!

yarongmu-google commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

norx1991 commented Apr 15, 2026 •

edited

Loading