[TPU][Pallas]Fix example/cross_entropy.py on Pallas TPU by yarongmu-google · Pull Request #2002 · pytorch/helion

yarongmu-google · 2026-04-10T22:39:06Z

The kernels currently has 2 common issues that need support:

Long types are not supported in Pallas/Mosaic (XLA does support it but Helion doesn't go through XLA).
Directly indexing into vectors on HBM.
Add CI workflow #2 is the bigger fix here. More about it:

The issue was that evaluating hl.load(logits_flat, [flat_indices]) maps to random reads from HBM, which TensorCores do not support. By changing the Helion code to apply the label == v_chunk_index boolean mask directly across the logits_rows which are sequentially streaming into VMEM, we eliminated the 1D sparse gather entirely.

The updated cross_entropy.py is now verified mathematically correct, fully functional, and autotunes on the TPU v7s on smaller shapes.

After this PR:

=================================================================
Benchmark Results
=================================================================
Implementation       Time (ms)    Speedup        
-----------------------------------------------------------------
helion               0.3826       1.10x          
torch                0.4208       1.00x (ref)    
=================================================================

Note: this PR depends on pytorch/pytorch#180252

…ivide dim (#1937)

…errors and fix zero division in block size calculation

…py.py to avoid unaligned HBM gather This optimizes the cross_entropy kernel to be hardware agnostic. By calculating the target logits via a boolean mask over the streaming dense block, it stays entirely within TensorCore/VMEM boundaries on TPU and perfectly coalesced on GPU, eliminating the unaligned 1D HBM gather which Pallas TC kernels do not natively support without SC DMA staging.

…s in backend.py

jansel · 2026-04-14T18:20:30Z

-    # Flatten logits once at the beginning
-    logits_flat = logits.view(-1)
-


Why are we changig the kernel? Shouldn't we make the existing kernel work?

Making teh existing kernel work means we have to use fori_loop loop.

I avoided it with a dynamic gather (e.g., hl.load([flat_indices])) because it forces the kernel out of the emit_pipeline execution model and do an indirect, data-dependent read from HBM based on the label indices. TPUs are heavily optimized for large, contiguous memory bursts rather than sparse, 4-byte random accesses. By loading the full, contiguous rows of logits and using a boolean mask (chunk_logits * mask) to extract the target logit, we deliberately trade extremely cheap ALU operations for predictable, dense memory access. This allows the compiler to keep the kernel in emit_pipeline mode, maximizing HBM bandwidth utilization and overlapping our memory loads with the masking compute.

Cleaned up PR in #2019

jansel · 2026-04-14T18:20:45Z

+        print("\n[DEBUG TRACE] Executing Pallas Kernel")
+        print("Original Order (Tensors):")


Sorry this PR was messed up when syncing with upstream and contains extra fiels. I will abandon this and create a clean one based on the current upstream main: #2018

yarongmu-google · 2026-04-14T21:21:17Z

Replaced by #2019

norx1991 and others added 7 commits April 2, 2026 19:06

[Pallas] Add test for Pallas OOB slice when reduction_loops doesn't d…

e40e4e7

…ivide dim (#1937)

Merge branch 'pytorch:main' into main

97b35b3

[Pallas] Add test for Pallas OOB slice when reduction_loops doesn't d…

fdbd20b

…ivide dim (#1937)

Merge branch 'pytorch:main' into main

8201686

fix(pallas): add mapping for 64-bit dtypes to 32-bit to avoid Pallas …

9878a8f

…errors and fix zero division in block size calculation

style: apply ruff and pyrefly auto-formatting across project files

a978600

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 10, 2026

yarongmu-google and others added 3 commits April 10, 2026 15:41

Merge branch 'pytorch:main' into main

54a1bdc

Merge branch 'main' into fix-pallas-dtype-mapping to resolve conflict…

7c6c52e

…s in backend.py

Merge branch 'pytorch:main' into main

bc936aa

yarongmu-google mentioned this pull request Apr 13, 2026

[TPU][Pallas] TPUs do not support 64-bit ints or floats. pytorch/pytorch#180252

Closed

Merge origin/main into fix-pallas-dtype-mapping

54c65a0

yarongmu-google force-pushed the fix-pallas-dtype-mapping branch from 7836006 to 54c65a0 Compare April 13, 2026 22:24

jansel requested changes Apr 14, 2026

View reviewed changes

yarongmu-google marked this pull request as draft April 14, 2026 21:18

yarongmu-google closed this by deleting the head repository Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPU][Pallas]Fix example/cross_entropy.py on Pallas TPU#2002

[TPU][Pallas]Fix example/cross_entropy.py on Pallas TPU#2002
yarongmu-google wants to merge 11 commits intopytorch:mainfrom
yarongmu-google:fix-pallas-dtype-mapping

yarongmu-google commented Apr 10, 2026 •

edited

Loading

Uh oh!

jansel Apr 14, 2026

Uh oh!

yarongmu-google Apr 15, 2026

Uh oh!

jansel Apr 14, 2026

Uh oh!

yarongmu-google Apr 14, 2026 •

edited

Loading

Uh oh!

yarongmu-google commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Flatten logits once at the beginning
		logits_flat = logits.view(-1)

		print("\n[DEBUG TRACE] Executing Pallas Kernel")
		print("Original Order (Tensors):")

Conversation

yarongmu-google commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jansel Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

yarongmu-google Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

jansel Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

yarongmu-google Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yarongmu-google commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yarongmu-google commented Apr 10, 2026 •

edited

Loading

yarongmu-google Apr 14, 2026 •

edited

Loading

yarongmu-google commented Apr 14, 2026 •

edited

Loading