Subit prefill by tomasruizt · Pull Request #2 · tomasruizt/flashinfer-competition-codebase

tomasruizt · 2026-03-27T16:58:02Z

No description provided.

Split inlined FLA Triton kernels into fla_prefill_code.py (1763 lines) and keep competition entry points in prefill_kernel.py. Fixes COMPILE_ERROR on eval server where fla is not installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Allows targeting specific workloads (e.g. idx=53 for B=64) to compare kernel performance across batch sizes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CUPTI measurements across all 54 decode workloads show fla-recurrent wins at B<=4 (1.30x at B=1) but loses at B>=16 (1.43x slower at B=64). Crossover at B=8. This explains the 0.96x competition result. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

NCU at B=64 shows fla-recurrent executes 2.2x more instructions than fi-baseline (13.4M vs 6M) due to BV=8 creating 16 V-tiles per head, each redundantly loading q/k/gates. Batch-adaptive BV is the fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Autotune sweep on B200 (CUPTI) found BV=8 always wins, but warp count is the key lever: fewer warps reduce contention at high occupancy. B=64 improved 18.18 -> 15.01 us (17%), crossover moved from B=8 to B=16. Also fix pre-existing ruff E402/E731 lint issues in kernel.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

BV=128 (18.18 us) and fla-tma (34.88 us) both worse than BV=8 with adaptive warps (15.01 us) at B=64. Remaining 1.18x gap to fi-baseline is from 2.2x instruction count difference due to redundant q/k loads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The prefill wrapper was computing gates (exp, softplus, sigmoid) as separate PyTorch ops, launching ~8 elementwise kernels before the cumsum. The new fused_gate_cumsum_kernel computes g and beta from raw a, b, dt_bias, A_log inputs and applies cumsum in one pass. Workload 0 (seq_len=6): 1742 -> 1201 us median (31% faster, RTX 3090). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

93% of prefill wall time is GPU idle (launch overhead), not compute. Main bottleneck: 51 kernel launches per forward pass with 82-150 us gaps before each Triton kernel. Remaining opportunities: caching prepare_chunk_indices and CUDA graphs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SOLUTION env var on run_modal.py loads a pre-existing solution JSON from the dataset instead of packing from source. New targets: make modal-official-prefill # our solution on B200 make modal-official-prefill-baseline # FlashInfer baseline on B200 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Set TRITON_CACHE_DIR to /data/cache/triton on the persistent Modal volume so compiled kernels and autotune configs survive container restarts. Applied to all Modal remote functions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Wraps flashinfer.gdn_prefill.chunk_gated_delta_rule (SM90+ only) as a prefill-fi-baseline algo for comparing against our prefill-fla-chunk on Modal B200 via: make prefill-fi-timing-modal Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Prefill CUPTI on B200: fi-baseline 185 us vs fla-chunk 1536 us (8.3x gap from launch overhead). Added Makefile targets and Modal notes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…aphs B200 CUPTI: fi-baseline 185 us vs fla-chunk 1536 us (8.3x gap). Next steps: (1) cache prepare_chunk_indices (~260 us), (2) CUDA graph capture (~1100 us), (3) kernel fusion (longer-term). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tomasruizt changed the title ~~Set config.toml for GDN prefill submission (prefill-fla-chunk)~~ Subit prefill Mar 27, 2026

tomasruizt and others added 16 commits March 31, 2026 17:04

Use same img as evaluator

08cbffe

Add --workload-idx flag to bench_fi_timing

4301126

Allows targeting specific workloads (e.g. idx=53 for B=64) to compare kernel performance across batch sizes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pass --workload-idx through to fi-timing-modal

4b86bd6

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Document prefill B200 results, Modal cache, fi-baseline, new targets

f3a7da9

Prefill CUPTI on B200: fi-baseline 185 us vs fla-chunk 1536 us (8.3x gap from launch overhead). Added Makefile targets and Modal notes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Set config.toml for GDN prefill submission (prefill-fla-chunk)

43ed859

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tomasruizt force-pushed the submit-prefill branch from 42b9fda to 43ed859 Compare March 31, 2026 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subit prefill#2

Subit prefill#2
tomasruizt wants to merge 16 commits intomainfrom
submit-prefill

tomasruizt commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tomasruizt commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant