Skip to content

Microbenchmarking and CI performance regression test#478

Open
matthiasdiener wants to merge 19 commits intodevfrom
mdiener/ci-microbench
Open

Microbenchmarking and CI performance regression test#478
matthiasdiener wants to merge 19 commits intodevfrom
mdiener/ci-microbench

Conversation

@matthiasdiener
Copy link
Contributor

@matthiasdiener matthiasdiener commented Mar 10, 2026

Description

Open questions:

  • What to do with performance-related env variables? (e.g., using Triton kernels for some operations, ck_tile for grouped GEMM, ...)
  • Do we need to rebuild the PR branch after perf testing is done?

Partly addresses https://github.com/ROCm/frameworks-internal/issues/15863

Microbenchmarking (not just) for CI.

  • Implemented 6 microbenchmarks, (attention, fp16 gemm, fp8 gemm, fp16 grouped gemm, normalization, casting)
  • Implemented performance regression test for CI:
    1. Run benchmarks for PR branch
    2. Checkout base branch, re-build that branch, re-run benchmarks
    3. Compare results between base branch and PR branch
    4. Report comparison results as PR comment, but don't fail CI if performance regresses
  • Could be expanded to non-CI use cases (like nightly performance regression tests)
    • If additional CI time (currently, ~5 minutes) is an issue, could be part of just a Level 3 CI run

TODOs:

  • add attention, normalization, casting
  • print commit in header of comment

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Added a set of 6 microbenchmarks (attention, fp16 gemm, fp8 gemm, fp16 grouped gemm, normalization, casting)
  • Create a test within CI that compares microbenchmark performance between PR branch and base branch

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@matthiasdiener matthiasdiener self-assigned this Mar 10, 2026
@matthiasdiener matthiasdiener force-pushed the mdiener/ci-microbench branch 2 times, most recently from d9f25f2 to ce0775a Compare March 10, 2026 20:28
@matthiasdiener matthiasdiener force-pushed the mdiener/ci-microbench branch from ce0775a to 8a0ea47 Compare March 10, 2026 22:12
@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions
Copy link

github-actions bot commented Mar 11, 2026

Performance Regression Report

MI325

PR commit: ddd17d4 | Base: dev | 2026-03-17 18:30:52 CDT

Benchmark suite Median speedup Min speedup Max speedup
benchmark_attention 1.000x 0.650x 1.390x
benchmark_casting 0.998x 0.920x 1.034x
benchmark_gemm 1.001x 0.322x 2.361x
benchmark_gemm_fp8 0.986x 0.285x 1.610x
benchmark_grouped_gemm 1.000x 0.439x 2.438x
benchmark_normalization 1.006x 0.633x 2.066x
benchmark_attention (median 1.000x, min 0.650x, max 1.390x)
Case batch seq_len num_q_heads num_kv_heads head_dim TE Forward Base TE Forward PR TE Forward Speedup TE Backward Base TE Backward PR TE Backward Speedup
Llama3-8B/TP1 2 1024 32 8 128 257.82 256.21 0.994x 180.08 198.09 1.100x
Llama3-8B/TP1 2 2048 32 8 128 385.63 384.63 0.997x 236.46 236.36 1.000x
Llama3-8B/TP1 2 4096 32 8 128 543.04 543.30 1.000x 275.26 275.64 1.001x
Llama3-8B/TP1 2 8192 32 8 128 705.72 709.27 1.005x 373.42 372.25 0.997x
Llama3-8B/TP8 2 1024 4 1 128 34.92 34.50 0.988x 22.25 21.58 0.970x
Llama3-8B/TP8 2 2048 4 1 128 125.61 124.96 0.995x 91.81 89.84 0.979x
Llama3-8B/TP8 2 4096 4 1 128 305.17 306.69 1.005x 255.60 255.78 1.001x
Llama3-8B/TP8 2 8192 4 1 128 406.33 523.55 1.288x 338.50 288.46 0.852x
Llama3-70B/TP8 2 1024 8 1 128 69.69 69.41 0.996x 44.22 44.54 1.007x
Llama3-70B/TP8 2 2048 8 1 128 236.62 238.34 1.007x 188.51 187.98 0.997x
Llama3-70B/TP8 2 4096 8 1 128 496.23 322.34 0.650x 238.89 300.79 1.259x
Llama3-70B/TP8 2 8192 8 1 128 617.34 616.97 0.999x 353.27 269.61 0.763x
Llama3-405B/TP8 2 1024 16 1 128 140.00 138.95 0.992x 89.30 124.16 1.390x
Llama3-405B/TP8 2 2048 16 1 128 364.79 365.62 1.002x 237.26 238.10 1.004x
Llama3-405B/TP8 2 4096 16 1 128 527.40 526.97 0.999x 258.35 240.51 0.931x
Llama3-405B/TP8 2 8192 16 1 128 697.06 697.91 1.001x 361.86 363.01 1.003x
Qwen2.5-7B/TP1 2 1024 28 4 128 230.63 234.43 1.016x 182.59 181.72 0.995x
Qwen2.5-7B/TP1 2 2048 28 4 128 404.59 405.18 1.001x 272.98 271.30 0.994x
Qwen2.5-7B/TP1 2 4096 28 4 128 590.77 517.93 0.877x 290.83 291.70 1.003x
Qwen2.5-7B/TP1 2 8192 28 4 128 707.55 687.44 0.972x 370.92 371.85 1.003x
Qwen2.5-72B/TP8 2 1024 8 1 128 67.26 68.40 1.017x 63.04 44.08 0.699x
Qwen2.5-72B/TP8 2 2048 8 1 128 242.45 238.28 0.983x 232.54 156.39 0.673x
Qwen2.5-72B/TP8 2 4096 8 1 128 493.67 494.51 1.002x 259.64 259.79 1.001x
Qwen2.5-72B/TP8 2 8192 8 1 128 615.70 616.16 1.001x 340.26 352.74 1.037x
benchmark_casting (median 0.998x, min 0.920x, max 1.034x)
Case M hidden_size dtype_str Cast GB/s Base Cast GB/s PR Cast GB/s Speedup
Llama3-8B/BF16-to-FP8-E4M3 1024 4096 BF16-to-FP8-E4M3 731.70 704.60 0.963x
Llama3-8B/BF16-to-FP8-E4M3 2048 4096 BF16-to-FP8-E4M3 1254.70 1250.10 0.996x
Llama3-8B/BF16-to-FP8-E4M3 4096 4096 BF16-to-FP8-E4M3 2160.70 2220.80 1.028x
Llama3-8B/BF16-to-FP8-E4M3 8192 4096 BF16-to-FP8-E4M3 2583.80 2569.50 0.994x
Llama3-8B/FP8-E4M3-to-BF16 1024 4096 FP8-E4M3-to-BF16 1115.30 1026.00 0.920x
Llama3-8B/FP8-E4M3-to-BF16 2048 4096 FP8-E4M3-to-BF16 2365.20 2337.20 0.988x
Llama3-8B/FP8-E4M3-to-BF16 4096 4096 FP8-E4M3-to-BF16 3631.80 3662.40 1.008x
Llama3-8B/FP8-E4M3-to-BF16 8192 4096 FP8-E4M3-to-BF16 3916.40 3737.90 0.954x
Llama3-8B/BF16-to-FP8-E5M2 1024 4096 BF16-to-FP8-E5M2 747.60 750.00 1.003x
Llama3-8B/BF16-to-FP8-E5M2 2048 4096 BF16-to-FP8-E5M2 1260.10 1250.20 0.992x
Llama3-8B/BF16-to-FP8-E5M2 4096 4096 BF16-to-FP8-E5M2 2221.30 2217.50 0.998x
Llama3-8B/BF16-to-FP8-E5M2 8192 4096 BF16-to-FP8-E5M2 2489.00 2573.20 1.034x
Llama3-8B/FP8-E5M2-to-BF16 1024 4096 FP8-E5M2-to-BF16 1167.00 1195.00 1.024x
Llama3-8B/FP8-E5M2-to-BF16 2048 4096 FP8-E5M2-to-BF16 2411.00 2411.60 1.000x
Llama3-8B/FP8-E5M2-to-BF16 4096 4096 FP8-E5M2-to-BF16 3701.80 3693.20 0.998x
Llama3-8B/FP8-E5M2-to-BF16 8192 4096 FP8-E5M2-to-BF16 3962.00 3980.50 1.005x
Llama3-70B/BF16-to-FP8-E4M3 1024 8192 BF16-to-FP8-E4M3 1230.40 1233.00 1.002x
Llama3-70B/BF16-to-FP8-E4M3 2048 8192 BF16-to-FP8-E4M3 2298.20 2310.10 1.005x
Llama3-70B/BF16-to-FP8-E4M3 4096 8192 BF16-to-FP8-E4M3 2682.70 2576.40 0.960x
Llama3-70B/BF16-to-FP8-E4M3 8192 8192 BF16-to-FP8-E4M3 1747.30 1723.50 0.986x
Llama3-70B/FP8-E4M3-to-BF16 1024 8192 FP8-E4M3-to-BF16 2368.60 2244.90 0.948x
Llama3-70B/FP8-E4M3-to-BF16 2048 8192 FP8-E4M3-to-BF16 3685.50 3692.10 1.002x
Llama3-70B/FP8-E4M3-to-BF16 4096 8192 FP8-E4M3-to-BF16 4001.20 3972.50 0.993x
Llama3-70B/FP8-E4M3-to-BF16 8192 8192 FP8-E4M3-to-BF16 4310.00 4264.40 0.989x
Llama3-70B/BF16-to-FP8-E5M2 1024 8192 BF16-to-FP8-E5M2 1173.10 1170.30 0.998x
Llama3-70B/BF16-to-FP8-E5M2 2048 8192 BF16-to-FP8-E5M2 2295.70 2306.90 1.005x
Llama3-70B/BF16-to-FP8-E5M2 4096 8192 BF16-to-FP8-E5M2 2682.40 2576.90 0.961x
Llama3-70B/BF16-to-FP8-E5M2 8192 8192 BF16-to-FP8-E5M2 1761.80 1678.20 0.953x
Llama3-70B/FP8-E5M2-to-BF16 1024 8192 FP8-E5M2-to-BF16 2414.50 2253.20 0.933x
Llama3-70B/FP8-E5M2-to-BF16 2048 8192 FP8-E5M2-to-BF16 3677.70 3674.90 0.999x
Llama3-70B/FP8-E5M2-to-BF16 4096 8192 FP8-E5M2-to-BF16 4029.50 3958.10 0.982x
Llama3-70B/FP8-E5M2-to-BF16 8192 8192 FP8-E5M2-to-BF16 4307.00 4271.90 0.992x
Llama3-405B/BF16-to-FP8-E4M3 1024 16384 BF16-to-FP8-E4M3 2063.50 2076.70 1.006x
Llama3-405B/BF16-to-FP8-E4M3 2048 16384 BF16-to-FP8-E4M3 2323.90 2321.60 0.999x
Llama3-405B/BF16-to-FP8-E4M3 4096 16384 BF16-to-FP8-E4M3 1858.90 1836.20 0.988x
Llama3-405B/BF16-to-FP8-E4M3 8192 16384 BF16-to-FP8-E4M3 1183.70 1178.40 0.996x
Llama3-405B/FP8-E4M3-to-BF16 1024 16384 FP8-E4M3-to-BF16 3685.10 3698.70 1.004x
Llama3-405B/FP8-E4M3-to-BF16 2048 16384 FP8-E4M3-to-BF16 4006.40 4030.60 1.006x
Llama3-405B/FP8-E4M3-to-BF16 4096 16384 FP8-E4M3-to-BF16 4426.90 4412.00 0.997x
Llama3-405B/FP8-E4M3-to-BF16 8192 16384 FP8-E4M3-to-BF16 3714.70 3748.10 1.009x
Llama3-405B/BF16-to-FP8-E5M2 1024 16384 BF16-to-FP8-E5M2 2067.20 2096.30 1.014x
Llama3-405B/BF16-to-FP8-E5M2 2048 16384 BF16-to-FP8-E5M2 2320.10 2335.40 1.007x
Llama3-405B/BF16-to-FP8-E5M2 4096 16384 BF16-to-FP8-E5M2 1855.60 1834.70 0.989x
Llama3-405B/BF16-to-FP8-E5M2 8192 16384 BF16-to-FP8-E5M2 1183.80 1175.30 0.993x
Llama3-405B/FP8-E5M2-to-BF16 1024 16384 FP8-E5M2-to-BF16 3697.50 3692.20 0.999x
Llama3-405B/FP8-E5M2-to-BF16 2048 16384 FP8-E5M2-to-BF16 3997.10 4031.70 1.009x
Llama3-405B/FP8-E5M2-to-BF16 4096 16384 FP8-E5M2-to-BF16 4427.20 4429.10 1.000x
Llama3-405B/FP8-E5M2-to-BF16 8192 16384 FP8-E5M2-to-BF16 3692.60 3740.70 1.013x
Qwen2.5-7B/BF16-to-FP8-E4M3 1024 3584 BF16-to-FP8-E4M3 505.90 503.60 0.995x
Qwen2.5-7B/BF16-to-FP8-E4M3 2048 3584 BF16-to-FP8-E4M3 967.80 957.80 0.990x
Qwen2.5-7B/BF16-to-FP8-E4M3 4096 3584 BF16-to-FP8-E4M3 1404.30 1409.20 1.003x
Qwen2.5-7B/BF16-to-FP8-E4M3 8192 3584 BF16-to-FP8-E4M3 2593.90 2608.70 1.006x
Qwen2.5-7B/FP8-E4M3-to-BF16 1024 3584 FP8-E4M3-to-BF16 1000.80 1012.10 1.011x
Qwen2.5-7B/FP8-E4M3-to-BF16 2048 3584 FP8-E4M3-to-BF16 2081.80 1999.70 0.961x
Qwen2.5-7B/FP8-E4M3-to-BF16 4096 3584 FP8-E4M3-to-BF16 3708.30 3714.50 1.002x
Qwen2.5-7B/FP8-E4M3-to-BF16 8192 3584 FP8-E4M3-to-BF16 3864.60 3862.80 1.000x
Qwen2.5-7B/BF16-to-FP8-E5M2 1024 3584 BF16-to-FP8-E5M2 505.70 503.60 0.996x
Qwen2.5-7B/BF16-to-FP8-E5M2 2048 3584 BF16-to-FP8-E5M2 968.00 959.20 0.991x
Qwen2.5-7B/BF16-to-FP8-E5M2 4096 3584 BF16-to-FP8-E5M2 1406.20 1408.70 1.002x
Qwen2.5-7B/BF16-to-FP8-E5M2 8192 3584 BF16-to-FP8-E5M2 2575.50 2594.60 1.007x
Qwen2.5-7B/FP8-E5M2-to-BF16 1024 3584 FP8-E5M2-to-BF16 1038.10 998.30 0.962x
Qwen2.5-7B/FP8-E5M2-to-BF16 2048 3584 FP8-E5M2-to-BF16 2111.80 2033.60 0.963x
Qwen2.5-7B/FP8-E5M2-to-BF16 4096 3584 FP8-E5M2-to-BF16 3715.70 3724.60 1.002x
Qwen2.5-7B/FP8-E5M2-to-BF16 8192 3584 FP8-E5M2-to-BF16 3895.90 3881.40 0.996x
Qwen2.5-72B/BF16-to-FP8-E4M3 1024 8192 BF16-to-FP8-E4M3 1242.90 1243.80 1.001x
Qwen2.5-72B/BF16-to-FP8-E4M3 2048 8192 BF16-to-FP8-E4M3 2292.30 2307.50 1.007x
Qwen2.5-72B/BF16-to-FP8-E4M3 4096 8192 BF16-to-FP8-E4M3 2690.00 2577.50 0.958x
Qwen2.5-72B/BF16-to-FP8-E4M3 8192 8192 BF16-to-FP8-E4M3 1781.50 1748.70 0.982x
Qwen2.5-72B/FP8-E4M3-to-BF16 1024 8192 FP8-E4M3-to-BF16 2415.80 2333.40 0.966x
Qwen2.5-72B/FP8-E4M3-to-BF16 2048 8192 FP8-E4M3-to-BF16 3688.30 3697.70 1.003x
Qwen2.5-72B/FP8-E4M3-to-BF16 4096 8192 FP8-E4M3-to-BF16 3996.00 3997.00 1.000x
Qwen2.5-72B/FP8-E4M3-to-BF16 8192 8192 FP8-E4M3-to-BF16 4299.10 4295.80 0.999x
Qwen2.5-72B/BF16-to-FP8-E5M2 1024 8192 BF16-to-FP8-E5M2 1182.50 1174.30 0.993x
Qwen2.5-72B/BF16-to-FP8-E5M2 2048 8192 BF16-to-FP8-E5M2 2295.40 2310.90 1.007x
Qwen2.5-72B/BF16-to-FP8-E5M2 4096 8192 BF16-to-FP8-E5M2 2699.50 2582.60 0.957x
Qwen2.5-72B/BF16-to-FP8-E5M2 8192 8192 BF16-to-FP8-E5M2 1784.60 1746.70 0.979x
Qwen2.5-72B/FP8-E5M2-to-BF16 1024 8192 FP8-E5M2-to-BF16 2393.60 2315.80 0.967x
Qwen2.5-72B/FP8-E5M2-to-BF16 2048 8192 FP8-E5M2-to-BF16 3703.50 3698.20 0.999x
Qwen2.5-72B/FP8-E5M2-to-BF16 4096 8192 FP8-E5M2-to-BF16 4009.30 4007.20 0.999x
Qwen2.5-72B/FP8-E5M2-to-BF16 8192 8192 FP8-E5M2-to-BF16 4419.90 4283.00 0.969x
benchmark_gemm (median 1.001x, min 0.322x, max 2.361x)
Case M N K dtype TE Forward Base TE Forward PR TE Forward Speedup TE Backward Base TE Backward PR TE Backward Speedup
Llama3-8B/TP1-QKV 1024 6144 4096 torch.bfloat16 591.68 540.88 0.914x 565.78 578.25 1.022x
Llama3-8B/TP1-AttnOut 1024 4096 4096 torch.bfloat16 553.06 551.29 0.997x 524.17 526.60 1.005x
Llama3-8B/TP1-GateUp 1024 28672 4096 torch.bfloat16 746.97 747.83 1.001x 416.28 414.18 0.995x
Llama3-8B/TP1-Down 1024 4096 14336 torch.bfloat16 582.34 581.93 0.999x 668.69 666.49 0.997x
Llama3-8B/TP1-QKV 2048 6144 4096 torch.bfloat16 677.20 680.06 1.004x 655.94 632.32 0.964x
Llama3-8B/TP1-AttnOut 2048 4096 4096 torch.bfloat16 564.90 565.15 1.000x 658.52 650.44 0.988x
Llama3-8B/TP1-GateUp 2048 28672 4096 torch.bfloat16 648.45 731.64 1.128x 535.40 544.40 1.017x
Llama3-8B/TP1-Down 2048 4096 14336 torch.bfloat16 644.70 644.60 1.000x 707.26 707.21 1.000x
Llama3-8B/TP1-QKV 4096 6144 4096 torch.bfloat16 682.29 728.15 1.067x 726.50 709.37 0.976x
Llama3-8B/TP1-AttnOut 4096 4096 4096 torch.bfloat16 677.77 701.16 1.035x 735.75 733.85 0.997x
Llama3-8B/TP1-GateUp 4096 28672 4096 torch.bfloat16 675.27 758.75 1.124x 810.45 760.78 0.939x
Llama3-8B/TP1-Down 4096 4096 14336 torch.bfloat16 748.83 642.09 0.857x 782.79 848.88 1.084x
Llama3-8B/TP1-QKV 8192 6144 4096 torch.bfloat16 744.54 743.87 0.999x 737.18 680.03 0.922x
Llama3-8B/TP1-AttnOut 8192 4096 4096 torch.bfloat16 730.74 735.51 1.007x 751.13 743.40 0.990x
Llama3-8B/TP1-GateUp 8192 28672 4096 torch.bfloat16 755.67 769.35 1.018x 769.86 728.81 0.947x
Llama3-8B/TP1-Down 8192 4096 14336 torch.bfloat16 710.39 712.41 1.003x 770.32 733.15 0.952x
Llama3-8B/TP8-QKV 1024 768 4096 torch.bfloat16 152.18 151.90 0.998x 57.00 55.64 0.976x
Llama3-8B/TP8-AttnOut 1024 4096 512 torch.bfloat16 101.15 101.61 1.005x 36.73 37.60 1.024x
Llama3-8B/TP8-GateUp 1024 3584 4096 torch.bfloat16 523.70 524.64 1.002x 275.69 278.79 1.011x
Llama3-8B/TP8-Down 1024 4096 1792 torch.bfloat16 355.82 359.89 1.011x 125.42 126.47 1.008x
Llama3-8B/TP8-QKV 2048 768 4096 torch.bfloat16 299.05 300.83 1.006x 111.31 110.81 0.996x
Llama3-8B/TP8-AttnOut 2048 4096 512 torch.bfloat16 203.38 203.75 1.002x 74.00 74.44 1.006x
Llama3-8B/TP8-GateUp 2048 3584 4096 torch.bfloat16 591.88 583.00 0.985x 619.07 622.05 1.005x
Llama3-8B/TP8-Down 2048 4096 1792 torch.bfloat16 549.18 561.59 1.023x 265.03 272.37 1.028x
Llama3-8B/TP8-QKV 4096 768 4096 torch.bfloat16 492.65 497.52 1.010x 223.78 229.66 1.026x
Llama3-8B/TP8-AttnOut 4096 4096 512 torch.bfloat16 399.21 407.65 1.021x 145.15 136.23 0.939x
Llama3-8B/TP8-GateUp 4096 3584 4096 torch.bfloat16 708.43 709.10 1.001x 702.34 720.58 1.026x
Llama3-8B/TP8-Down 4096 4096 1792 torch.bfloat16 642.92 639.09 0.994x 590.33 598.71 1.014x
Llama3-8B/TP8-QKV 8192 768 4096 torch.bfloat16 591.15 588.95 0.996x 507.42 509.56 1.004x
Llama3-8B/TP8-AttnOut 8192 4096 512 torch.bfloat16 489.20 501.70 1.026x 332.85 329.49 0.990x
Llama3-8B/TP8-GateUp 8192 3584 4096 torch.bfloat16 764.35 765.54 1.002x 728.82 734.48 1.008x
Llama3-8B/TP8-Down 8192 4096 1792 torch.bfloat16 608.34 654.30 1.076x 701.91 667.24 0.951x
Llama3-70B/TP8-QKV 1024 1280 8192 torch.bfloat16 417.74 414.70 0.993x 198.45 193.12 0.973x
Llama3-70B/TP8-AttnOut 1024 8192 1024 torch.bfloat16 406.15 411.43 1.013x 145.32 146.47 1.008x
Llama3-70B/TP8-GateUp 1024 7168 8192 torch.bfloat16 700.72 697.09 0.995x 498.79 652.31 1.308x
Llama3-70B/TP8-Down 1024 8192 3584 torch.bfloat16 580.39 597.17 1.029x 570.68 565.70 0.991x
Llama3-70B/TP8-QKV 2048 1280 8192 torch.bfloat16 498.32 500.43 1.004x 439.08 454.43 1.035x
Llama3-70B/TP8-AttnOut 2048 8192 1024 torch.bfloat16 554.32 541.12 0.976x 316.73 329.96 1.042x
Llama3-70B/TP8-GateUp 2048 7168 8192 torch.bfloat16 758.46 754.91 0.995x 607.20 742.45 1.223x
Llama3-70B/TP8-Down 2048 8192 3584 torch.bfloat16 699.38 691.39 0.989x 514.35 656.10 1.276x
Llama3-70B/TP8-QKV 4096 1280 8192 torch.bfloat16 519.74 527.64 1.015x 614.42 610.07 0.993x
Llama3-70B/TP8-AttnOut 4096 8192 1024 torch.bfloat16 599.72 598.24 0.998x 609.34 597.44 0.980x
Llama3-70B/TP8-GateUp 4096 7168 8192 torch.bfloat16 326.94 771.77 2.361x 2374.71 764.81 0.322x
Llama3-70B/TP8-Down 4096 8192 3584 torch.bfloat16 696.83 695.70 0.998x 739.45 739.89 1.001x
Llama3-70B/TP8-QKV 8192 1280 8192 torch.bfloat16 732.43 731.77 0.999x 676.91 678.74 1.003x
Llama3-70B/TP8-AttnOut 8192 8192 1024 torch.bfloat16 608.01 601.17 0.989x 657.81 671.50 1.021x
Llama3-70B/TP8-GateUp 8192 7168 8192 torch.bfloat16 795.07 795.29 1.000x 780.07 778.58 0.998x
Llama3-70B/TP8-Down 8192 8192 3584 torch.bfloat16 707.68 709.46 1.003x 761.20 761.80 1.001x
Llama3-405B/TP8-QKV 1024 2304 16384 torch.bfloat16 547.37 539.73 0.986x 603.38 601.90 0.998x
Llama3-405B/TP8-AttnOut 1024 16384 2048 torch.bfloat16 564.40 566.00 1.003x 537.46 536.20 0.998x
Llama3-405B/TP8-GateUp 1024 13312 16384 torch.bfloat16 671.74 679.53 1.012x 708.74 700.37 0.988x
Llama3-405B/TP8-Down 1024 16384 6656 torch.bfloat16 661.08 660.55 0.999x 517.57 515.50 0.996x
Llama3-405B/TP8-QKV 2048 2304 16384 torch.bfloat16 593.80 593.45 0.999x 688.76 686.37 0.997x
Llama3-405B/TP8-AttnOut 2048 16384 2048 torch.bfloat16 616.14 614.20 0.997x 646.91 651.50 1.007x
Llama3-405B/TP8-GateUp 2048 13312 16384 torch.bfloat16 664.29 671.27 1.011x 745.92 705.88 0.946x
Llama3-405B/TP8-Down 2048 16384 6656 torch.bfloat16 694.89 693.52 0.998x 640.86 702.29 1.096x
Llama3-405B/TP8-QKV 4096 2304 16384 torch.bfloat16 702.97 709.60 1.009x 626.73 719.54 1.148x
Llama3-405B/TP8-AttnOut 4096 16384 2048 torch.bfloat16 702.37 703.01 1.001x 518.03 600.56 1.159x
Llama3-405B/TP8-GateUp 4096 13312 16384 torch.bfloat16 621.31 658.78 1.060x 788.88 762.88 0.967x
Llama3-405B/TP8-Down 4096 16384 6656 torch.bfloat16 735.20 729.72 0.993x 713.31 716.14 1.004x
Llama3-405B/TP8-QKV 8192 2304 16384 torch.bfloat16 623.53 611.59 0.981x 735.79 744.59 1.012x
Llama3-405B/TP8-AttnOut 8192 16384 2048 torch.bfloat16 706.78 705.36 0.998x 748.39 681.01 0.910x
Llama3-405B/TP8-GateUp 8192 13312 16384 torch.bfloat16 679.56 668.72 0.984x 712.11 776.83 1.091x
Llama3-405B/TP8-Down 8192 16384 6656 torch.bfloat16 725.54 724.35 0.998x 733.33 717.45 0.978x
Qwen2.5-7B/TP1-QKV 1024 4608 3584 torch.bfloat16 560.14 563.73 1.006x 304.27 313.21 1.029x
Qwen2.5-7B/TP1-AttnOut 1024 3584 3584 torch.bfloat16 508.56 496.35 0.976x 233.33 235.40 1.009x
Qwen2.5-7B/TP1-GateUp 1024 37888 3584 torch.bfloat16 647.35 645.31 0.997x 592.55 594.47 1.003x
Qwen2.5-7B/TP1-Down 1024 3584 18944 torch.bfloat16 597.84 603.16 1.009x 657.60 662.62 1.008x
Qwen2.5-7B/TP1-QKV 2048 4608 3584 torch.bfloat16 636.82 637.18 1.001x 606.07 598.09 0.987x
Qwen2.5-7B/TP1-AttnOut 2048 3584 3584 torch.bfloat16 574.44 584.08 1.017x 561.73 512.25 0.912x
Qwen2.5-7B/TP1-GateUp 2048 37888 3584 torch.bfloat16 627.21 638.95 1.019x 532.69 527.39 0.990x
Qwen2.5-7B/TP1-Down 2048 3584 18944 torch.bfloat16 618.05 620.19 1.003x 705.68 708.02 1.003x
Qwen2.5-7B/TP1-QKV 4096 4608 3584 torch.bfloat16 645.12 638.90 0.990x 719.66 722.77 1.004x
Qwen2.5-7B/TP1-AttnOut 4096 3584 3584 torch.bfloat16 694.93 664.31 0.956x 677.88 671.21 0.990x
Qwen2.5-7B/TP1-GateUp 4096 37888 3584 torch.bfloat16 680.47 686.40 1.009x 725.63 729.91 1.006x
Qwen2.5-7B/TP1-Down 4096 3584 18944 torch.bfloat16 617.73 715.70 1.159x 844.02 773.91 0.917x
Qwen2.5-7B/TP1-QKV 8192 4608 3584 torch.bfloat16 696.76 705.02 1.012x 736.71 733.10 0.995x
Qwen2.5-7B/TP1-AttnOut 8192 3584 3584 torch.bfloat16 743.13 745.12 1.003x 610.92 717.88 1.175x
Qwen2.5-7B/TP1-GateUp 8192 37888 3584 torch.bfloat16 675.14 706.40 1.046x 785.37 767.06 0.977x
Qwen2.5-7B/TP1-Down 8192 3584 18944 torch.bfloat16 675.69 673.23 0.996x 733.91 734.10 1.000x
Qwen2.5-72B/TP8-QKV 1024 1280 8192 torch.bfloat16 409.01 409.16 1.000x 190.95 189.72 0.994x
Qwen2.5-72B/TP8-AttnOut 1024 8192 1024 torch.bfloat16 404.01 414.30 1.025x 146.48 146.40 0.999x
Qwen2.5-72B/TP8-GateUp 1024 7392 8192 torch.bfloat16 642.62 642.71 1.000x 612.83 614.72 1.003x
Qwen2.5-72B/TP8-Down 1024 8192 3696 torch.bfloat16 611.45 604.58 0.989x 479.04 485.55 1.014x
Qwen2.5-72B/TP8-QKV 2048 1280 8192 torch.bfloat16 497.06 496.35 0.999x 446.73 442.49 0.991x
Qwen2.5-72B/TP8-AttnOut 2048 8192 1024 torch.bfloat16 541.94 528.69 0.976x 322.38 327.45 1.016x
Qwen2.5-72B/TP8-GateUp 2048 7392 8192 torch.bfloat16 676.61 678.88 1.003x 699.78 696.51 0.995x
Qwen2.5-72B/TP8-Down 2048 8192 3696 torch.bfloat16 627.88 626.47 0.998x 577.34 577.04 0.999x
Qwen2.5-72B/TP8-QKV 4096 1280 8192 torch.bfloat16 522.99 526.33 1.006x 608.79 607.74 0.998x
Qwen2.5-72B/TP8-AttnOut 4096 8192 1024 torch.bfloat16 590.00 595.77 1.010x 594.40 592.51 0.997x
Qwen2.5-72B/TP8-GateUp 4096 7392 8192 torch.bfloat16 712.77 713.02 1.000x 723.54 660.74 0.913x
Qwen2.5-72B/TP8-Down 4096 8192 3696 torch.bfloat16 701.61 699.06 0.996x 672.36 674.50 1.003x
Qwen2.5-72B/TP8-QKV 8192 1280 8192 torch.bfloat16 716.06 718.70 1.004x 669.16 676.93 1.012x
Qwen2.5-72B/TP8-AttnOut 8192 8192 1024 torch.bfloat16 601.04 601.07 1.000x 656.66 668.00 1.017x
Qwen2.5-72B/TP8-GateUp 8192 7392 8192 torch.bfloat16 754.89 757.19 1.003x 702.16 692.87 0.987x
Qwen2.5-72B/TP8-Down 8192 8192 3696 torch.bfloat16 593.87 712.95 1.201x 769.46 702.40 0.913x
benchmark_gemm_fp8 (median 0.986x, min 0.285x, max 1.610x)
Case M N K dtype FP8 Forward Base FP8 Forward PR FP8 Forward Speedup FP8 Backward Base FP8 Backward PR FP8 Backward Speedup
Llama3-8B/TP1-QKV 1024 6144 4096 torch.bfloat16 378.14 380.23 1.006x 396.68 222.57 0.561x
Llama3-8B/TP1-AttnOut 1024 4096 4096 torch.bfloat16 261.48 258.23 0.988x 180.05 149.01 0.828x
Llama3-8B/TP1-GateUp 1024 28672 4096 torch.bfloat16 441.97 446.80 1.011x 940.07 948.99 1.009x
Llama3-8B/TP1-Down 1024 4096 14336 torch.bfloat16 418.10 419.15 1.003x 938.10 737.39 0.786x
Llama3-8B/TP1-QKV 2048 6144 4096 torch.bfloat16 688.50 654.45 0.951x 927.69 841.70 0.907x
Llama3-8B/TP1-AttnOut 2048 4096 4096 torch.bfloat16 516.39 507.14 0.982x 890.86 343.41 0.385x
Llama3-8B/TP1-GateUp 2048 28672 4096 torch.bfloat16 703.37 698.54 0.993x 991.17 1000.88 1.010x
Llama3-8B/TP1-Down 2048 4096 14336 torch.bfloat16 644.41 656.80 1.019x 1026.81 1000.97 0.975x
Llama3-8B/TP1-QKV 4096 6144 4096 torch.bfloat16 860.17 855.63 0.995x 1006.47 1008.00 1.002x
Llama3-8B/TP1-AttnOut 4096 4096 4096 torch.bfloat16 756.96 750.96 0.992x 994.86 627.90 0.631x
Llama3-8B/TP1-GateUp 4096 28672 4096 torch.bfloat16 891.83 872.75 0.979x 1077.84 1074.88 0.997x
Llama3-8B/TP1-Down 4096 4096 14336 torch.bfloat16 776.55 777.02 1.001x 1209.00 1205.00 0.997x
Llama3-8B/TP1-QKV 8192 6144 4096 torch.bfloat16 978.53 972.66 0.994x 1052.30 925.52 0.880x
Llama3-8B/TP1-AttnOut 8192 4096 4096 torch.bfloat16 885.61 957.24 1.081x 1133.74 1074.68 0.948x
Llama3-8B/TP1-GateUp 8192 28672 4096 torch.bfloat16 1084.89 1095.76 1.010x 1098.03 1068.04 0.973x
Llama3-8B/TP1-Down 8192 4096 14336 torch.bfloat16 859.54 782.10 0.910x 1195.36 1279.06 1.070x
Llama3-8B/TP8-QKV 1024 768 4096 torch.bfloat16 46.68 45.77 0.981x 49.17 30.54 0.621x
Llama3-8B/TP8-AttnOut 1024 4096 512 torch.bfloat16 31.01 23.09 0.745x 26.42 18.70 0.708x
Llama3-8B/TP8-GateUp 1024 3584 4096 torch.bfloat16 216.19 212.29 0.982x 396.30 206.62 0.521x
Llama3-8B/TP8-Down 1024 4096 1792 torch.bfloat16 107.89 105.90 0.982x 90.11 102.33 1.136x
Llama3-8B/TP8-QKV 2048 768 4096 torch.bfloat16 92.12 90.78 0.985x 71.60 86.40 1.207x
Llama3-8B/TP8-AttnOut 2048 4096 512 torch.bfloat16 60.85 60.43 0.993x 51.91 39.90 0.769x
Llama3-8B/TP8-GateUp 2048 3584 4096 torch.bfloat16 427.92 420.70 0.983x 358.77 276.59 0.771x
Llama3-8B/TP8-Down 2048 4096 1792 torch.bfloat16 212.98 209.85 0.985x 179.69 51.16 0.285x
Llama3-8B/TP8-QKV 4096 768 4096 torch.bfloat16 181.83 180.56 0.993x 153.02 98.67 0.645x
Llama3-8B/TP8-AttnOut 4096 4096 512 torch.bfloat16 121.14 119.86 0.989x 102.16 66.09 0.647x
Llama3-8B/TP8-GateUp 4096 3584 4096 torch.bfloat16 746.10 734.26 0.984x 748.24 468.74 0.626x
Llama3-8B/TP8-Down 4096 4096 1792 torch.bfloat16 418.60 385.64 0.921x 359.98 231.24 0.642x
Llama3-8B/TP8-QKV 8192 768 4096 torch.bfloat16 359.00 353.96 0.986x 303.24 194.26 0.641x
Llama3-8B/TP8-AttnOut 8192 4096 512 torch.bfloat16 239.68 235.41 0.982x 198.77 130.20 0.655x
Llama3-8B/TP8-GateUp 8192 3584 4096 torch.bfloat16 882.32 857.75 0.972x 853.81 1082.94 1.268x
Llama3-8B/TP8-Down 8192 4096 1792 torch.bfloat16 678.54 690.40 1.017x 764.79 474.50 0.620x
Llama3-70B/TP8-QKV 1024 1280 8192 torch.bfloat16 149.68 147.15 0.983x 126.81 80.49 0.635x
Llama3-70B/TP8-AttnOut 1024 8192 1024 torch.bfloat16 119.57 117.04 0.979x 99.28 63.90 0.644x
Llama3-70B/TP8-GateUp 1024 7168 8192 torch.bfloat16 427.00 426.75 0.999x 804.59 584.35 0.726x
Llama3-70B/TP8-Down 1024 8192 3584 torch.bfloat16 415.27 407.05 0.980x 349.06 223.73 0.641x
Llama3-70B/TP8-QKV 2048 1280 8192 torch.bfloat16 297.18 292.71 0.985x 248.77 159.93 0.643x
Llama3-70B/TP8-AttnOut 2048 8192 1024 torch.bfloat16 236.61 232.61 0.983x 196.43 127.03 0.647x
Llama3-70B/TP8-GateUp 2048 7168 8192 torch.bfloat16 610.89 502.97 0.823x 902.80 1453.88 1.610x
Llama3-70B/TP8-Down 2048 8192 3584 torch.bfloat16 700.47 713.50 1.019x 733.98 451.37 0.615x
Llama3-70B/TP8-QKV 4096 1280 8192 torch.bfloat16 542.17 546.10 1.007x 506.34 315.65 0.623x
Llama3-70B/TP8-AttnOut 4096 8192 1024 torch.bfloat16 463.04 458.40 0.990x 391.22 273.45 0.699x
Llama3-70B/TP8-GateUp 4096 7168 8192 torch.bfloat16 841.38 820.94 0.976x 1151.14 1186.75 1.031x
Llama3-70B/TP8-Down 4096 8192 3584 torch.bfloat16 882.45 842.82 0.955x 1026.79 1054.82 1.027x
Llama3-70B/TP8-QKV 8192 1280 8192 torch.bfloat16 443.03 446.33 1.007x 962.61 955.33 0.992x
Llama3-70B/TP8-AttnOut 8192 8192 1024 torch.bfloat16 675.66 670.79 0.993x 546.66 544.59 0.996x
Llama3-70B/TP8-GateUp 8192 7168 8192 torch.bfloat16 863.77 860.41 0.996x 1236.13 1158.05 0.937x
Llama3-70B/TP8-Down 8192 8192 3584 torch.bfloat16 976.25 972.75 0.996x 1045.82 1051.59 1.006x
Llama3-405B/TP8-QKV 1024 2304 16384 torch.bfloat16 488.99 492.35 1.007x 439.79 433.71 0.986x
Llama3-405B/TP8-AttnOut 1024 16384 2048 torch.bfloat16 454.92 464.26 1.021x 384.34 381.92 0.994x
Llama3-405B/TP8-GateUp 1024 13312 16384 torch.bfloat16 479.77 482.22 1.005x 1055.76 1047.58 0.992x
Llama3-405B/TP8-Down 1024 16384 6656 torch.bfloat16 466.09 467.39 1.003x 925.94 921.43 0.995x
Llama3-405B/TP8-QKV 2048 2304 16384 torch.bfloat16 674.19 649.05 0.963x 951.73 954.59 1.003x
Llama3-405B/TP8-AttnOut 2048 16384 2048 torch.bfloat16 706.04 701.10 0.993x 814.34 819.39 1.006x
Llama3-405B/TP8-GateUp 2048 13312 16384 torch.bfloat16 689.36 700.17 1.016x 1124.65 1040.51 0.925x
Llama3-405B/TP8-Down 2048 16384 6656 torch.bfloat16 665.66 559.19 0.840x 1046.41 1222.82 1.169x
Llama3-405B/TP8-QKV 4096 2304 16384 torch.bfloat16 599.33 606.85 1.013x 1133.79 1107.40 0.977x
Llama3-405B/TP8-AttnOut 4096 16384 2048 torch.bfloat16 824.37 826.55 1.003x 823.46 829.47 1.007x
Llama3-405B/TP8-GateUp 4096 13312 16384 torch.bfloat16 835.02 828.55 0.992x 1180.13 1212.87 1.028x
Llama3-405B/TP8-Down 4096 16384 6656 torch.bfloat16 860.97 857.20 0.996x 1136.42 1138.12 1.001x
Llama3-405B/TP8-QKV 8192 2304 16384 torch.bfloat16 741.53 740.34 0.998x 1145.48 1037.09 0.905x
Llama3-405B/TP8-AttnOut 8192 16384 2048 torch.bfloat16 969.80 969.06 0.999x 882.59 810.85 0.919x
Llama3-405B/TP8-GateUp 8192 13312 16384 torch.bfloat16 945.27 974.19 1.031x 1296.04 1276.34 0.985x
Llama3-405B/TP8-Down 8192 16384 6656 torch.bfloat16 913.89 982.00 1.075x 1208.61 1154.32 0.955x
Qwen2.5-7B/TP1-QKV 1024 4608 3584 torch.bfloat16 216.95 215.66 0.994x 168.68 113.82 0.675x
Qwen2.5-7B/TP1-AttnOut 1024 3584 3584 torch.bfloat16 171.68 168.54 0.982x 141.76 88.56 0.625x
Qwen2.5-7B/TP1-GateUp 1024 37888 3584 torch.bfloat16 464.55 465.51 1.002x 915.88 923.65 1.008x
Qwen2.5-7B/TP1-Down 1024 3584 18944 torch.bfloat16 397.47 400.37 1.007x 985.68 660.54 0.670x
Qwen2.5-7B/TP1-QKV 2048 4608 3584 torch.bfloat16 438.89 430.90 0.982x 335.96 222.31 0.662x
Qwen2.5-7B/TP1-AttnOut 2048 3584 3584 torch.bfloat16 340.03 334.53 0.984x 272.67 173.02 0.635x
Qwen2.5-7B/TP1-GateUp 2048 37888 3584 torch.bfloat16 606.30 676.60 1.116x 1020.92 935.10 0.916x
Qwen2.5-7B/TP1-Down 2048 3584 18944 torch.bfloat16 605.81 596.32 0.984x 1001.46 1020.38 1.019x
Qwen2.5-7B/TP1-QKV 4096 4608 3584 torch.bfloat16 780.96 443.45 0.568x 737.00 575.81 0.781x
Qwen2.5-7B/TP1-AttnOut 4096 3584 3584 torch.bfloat16 618.43 632.15 1.022x 573.33 348.09 0.607x
Qwen2.5-7B/TP1-GateUp 4096 37888 3584 torch.bfloat16 879.99 894.78 1.017x 1058.47 1053.23 0.995x
Qwen2.5-7B/TP1-Down 4096 3584 18944 torch.bfloat16 731.04 727.08 0.995x 1158.43 1051.43 0.908x
Qwen2.5-7B/TP1-QKV 8192 4608 3584 torch.bfloat16 906.38 921.77 1.017x 1066.33 1057.70 0.992x
Qwen2.5-7B/TP1-AttnOut 8192 3584 3584 torch.bfloat16 844.81 852.78 1.009x 1020.12 784.32 0.769x
Qwen2.5-7B/TP1-GateUp 8192 37888 3584 torch.bfloat16 988.73 990.40 1.002x 1046.09 1071.67 1.024x
Qwen2.5-7B/TP1-Down 8192 3584 18944 torch.bfloat16 826.23 823.97 0.997x 1142.17 1225.21 1.073x
Qwen2.5-72B/TP8-QKV 1024 1280 8192 torch.bfloat16 138.15 135.14 0.978x 107.97 68.12 0.631x
Qwen2.5-72B/TP8-AttnOut 1024 8192 1024 torch.bfloat16 110.14 107.58 0.977x 87.93 52.25 0.594x
Qwen2.5-72B/TP8-GateUp 1024 7392 8192 torch.bfloat16 389.01 389.72 1.002x 821.18 800.21 0.974x
Qwen2.5-72B/TP8-Down 1024 8192 3696 torch.bfloat16 330.70 334.50 1.011x 348.97 204.55 0.586x
Qwen2.5-72B/TP8-QKV 2048 1280 8192 torch.bfloat16 272.61 267.24 0.980x 218.58 135.16 0.618x
Qwen2.5-72B/TP8-AttnOut 2048 8192 1024 torch.bfloat16 218.03 214.46 0.984x 173.37 107.57 0.620x
Qwen2.5-72B/TP8-GateUp 2048 7392 8192 torch.bfloat16 637.36 630.76 0.990x 983.71 1040.77 1.058x
Qwen2.5-72B/TP8-Down 2048 8192 3696 torch.bfloat16 327.51 483.54 1.476x 1393.90 456.44 0.327x
Qwen2.5-72B/TP8-QKV 4096 1280 8192 torch.bfloat16 526.68 315.77 0.600x 435.24 321.30 0.738x
Qwen2.5-72B/TP8-AttnOut 4096 8192 1024 torch.bfloat16 425.69 424.82 0.998x 344.86 213.62 0.619x
Qwen2.5-72B/TP8-GateUp 4096 7392 8192 torch.bfloat16 806.53 794.63 0.985x 1090.67 1111.11 1.019x
Qwen2.5-72B/TP8-Down 4096 8192 3696 torch.bfloat16 643.16 640.06 0.995x 1008.94 1007.99 0.999x
Qwen2.5-72B/TP8-QKV 8192 1280 8192 torch.bfloat16 433.62 312.92 0.722x 938.52 1233.54 1.314x
Qwen2.5-72B/TP8-AttnOut 8192 8192 1024 torch.bfloat16 653.31 655.29 1.003x 532.67 453.51 0.851x
Qwen2.5-72B/TP8-GateUp 8192 7392 8192 torch.bfloat16 849.70 856.01 1.007x 1240.85 1234.23 0.995x
Qwen2.5-72B/TP8-Down 8192 8192 3696 torch.bfloat16 836.43 838.64 1.003x 962.93 964.94 1.002x
benchmark_grouped_gemm (median 1.000x, min 0.439x, max 2.438x)
Case B M N K dtype TE (CK_Tile) Forward Base TE (CK_Tile) Forward PR TE (CK_Tile) Forward Speedup TE (CK_Tile) Backward Base TE (CK_Tile) Backward PR TE (CK_Tile) Backward Speedup
DSV2-Lite-GateUP 2 512 2816 2048 torch.bfloat16 136.61 136.55 1.000x 146.06 146.00 1.000x
DSV2-Lite-Down 2 512 2048 1408 torch.bfloat16 94.93 94.59 0.996x 147.73 147.64 0.999x
DSV2-Lite-GateUP 2 1024 2816 2048 torch.bfloat16 107.97 263.21 2.438x 241.01 246.45 1.023x
DSV2-Lite-Down 2 1024 2048 1408 torch.bfloat16 184.01 183.88 0.999x 248.51 247.66 0.997x
DSV2-Lite-GateUP 2 2048 2816 2048 torch.bfloat16 443.08 194.70 0.439x 346.09 353.65 1.022x
DSV2-Lite-Down 2 2048 2048 1408 torch.bfloat16 339.55 341.79 1.007x 361.04 361.20 1.000x
DSV2-Lite-GateUP 2 4096 2816 2048 torch.bfloat16 451.10 452.78 1.004x 453.46 435.64 0.961x
DSV2-Lite-Down 2 4096 2048 1408 torch.bfloat16 535.49 535.92 1.001x 387.76 385.58 0.994x
DSV2-Lite-GateUP 4 512 2816 2048 torch.bfloat16 261.10 261.21 1.000x 224.11 223.33 0.997x
DSV2-Lite-Down 4 512 2048 1408 torch.bfloat16 182.07 182.05 1.000x 223.48 223.01 0.998x
DSV2-Lite-GateUP 4 1024 2816 2048 torch.bfloat16 433.24 433.93 1.002x 337.89 337.33 0.998x
DSV2-Lite-Down 4 1024 2048 1408 torch.bfloat16 334.81 333.68 0.997x 329.57 329.23 0.999x
DSV2-Lite-GateUP 4 2048 2816 2048 torch.bfloat16 425.44 424.19 0.997x 437.01 437.32 1.001x
DSV2-Lite-Down 4 2048 2048 1408 torch.bfloat16 523.87 525.98 1.004x 337.00 348.13 1.033x
DSV2-Lite-GateUP 4 4096 2816 2048 torch.bfloat16 570.78 563.79 0.988x 446.51 446.61 1.000x
DSV2-Lite-Down 4 4096 2048 1408 torch.bfloat16 544.30 554.23 1.018x 412.63 410.77 0.995x
DSV2-Lite-GateUP 8 512 2816 2048 torch.bfloat16 423.34 419.98 0.992x 335.54 332.86 0.992x
DSV2-Lite-Down 8 512 2048 1408 torch.bfloat16 325.58 324.71 0.997x 320.35 319.85 0.998x
DSV2-Lite-GateUP 8 1024 2816 2048 torch.bfloat16 407.18 405.29 0.995x 440.04 441.57 1.003x
DSV2-Lite-Down 8 1024 2048 1408 torch.bfloat16 504.75 501.55 0.994x 364.48 364.15 0.999x
DSV2-Lite-GateUP 8 2048 2816 2048 torch.bfloat16 567.09 567.78 1.001x 495.92 474.38 0.957x
DSV2-Lite-Down 8 2048 2048 1408 torch.bfloat16 536.33 536.70 1.001x 459.26 459.74 1.001x
DSV2-Lite-GateUP 8 4096 2816 2048 torch.bfloat16 611.51 615.57 1.007x 514.81 515.23 1.001x
DSV2-Lite-Down 8 4096 2048 1408 torch.bfloat16 570.06 570.33 1.000x 499.43 496.88 0.995x
DSV2-GateUP 5 512 3072 5120 torch.bfloat16 356.91 360.34 1.010x 422.01 410.45 0.973x
DSV2-Down 5 512 5120 1536 torch.bfloat16 443.37 440.61 0.994x 254.27 253.67 0.998x
DSV2-GateUP 5 1024 3072 5120 torch.bfloat16 606.34 606.51 1.000x 480.05 485.13 1.011x
DSV2-Down 5 1024 5120 1536 torch.bfloat16 459.41 460.46 1.002x 388.05 388.54 1.001x
DSV2-GateUP 5 2048 3072 5120 torch.bfloat16 579.84 579.33 0.999x 559.41 559.62 1.000x
DSV2-Down 5 2048 5120 1536 torch.bfloat16 585.94 586.28 1.001x 538.69 539.68 1.002x
DSV2-GateUP 5 4096 3072 5120 torch.bfloat16 612.67 611.57 0.998x 567.12 573.83 1.012x
DSV2-Down 5 4096 5120 1536 torch.bfloat16 590.31 589.30 0.998x 558.32 558.86 1.001x
DSV2-GateUP 10 512 3072 5120 torch.bfloat16 526.42 524.88 0.997x 418.98 417.29 0.996x
DSV2-Down 10 512 5120 1536 torch.bfloat16 446.07 446.85 1.002x 347.50 348.18 1.002x
DSV2-GateUP 10 1024 3072 5120 torch.bfloat16 493.04 490.06 0.994x 519.32 514.08 0.990x
DSV2-Down 10 1024 5120 1536 torch.bfloat16 548.76 543.97 0.991x 482.00 482.63 1.001x
DSV2-GateUP 10 2048 3072 5120 torch.bfloat16 596.15 596.18 1.000x 527.45 528.58 1.002x
DSV2-Down 10 2048 5120 1536 torch.bfloat16 544.78 546.27 1.003x 535.06 534.90 1.000x
DSV2-GateUP 10 4096 3072 5120 torch.bfloat16 654.99 653.80 0.998x 579.65 578.41 0.998x
DSV2-Down 10 4096 5120 1536 torch.bfloat16 567.31 569.53 1.004x 555.25 556.35 1.002x
DSV2-GateUP 20 512 3072 5120 torch.bfloat16 516.01 515.30 0.999x 438.67 437.76 0.998x
DSV2-Down 20 512 5120 1536 torch.bfloat16 490.73 495.44 1.010x 421.15 421.12 1.000x
DSV2-GateUP 20 1024 3072 5120 torch.bfloat16 544.96 541.73 0.994x 513.68 512.47 0.998x
DSV2-Down 20 1024 5120 1536 torch.bfloat16 512.41 512.31 1.000x 472.34 472.73 1.001x
DSV2-GateUP 20 2048 3072 5120 torch.bfloat16 641.16 641.96 1.001x 542.03 556.34 1.026x
DSV2-Down 20 2048 5120 1536 torch.bfloat16 556.16 557.08 1.002x 508.18 507.29 0.998x
DSV2-GateUP 20 4096 3072 5120 torch.bfloat16 662.95 663.15 1.000x 563.79 556.14 0.986x
DSV2-Down 20 4096 5120 1536 torch.bfloat16 546.03 548.11 1.004x 551.84 564.05 1.022x
DSV3-GateUP 8 512 4096 7168 torch.bfloat16 561.83 561.96 1.000x 439.28 439.53 1.001x
DSV3-Down 8 512 7168 2048 torch.bfloat16 506.00 502.19 0.992x 363.25 363.13 1.000x
DSV3-GateUP 8 1024 4096 7168 torch.bfloat16 593.38 594.00 1.001x 537.98 537.92 1.000x
DSV3-Down 8 1024 7168 2048 torch.bfloat16 605.11 607.23 1.004x 512.19 513.62 1.003x
DSV3-GateUP 8 2048 4096 7168 torch.bfloat16 629.67 619.95 0.985x 566.65 567.10 1.001x
DSV3-Down 8 2048 7168 2048 torch.bfloat16 628.29 630.02 1.003x 555.61 516.41 0.929x
DSV3-GateUP 8 4096 4096 7168 torch.bfloat16 656.86 682.34 1.039x 574.16 573.92 1.000x
DSV3-Down 8 4096 7168 2048 torch.bfloat16 639.12 641.34 1.003x 552.53 573.03 1.037x
DSV3-GateUP 16 512 4096 7168 torch.bfloat16 544.52 543.91 0.999x 450.89 462.98 1.027x
DSV3-Down 16 512 7168 2048 torch.bfloat16 535.66 535.10 0.999x 439.26 439.21 1.000x
DSV3-GateUP 16 1024 4096 7168 torch.bfloat16 567.41 538.63 0.949x 524.70 524.45 1.000x
DSV3-Down 16 1024 7168 2048 torch.bfloat16 594.85 595.96 1.002x 490.64 490.53 1.000x
DSV3-GateUP 16 2048 4096 7168 torch.bfloat16 669.68 638.09 0.953x 555.96 556.79 1.001x
DSV3-Down 16 2048 7168 2048 torch.bfloat16 626.21 626.69 1.001x 528.65 550.39 1.041x
DSV3-GateUP 16 4096 4096 7168 torch.bfloat16 657.35 670.19 1.020x 583.75 579.27 0.992x
DSV3-Down 16 4096 7168 2048 torch.bfloat16 632.78 632.01 0.999x 582.36 572.27 0.983x
DSV3-GateUP 32 512 4096 7168 torch.bfloat16 538.44 538.55 1.000x 426.59 426.21 0.999x
DSV3-Down 32 512 7168 2048 torch.bfloat16 526.51 459.17 0.872x 402.05 426.37 1.060x
DSV3-GateUP 32 1024 4096 7168 torch.bfloat16 644.04 642.56 0.998x 506.65 514.44 1.015x
DSV3-Down 32 1024 7168 2048 torch.bfloat16 588.36 587.85 0.999x 472.44 473.26 1.002x
DSV3-GateUP 32 2048 4096 7168 torch.bfloat16 656.46 641.24 0.977x 554.86 559.71 1.009x
DSV3-Down 32 2048 7168 2048 torch.bfloat16 613.63 613.77 1.000x 545.77 554.32 1.016x
DSV3-GateUP 32 4096 4096 7168 torch.bfloat16 667.88 674.54 1.010x 571.30 571.35 1.000x
DSV3-Down 32 4096 7168 2048 torch.bfloat16 599.49 610.80 1.019x 574.82 575.19 1.001x
Grok-V2-GateUP 1 512 32768 8192 torch.bfloat16 514.60 515.22 1.001x 553.62 555.90 1.004x
Grok-V2-Down 1 512 8192 16384 torch.bfloat16 480.71 483.43 1.006x 570.76 577.15 1.011x
Grok-V2-GateUP 1 1024 32768 8192 torch.bfloat16 614.69 614.51 1.000x 646.34 605.78 0.937x
Grok-V2-Down 1 1024 8192 16384 torch.bfloat16 534.74 533.95 0.999x 645.96 645.59 0.999x
Grok-V2-GateUP 1 2048 32768 8192 torch.bfloat16 659.29 659.61 1.000x 746.55 747.23 1.001x
Grok-V2-Down 1 2048 8192 16384 torch.bfloat16 605.26 605.95 1.001x 718.30 716.29 0.997x
Grok-V2-GateUP 1 4096 32768 8192 torch.bfloat16 724.20 724.67 1.001x 761.82 762.57 1.001x
Grok-V2-Down 1 4096 8192 16384 torch.bfloat16 626.54 627.74 1.002x 743.74 746.09 1.003x
benchmark_normalization (median 1.006x, min 0.633x, max 2.066x)
Case M hidden_size dtype TE Forward GB/s Base TE Forward GB/s PR TE Forward GB/s Speedup TE Backward GB/s Base TE Backward GB/s PR TE Backward GB/s Speedup
Llama3-8B/RMSNorm 1024 4096 torch.bfloat16 609.30 578.90 0.950x 312.20 644.60 2.065x
Llama3-8B/RMSNorm 2048 4096 torch.bfloat16 1205.50 1163.00 0.965x 643.80 1329.80 2.066x
Llama3-8B/RMSNorm 4096 4096 torch.bfloat16 2449.20 2366.20 0.966x 2595.00 2612.60 1.007x
Llama3-8B/RMSNorm 8192 4096 torch.bfloat16 3812.50 3856.40 1.012x 4284.00 4211.50 0.983x
Llama3-8B/LayerNorm 1024 4096 torch.bfloat16 528.60 491.40 0.930x 561.00 591.70 1.055x
Llama3-8B/LayerNorm 2048 4096 torch.bfloat16 1045.80 1020.30 0.976x 1111.10 1144.10 1.030x
Llama3-8B/LayerNorm 4096 4096 torch.bfloat16 2031.30 2021.20 0.995x 2286.40 2294.30 1.003x
Llama3-8B/LayerNorm 8192 4096 torch.bfloat16 3497.10 3520.10 1.007x 4018.80 3071.20 0.764x
Llama3-70B/RMSNorm 1024 8192 torch.bfloat16 1215.30 1176.60 0.968x 1320.10 844.50 0.640x
Llama3-70B/RMSNorm 2048 8192 torch.bfloat16 2402.30 2365.50 0.985x 2684.10 1698.30 0.633x
Llama3-70B/RMSNorm 4096 8192 torch.bfloat16 3086.30 3054.70 0.990x 3292.10 3278.60 0.996x
Llama3-70B/RMSNorm 8192 8192 torch.bfloat16 3639.30 3637.10 0.999x 3560.70 3486.10 0.979x
Llama3-70B/LayerNorm 1024 8192 torch.bfloat16 1034.70 1024.60 0.990x 657.30 556.50 0.847x
Llama3-70B/LayerNorm 2048 8192 torch.bfloat16 2054.00 2042.30 0.994x 731.70 737.80 1.008x
Llama3-70B/LayerNorm 4096 8192 torch.bfloat16 2970.00 2961.50 0.997x 640.90 678.80 1.059x
Llama3-70B/LayerNorm 8192 8192 torch.bfloat16 3524.30 3538.60 1.004x 599.10 625.10 1.043x
Llama3-405B/RMSNorm 1024 16384 torch.bfloat16 664.60 681.30 1.025x 536.60 519.80 0.969x
Llama3-405B/RMSNorm 2048 16384 torch.bfloat16 714.30 718.80 1.006x 522.70 535.30 1.024x
Llama3-405B/RMSNorm 4096 16384 torch.bfloat16 729.40 726.20 0.996x 455.10 464.00 1.020x
Llama3-405B/RMSNorm 8192 16384 torch.bfloat16 608.40 608.50 1.000x 477.20 479.80 1.005x
Llama3-405B/LayerNorm 1024 16384 torch.bfloat16 672.50 681.50 1.013x 614.40 624.50 1.016x
Llama3-405B/LayerNorm 2048 16384 torch.bfloat16 673.40 672.10 0.998x 597.90 637.10 1.066x
Llama3-405B/LayerNorm 4096 16384 torch.bfloat16 701.00 701.40 1.001x 522.90 548.70 1.049x
Llama3-405B/LayerNorm 8192 16384 torch.bfloat16 577.60 582.30 1.008x 606.40 607.30 1.001x
Qwen2.5-7B/RMSNorm 1024 3584 torch.bfloat16 522.70 525.70 1.006x 181.50 296.20 1.632x
Qwen2.5-7B/RMSNorm 2048 3584 torch.bfloat16 1049.10 1036.50 0.988x 377.00 588.70 1.562x
Qwen2.5-7B/RMSNorm 4096 3584 torch.bfloat16 2108.70 2067.30 0.980x 734.20 997.60 1.359x
Qwen2.5-7B/RMSNorm 8192 3584 torch.bfloat16 2905.40 2927.20 1.008x 1521.10 1640.50 1.078x
Qwen2.5-7B/LayerNorm 1024 3584 torch.bfloat16 444.50 447.00 1.006x 146.90 248.20 1.690x
Qwen2.5-7B/LayerNorm 2048 3584 torch.bfloat16 902.80 899.30 0.996x 296.40 498.60 1.682x
Qwen2.5-7B/LayerNorm 4096 3584 torch.bfloat16 1767.00 1768.60 1.001x 583.20 891.80 1.529x
Qwen2.5-7B/LayerNorm 8192 3584 torch.bfloat16 2531.80 2573.90 1.017x 1223.00 1436.60 1.175x
Qwen2.5-72B/RMSNorm 1024 8192 torch.bfloat16 1168.60 1188.60 1.017x 396.20 672.10 1.696x
Qwen2.5-72B/RMSNorm 2048 8192 torch.bfloat16 2348.80 2354.40 1.002x 816.50 1374.00 1.683x
Qwen2.5-72B/RMSNorm 4096 8192 torch.bfloat16 3032.00 3021.70 0.997x 1770.80 2928.60 1.654x
Qwen2.5-72B/RMSNorm 8192 8192 torch.bfloat16 3568.60 3604.10 1.010x 3541.80 3523.90 0.995x
Qwen2.5-72B/LayerNorm 1024 8192 torch.bfloat16 1039.00 1006.90 0.969x 319.90 543.40 1.699x
Qwen2.5-72B/LayerNorm 2048 8192 torch.bfloat16 2085.00 2058.50 0.987x 645.70 741.00 1.148x
Qwen2.5-72B/LayerNorm 4096 8192 torch.bfloat16 2987.20 2967.70 0.993x 643.30 679.60 1.056x
Qwen2.5-72B/LayerNorm 8192 8192 torch.bfloat16 3434.70 3502.90 1.020x 592.70 626.00 1.056x

MI355

PR commit: ddd17d4 | Base: dev | 2026-03-17 18:30:52 CDT

Benchmark suite Median speedup Min speedup Max speedup
benchmark_attention 1.000x 0.969x 1.014x
benchmark_casting 0.998x 0.898x 1.138x
benchmark_gemm 1.001x 0.935x 1.094x
benchmark_gemm_fp8 0.988x 0.401x 1.038x
benchmark_grouped_gemm 0.999x 0.912x 1.134x
benchmark_normalization 0.993x 0.399x 1.490x
benchmark_attention (median 1.000x, min 0.969x, max 1.014x)
Case batch seq_len num_q_heads num_kv_heads head_dim TE Forward Base TE Forward PR TE Forward Speedup TE Backward Base TE Backward PR TE Backward Speedup
Llama3-8B/TP1 2 1024 32 8 128 298.06 298.56 1.002x 246.01 238.32 0.969x
Llama3-8B/TP1 2 2048 32 8 128 531.54 535.00 1.007x 273.78 273.40 0.999x
Llama3-8B/TP1 2 4096 32 8 128 681.26 681.83 1.001x 297.55 298.62 1.004x
Llama3-8B/TP1 2 8192 32 8 128 943.78 946.38 1.003x 392.95 393.58 1.002x
Llama3-8B/TP8 2 1024 4 1 128 38.23 37.85 0.990x 38.08 38.14 1.002x
Llama3-8B/TP8 2 2048 4 1 128 151.71 149.89 0.988x 146.00 146.67 1.005x
Llama3-8B/TP8 2 4096 4 1 128 421.15 417.61 0.992x 289.94 290.16 1.001x
Llama3-8B/TP8 2 8192 4 1 128 697.50 700.77 1.005x 312.90 313.96 1.003x
Llama3-70B/TP8 2 1024 8 1 128 75.79 75.91 1.002x 76.97 75.59 0.982x
Llama3-70B/TP8 2 2048 8 1 128 304.14 301.02 0.990x 270.62 274.53 1.014x
Llama3-70B/TP8 2 4096 8 1 128 615.32 616.78 1.002x 301.91 301.68 0.999x
Llama3-70B/TP8 2 8192 8 1 128 752.84 759.49 1.009x 351.40 353.16 1.005x
Llama3-405B/TP8 2 1024 16 1 128 153.61 152.41 0.992x 152.84 153.13 1.002x
Llama3-405B/TP8 2 2048 16 1 128 494.52 496.16 1.003x 277.21 277.06 0.999x
Llama3-405B/TP8 2 4096 16 1 128 667.38 670.21 1.004x 298.69 298.46 0.999x
Llama3-405B/TP8 2 8192 16 1 128 900.01 900.16 1.000x 382.68 382.27 0.999x
Qwen2.5-7B/TP1 2 1024 28 4 128 266.49 262.94 0.987x 221.11 222.98 1.008x
Qwen2.5-7B/TP1 2 2048 28 4 128 491.33 491.07 0.999x 247.61 247.99 1.002x
Qwen2.5-7B/TP1 2 4096 28 4 128 685.47 685.32 1.000x 302.15 302.30 1.000x
Qwen2.5-7B/TP1 2 8192 28 4 128 953.99 952.98 0.999x 395.01 394.68 0.999x
Qwen2.5-72B/TP8 2 1024 8 1 128 76.69 75.99 0.991x 76.62 75.80 0.989x
Qwen2.5-72B/TP8 2 2048 8 1 128 306.02 302.08 0.987x 271.20 274.05 1.011x
Qwen2.5-72B/TP8 2 4096 8 1 128 618.80 617.37 0.998x 301.97 302.10 1.000x
Qwen2.5-72B/TP8 2 8192 8 1 128 752.92 756.74 1.005x 353.63 353.38 0.999x
benchmark_casting (median 0.998x, min 0.898x, max 1.138x)
Case M hidden_size dtype_str Cast GB/s Base Cast GB/s PR Cast GB/s Speedup
Llama3-8B/BF16-to-FP8-E4M3 1024 4096 BF16-to-FP8-E4M3 764.70 756.30 0.989x
Llama3-8B/BF16-to-FP8-E4M3 2048 4096 BF16-to-FP8-E4M3 1224.70 1232.50 1.006x
Llama3-8B/BF16-to-FP8-E4M3 4096 4096 BF16-to-FP8-E4M3 2145.90 2178.90 1.015x
Llama3-8B/BF16-to-FP8-E4M3 8192 4096 BF16-to-FP8-E4M3 1738.20 1746.90 1.005x
Llama3-8B/FP8-E4M3-to-BF16 1024 4096 FP8-E4M3-to-BF16 1160.80 1092.60 0.941x
Llama3-8B/FP8-E4M3-to-BF16 2048 4096 FP8-E4M3-to-BF16 2545.40 2392.40 0.940x
Llama3-8B/FP8-E4M3-to-BF16 4096 4096 FP8-E4M3-to-BF16 4642.80 4641.40 1.000x
Llama3-8B/FP8-E4M3-to-BF16 8192 4096 FP8-E4M3-to-BF16 5711.60 5694.70 0.997x
Llama3-8B/BF16-to-FP8-E5M2 1024 4096 BF16-to-FP8-E5M2 764.10 763.20 0.999x
Llama3-8B/BF16-to-FP8-E5M2 2048 4096 BF16-to-FP8-E5M2 1221.40 1234.10 1.010x
Llama3-8B/BF16-to-FP8-E5M2 4096 4096 BF16-to-FP8-E5M2 2159.10 2178.30 1.009x
Llama3-8B/BF16-to-FP8-E5M2 8192 4096 BF16-to-FP8-E5M2 1742.10 1750.10 1.005x
Llama3-8B/FP8-E5M2-to-BF16 1024 4096 FP8-E5M2-to-BF16 1249.40 1122.30 0.898x
Llama3-8B/FP8-E5M2-to-BF16 2048 4096 FP8-E5M2-to-BF16 2542.20 2353.60 0.926x
Llama3-8B/FP8-E5M2-to-BF16 4096 4096 FP8-E5M2-to-BF16 4713.90 4636.80 0.984x
Llama3-8B/FP8-E5M2-to-BF16 8192 4096 FP8-E5M2-to-BF16 5684.70 5697.40 1.002x
Llama3-70B/BF16-to-FP8-E4M3 1024 8192 BF16-to-FP8-E4M3 1248.20 1253.10 1.004x
Llama3-70B/BF16-to-FP8-E4M3 2048 8192 BF16-to-FP8-E4M3 2254.30 2260.60 1.003x
Llama3-70B/BF16-to-FP8-E4M3 4096 8192 BF16-to-FP8-E4M3 1584.10 1802.80 1.138x
Llama3-70B/BF16-to-FP8-E4M3 8192 8192 BF16-to-FP8-E4M3 1903.50 1903.10 1.000x
Llama3-70B/FP8-E4M3-to-BF16 1024 8192 FP8-E4M3-to-BF16 2533.20 2439.10 0.963x
Llama3-70B/FP8-E4M3-to-BF16 2048 8192 FP8-E4M3-to-BF16 4719.90 4698.20 0.995x
Llama3-70B/FP8-E4M3-to-BF16 4096 8192 FP8-E4M3-to-BF16 5733.00 5705.20 0.995x
Llama3-70B/FP8-E4M3-to-BF16 8192 8192 FP8-E4M3-to-BF16 6595.90 6578.60 0.997x
Llama3-70B/BF16-to-FP8-E5M2 1024 8192 BF16-to-FP8-E5M2 1253.10 1261.70 1.007x
Llama3-70B/BF16-to-FP8-E5M2 2048 8192 BF16-to-FP8-E5M2 2257.00 2269.50 1.006x
Llama3-70B/BF16-to-FP8-E5M2 4096 8192 BF16-to-FP8-E5M2 1812.90 1801.70 0.994x
Llama3-70B/BF16-to-FP8-E5M2 8192 8192 BF16-to-FP8-E5M2 1902.80 1871.40 0.983x
Llama3-70B/FP8-E5M2-to-BF16 1024 8192 FP8-E5M2-to-BF16 2548.60 2453.50 0.963x
Llama3-70B/FP8-E5M2-to-BF16 2048 8192 FP8-E5M2-to-BF16 4702.40 4717.80 1.003x
Llama3-70B/FP8-E5M2-to-BF16 4096 8192 FP8-E5M2-to-BF16 5736.00 5727.50 0.999x
Llama3-70B/FP8-E5M2-to-BF16 8192 8192 FP8-E5M2-to-BF16 6451.10 6457.60 1.001x
Llama3-405B/BF16-to-FP8-E4M3 1024 16384 BF16-to-FP8-E4M3 2267.60 2261.30 0.997x
Llama3-405B/BF16-to-FP8-E4M3 2048 16384 BF16-to-FP8-E4M3 1674.70 1669.70 0.997x
Llama3-405B/BF16-to-FP8-E4M3 4096 16384 BF16-to-FP8-E4M3 1866.20 1867.10 1.000x
Llama3-405B/BF16-to-FP8-E4M3 8192 16384 BF16-to-FP8-E4M3 1080.60 1091.40 1.010x
Llama3-405B/FP8-E4M3-to-BF16 1024 16384 FP8-E4M3-to-BF16 4635.60 4611.80 0.995x
Llama3-405B/FP8-E4M3-to-BF16 2048 16384 FP8-E4M3-to-BF16 5696.40 5685.10 0.998x
Llama3-405B/FP8-E4M3-to-BF16 4096 16384 FP8-E4M3-to-BF16 6602.00 6505.70 0.985x
Llama3-405B/FP8-E4M3-to-BF16 8192 16384 FP8-E4M3-to-BF16 5371.20 5383.90 1.002x
Llama3-405B/BF16-to-FP8-E5M2 1024 16384 BF16-to-FP8-E5M2 2268.10 2269.90 1.001x
Llama3-405B/BF16-to-FP8-E5M2 2048 16384 BF16-to-FP8-E5M2 1673.80 1670.60 0.998x
Llama3-405B/BF16-to-FP8-E5M2 4096 16384 BF16-to-FP8-E5M2 1860.20 1863.20 1.002x
Llama3-405B/BF16-to-FP8-E5M2 8192 16384 BF16-to-FP8-E5M2 1076.80 1083.20 1.006x
Llama3-405B/FP8-E5M2-to-BF16 1024 16384 FP8-E5M2-to-BF16 4643.10 4630.50 0.997x
Llama3-405B/FP8-E5M2-to-BF16 2048 16384 FP8-E5M2-to-BF16 5709.10 5703.70 0.999x
Llama3-405B/FP8-E5M2-to-BF16 4096 16384 FP8-E5M2-to-BF16 6578.80 6595.10 1.002x
Llama3-405B/FP8-E5M2-to-BF16 8192 16384 FP8-E5M2-to-BF16 5379.10 5371.30 0.999x
Qwen2.5-7B/BF16-to-FP8-E4M3 1024 3584 BF16-to-FP8-E4M3 711.50 684.10 0.961x
Qwen2.5-7B/BF16-to-FP8-E4M3 2048 3584 BF16-to-FP8-E4M3 1187.40 1195.50 1.007x
Qwen2.5-7B/BF16-to-FP8-E4M3 4096 3584 BF16-to-FP8-E4M3 2155.40 2156.50 1.001x
Qwen2.5-7B/BF16-to-FP8-E4M3 8192 3584 BF16-to-FP8-E4M3 2538.90 2645.00 1.042x
Qwen2.5-7B/FP8-E4M3-to-BF16 1024 3584 FP8-E4M3-to-BF16 1109.70 1078.90 0.972x
Qwen2.5-7B/FP8-E4M3-to-BF16 2048 3584 FP8-E4M3-to-BF16 2205.50 2084.20 0.945x
Qwen2.5-7B/FP8-E4M3-to-BF16 4096 3584 FP8-E4M3-to-BF16 4411.40 4350.20 0.986x
Qwen2.5-7B/FP8-E4M3-to-BF16 8192 3584 FP8-E4M3-to-BF16 5542.90 5526.80 0.997x
Qwen2.5-7B/BF16-to-FP8-E5M2 1024 3584 BF16-to-FP8-E5M2 711.50 688.30 0.967x
Qwen2.5-7B/BF16-to-FP8-E5M2 2048 3584 BF16-to-FP8-E5M2 1195.20 1195.50 1.000x
Qwen2.5-7B/BF16-to-FP8-E5M2 4096 3584 BF16-to-FP8-E5M2 2151.90 2157.80 1.003x
Qwen2.5-7B/BF16-to-FP8-E5M2 8192 3584 BF16-to-FP8-E5M2 2540.10 2647.00 1.042x
Qwen2.5-7B/FP8-E5M2-to-BF16 1024 3584 FP8-E5M2-to-BF16 1048.60 1077.50 1.028x
Qwen2.5-7B/FP8-E5M2-to-BF16 2048 3584 FP8-E5M2-to-BF16 2245.20 2116.50 0.943x
Qwen2.5-7B/FP8-E5M2-to-BF16 4096 3584 FP8-E5M2-to-BF16 4392.40 4353.40 0.991x
Qwen2.5-7B/FP8-E5M2-to-BF16 8192 3584 FP8-E5M2-to-BF16 5548.20 5512.70 0.994x
Qwen2.5-72B/BF16-to-FP8-E4M3 1024 8192 BF16-to-FP8-E4M3 1251.50 1255.90 1.004x
Qwen2.5-72B/BF16-to-FP8-E4M3 2048 8192 BF16-to-FP8-E4M3 2256.70 2264.90 1.004x
Qwen2.5-72B/BF16-to-FP8-E4M3 4096 8192 BF16-to-FP8-E4M3 1811.40 1805.90 0.997x
Qwen2.5-72B/BF16-to-FP8-E4M3 8192 8192 BF16-to-FP8-E4M3 1902.90 1871.90 0.984x
Qwen2.5-72B/FP8-E4M3-to-BF16 1024 8192 FP8-E4M3-to-BF16 2540.70 2473.90 0.974x
Qwen2.5-72B/FP8-E4M3-to-BF16 2048 8192 FP8-E4M3-to-BF16 4730.20 4719.70 0.998x
Qwen2.5-72B/FP8-E4M3-to-BF16 4096 8192 FP8-E4M3-to-BF16 5728.30 5735.20 1.001x
Qwen2.5-72B/FP8-E4M3-to-BF16 8192 8192 FP8-E4M3-to-BF16 6587.30 6529.40 0.991x
Qwen2.5-72B/BF16-to-FP8-E5M2 1024 8192 BF16-to-FP8-E5M2 1252.70 1258.20 1.004x
Qwen2.5-72B/BF16-to-FP8-E5M2 2048 8192 BF16-to-FP8-E5M2 2258.40 2267.30 1.004x
Qwen2.5-72B/BF16-to-FP8-E5M2 4096 8192 BF16-to-FP8-E5M2 1813.80 1802.50 0.994x
Qwen2.5-72B/BF16-to-FP8-E5M2 8192 8192 BF16-to-FP8-E5M2 1902.40 1873.90 0.985x
Qwen2.5-72B/FP8-E5M2-to-BF16 1024 8192 FP8-E5M2-to-BF16 2557.20 2488.80 0.973x
Qwen2.5-72B/FP8-E5M2-to-BF16 2048 8192 FP8-E5M2-to-BF16 4714.20 4714.40 1.000x
Qwen2.5-72B/FP8-E5M2-to-BF16 4096 8192 FP8-E5M2-to-BF16 5743.10 5715.60 0.995x
Qwen2.5-72B/FP8-E5M2-to-BF16 8192 8192 FP8-E5M2-to-BF16 6603.50 6591.70 0.998x
benchmark_gemm (median 1.001x, min 0.935x, max 1.094x)
Case M N K dtype TE Forward Base TE Forward PR TE Forward Speedup TE Backward Base TE Backward PR TE Backward Speedup
Llama3-8B/TP1-QKV 1024 6144 4096 torch.bfloat16 1031.56 1044.11 1.012x 759.28 764.70 1.007x
Llama3-8B/TP1-AttnOut 1024 4096 4096 torch.bfloat16 806.21 817.18 1.014x 474.12 469.24 0.990x
Llama3-8B/TP1-GateUp 1024 28672 4096 torch.bfloat16 1235.92 1218.47 0.986x 1218.56 1232.06 1.011x
Llama3-8B/TP1-Down 1024 4096 14336 torch.bfloat16 1090.59 1084.23 0.994x 1097.33 1091.06 0.994x
Llama3-8B/TP1-QKV 2048 6144 4096 torch.bfloat16 1312.34 1303.96 0.994x 1054.29 1061.04 1.006x
Llama3-8B/TP1-AttnOut 2048 4096 4096 torch.bfloat16 722.97 675.64 0.935x 1035.42 1014.34 0.980x
Llama3-8B/TP1-GateUp 2048 28672 4096 torch.bfloat16 1296.36 1295.50 0.999x 1364.19 1364.80 1.000x
Llama3-8B/TP1-Down 2048 4096 14336 torch.bfloat16 1172.73 1243.84 1.061x 1263.38 1218.60 0.965x
Llama3-8B/TP1-QKV 4096 6144 4096 torch.bfloat16 1347.05 1329.77 0.987x 1243.39 1249.03 1.005x
Llama3-8B/TP1-AttnOut 4096 4096 4096 torch.bfloat16 1458.15 1458.33 1.000x 1377.25 1375.15 0.998x
Llama3-8B/TP1-GateUp 4096 28672 4096 torch.bfloat16 1552.05 1553.90 1.001x 1426.28 1406.21 0.986x
Llama3-8B/TP1-Down 4096 4096 14336 torch.bfloat16 1596.08 1588.77 0.995x 1349.14 1344.52 0.997x
Llama3-8B/TP1-QKV 8192 6144 4096 torch.bfloat16 1533.34 1537.14 1.002x 1261.16 1263.60 1.002x
Llama3-8B/TP1-AttnOut 8192 4096 4096 torch.bfloat16 1531.56 1522.12 0.994x 1440.07 1437.99 0.999x
Llama3-8B/TP1-GateUp 8192 28672 4096 torch.bfloat16 1531.03 1540.89 1.006x 1431.77 1441.42 1.007x
Llama3-8B/TP1-Down 8192 4096 14336 torch.bfloat16 1572.67 1566.68 0.996x 1380.70 1384.38 1.003x
Llama3-8B/TP8-QKV 1024 768 4096 torch.bfloat16 163.76 161.84 0.988x 85.91 88.00 1.024x
Llama3-8B/TP8-AttnOut 1024 4096 512 torch.bfloat16 106.91 107.50 1.006x 58.26 57.43 0.986x
Llama3-8B/TP8-GateUp 1024 3584 4096 torch.bfloat16 730.61 726.25 0.994x 400.66 408.95 1.021x
Llama3-8B/TP8-Down 1024 4096 1792 torch.bfloat16 375.78 373.99 0.995x 200.18 203.59 1.017x
Llama3-8B/TP8-QKV 2048 768 4096 torch.bfloat16 321.54 320.20 0.996x 174.12 174.79 1.004x
Llama3-8B/TP8-AttnOut 2048 4096 512 torch.bfloat16 216.19 213.79 0.989x 113.80 115.11 1.012x
Llama3-8B/TP8-GateUp 2048 3584 4096 torch.bfloat16 917.40 918.18 1.001x 973.01 990.20 1.018x
Llama3-8B/TP8-Down 2048 4096 1792 torch.bfloat16 757.40 745.23 0.984x 406.19 411.89 1.014x
Llama3-8B/TP8-QKV 4096 768 4096 torch.bfloat16 647.06 638.18 0.986x 343.95 347.77 1.011x
Llama3-8B/TP8-AttnOut 4096 4096 512 torch.bfloat16 432.88 439.18 1.015x 232.59 234.34 1.008x
Llama3-8B/TP8-GateUp 4096 3584 4096 torch.bfloat16 1249.15 1246.90 0.998x 1379.01 1378.43 1.000x
Llama3-8B/TP8-Down 4096 4096 1792 torch.bfloat16 1340.86 1351.01 1.008x 780.96 760.32 0.974x
Llama3-8B/TP8-QKV 8192 768 4096 torch.bfloat16 1011.14 1046.41 1.035x 753.62 740.72 0.983x
Llama3-8B/TP8-AttnOut 8192 4096 512 torch.bfloat16 864.57 861.49 0.996x 455.36 456.11 1.002x
Llama3-8B/TP8-GateUp 8192 3584 4096 torch.bfloat16 1331.39 1340.78 1.007x 1410.76 1401.87 0.994x
Llama3-8B/TP8-Down 8192 4096 1792 torch.bfloat16 1384.58 1401.17 1.012x 1118.15 1109.03 0.992x
Llama3-70B/TP8-QKV 1024 1280 8192 torch.bfloat16 490.69 497.37 1.014x 291.37 292.00 1.002x
Llama3-70B/TP8-AttnOut 1024 8192 1024 torch.bfloat16 434.30 432.02 0.995x 223.56 227.47 1.017x
Llama3-70B/TP8-GateUp 1024 7168 8192 torch.bfloat16 1071.37 1090.79 1.018x 1015.08 1024.66 1.009x
Llama3-70B/TP8-Down 1024 8192 3584 torch.bfloat16 1166.13 1168.12 1.002x 851.74 854.46 1.003x
Llama3-70B/TP8-QKV 2048 1280 8192 torch.bfloat16 754.44 762.55 1.011x 630.23 620.15 0.984x
Llama3-70B/TP8-AttnOut 2048 8192 1024 torch.bfloat16 885.39 893.65 1.009x 453.19 453.42 1.001x
Llama3-70B/TP8-GateUp 2048 7168 8192 torch.bfloat16 1405.45 1397.94 0.995x 1292.32 1292.95 1.000x
Llama3-70B/TP8-Down 2048 8192 3584 torch.bfloat16 1457.79 1453.66 0.997x 1163.03 1179.82 1.014x
Llama3-70B/TP8-QKV 4096 1280 8192 torch.bfloat16 1025.43 1005.08 0.980x 1022.05 1010.27 0.988x
Llama3-70B/TP8-AttnOut 4096 8192 1024 torch.bfloat16 1217.94 1220.35 1.002x 787.22 787.92 1.001x
Llama3-70B/TP8-GateUp 4096 7168 8192 torch.bfloat16 1430.36 1430.71 1.000x 1382.28 1389.21 1.005x
Llama3-70B/TP8-Down 4096 8192 3584 torch.bfloat16 1500.80 1505.96 1.003x 1381.77 1368.31 0.990x
Llama3-70B/TP8-QKV 8192 1280 8192 torch.bfloat16 1303.64 1305.61 1.002x 1041.07 1052.20 1.011x
Llama3-70B/TP8-AttnOut 8192 8192 1024 torch.bfloat16 1222.76 1216.39 0.995x 1032.74 1058.58 1.025x
Llama3-70B/TP8-GateUp 8192 7168 8192 torch.bfloat16 1409.23 1393.83 0.989x 1420.60 1409.07 0.992x
Llama3-70B/TP8-Down 8192 8192 3584 torch.bfloat16 1528.03 1538.20 1.007x 1404.21 1406.70 1.002x
Llama3-405B/TP8-QKV 1024 2304 16384 torch.bfloat16 904.21 898.17 0.993x 1060.39 1036.51 0.977x
Llama3-405B/TP8-AttnOut 1024 16384 2048 torch.bfloat16 1374.74 1355.05 0.986x 944.64 946.03 1.001x
Llama3-405B/TP8-GateUp 1024 13312 16384 torch.bfloat16 1247.84 1250.03 1.002x 1266.71 1309.21 1.034x
Llama3-405B/TP8-Down 1024 16384 6656 torch.bfloat16 1557.68 1556.10 0.999x 1104.87 1098.97 0.995x
Llama3-405B/TP8-QKV 2048 2304 16384 torch.bfloat16 942.92 939.66 0.997x 1158.83 1144.05 0.987x
Llama3-405B/TP8-AttnOut 2048 16384 2048 torch.bfloat16 1405.33 1393.94 0.992x 1232.41 1250.84 1.015x
Llama3-405B/TP8-GateUp 2048 13312 16384 torch.bfloat16 1249.54 1235.35 0.989x 1396.91 1373.03 0.983x
Llama3-405B/TP8-Down 2048 16384 6656 torch.bfloat16 1563.50 1572.38 1.006x 1266.70 1275.01 1.007x
Llama3-405B/TP8-QKV 4096 2304 16384 torch.bfloat16 1311.34 1309.76 0.999x 1178.44 1179.30 1.001x
Llama3-405B/TP8-AttnOut 4096 16384 2048 torch.bfloat16 1459.78 1444.47 0.990x 1383.79 1359.37 0.982x
Llama3-405B/TP8-GateUp 4096 13312 16384 torch.bfloat16 1253.45 1254.54 1.001x 1424.14 1430.10 1.004x
Llama3-405B/TP8-Down 4096 16384 6656 torch.bfloat16 1563.16 1553.43 0.994x 1269.96 1281.38 1.009x
Llama3-405B/TP8-QKV 8192 2304 16384 torch.bfloat16 1196.13 1192.27 0.997x 1197.02 1197.63 1.001x
Llama3-405B/TP8-AttnOut 8192 16384 2048 torch.bfloat16 1444.72 1462.17 1.012x 1435.09 1405.15 0.979x
Llama3-405B/TP8-GateUp 8192 13312 16384 torch.bfloat16 1305.87 1295.48 0.992x 1433.19 1434.02 1.001x
Llama3-405B/TP8-Down 8192 16384 6656 torch.bfloat16 1562.92 1548.92 0.991x 1329.06 1312.97 0.988x
Qwen2.5-7B/TP1-QKV 1024 4608 3584 torch.bfloat16 834.45 839.43 1.006x 454.28 461.59 1.016x
Qwen2.5-7B/TP1-AttnOut 1024 3584 3584 torch.bfloat16 658.91 650.06 0.987x 350.44 359.77 1.027x
Qwen2.5-7B/TP1-GateUp 1024 37888 3584 torch.bfloat16 1039.15 1057.32 1.017x 1105.16 1092.41 0.988x
Qwen2.5-7B/TP1-Down 1024 3584 18944 torch.bfloat16 1135.23 1151.80 1.015x 929.26 941.09 1.013x
Qwen2.5-7B/TP1-QKV 2048 4608 3584 torch.bfloat16 1123.78 1145.57 1.019x 948.76 1038.27 1.094x
Qwen2.5-7B/TP1-AttnOut 2048 3584 3584 torch.bfloat16 950.35 993.29 1.045x 778.31 772.29 0.992x
Qwen2.5-7B/TP1-GateUp 2048 37888 3584 torch.bfloat16 1207.31 1177.14 0.975x 1240.33 1205.00 0.972x
Qwen2.5-7B/TP1-Down 2048 3584 18944 torch.bfloat16 1317.45 1314.54 0.998x 1103.15 1116.11 1.012x
Qwen2.5-7B/TP1-QKV 4096 4608 3584 torch.bfloat16 1176.15 1168.29 0.993x 1331.64 1348.98 1.013x
Qwen2.5-7B/TP1-AttnOut 4096 3584 3584 torch.bfloat16 1288.68 1299.03 1.008x 1263.45 1263.49 1.000x
Qwen2.5-7B/TP1-GateUp 4096 37888 3584 torch.bfloat16 1229.85 1236.07 1.005x 1349.78 1346.28 0.997x
Qwen2.5-7B/TP1-Down 4096 3584 18944 torch.bfloat16 1397.97 1397.86 1.000x 1302.88 1297.94 0.996x
Qwen2.5-7B/TP1-QKV 8192 4608 3584 torch.bfloat16 1292.97 1334.65 1.032x 1318.98 1309.79 0.993x
Qwen2.5-7B/TP1-AttnOut 8192 3584 3584 torch.bfloat16 1324.98 1306.60 0.986x 1223.76 1234.15 1.008x
Qwen2.5-7B/TP1-GateUp 8192 37888 3584 torch.bfloat16 1423.73 1427.85 1.003x 1318.67 1314.88 0.997x
Qwen2.5-7B/TP1-Down 8192 3584 18944 torch.bfloat16 1397.80 1397.58 1.000x 1288.21 1289.39 1.001x
Qwen2.5-72B/TP8-QKV 1024 1280 8192 torch.bfloat16 493.81 486.35 0.985x 286.99 293.77 1.024x
Qwen2.5-72B/TP8-AttnOut 1024 8192 1024 torch.bfloat16 432.26 434.02 1.004x 223.05 229.12 1.027x
Qwen2.5-72B/TP8-GateUp 1024 7392 8192 torch.bfloat16 1074.15 1065.03 0.992x 1057.86 1059.16 1.001x
Qwen2.5-72B/TP8-Down 1024 8192 3696 torch.bfloat16 607.86 584.66 0.962x 761.73 784.62 1.030x
Qwen2.5-72B/TP8-QKV 2048 1280 8192 torch.bfloat16 762.10 741.45 0.973x 620.88 645.59 1.040x
Qwen2.5-72B/TP8-AttnOut 2048 8192 1024 torch.bfloat16 878.78 884.70 1.007x 454.10 465.87 1.026x
Qwen2.5-72B/TP8-GateUp 2048 7392 8192 torch.bfloat16 1336.88 1346.57 1.007x 1315.67 1324.50 1.007x
Qwen2.5-72B/TP8-Down 2048 8192 3696 torch.bfloat16 1301.85 1310.54 1.007x 1046.29 1038.17 0.992x
Qwen2.5-72B/TP8-QKV 4096 1280 8192 torch.bfloat16 1032.78 1009.95 0.978x 1017.51 1018.11 1.001x
Qwen2.5-72B/TP8-AttnOut 4096 8192 1024 torch.bfloat16 1213.82 1223.14 1.008x 793.37 791.32 0.997x
Qwen2.5-72B/TP8-GateUp 4096 7392 8192 torch.bfloat16 1390.24 1387.97 0.998x 1360.55 1358.70 0.999x
Qwen2.5-72B/TP8-Down 4096 8192 3696 torch.bfloat16 1345.45 1339.22 0.995x 1263.22 1261.03 0.998x
Qwen2.5-72B/TP8-QKV 8192 1280 8192 torch.bfloat16 1319.24 1309.42 0.993x 1045.93 1056.10 1.010x
Qwen2.5-72B/TP8-AttnOut 8192 8192 1024 torch.bfloat16 1240.01 1249.01 1.007x 1000.39 1012.30 1.012x
Qwen2.5-72B/TP8-GateUp 8192 7392 8192 torch.bfloat16 1294.76 1291.01 0.997x 1362.93 1368.38 1.004x
Qwen2.5-72B/TP8-Down 8192 8192 3696 torch.bfloat16 1373.41 1366.86 0.995x 1336.98 1333.71 0.998x
benchmark_gemm_fp8 (median 0.988x, min 0.401x, max 1.038x)
Case M N K dtype FP8 Forward Base FP8 Forward PR FP8 Forward Speedup FP8 Backward Base FP8 Backward PR FP8 Backward Speedup
Llama3-8B/TP1-QKV 1024 6144 4096 torch.bfloat16 416.89 407.31 0.977x 365.49 350.64 0.959x
Llama3-8B/TP1-AttnOut 1024 4096 4096 torch.bfloat16 286.54 273.39 0.954x 242.76 234.63 0.967x
Llama3-8B/TP1-GateUp 1024 28672 4096 torch.bfloat16 548.57 550.57 1.004x 1639.83 1702.61 1.038x
Llama3-8B/TP1-Down 1024 4096 14336 torch.bfloat16 446.49 445.96 0.999x 1540.76 1493.60 0.969x
Llama3-8B/TP1-QKV 2048 6144 4096 torch.bfloat16 846.58 809.82 0.957x 1432.76 680.41 0.475x
Llama3-8B/TP1-AttnOut 2048 4096 4096 torch.bfloat16 562.43 548.45 0.975x 1002.77 462.02 0.461x
Llama3-8B/TP1-GateUp 2048 28672 4096 torch.bfloat16 862.47 855.86 0.992x 1796.27 1792.72 0.998x
Llama3-8B/TP1-Down 2048 4096 14336 torch.bfloat16 713.95 715.87 1.003x 1802.56 1776.26 0.985x
Llama3-8B/TP1-QKV 4096 6144 4096 torch.bfloat16 1290.63 1271.35 0.985x 1840.90 1555.63 0.845x
Llama3-8B/TP1-AttnOut 4096 4096 4096 torch.bfloat16 1130.59 1089.03 0.963x 2051.26 905.90 0.442x
Llama3-8B/TP1-GateUp 4096 28672 4096 torch.bfloat16 1460.17 1467.62 1.005x 2154.53 2144.49 0.995x
Llama3-8B/TP1-Down 4096 4096 14336 torch.bfloat16 1014.90 1014.04 0.999x 2200.18 2194.11 0.997x
Llama3-8B/TP1-QKV 8192 6144 4096 torch.bfloat16 1771.48 1764.10 0.996x 1777.98 1780.02 1.001x
Llama3-8B/TP1-AttnOut 8192 4096 4096 torch.bfloat16 1516.59 1516.08 1.000x 1790.38 1780.72 0.995x
Llama3-8B/TP1-GateUp 8192 28672 4096 torch.bfloat16 1904.04 1901.36 0.999x 2185.83 2184.78 1.000x
Llama3-8B/TP1-Down 8192 4096 14336 torch.bfloat16 1287.22 1283.07 0.997x 2400.04 2391.69 0.997x
Llama3-8B/TP8-QKV 1024 768 4096 torch.bfloat16 51.30 49.23 0.960x 93.27 41.38 0.444x
Llama3-8B/TP8-AttnOut 1024 4096 512 torch.bfloat16 34.16 32.75 0.959x 61.68 27.38 0.444x
Llama3-8B/TP8-GateUp 1024 3584 4096 torch.bfloat16 236.53 227.33 0.961x 425.33 189.95 0.447x
Llama3-8B/TP8-Down 1024 4096 1792 torch.bfloat16 119.07 113.56 0.954x 213.74 94.65 0.443x
Llama3-8B/TP8-QKV 2048 768 4096 torch.bfloat16 102.02 97.55 0.956x 184.19 80.43 0.437x
Llama3-8B/TP8-AttnOut 2048 4096 512 torch.bfloat16 67.98 64.81 0.953x 109.52 53.81 0.491x
Llama3-8B/TP8-GateUp 2048 3584 4096 torch.bfloat16 472.76 454.97 0.962x 835.85 376.37 0.450x
Llama3-8B/TP8-Down 2048 4096 1792 torch.bfloat16 234.85 227.06 0.967x 421.16 186.50 0.443x
Llama3-8B/TP8-QKV 4096 768 4096 torch.bfloat16 201.66 194.86 0.966x 366.72 161.88 0.441x
Llama3-8B/TP8-AttnOut 4096 4096 512 torch.bfloat16 135.06 131.60 0.974x 241.29 107.82 0.447x
Llama3-8B/TP8-GateUp 4096 3584 4096 torch.bfloat16 943.94 900.94 0.954x 1673.36 746.75 0.446x
Llama3-8B/TP8-Down 4096 4096 1792 torch.bfloat16 475.53 456.38 0.960x 843.53 369.42 0.438x
Llama3-8B/TP8-QKV 8192 768 4096 torch.bfloat16 397.84 386.86 0.972x 717.45 318.65 0.444x
Llama3-8B/TP8-AttnOut 8192 4096 512 torch.bfloat16 266.86 259.11 0.971x 481.47 211.50 0.439x
Llama3-8B/TP8-GateUp 8192 3584 4096 torch.bfloat16 1287.71 1277.83 0.992x 1730.15 1763.56 1.019x
Llama3-8B/TP8-Down 8192 4096 1792 torch.bfloat16 934.26 899.90 0.963x 1184.55 735.82 0.621x
Llama3-70B/TP8-QKV 1024 1280 8192 torch.bfloat16 158.92 131.89 0.830x 282.81 141.65 0.501x
Llama3-70B/TP8-AttnOut 1024 8192 1024 torch.bfloat16 133.15 129.72 0.974x 232.85 105.77 0.454x
Llama3-70B/TP8-GateUp 1024 7168 8192 torch.bfloat16 406.32 407.55 1.003x 1273.88 1300.77 1.021x
Llama3-70B/TP8-Down 1024 8192 3584 torch.bfloat16 463.08 449.43 0.971x 817.41 362.30 0.443x
Llama3-70B/TP8-QKV 2048 1280 8192 torch.bfloat16 332.69 324.45 0.975x 590.52 262.93 0.445x
Llama3-70B/TP8-AttnOut 2048 8192 1024 torch.bfloat16 264.66 257.66 0.974x 470.04 206.70 0.440x
Llama3-70B/TP8-GateUp 2048 7168 8192 torch.bfloat16 766.83 777.71 1.014x 2069.73 2079.82 1.005x
Llama3-70B/TP8-Down 2048 8192 3584 torch.bfloat16 916.93 898.27 0.980x 1346.79 719.00 0.534x
Llama3-70B/TP8-QKV 4096 1280 8192 torch.bfloat16 626.26 626.45 1.000x 1185.91 518.39 0.437x
Llama3-70B/TP8-AttnOut 4096 8192 1024 torch.bfloat16 520.77 508.45 0.976x 924.05 411.15 0.445x
Llama3-70B/TP8-GateUp 4096 7168 8192 torch.bfloat16 1040.12 1047.30 1.007x 2391.48 2396.51 1.002x
Llama3-70B/TP8-Down 4096 8192 3584 torch.bfloat16 1614.24 1604.80 0.994x 1550.40 1487.51 0.959x
Llama3-70B/TP8-QKV 8192 1280 8192 torch.bfloat16 571.96 565.82 0.989x 1728.83 1740.09 1.007x
Llama3-70B/TP8-AttnOut 8192 8192 1024 torch.bfloat16 1044.13 1023.40 0.980x 690.52 690.23 1.000x
Llama3-70B/TP8-GateUp 8192 7168 8192 torch.bfloat16 1305.54 1309.23 1.003x 2321.50 2334.78 1.006x
Llama3-70B/TP8-Down 8192 8192 3584 torch.bfloat16 1995.64 2010.59 1.007x 1615.81 1619.20 1.002x
Llama3-405B/TP8-QKV 1024 2304 16384 torch.bfloat16 490.70 479.74 0.978x 891.25 470.88 0.528x
Llama3-405B/TP8-AttnOut 1024 16384 2048 torch.bfloat16 514.71 498.74 0.969x 848.12 391.64 0.462x
Llama3-405B/TP8-GateUp 1024 13312 16384 torch.bfloat16 579.98 577.80 0.996x 2366.75 2352.56 0.994x
Llama3-405B/TP8-Down 1024 16384 6656 torch.bfloat16 552.99 548.30 0.992x 1292.16 1300.64 1.007x
Llama3-405B/TP8-QKV 2048 2304 16384 torch.bfloat16 529.44 535.51 1.011x 1768.64 1536.04 0.868x
Llama3-405B/TP8-AttnOut 2048 16384 2048 torch.bfloat16 980.45 978.98 0.999x 965.67 743.78 0.770x
Llama3-405B/TP8-GateUp 2048 13312 16384 torch.bfloat16 901.21 899.02 0.998x 2624.82 2603.50 0.992x
Llama3-405B/TP8-Down 2048 16384 6656 torch.bfloat16 958.46 956.77 0.998x 1603.38 1592.63 0.993x
Llama3-405B/TP8-QKV 4096 2304 16384 torch.bfloat16 696.02 706.30 1.015x 2133.76 2121.95 0.994x
Llama3-405B/TP8-AttnOut 4096 16384 2048 torch.bfloat16 1396.06 1398.51 1.002x 1178.14 1162.43 0.987x
Llama3-405B/TP8-GateUp 4096 13312 16384 torch.bfloat16 1150.06 1151.27 1.001x 2754.81 2744.39 0.996x
Llama3-405B/TP8-Down 4096 16384 6656 torch.bfloat16 1445.70 1444.09 0.999x 1772.21 1764.32 0.996x
Llama3-405B/TP8-QKV 8192 2304 16384 torch.bfloat16 875.19 878.05 1.003x 2450.26 2451.96 1.001x
Llama3-405B/TP8-AttnOut 8192 16384 2048 torch.bfloat16 1895.28 1878.29 0.991x 1463.08 1464.39 1.001x
Llama3-405B/TP8-GateUp 8192 13312 16384 torch.bfloat16 1642.99 1643.30 1.000x 2818.38 2875.81 1.020x
Llama3-405B/TP8-Down 8192 16384 6656 torch.bfloat16 1781.32 1776.56 0.997x 1955.93 1950.84 0.997x
Qwen2.5-7B/TP1-QKV 1024 4608 3584 torch.bfloat16 245.20 239.18 0.975x 432.45 191.51 0.443x
Qwen2.5-7B/TP1-AttnOut 1024 3584 3584 torch.bfloat16 189.76 184.46 0.972x 334.31 148.58 0.444x
Qwen2.5-7B/TP1-GateUp 1024 37888 3584 torch.bfloat16 511.07 509.79 0.997x 1260.04 1254.61 0.996x
Qwen2.5-7B/TP1-Down 1024 3584 18944 torch.bfloat16 420.57 419.81 0.998x 1358.67 1353.00 0.996x
Qwen2.5-7B/TP1-QKV 2048 4608 3584 torch.bfloat16 486.59 472.44 0.971x 858.86 377.57 0.440x
Qwen2.5-7B/TP1-AttnOut 2048 3584 3584 torch.bfloat16 378.81 365.92 0.966x 668.03 291.22 0.436x
Qwen2.5-7B/TP1-GateUp 2048 37888 3584 torch.bfloat16 791.49 790.37 0.999x 1549.38 1545.83 0.998x
Qwen2.5-7B/TP1-Down 2048 3584 18944 torch.bfloat16 611.19 601.36 0.984x 1675.32 1698.05 1.014x
Qwen2.5-7B/TP1-QKV 4096 4608 3584 torch.bfloat16 974.46 935.89 0.960x 1692.07 728.21 0.430x
Qwen2.5-7B/TP1-AttnOut 4096 3584 3584 torch.bfloat16 713.30 727.26 1.020x 1313.12 563.62 0.429x
Qwen2.5-7B/TP1-GateUp 4096 37888 3584 torch.bfloat16 1159.85 1160.52 1.001x 1717.77 1715.36 0.999x
Qwen2.5-7B/TP1-Down 4096 3584 18944 torch.bfloat16 906.83 903.52 0.996x 1976.89 1979.59 1.001x
Qwen2.5-7B/TP1-QKV 8192 4608 3584 torch.bfloat16 1374.94 1364.16 0.992x 1517.27 1528.44 1.007x
Qwen2.5-7B/TP1-AttnOut 8192 3584 3584 torch.bfloat16 1264.20 1249.55 0.988x 1626.07 1194.42 0.735x
Qwen2.5-7B/TP1-GateUp 8192 37888 3584 torch.bfloat16 1896.52 1900.24 1.002x 1792.03 1776.86 0.992x
Qwen2.5-7B/TP1-Down 8192 3584 18944 torch.bfloat16 1112.18 1110.88 0.999x 2114.34 2109.42 0.998x
Qwen2.5-72B/TP8-QKV 1024 1280 8192 torch.bfloat16 146.27 142.46 0.974x 266.87 113.15 0.424x
Qwen2.5-72B/TP8-AttnOut 1024 8192 1024 torch.bfloat16 122.70 118.23 0.964x 214.73 91.38 0.426x
Qwen2.5-72B/TP8-GateUp 1024 7392 8192 torch.bfloat16 385.64 382.08 0.991x 1041.60 1047.27 1.005x
Qwen2.5-72B/TP8-Down 1024 8192 3696 torch.bfloat16 401.29 400.32 0.998x 840.16 336.64 0.401x
Qwen2.5-72B/TP8-QKV 2048 1280 8192 torch.bfloat16 307.29 293.19 0.954x 531.19 228.12 0.429x
Qwen2.5-72B/TP8-AttnOut 2048 8192 1024 torch.bfloat16 245.11 233.32 0.952x 421.54 181.18 0.430x
Qwen2.5-72B/TP8-GateUp 2048 7392 8192 torch.bfloat16 735.21 735.24 1.000x 1758.15 1741.20 0.990x
Qwen2.5-72B/TP8-Down 2048 8192 3696 torch.bfloat16 662.87 662.19 0.999x 1068.35 729.50 0.683x
Qwen2.5-72B/TP8-QKV 4096 1280 8192 torch.bfloat16 596.74 577.85 0.968x 1057.04 447.00 0.423x
Qwen2.5-72B/TP8-AttnOut 4096 8192 1024 torch.bfloat16 483.70 465.44 0.962x 847.84 358.43 0.423x
Qwen2.5-72B/TP8-GateUp 4096 7392 8192 torch.bfloat16 983.66 990.29 1.007x 1957.53 1957.75 1.000x
Qwen2.5-72B/TP8-Down 4096 8192 3696 torch.bfloat16 962.46 963.35 1.001x 1324.31 1329.60 1.004x
Qwen2.5-72B/TP8-QKV 8192 1280 8192 torch.bfloat16 560.82 554.91 0.989x 1485.29 1465.89 0.987x
Qwen2.5-72B/TP8-AttnOut 8192 8192 1024 torch.bfloat16 935.32 924.77 0.989x 698.43 699.77 1.002x
Qwen2.5-72B/TP8-GateUp 8192 7392 8192 torch.bfloat16 1215.17 1220.05 1.004x 2158.54 2146.23 0.994x
Qwen2.5-72B/TP8-Down 8192 8192 3696 torch.bfloat16 1246.18 1242.25 0.997x 1452.22 1446.39 0.996x
benchmark_grouped_gemm (median 0.999x, min 0.912x, max 1.134x)
Case B M N K dtype TE (CK_Tile) Forward Base TE (CK_Tile) Forward PR TE (CK_Tile) Forward Speedup TE (CK_Tile) Backward Base TE (CK_Tile) Backward PR TE (CK_Tile) Backward Speedup
DSV2-Lite-GateUP 2 512 2816 2048 torch.bfloat16 241.31 241.50 1.001x 165.25 174.21 1.054x
DSV2-Lite-Down 2 512 2048 1408 torch.bfloat16 157.97 157.66 0.998x 164.25 163.75 0.997x
DSV2-Lite-GateUP 2 1024 2816 2048 torch.bfloat16 470.04 472.15 1.004x 317.46 300.52 0.947x
DSV2-Lite-Down 2 1024 2048 1408 torch.bfloat16 311.59 309.83 0.994x 293.90 294.03 1.000x
DSV2-Lite-GateUP 2 2048 2816 2048 torch.bfloat16 836.18 816.91 0.977x 542.45 535.42 0.987x
DSV2-Lite-Down 2 2048 2048 1408 torch.bfloat16 588.03 593.84 1.010x 487.69 488.22 1.001x
DSV2-Lite-GateUP 2 4096 2816 2048 torch.bfloat16 862.06 854.77 0.992x 821.46 829.75 1.010x
DSV2-Lite-Down 2 4096 2048 1408 torch.bfloat16 804.83 913.05 1.134x 535.44 535.77 1.001x
DSV2-Lite-GateUP 4 512 2816 2048 torch.bfloat16 464.77 466.73 1.004x 298.10 297.98 1.000x
DSV2-Lite-Down 4 512 2048 1408 torch.bfloat16 303.65 303.31 0.999x 271.82 247.78 0.912x
DSV2-Lite-GateUP 4 1024 2816 2048 torch.bfloat16 802.06 800.88 0.999x 509.17 503.26 0.988x
DSV2-Lite-Down 4 1024 2048 1408 torch.bfloat16 584.50 588.16 1.006x 457.76 459.09 1.003x
DSV2-Lite-GateUP 4 2048 2816 2048 torch.bfloat16 841.69 837.09 0.995x 758.58 760.67 1.003x
DSV2-Lite-Down 4 2048 2048 1408 torch.bfloat16 915.53 915.39 1.000x 506.16 503.30 0.994x
DSV2-Lite-GateUP 4 4096 2816 2048 torch.bfloat16 1035.61 1032.06 0.997x 814.15 814.28 1.000x
DSV2-Lite-Down 4 4096 2048 1408 torch.bfloat16 976.74 968.55 0.992x 617.04 611.33 0.991x
DSV2-Lite-GateUP 8 512 2816 2048 torch.bfloat16 811.88 803.25 0.989x 497.07 496.35 0.999x
DSV2-Lite-Down 8 512 2048 1408 torch.bfloat16 562.85 551.32 0.980x 437.47 435.13 0.995x
DSV2-Lite-GateUP 8 1024 2816 2048 torch.bfloat16 811.96 819.62 1.009x 765.54 779.72 1.019x
DSV2-Lite-Down 8 1024 2048 1408 torch.bfloat16 871.38 875.18 1.004x 487.58 512.78 1.052x
DSV2-Lite-GateUP 8 2048 2816 2048 torch.bfloat16 1009.45 1011.40 1.002x 840.10 847.07 1.008x
DSV2-Lite-Down 8 2048 2048 1408 torch.bfloat16 936.19 934.18 0.998x 623.49 625.78 1.004x
DSV2-Lite-GateUP 8 4096 2816 2048 torch.bfloat16 1018.90 1015.76 0.997x 874.56 873.36 0.999x
DSV2-Lite-Down 8 4096 2048 1408 torch.bfloat16 994.19 994.77 1.001x 687.93 692.68 1.007x
DSV2-GateUP 5 512 3072 5120 torch.bfloat16 758.10 748.07 0.987x 651.82 649.44 0.996x
DSV2-Down 5 512 5120 1536 torch.bfloat16 805.76 782.73 0.971x 309.70 309.18 0.998x
DSV2-GateUP 5 1024 3072 5120 torch.bfloat16 1055.65 1115.60 1.057x 726.08 738.29 1.017x
DSV2-Down 5 1024 5120 1536 torch.bfloat16 860.51 840.11 0.976x 530.47 526.90 0.993x
DSV2-GateUP 5 2048 3072 5120 torch.bfloat16 1117.45 1107.97 0.992x 791.38 788.22 0.996x
DSV2-Down 5 2048 5120 1536 torch.bfloat16 862.90 864.92 1.002x 801.96 794.04 0.990x
DSV2-GateUP 5 4096 3072 5120 torch.bfloat16 1150.31 1146.38 0.997x 893.54 895.39 1.002x
DSV2-Down 5 4096 5120 1536 torch.bfloat16 975.09 960.36 0.985x 833.25 830.31 0.996x
DSV2-GateUP 10 512 3072 5120 torch.bfloat16 1005.55 983.90 0.978x 643.24 639.69 0.994x
DSV2-Down 10 512 5120 1536 torch.bfloat16 833.59 807.19 0.968x 491.60 485.48 0.988x
DSV2-GateUP 10 1024 3072 5120 torch.bfloat16 1055.26 1053.50 0.998x 751.88 751.79 1.000x
DSV2-Down 10 1024 5120 1536 torch.bfloat16 831.97 826.24 0.993x 760.98 759.32 0.998x
DSV2-GateUP 10 2048 3072 5120 torch.bfloat16 1117.79 1101.13 0.985x 853.62 861.91 1.010x
DSV2-Down 10 2048 5120 1536 torch.bfloat16 923.67 929.06 1.006x 814.38 820.78 1.008x
DSV2-GateUP 10 4096 3072 5120 torch.bfloat16 1144.04 1135.59 0.993x 909.52 908.71 0.999x
DSV2-Down 10 4096 5120 1536 torch.bfloat16 987.12 984.72 0.998x 870.50 872.74 1.003x
DSV2-GateUP 20 512 3072 5120 torch.bfloat16 969.08 977.88 1.009x 632.85 635.43 1.004x
DSV2-Down 20 512 5120 1536 torch.bfloat16 770.09 762.78 0.991x 629.63 629.36 1.000x
DSV2-GateUP 20 1024 3072 5120 torch.bfloat16 1033.21 1038.29 1.005x 794.20 791.82 0.997x
DSV2-Down 20 1024 5120 1536 torch.bfloat16 876.91 872.56 0.995x 750.70 742.67 0.989x
DSV2-GateUP 20 2048 3072 5120 torch.bfloat16 1087.90 1080.25 0.993x 870.51 870.66 1.000x
DSV2-Down 20 2048 5120 1536 torch.bfloat16 953.93 952.25 0.998x 818.09 823.26 1.006x
DSV2-GateUP 20 4096 3072 5120 torch.bfloat16 1157.31 1155.49 0.998x 891.78 898.10 1.007x
DSV2-Down 20 4096 5120 1536 torch.bfloat16 989.75 991.38 1.002x 858.18 860.10 1.002x
DSV3-GateUP 8 512 4096 7168 torch.bfloat16 1050.53 1059.97 1.009x 687.06 685.03 0.997x
DSV3-Down 8 512 7168 2048 torch.bfloat16 909.32 901.57 0.991x 517.39 519.84 1.005x
DSV3-GateUP 8 1024 4096 7168 torch.bfloat16 1133.68 1134.79 1.001x 812.56 811.88 0.999x
DSV3-Down 8 1024 7168 2048 torch.bfloat16 963.17 971.34 1.008x 820.30 827.47 1.009x
DSV3-GateUP 8 2048 4096 7168 torch.bfloat16 1181.85 1179.63 0.998x 926.65 921.78 0.995x
DSV3-Down 8 2048 7168 2048 torch.bfloat16 1053.68 1049.93 0.996x 885.84 879.64 0.993x
DSV3-GateUP 8 4096 4096 7168 torch.bfloat16 1210.41 1201.53 0.993x 960.94 962.05 1.001x
DSV3-Down 8 4096 7168 2048 torch.bfloat16 1083.47 1085.41 1.002x 934.58 934.96 1.000x
DSV3-GateUP 16 512 4096 7168 torch.bfloat16 1040.65 1030.43 0.990x 668.79 669.72 1.001x
DSV3-Down 16 512 7168 2048 torch.bfloat16 901.13 900.01 0.999x 697.83 695.37 0.996x
DSV3-GateUP 16 1024 4096 7168 torch.bfloat16 1127.35 1119.62 0.993x 835.60 834.86 0.999x
DSV3-Down 16 1024 7168 2048 torch.bfloat16 1005.05 1003.60 0.999x 813.51 806.73 0.992x
DSV3-GateUP 16 2048 4096 7168 torch.bfloat16 1165.20 1166.80 1.001x 919.75 919.29 0.999x
DSV3-Down 16 2048 7168 2048 torch.bfloat16 1053.64 1051.81 0.998x 886.32 884.69 0.998x
DSV3-GateUP 16 4096 4096 7168 torch.bfloat16 1207.24 1209.12 1.002x 940.29 946.10 1.006x
DSV3-Down 16 4096 7168 2048 torch.bfloat16 1078.18 1077.07 0.999x 909.47 914.80 1.006x
DSV3-GateUP 32 512 4096 7168 torch.bfloat16 1016.13 1017.20 1.001x 664.62 661.71 0.996x
DSV3-Down 32 512 7168 2048 torch.bfloat16 927.19 906.34 0.978x 665.70 671.98 1.009x
DSV3-GateUP 32 1024 4096 7168 torch.bfloat16 1106.20 1102.44 0.997x 809.68 809.12 0.999x
DSV3-Down 32 1024 7168 2048 torch.bfloat16 976.05 981.99 1.006x 793.37 795.26 1.002x
DSV3-GateUP 32 2048 4096 7168 torch.bfloat16 1157.14 1155.39 0.998x 889.82 884.26 0.994x
DSV3-Down 32 2048 7168 2048 torch.bfloat16 1018.36 1024.81 1.006x 873.57 874.36 1.001x
DSV3-GateUP 32 4096 4096 7168 torch.bfloat16 1181.02 1182.44 1.001x 908.43 907.72 0.999x
DSV3-Down 32 4096 7168 2048 torch.bfloat16 1049.28 1047.24 0.998x 903.45 881.54 0.976x
Grok-V2-GateUP 1 512 32768 8192 torch.bfloat16 1016.40 1012.02 0.996x 885.37 898.65 1.015x
Grok-V2-Down 1 512 8192 16384 torch.bfloat16 770.72 764.61 0.992x 910.54 933.63 1.025x
Grok-V2-GateUP 1 1024 32768 8192 torch.bfloat16 1400.59 1427.92 1.020x 1220.00 1232.17 1.010x
Grok-V2-Down 1 1024 8192 16384 torch.bfloat16 1132.22 1171.32 1.035x 1207.33 1225.05 1.015x
Grok-V2-GateUP 1 2048 32768 8192 torch.bfloat16 1446.72 1448.37 1.001x 1374.03 1371.83 0.998x
Grok-V2-Down 1 2048 8192 16384 torch.bfloat16 1485.19 1483.65 0.999x 1338.63 1354.75 1.012x
Grok-V2-GateUP 1 4096 32768 8192 torch.bfloat16 1460.68 1475.09 1.010x 1415.70 1414.40 0.999x
Grok-V2-Down 1 4096 8192 16384 torch.bfloat16 1501.04 1499.14 0.999x 1401.87 1399.71 0.998x
benchmark_normalization (median 0.993x, min 0.399x, max 1.490x)
Case M hidden_size dtype TE Forward GB/s Base TE Forward GB/s PR TE Forward GB/s Speedup TE Backward GB/s Base TE Backward GB/s PR TE Backward GB/s Speedup
Llama3-8B/RMSNorm 1024 4096 torch.bfloat16 638.60 624.90 0.979x 485.60 723.50 1.490x
Llama3-8B/RMSNorm 2048 4096 torch.bfloat16 1305.50 1272.80 0.975x 1425.70 1455.90 1.021x
Llama3-8B/RMSNorm 4096 4096 torch.bfloat16 2599.10 2535.90 0.976x 2952.10 2924.70 0.991x
Llama3-8B/RMSNorm 8192 4096 torch.bfloat16 5199.00 5077.80 0.977x 5496.30 5679.60 1.033x
Llama3-8B/LayerNorm 1024 4096 torch.bfloat16 552.20 555.60 1.006x 624.50 633.10 1.014x
Llama3-8B/LayerNorm 2048 4096 torch.bfloat16 1075.00 1110.80 1.033x 1271.70 1262.10 0.992x
Llama3-8B/LayerNorm 4096 4096 torch.bfloat16 2223.10 2195.00 0.987x 2508.00 2549.20 1.016x
Llama3-8B/LayerNorm 8192 4096 torch.bfloat16 4425.10 4423.10 1.000x 5060.20 5065.30 1.001x
Llama3-70B/RMSNorm 1024 8192 torch.bfloat16 1307.50 1294.10 0.990x 1448.60 1450.50 1.001x
Llama3-70B/RMSNorm 2048 8192 torch.bfloat16 2637.30 2578.10 0.978x 2786.50 2957.50 1.061x
Llama3-70B/RMSNorm 4096 8192 torch.bfloat16 4389.30 4381.20 0.998x 4916.80 4935.30 1.004x
Llama3-70B/RMSNorm 8192 8192 torch.bfloat16 4993.60 5036.40 1.009x 5392.30 5367.10 0.995x
Llama3-70B/LayerNorm 1024 8192 torch.bfloat16 1100.70 1124.10 1.021x 730.40 730.30 1.000x
Llama3-70B/LayerNorm 2048 8192 torch.bfloat16 2210.50 2195.90 0.993x 747.90 739.50 0.989x
Llama3-70B/LayerNorm 4096 8192 torch.bfloat16 4224.90 4258.50 1.008x 644.60 609.30 0.945x
Llama3-70B/LayerNorm 8192 8192 torch.bfloat16 4865.30 4836.00 0.994x 579.90 588.60 1.015x
Llama3-405B/RMSNorm 1024 16384 torch.bfloat16 660.50 651.70 0.987x 556.20 547.20 0.984x
Llama3-405B/RMSNorm 2048 16384 torch.bfloat16 705.90 707.90 1.003x 541.70 541.10 0.999x
Llama3-405B/RMSNorm 4096 16384 torch.bfloat16 725.70 736.10 1.014x 455.20 452.10 0.993x
Llama3-405B/RMSNorm 8192 16384 torch.bfloat16 581.40 581.80 1.001x 487.00 489.20 1.005x
Llama3-405B/LayerNorm 1024 16384 torch.bfloat16 675.60 691.60 1.024x 643.00 623.30 0.969x
Llama3-405B/LayerNorm 2048 16384 torch.bfloat16 690.30 690.80 1.001x 606.70 561.50 0.925x
Llama3-405B/LayerNorm 4096 16384 torch.bfloat16 702.70 696.50 0.991x 512.50 518.50 1.012x
Llama3-405B/LayerNorm 8192 16384 torch.bfloat16 562.60 563.80 1.002x 576.20 563.10 0.977x
Qwen2.5-7B/RMSNorm 1024 3584 torch.bfloat16 493.10 549.00 1.113x 398.80 253.20 0.635x
Qwen2.5-7B/RMSNorm 2048 3584 torch.bfloat16 1142.60 1104.60 0.967x 689.00 499.00 0.724x
Qwen2.5-7B/RMSNorm 4096 3584 torch.bfloat16 2252.90 2214.20 0.983x 1127.90 1013.50 0.899x
Qwen2.5-7B/RMSNorm 8192 3584 torch.bfloat16 3070.90 3066.90 0.999x 1819.80 1663.20 0.914x
Qwen2.5-7B/LayerNorm 1024 3584 torch.bfloat16 481.50 479.50 0.996x 348.70 218.50 0.627x
Qwen2.5-7B/LayerNorm 2048 3584 torch.bfloat16 960.40 943.90 0.983x 639.10 439.40 0.688x
Qwen2.5-7B/LayerNorm 4096 3584 torch.bfloat16 1900.00 1869.40 0.984x 1050.80 868.40 0.826x
Qwen2.5-7B/LayerNorm 8192 3584 torch.bfloat16 2688.60 2703.50 1.006x 1581.20 1549.50 0.980x
Qwen2.5-72B/RMSNorm 1024 8192 torch.bfloat16 1284.00 1248.70 0.973x 1442.30 576.10 0.399x
Qwen2.5-72B/RMSNorm 2048 8192 torch.bfloat16 2618.00 2506.80 0.958x 2917.20 1206.50 0.414x
Qwen2.5-72B/RMSNorm 4096 8192 torch.bfloat16 4393.80 4379.10 0.997x 4933.10 2493.20 0.505x
Qwen2.5-72B/RMSNorm 8192 8192 torch.bfloat16 5035.00 4997.90 0.993x 5387.60 5318.40 0.987x
Qwen2.5-72B/LayerNorm 1024 8192 torch.bfloat16 1108.30 1090.60 0.984x 745.10 486.50 0.653x
Qwen2.5-72B/LayerNorm 2048 8192 torch.bfloat16 2238.60 2231.80 0.997x 750.00 744.20 0.992x
Qwen2.5-72B/LayerNorm 4096 8192 torch.bfloat16 4254.00 4264.80 1.003x 647.00 612.20 0.946x
Qwen2.5-72B/LayerNorm 8192 8192 torch.bfloat16 4741.10 4861.60 1.025x 581.40 589.60 1.014x

This comment was marked as outdated.

@ROCm ROCm deleted a comment from Copilot AI Mar 12, 2026
@ROCm ROCm deleted a comment from Copilot AI Mar 12, 2026
@ROCm ROCm deleted a comment from Copilot AI Mar 12, 2026
@matthiasdiener matthiasdiener changed the title [proof-of-concept] CI performance regression microbenchmarking CI performance regression microbenchmarking Mar 16, 2026
@matthiasdiener matthiasdiener changed the title CI performance regression microbenchmarking Microbenchmarking and CI performance regression test Mar 16, 2026
@matthiasdiener matthiasdiener marked this pull request as ready for review March 16, 2026 15:57
def main():
parser = argparse.ArgumentParser(description="Compare benchmark CSVs")
parser.add_argument("base_csv", help="Base branch CSV")
parser.add_argument("pr_csv", help="PR branch CSV")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not necessary PR. The script can compare results of two different branches, or two different machines or even run-to-run variations

EOF
)"

- name: Performance regression check
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Microbenhmarking hardly needs to be run on every PR commit. And it is not necessary to run on PR. IT may be push to dev branch.
Also, can the action store results in GHA cache and read them for compare instead of making two runs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we okay with comparing across machines? Especially when our CI pool changes, we might find some differences that aren't as easy to attribute to code change or machine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very good question. CI machines are not ideal for benchmarking anyway because even if runner has exclusive access to GPU, overall system load may vary significantly. On the other hand, hosts of the same type are expected to be identical

DETAILS="perf_results/reports/details.md"
[ -f "$SUMMARY" ] || exit 0

SECTION_START="<!-- perf-section:${DISPLAY_NAME} -->"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There going to be one one section - it compares the same different runs on the same machine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two sections, one per runner (MI325 and MI355), both runners post to the same PR comment. The section markers let the second runner to finish insert its results without overwriting the first runner's section. Without them, whichever runner finishes last would clobber the other's report.

If you prefer, we could use two separate comments, one for each runner; I implemented the single-comment approach to keep the PR thread a bit cleaner.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that it writes to PR comment. Why is it needed there?

@Micky774 Micky774 mentioned this pull request Mar 16, 2026
13 tasks
Copy link
Contributor

@alextmagro alextmagro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Matthias, these look good! Are we able to expand this yaml to run benchmarks in C++ or are we limited to python with this setup?

import transformer_engine.pytorch as te

# Sweep parameters
BATCH_SIZE = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an additional BS to try or is 2 sufficient for attention?

EOF
)"

- name: Performance regression check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we okay with comparing across machines? Especially when our CI pool changes, we might find some differences that aren't as easy to attribute to code change or machine.

Comment on lines +42 to +45
("Llama3-8B/TP1", 32, 8, 128, 1),
("Llama3-8B/TP8", 32, 8, 128, 8),
("Llama3-70B/TP8", 64, 8, 128, 8),
("Llama3-405B/TP8", 128, 8, 128, 8),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Let's rename these 3.1 for accuracy

@Micky774
Copy link
Contributor

@ipanfilo @alextmagro I wanted to also mention I have a proposal implementation of ASV as an alternative benchmarking approach that might be worth taking a look at as a point of comparison. Matthias and I have talked a bit about our two different approaches and the pros-and-cons of each. I've left a small write-up of some of the benefits in the PR body, and I'll let Matthias present his own opinion for posterity.

In short, I prefer ASV since it provides a lot of important/critical functionality for free (e.g. machine indexing, commit-based regression tracking, static visualization/dashboards, statistical sampling, data storage/parsing), and ensures we don't have a strong maintenance or development burden since it is a widely-adopted OSS project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants