feat: parallel JIT warmup for faster server startup by sunway513 · Pull Request #2338 · ROCm/aiter

sunway513 · 2026-03-18T20:12:24Z

Summary

Adds warmup_jit.py — a standalone script that pre-builds all AITER JIT modules in parallel using ThreadPoolExecutor
Leverages existing PREBUILD_THREAD_NUM mechanism to split CPU budget across concurrent ninja builds
Gracefully handles build failures (e.g. CK-dependent modules in CK-free mode)
Supports --exclude to skip expensive/unnecessary modules per workload

Benchmark (MI355, 8x GPU, 256 CPU threads, 23 modules)

Config	Total time	Speedup
Sequential (parallel=1)	943.8s	baseline
Parallel (parallel=4)	606.9s	1.55x
Parallel=4, excl. aiter_operator	~56s vs ~390s	7x for normal modules

module_aiter_operator (551s) dominates both runs as a single massive compilation unit. All other 20 modules go from 390s → 56s with parallel=4.

Usage

# Pre-warm all modules with 4-way parallelism, skip the expensive optional one
python warmup_jit.py --parallel 4 --exclude module_aiter_operator

# Then start the server — zero JIT overhead
ENABLE_CK=0 AITER_USE_OPUS_MOE_SORTING=1   python -m atom.entrypoints.openai_server --model /data/DeepSeek-R1-0528   --kv_cache_dtype fp8 -tp 8

Test plan

Sequential baseline validated (943.8s, 21 built, 2 failed CK-free)
Parallel=4 validated (606.9s, same 21 built, 2 failed)
Build failures handled gracefully (fmha_v3_fwd/varlen_fwd in CK-free)
Integration test: warmup then ATOM server startup with DeepSeek-R1

Closes #2336

🤖 Generated with Claude Code

Add warmup_jit.py that pre-builds all JIT modules in parallel using ThreadPoolExecutor, leveraging the existing PREBUILD_THREAD_NUM mechanism to split CPU budget across concurrent ninja builds. Benchmark on MI355 (256 CPU threads, 23 modules): - Sequential: 943.8s - Parallel (4 threads): 606.9s (1.55x overall) - Excluding module_aiter_operator: 390s -> 56s (7x speedup) Usage: python warmup_jit.py --parallel 4 --exclude module_aiter_operator python warmup_jit.py --clean --sequential # baseline Closes #2336

github-actions · 2026-03-18T20:13:41Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2338 --add-label <label>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: parallel JIT warmup for faster server startup#2338

feat: parallel JIT warmup for faster server startup#2338
sunway513 wants to merge 1 commit intotest_dsfp8_no_ckfrom
feat/parallel-jit-warmup

sunway513 commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sunway513 commented Mar 18, 2026

Summary

Benchmark (MI355, 8x GPU, 256 CPU threads, 23 modules)

Usage

Test plan

Uh oh!

github-actions bot commented Mar 18, 2026

🏷️ CI Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant