Skip to content

feat: parallel JIT warmup for faster server startup#2338

Draft
sunway513 wants to merge 1 commit intotest_dsfp8_no_ckfrom
feat/parallel-jit-warmup
Draft

feat: parallel JIT warmup for faster server startup#2338
sunway513 wants to merge 1 commit intotest_dsfp8_no_ckfrom
feat/parallel-jit-warmup

Conversation

@sunway513
Copy link
Collaborator

Summary

  • Adds warmup_jit.py — a standalone script that pre-builds all AITER JIT modules in parallel using ThreadPoolExecutor
  • Leverages existing PREBUILD_THREAD_NUM mechanism to split CPU budget across concurrent ninja builds
  • Gracefully handles build failures (e.g. CK-dependent modules in CK-free mode)
  • Supports --exclude to skip expensive/unnecessary modules per workload

Benchmark (MI355, 8x GPU, 256 CPU threads, 23 modules)

Config Total time Speedup
Sequential (parallel=1) 943.8s baseline
Parallel (parallel=4) 606.9s 1.55x
Parallel=4, excl. aiter_operator ~56s vs ~390s 7x for normal modules

module_aiter_operator (551s) dominates both runs as a single massive compilation unit. All other 20 modules go from 390s → 56s with parallel=4.

Usage

# Pre-warm all modules with 4-way parallelism, skip the expensive optional one
python warmup_jit.py --parallel 4 --exclude module_aiter_operator

# Then start the server — zero JIT overhead
ENABLE_CK=0 AITER_USE_OPUS_MOE_SORTING=1   python -m atom.entrypoints.openai_server --model /data/DeepSeek-R1-0528   --kv_cache_dtype fp8 -tp 8

Test plan

  • Sequential baseline validated (943.8s, 21 built, 2 failed CK-free)
  • Parallel=4 validated (606.9s, same 21 built, 2 failed)
  • Build failures handled gracefully (fmha_v3_fwd/varlen_fwd in CK-free)
  • Integration test: warmup then ATOM server startup with DeepSeek-R1

Closes #2336

🤖 Generated with Claude Code

Add warmup_jit.py that pre-builds all JIT modules in parallel using
ThreadPoolExecutor, leveraging the existing PREBUILD_THREAD_NUM
mechanism to split CPU budget across concurrent ninja builds.

Benchmark on MI355 (256 CPU threads, 23 modules):
- Sequential: 943.8s
- Parallel (4 threads): 606.9s (1.55x overall)
- Excluding module_aiter_operator: 390s -> 56s (7x speedup)

Usage:
  python warmup_jit.py --parallel 4 --exclude module_aiter_operator
  python warmup_jit.py --clean --sequential  # baseline

Closes #2336
@github-actions
Copy link
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2338 --add-label <label>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant