[Plugin][Recipe] Refine and add recipes for OOT design#343
[Plugin][Recipe] Refine and add recipes for OOT design#343zejunchen-zejun wants to merge 15 commits intomainfrom
Conversation
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
There was a problem hiding this comment.
Pull request overview
This PR reorganizes and expands the documentation for running ATOM as a vLLM out-of-tree (OOT) plugin backend by moving the core guide under recipes/atom_vllm/, adding an animated “injection flow” visualization, and introducing per-model launch recipes.
Changes:
- Move/replace the vLLM OOT plugin backend guide into
recipes/atom_vllm/with a clearer explanation of plugin entry points and integration flow. - Add model-specific OOT run recipes (DeepSeek-R1, GLM-4, GPT-OSS, Kimi-K2, Qwen3-235B).
- Add an animated HTML diagram illustrating the vLLM→ATOM plugin injection sequence.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| recipes/vLLM-ATOM-OOT-Plugin-Backend.md | Removes the previous top-level OOT backend recipe (now effectively relocated). |
| recipes/atom_vllm/vLLM-ATOM-OOT-Plugin-Backend.md | New consolidated OOT guide with entry points + launch instructions + flow reference. |
| recipes/atom_vllm/DeepSeek-R1.md | New DeepSeek-R1 OOT launch recipe. |
| recipes/atom_vllm/GLM-4.md | New GLM-4-MoE OOT launch recipe. |
| recipes/atom_vllm/GPT-OSS.md | New GPT-OSS OOT launch recipe. |
| recipes/atom_vllm/Kimi-K2-Thinking.md | New Kimi-K2-Thinking OOT launch recipe (uses --trust-remote-code). |
| recipes/atom_vllm/Qwen-235B.md | New Qwen3-235B OOT launch recipe with relevant env toggles. |
| recipes/atom_vllm/atom_vllm_oot_injection.html | New animated visualization of the plugin registration/injection flow. |
Comments suppressed due to low confidence (1)
recipes/vLLM-ATOM-OOT-Plugin-Backend.md:1
- This file is being removed/moved, but the repo still references
recipes/vLLM-ATOM-OOT-Plugin-Backend.md(e.g., README links to it). As-is, those links will break. Consider keeping a small stub at the old path that links/redirects torecipes/atom_vllm/vLLM-ATOM-OOT-Plugin-Backend.md, and/or update all references in the same PR.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
This PR reorganizes and expands documentation for running ATOM as a vLLM out-of-tree (OOT) plugin backend by moving the main guide into recipes/atom_vllm/, adding an architecture/flow explanation (with an animated SVG), and introducing per-model OOT launch recipes.
Changes:
- Move/replace the vLLM OOT plugin backend guide under
recipes/atom_vllm/with updated plugin-flow explanation and profiling instructions. - Add new OOT recipes for several large models (Qwen3-235B, Kimi-K2-Thinking, GPT-OSS, GLM-4, DeepSeek-R1).
- Add an animated SVG illustrating the plugin injection/execution flow.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| recipes/vLLM-ATOM-OOT-Plugin-Backend.md | Removes the old top-level OOT plugin backend recipe doc (content relocated). |
| recipes/atom_vllm/vLLM-ATOM-OOT-Plugin-Backend.md | New/updated canonical OOT plugin backend guide, including plugin entry-point flow and optional profiling. |
| recipes/atom_vllm/Qwen-235B.md | Adds a Qwen3-235B OOT launch + accuracy validation recipe. |
| recipes/atom_vllm/Kimi-K2-Thinking.md | Adds a Kimi-K2-Thinking OOT launch + accuracy validation recipe (trust-remote-code). |
| recipes/atom_vllm/GPT-OSS.md | Adds a GPT-OSS-120B OOT launch + accuracy validation recipe with an accuracy warning note. |
| recipes/atom_vllm/GLM-4.md | Adds a GLM-4-MoE OOT launch + accuracy validation recipe. |
| recipes/atom_vllm/DeepSeek-R1.md | Adds DeepSeek-R1 OOT recipes for FP8 and MXFP4 checkpoints. |
| recipes/atom_vllm/atom_vllm_oot_injection.svg | Adds an animated diagram of the vLLM↔ATOM OOT injection/execution flow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hi, @PerryZhang01 @gbyu-amd @XiaobingSuper |
There was a problem hiding this comment.
Pull request overview
This PR reorganizes and expands the documentation for running ATOM as a vLLM out-of-tree (OOT) plugin backend by moving the main integration recipe into a dedicated recipes/atom_vllm/ folder, adding per-model run recipes, and including a visual execution-flow diagram.
Changes:
- Move/replace the vLLM OOT plugin backend recipe into
recipes/atom_vllm/with updated integration explanation. - Add model-specific OOT run recipes (Qwen3-235B, Kimi-K2, GPT-OSS, GLM-4, DeepSeek-R1).
- Add an SVG diagram illustrating the vLLM+ATOM OOT execution flow.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| recipes/vLLM-ATOM-OOT-Plugin-Backend.md | Removes the old top-level vLLM OOT plugin backend recipe (content relocated). |
| recipes/atom_vllm/vLLM-ATOM-OOT-Plugin-Backend.md | New canonical OOT integration doc (entry points, flow, supported models, run instructions). |
| recipes/atom_vllm/Qwen-235B.md | New model-specific recipe for running Qwen3-235B via vLLM OOT platform. |
| recipes/atom_vllm/Kimi-K2-Thinking.md | New model-specific recipe for running Kimi-K2-Thinking via vLLM OOT platform. |
| recipes/atom_vllm/GPT-OSS.md | New model-specific recipe for running GPT-OSS-120B via vLLM OOT platform. |
| recipes/atom_vllm/GLM-4.md | New model-specific recipe for running GLM-4-MoE via vLLM OOT platform. |
| recipes/atom_vllm/DeepSeek-R1.md | New model-specific recipe for running DeepSeek-R1 via vLLM OOT platform. |
| recipes/atom_vllm/atom_vllm_oot_injection.svg | Adds an execution-flow diagram used by the new integration doc. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
This PR reorganizes and expands the documentation for running ATOM as a vLLM out-of-tree (OOT) plugin backend by moving the main guide into a dedicated recipes/atom_vllm/ section, adding model-specific run recipes, and including an execution-flow diagram. It also removes an older “known issue” warning from the plugin config generator.
Changes:
- Move/replace the vLLM OOT plugin backend guide into
recipes/atom_vllm/with a more detailed explanation of plugin entrypoints, execution flow, and supported models. - Add several model-specific OOT recipes (DeepSeek-R1, GLM-4-MoE, GPT-OSS, Kimi-K2, Qwen3-235B) plus an SVG flow illustration.
- Remove the
max_num_batched_tokens“known issue” warning block fromatom/plugin/config.py.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| recipes/vLLM-ATOM-OOT-Plugin-Backend.md | Removes the previous top-level vLLM OOT backend recipe doc (content relocated). |
| recipes/atom_vllm/vLLM-ATOM-OOT-Plugin-Backend.md | New primary OOT backend guide: plugin mechanism, entrypoints, execution flow, supported models, and usage. |
| recipes/atom_vllm/atom_vllm_oot_injection.svg | Adds an animated execution-flow diagram referenced by the new guide. |
| recipes/atom_vllm/Qwen-235B.md | Adds an OOT recipe for running Qwen3-235B-A22B. |
| recipes/atom_vllm/Kimi-K2-Thinking.md | Adds an OOT recipe for running Kimi-K2-Thinking (trust-remote-code). |
| recipes/atom_vllm/GPT-OSS.md | Adds an OOT recipe for running GPT-OSS-120B with a TP8 accuracy caution. |
| recipes/atom_vllm/GLM-4.md | Adds an OOT recipe for running GLM-4-MoE checkpoints. |
| recipes/atom_vllm/DeepSeek-R1.md | Adds an OOT recipe for running DeepSeek-R1 FP8/MXFP4. |
| atom/plugin/config.py | Removes a warning related to a prior fused_moe illegal-memory-access “known issue” threshold. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Moves the vLLM out-of-tree (OOT) plugin backend documentation into the Sphinx docs site and adds model-specific vLLM OOT launch recipes, plus an execution-flow SVG for the guide.
Changes:
- Relocates the vLLM OOT plugin backend guide from
recipes/intodocs/and wires it into the Sphinx documentation index. - Adds model-specific vLLM OOT launch recipes (Qwen3-235B, Kimi-K2, GPT-OSS, GLM-4, DeepSeek-R1) under
recipes/atom_vllm/. - Updates top-level README link and adds SVG assets for the vLLM OOT execution flow.
Reviewed changes
Copilot reviewed 11 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| recipes/vLLM-ATOM-OOT-Plugin-Backend.md | Removes the old standalone recipe guide (content moved into docs). |
| recipes/atom_vllm/vLLM-ATOM-OOT-Plugin-Backend.md | Adds a stub pointing readers to the new docs guide. |
| recipes/atom_vllm/Qwen-235B.md | Adds a Qwen3-235B vLLM OOT launch + lm_eval recipe. |
| recipes/atom_vllm/Kimi-K2-Thinking.md | Adds a Kimi-K2 vLLM OOT launch + lm_eval recipe (trust-remote-code). |
| recipes/atom_vllm/GPT-OSS.md | Adds a GPT-OSS vLLM OOT launch + lm_eval recipe (notes TP8 accuracy issue). |
| recipes/atom_vllm/GLM-4.md | Adds a GLM-4-MoE vLLM OOT launch + lm_eval recipe. |
| recipes/atom_vllm/DeepSeek-R1.md | Adds a DeepSeek-R1 vLLM OOT launch + lm_eval recipe (FP8 + MXFP4). |
| recipes/atom_vllm/atom_vllm_oot_injection.svg | Adds an execution-flow SVG copy under recipes. |
| README.md | Updates the framework integration link to point at the new docs guide. |
| docs/vllm_plugin_backend_guide.md | Adds the dedicated vLLM OOT plugin backend guide for the docs site. |
| docs/index.rst | Adds the new vLLM guide into the Sphinx toctree and index page. |
| docs/assets/atom_vllm_oot_injection.svg | Adds the SVG used by the docs guide. |
| atom/plugin/config.py | Removes a plugin-mode warning about large max_num_batched_tokens and fused_moe illegal memory access. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Could you help review this PR? This PR is to add the OOT recipe and associated doc. Thank you |
| --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \ | ||
| --max-model-len 16384 \ | ||
| --no-enable-prefix-caching \ | ||
| 2>&1 | tee log.serve.log & |
There was a problem hiding this comment.
2>&1 | tee log.serve.log &
remove the last sentence, it's not necessary
| --host localhost \ | ||
| --port 8000 \ | ||
| --tensor-parallel-size 8 \ | ||
| --enable-expert-parallel \ |
There was a problem hiding this comment.
remove it in default command. Let's have a description that, if user want to enable EP, user can add --enable-expert-parallel
| The vLLM OOT plugin backend keeps the standard vLLM CLI, server APIs, and general usage flow compatible with upstream vLLM. For general server options and API usage, refer to the [official vLLM documentation](https://docs.vllm.ai/en/latest/). | ||
|
|
||
| ```bash | ||
| export ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1 |
There was a problem hiding this comment.
please add comment here, let user understand the flag helps fuse qk_norm and qk_rope and the fp8 block scale quant? how about the other quant? @gbyu-amd know more details.
| --model_args model=${model},base_url=${url},num_concurrent=16,max_retries=3,tokenized_requests=False \ | ||
| --tasks ${task} \ | ||
| --num_fewshot 3 \ | ||
| 2>&1 | tee log.lmeval.log |
There was a problem hiding this comment.
remote this sentence.
and please attach the accuracy result here for reference.
| --model_args model=${model},base_url=${url},num_concurrent=16,max_retries=3,tokenized_requests=False \ | ||
| --tasks ${task} \ | ||
| --num_fewshot 3 \ | ||
| 2>&1 | tee log.lmeval.log |
| --async-scheduling \ | ||
| --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \ | ||
| --no-enable-prefix-caching \ | ||
| 2>&1 | tee log.serve.log & |
| --host localhost \ | ||
| --port 8000 \ | ||
| --tensor-parallel-size 8 \ | ||
| --enable-expert-parallel \ |
There was a problem hiding this comment.
remove it in default command. Let's have a description that, if user want to enable EP, user can add --enable-expert-parallel
| --async-scheduling \ | ||
| --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \ | ||
| --no-enable-prefix-caching \ | ||
| 2>&1 | tee log.serve.log & |
| --async-scheduling \ | ||
| --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \ | ||
| --no-enable-prefix-caching \ | ||
| 2>&1 | tee log.serve.log & |
| --host localhost \ | ||
| --port 8000 \ | ||
| --trust-remote-code \ | ||
| --tensor-parallel-size 4 \ |
There was a problem hiding this comment.
can you help elaborate that, both TP4 and TP8 work fine.
| --model_args model=${model},base_url=${url},num_concurrent=16,max_retries=3,tokenized_requests=False \ | ||
| --tasks ${task} \ | ||
| --num_fewshot 3 \ | ||
| 2>&1 | tee log.lmeval.log |
| --async-scheduling \ | ||
| --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \ | ||
| --no-enable-prefix-caching \ | ||
| 2>&1 | tee log.serve.log & |
| --model_args model=${model},base_url=${url},num_concurrent=16,max_retries=3,tokenized_requests=False \ | ||
| --tasks ${task} \ | ||
| --num_fewshot 3 \ | ||
| 2>&1 | tee log.lmeval.log |
| --async-scheduling \ | ||
| --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \ | ||
| --no-enable-prefix-caching \ | ||
| 2>&1 | tee log.serve.log & |
| --async-scheduling \ | ||
| --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \ | ||
| --no-enable-prefix-caching \ | ||
| 2>&1 | tee log.serve.log & |
| --model_args model=${model},base_url=${url},num_concurrent=16,max_retries=3,tokenized_requests=False \ | ||
| --tasks ${task} \ | ||
| --num_fewshot 3 \ | ||
| 2>&1 | tee log.lmeval.log |
| --tasks ${task} \ | ||
| --num_fewshot 3 \ | ||
| 2>&1 | tee log.lmeval.log | ||
| ``` |
There was a problem hiding this comment.
we'd better put the accuracy reference here
No description provided.