Skip to content

[Plugin][Recipe] Refine and add recipes for OOT design#343

Open
zejunchen-zejun wants to merge 15 commits intomainfrom
zejun/update_oot_recipe
Open

[Plugin][Recipe] Refine and add recipes for OOT design#343
zejunchen-zejun wants to merge 15 commits intomainfrom
zejun/update_oot_recipe

Conversation

@zejunchen-zejun
Copy link
Contributor

No description provided.

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Copilot AI review requested due to automatic review settings March 16, 2026 07:44
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reorganizes and expands the documentation for running ATOM as a vLLM out-of-tree (OOT) plugin backend by moving the core guide under recipes/atom_vllm/, adding an animated “injection flow” visualization, and introducing per-model launch recipes.

Changes:

  • Move/replace the vLLM OOT plugin backend guide into recipes/atom_vllm/ with a clearer explanation of plugin entry points and integration flow.
  • Add model-specific OOT run recipes (DeepSeek-R1, GLM-4, GPT-OSS, Kimi-K2, Qwen3-235B).
  • Add an animated HTML diagram illustrating the vLLM→ATOM plugin injection sequence.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
recipes/vLLM-ATOM-OOT-Plugin-Backend.md Removes the previous top-level OOT backend recipe (now effectively relocated).
recipes/atom_vllm/vLLM-ATOM-OOT-Plugin-Backend.md New consolidated OOT guide with entry points + launch instructions + flow reference.
recipes/atom_vllm/DeepSeek-R1.md New DeepSeek-R1 OOT launch recipe.
recipes/atom_vllm/GLM-4.md New GLM-4-MoE OOT launch recipe.
recipes/atom_vllm/GPT-OSS.md New GPT-OSS OOT launch recipe.
recipes/atom_vllm/Kimi-K2-Thinking.md New Kimi-K2-Thinking OOT launch recipe (uses --trust-remote-code).
recipes/atom_vllm/Qwen-235B.md New Qwen3-235B OOT launch recipe with relevant env toggles.
recipes/atom_vllm/atom_vllm_oot_injection.html New animated visualization of the plugin registration/injection flow.
Comments suppressed due to low confidence (1)

recipes/vLLM-ATOM-OOT-Plugin-Backend.md:1

  • This file is being removed/moved, but the repo still references recipes/vLLM-ATOM-OOT-Plugin-Backend.md (e.g., README links to it). As-is, those links will break. Consider keeping a small stub at the old path that links/redirects to recipes/atom_vllm/vLLM-ATOM-OOT-Plugin-Backend.md, and/or update all references in the same PR.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Copilot AI review requested due to automatic review settings March 16, 2026 08:56
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reorganizes and expands documentation for running ATOM as a vLLM out-of-tree (OOT) plugin backend by moving the main guide into recipes/atom_vllm/, adding an architecture/flow explanation (with an animated SVG), and introducing per-model OOT launch recipes.

Changes:

  • Move/replace the vLLM OOT plugin backend guide under recipes/atom_vllm/ with updated plugin-flow explanation and profiling instructions.
  • Add new OOT recipes for several large models (Qwen3-235B, Kimi-K2-Thinking, GPT-OSS, GLM-4, DeepSeek-R1).
  • Add an animated SVG illustrating the plugin injection/execution flow.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
recipes/vLLM-ATOM-OOT-Plugin-Backend.md Removes the old top-level OOT plugin backend recipe doc (content relocated).
recipes/atom_vllm/vLLM-ATOM-OOT-Plugin-Backend.md New/updated canonical OOT plugin backend guide, including plugin entry-point flow and optional profiling.
recipes/atom_vllm/Qwen-235B.md Adds a Qwen3-235B OOT launch + accuracy validation recipe.
recipes/atom_vllm/Kimi-K2-Thinking.md Adds a Kimi-K2-Thinking OOT launch + accuracy validation recipe (trust-remote-code).
recipes/atom_vllm/GPT-OSS.md Adds a GPT-OSS-120B OOT launch + accuracy validation recipe with an accuracy warning note.
recipes/atom_vllm/GLM-4.md Adds a GLM-4-MoE OOT launch + accuracy validation recipe.
recipes/atom_vllm/DeepSeek-R1.md Adds DeepSeek-R1 OOT recipes for FP8 and MXFP4 checkpoints.
recipes/atom_vllm/atom_vllm_oot_injection.svg Adds an animated diagram of the vLLM↔ATOM OOT injection/execution flow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Copilot AI review requested due to automatic review settings March 16, 2026 09:05
@zejunchen-zejun
Copy link
Contributor Author

Hi, @PerryZhang01 @gbyu-amd @XiaobingSuper
Could you help check the recipes here? We need to make sure the recipes correct to users.
Especially check the vllm launch server command
Thank you

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reorganizes and expands the documentation for running ATOM as a vLLM out-of-tree (OOT) plugin backend by moving the main integration recipe into a dedicated recipes/atom_vllm/ folder, adding per-model run recipes, and including a visual execution-flow diagram.

Changes:

  • Move/replace the vLLM OOT plugin backend recipe into recipes/atom_vllm/ with updated integration explanation.
  • Add model-specific OOT run recipes (Qwen3-235B, Kimi-K2, GPT-OSS, GLM-4, DeepSeek-R1).
  • Add an SVG diagram illustrating the vLLM+ATOM OOT execution flow.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
recipes/vLLM-ATOM-OOT-Plugin-Backend.md Removes the old top-level vLLM OOT plugin backend recipe (content relocated).
recipes/atom_vllm/vLLM-ATOM-OOT-Plugin-Backend.md New canonical OOT integration doc (entry points, flow, supported models, run instructions).
recipes/atom_vllm/Qwen-235B.md New model-specific recipe for running Qwen3-235B via vLLM OOT platform.
recipes/atom_vllm/Kimi-K2-Thinking.md New model-specific recipe for running Kimi-K2-Thinking via vLLM OOT platform.
recipes/atom_vllm/GPT-OSS.md New model-specific recipe for running GPT-OSS-120B via vLLM OOT platform.
recipes/atom_vllm/GLM-4.md New model-specific recipe for running GLM-4-MoE via vLLM OOT platform.
recipes/atom_vllm/DeepSeek-R1.md New model-specific recipe for running DeepSeek-R1 via vLLM OOT platform.
recipes/atom_vllm/atom_vllm_oot_injection.svg Adds an execution-flow diagram used by the new integration doc.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Copilot AI review requested due to automatic review settings March 17, 2026 03:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reorganizes and expands the documentation for running ATOM as a vLLM out-of-tree (OOT) plugin backend by moving the main guide into a dedicated recipes/atom_vllm/ section, adding model-specific run recipes, and including an execution-flow diagram. It also removes an older “known issue” warning from the plugin config generator.

Changes:

  • Move/replace the vLLM OOT plugin backend guide into recipes/atom_vllm/ with a more detailed explanation of plugin entrypoints, execution flow, and supported models.
  • Add several model-specific OOT recipes (DeepSeek-R1, GLM-4-MoE, GPT-OSS, Kimi-K2, Qwen3-235B) plus an SVG flow illustration.
  • Remove the max_num_batched_tokens “known issue” warning block from atom/plugin/config.py.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
recipes/vLLM-ATOM-OOT-Plugin-Backend.md Removes the previous top-level vLLM OOT backend recipe doc (content relocated).
recipes/atom_vllm/vLLM-ATOM-OOT-Plugin-Backend.md New primary OOT backend guide: plugin mechanism, entrypoints, execution flow, supported models, and usage.
recipes/atom_vllm/atom_vllm_oot_injection.svg Adds an animated execution-flow diagram referenced by the new guide.
recipes/atom_vllm/Qwen-235B.md Adds an OOT recipe for running Qwen3-235B-A22B.
recipes/atom_vllm/Kimi-K2-Thinking.md Adds an OOT recipe for running Kimi-K2-Thinking (trust-remote-code).
recipes/atom_vllm/GPT-OSS.md Adds an OOT recipe for running GPT-OSS-120B with a TP8 accuracy caution.
recipes/atom_vllm/GLM-4.md Adds an OOT recipe for running GLM-4-MoE checkpoints.
recipes/atom_vllm/DeepSeek-R1.md Adds an OOT recipe for running DeepSeek-R1 FP8/MXFP4.
atom/plugin/config.py Removes a warning related to a prior fused_moe illegal-memory-access “known issue” threshold.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

wuhuikx and others added 2 commits March 17, 2026 13:29
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Copilot AI review requested due to automatic review settings March 17, 2026 07:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Moves the vLLM out-of-tree (OOT) plugin backend documentation into the Sphinx docs site and adds model-specific vLLM OOT launch recipes, plus an execution-flow SVG for the guide.

Changes:

  • Relocates the vLLM OOT plugin backend guide from recipes/ into docs/ and wires it into the Sphinx documentation index.
  • Adds model-specific vLLM OOT launch recipes (Qwen3-235B, Kimi-K2, GPT-OSS, GLM-4, DeepSeek-R1) under recipes/atom_vllm/.
  • Updates top-level README link and adds SVG assets for the vLLM OOT execution flow.

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
recipes/vLLM-ATOM-OOT-Plugin-Backend.md Removes the old standalone recipe guide (content moved into docs).
recipes/atom_vllm/vLLM-ATOM-OOT-Plugin-Backend.md Adds a stub pointing readers to the new docs guide.
recipes/atom_vllm/Qwen-235B.md Adds a Qwen3-235B vLLM OOT launch + lm_eval recipe.
recipes/atom_vllm/Kimi-K2-Thinking.md Adds a Kimi-K2 vLLM OOT launch + lm_eval recipe (trust-remote-code).
recipes/atom_vllm/GPT-OSS.md Adds a GPT-OSS vLLM OOT launch + lm_eval recipe (notes TP8 accuracy issue).
recipes/atom_vllm/GLM-4.md Adds a GLM-4-MoE vLLM OOT launch + lm_eval recipe.
recipes/atom_vllm/DeepSeek-R1.md Adds a DeepSeek-R1 vLLM OOT launch + lm_eval recipe (FP8 + MXFP4).
recipes/atom_vllm/atom_vllm_oot_injection.svg Adds an execution-flow SVG copy under recipes.
README.md Updates the framework integration link to point at the new docs guide.
docs/vllm_plugin_backend_guide.md Adds the dedicated vLLM OOT plugin backend guide for the docs site.
docs/index.rst Adds the new vLLM guide into the Sphinx toctree and index page.
docs/assets/atom_vllm_oot_injection.svg Adds the SVG used by the docs guide.
atom/plugin/config.py Removes a plugin-mode warning about large max_num_batched_tokens and fused_moe illegal memory access.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Copilot AI review requested due to automatic review settings March 18, 2026 09:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

@zejunchen-zejun
Copy link
Contributor Author

Hi, @valarLip @gyohuangxin

Could you help review this PR? This PR is to add the OOT recipe and associated doc.

Thank you

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
--max-model-len 16384 \
--no-enable-prefix-caching \
2>&1 | tee log.serve.log &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2>&1 | tee log.serve.log &
remove the last sentence, it's not necessary

--host localhost \
--port 8000 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove it in default command. Let's have a description that, if user want to enable EP, user can add --enable-expert-parallel

The vLLM OOT plugin backend keeps the standard vLLM CLI, server APIs, and general usage flow compatible with upstream vLLM. For general server options and API usage, refer to the [official vLLM documentation](https://docs.vllm.ai/en/latest/).

```bash
export ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comment here, let user understand the flag helps fuse qk_norm and qk_rope and the fp8 block scale quant? how about the other quant? @gbyu-amd know more details.

--model_args model=${model},base_url=${url},num_concurrent=16,max_retries=3,tokenized_requests=False \
--tasks ${task} \
--num_fewshot 3 \
2>&1 | tee log.lmeval.log
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remote this sentence.
and please attach the accuracy result here for reference.

--model_args model=${model},base_url=${url},num_concurrent=16,max_retries=3,tokenized_requests=False \
--tasks ${task} \
--num_fewshot 3 \
2>&1 | tee log.lmeval.log
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

--async-scheduling \
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
--no-enable-prefix-caching \
2>&1 | tee log.serve.log &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

--host localhost \
--port 8000 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove it in default command. Let's have a description that, if user want to enable EP, user can add --enable-expert-parallel

--async-scheduling \
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
--no-enable-prefix-caching \
2>&1 | tee log.serve.log &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

--async-scheduling \
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
--no-enable-prefix-caching \
2>&1 | tee log.serve.log &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

--host localhost \
--port 8000 \
--trust-remote-code \
--tensor-parallel-size 4 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you help elaborate that, both TP4 and TP8 work fine.

--model_args model=${model},base_url=${url},num_concurrent=16,max_retries=3,tokenized_requests=False \
--tasks ${task} \
--num_fewshot 3 \
2>&1 | tee log.lmeval.log
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

--async-scheduling \
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
--no-enable-prefix-caching \
2>&1 | tee log.serve.log &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

--model_args model=${model},base_url=${url},num_concurrent=16,max_retries=3,tokenized_requests=False \
--tasks ${task} \
--num_fewshot 3 \
2>&1 | tee log.lmeval.log
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

--async-scheduling \
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
--no-enable-prefix-caching \
2>&1 | tee log.serve.log &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

--async-scheduling \
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
--no-enable-prefix-caching \
2>&1 | tee log.serve.log &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

--model_args model=${model},base_url=${url},num_concurrent=16,max_retries=3,tokenized_requests=False \
--tasks ${task} \
--num_fewshot 3 \
2>&1 | tee log.lmeval.log
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

--tasks ${task} \
--num_fewshot 3 \
2>&1 | tee log.lmeval.log
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd better put the accuracy reference here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants