[Feat] Support PD Disaggregation via CRD Operator#841
[Feat] Support PD Disaggregation via CRD Operator#841Vivo50E wants to merge 8 commits intovllm-project:mainfrom
Conversation
Summary of ChangesHello @Vivo50E, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the vLLM production stack by introducing native Kubernetes Custom Resource Definition (CRD) support for Prefill-Decode (PD) disaggregated serving. This change allows users to define a 2-Prefill, 2-Decode (2P2D) topology declaratively within a single VLLMRuntime resource, streamlining the deployment and management of complex, high-throughput model serving architectures. The operator now intelligently provisions and manages distinct prefill and decode components, facilitating efficient KV cache transfer and robust routing. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant new functionality by adding first-class CRD support for Prefill-Decode (PD) disaggregation, which is a great step towards more declarative and Kubernetes-native management of vLLM deployments. The changes are extensive, touching Helm charts, the operator controller, CRD definitions, and the router application. The introduction of the TopologySpec in the VLLMRuntime CRD is a clean approach to managing the disaggregated setup, and the accompanying test updates are thorough.
My review has identified a few key areas for improvement. There is a critical issue regarding inconsistent HTTP client usage (aiohttp vs. httpx) in the router's service discovery, which will likely lead to runtime errors. Additionally, for production readiness, the operator image should be sourced from an official organization repository instead of a personal one. I've also noted a couple of potentially extraneous sample files that might need to be removed for clarity. Overall, this is a strong contribution, and addressing these points will further enhance its quality and robustness.
ruizhang0101
left a comment
There was a problem hiding this comment.
Hi, Thanks for the contribution :))
I have several concerns towards this PR:
- For CRD support, why are there changes related to helm such as helm-related GitHub wf and helm template.
- Most of the codes seem too old, it actually tries to "revert" some of the PRs, I would suggest a re-implementation based on the current version.
@ruizhang0101 , Thanks for reviewing. I drafted this PR a long time ago and haven’t had much time to revisit it since then. I’ll rebase my changes onto the latest version and address your comments:> |
There was a problem hiding this comment.
Hi, Thanks for the prompt update and all the contribution! This version looks so much better, I left some comments, please let me know if there is any questions.
Also, a very IMPORTANT thing for this PR to be merged is that, could you show that it is working as expected following the tutorial? And also, it should not break the original functionality.
9ec9991 to
e6caf27
Compare
|
Hi, Could you fix the CI issue? Lemme know if you have any questions :))) |
Signed-off-by: Yiqi Xue <xuey666@gmail.com>
…t and improved error handling Signed-off-by: Yiqi Xue <xuey666@gmail.com>
…ency Signed-off-by: Yiqi Xue <xuey666@gmail.com>
…hunk processing Signed-off-by: Yiqi Xue <xuey666@gmail.com>
Signed-off-by: Yiqi Xue <xuey666@gmail.com>
Signed-off-by: Yiqi Xue <xuey666@gmail.com>
Signed-off-by: Yiqi Xue <xuey666@gmail.com>
Signed-off-by: Yiqi Xue <xuey666@gmail.com> Made-with: Cursor
|
Hi @ruizhang0101, I pushed a fix to ensure the NIXL path is only activated when nixl_proxy_host is configured. I've run the CI tests locally (static-discovery, k8s-discovery, helm chart Two-Pods-Minimal-Example) and they all pass. Also verified CRD-based 2P2D disaggregated prefill works end-to-end locally with the router correctly splitting traffic to prefill/decode nodes and ZMQ proxy running on the configured port. |
@ruizhang0101 The CI failures (Secure-Minimal-Example, CRD-Validation, k8s-discovery-e2e-test) all show minikube host: Stopped at the very first step — this appears to be a runner environment issue rather than a code problem. Could you re-trigger the CI? Thanks |
|
@ruizhang0101 The CI checks have all passed now. Could you please take another look and help move this PR forward when you have a moment? Thanks a lot! |
ruizhang0101
left a comment
There was a problem hiding this comment.
I think it is not a good idea to have this breaking change in spec schema because of PD. This introduces so much confusion of the existing spec. Also it introduces many duplications in reconcile logic. The helm chart solution for PD is to assign "role" for each model runtime, and routing based on the role. I would suggest the same pattern. Using different CR to represent prefill/decode.
Also, this PR is too large, I would suggest seperate PR to address this problem.
PR1: PD Routing logic
PR2: CRD deployment for PD
Aside from that, I have attached some comments. Please take a look when you have time.
|
|
||
| // Legacy fields (used when EnablePDDisaggregation=false) | ||
| // Model configuration | ||
| Model ModelSpec `json:"model,omitempty"` |
There was a problem hiding this comment.
For model, vllmconfig and deployment config, they were required, could you also make them required here?
| logger.warning(f"ZMQ: unknown message format: {msg_dict}") | ||
| continue | ||
| req_id = msg.req_id | ||
| self._finished_reqs.add(req_id) |
There was a problem hiding this comment.
It is better to have ttl for this list since the request number can be huge, this will cause OOM for router pod
| } | ||
|
|
||
| // Check if PD disaggregation is enabled | ||
| log.Info("Checking PD disaggregation flag", "Name", vllmRuntime.Name, "EnablePDDisaggregation", vllmRuntime.Spec.EnablePDDisaggregation) |
There was a problem hiding this comment.
I think these PD logs should be debug level instead of info
| app.state.request_stats_monitor = get_request_stats_monitor() | ||
| app.state.router = get_routing_logic() | ||
| app.state.request_rewriter = get_request_rewriter() | ||
| app.state.args = args |
There was a problem hiding this comment.
It is not recommended to stash all the args, Consider using a dataclass or named field.
|
|
||
| if self.event_loop_ready.is_set() and self.event_loop is not None: | ||
| try: | ||
| # Track all models we've ever seen |
There was a problem hiding this comment.
Please revert this since it is not relevant to this PR
| Initialize aiohttp client sessions for prefill and decode endpoints. | ||
| This must be called from an async context during app startup. | ||
| """ | ||
| logger.info( |
There was a problem hiding this comment.
These should not be info logs
Hi @ruizhang0101, thanks for the feedback! |
Summary
Add first-class CRD support for Prefill-Decode (PD) disaggregated serving in the production stack operator. Previously, PD disaggregation was only configurable via Helm values with manual multi-modelSpec setup. This PR enables a declarative xPyD topology (e.g., 2P2D) through a single VLLMRuntime resource, using NIXL for point-to-point GPU KV cache transfer.
Key Changes
Operator / CRD
enablePDDisaggregationandtopology(prefill/decode) fields to VLLMRuntime CRDRouter (NIXL-based disaggregated_prefill routing)
Adds a new
route_disaggregated_prefill_nixl_requestalongside the existingroute_disaggregated_prefill_request(LMCache shared storage mode). The NIXL path is automatically selected when ZMQ proxy is active (hasattr(app.state, 'zmq_proxy')), preserving backward compatibility for Helm-based disagg deployments._prepare_nixl_prefill_request()— tokenization (NIXL requires token IDs),disagg_specconstruction with decode node IP,kv_transfer_paramsinjection_convert_completion_chunk_to_chat()— converts/v1/completionsSSE chunks tochat.completion.chunkformat_clean_completion_chunk()— strips extra fields (prompt_token_ids,token_ids) from completion chunkszmq_proxy.py) for KV transfer completion notificationswait_decode_kv_ready()with 10s timeout for recompute fallbackProxyNotifbut router expectedNixlMsg— added fallback dict decoding and removed fatalbreakNixlMsgimport: add fallback chain for LMCache 0.3.13+ compatibilityStaticServiceDiscoveryusedhttpx.AsyncClientwhile rest of codebase usesaiohttp.ClientSession--nixl-peer-host/port,--nixl-proxy-host/portCLI argszmq/msgspecdeps topyproject.tomlDocs
25-disagg-prefill-crd-enabled.mdComponent Diagram (2P2D)
graph TB Client([Client]) subgraph "VLLMRouter CR" R[Router Pod] ZMQ[ZMQ PULL :7500] end subgraph "VLLMRuntime CR (enablePDDisaggregation: true)" subgraph "topology.prefill (replicas: 2)" P1["Prefill Pod 1<br/><i>vLLM + LMCache + NIXL</i><br/>kv_producer / sender"] P2["Prefill Pod 2<br/><i>vLLM + LMCache + NIXL</i><br/>kv_producer / sender"] end subgraph "topology.decode (replicas: 2)" D1["Decode Pod 1<br/><i>vLLM + LMCache + NIXL</i><br/>kv_consumer / receiver"] D2["Decode Pod 2<br/><i>vLLM + LMCache + NIXL</i><br/>kv_consumer / receiver"] end end Client -->|"① request"| R R -->|"② prefill"| P1 & P2 P1 & P2 -.->|"③ NIXL KV transfer"| D1 & D2 P1 & P2 -->|"④ ZMQ notify"| ZMQ R -->|"⑤ decode"| D1 & D2 D1 & D2 -->|"⑥ tokens"| R R -->|"⑦ response"| ClientRelation to PR #669
The router-side NIXL KV transfer logic builds on #669 (
[Feat][PD] latest PD support from LMCache with NIXLby @kobe0938). Components originating from #669:src/vllm_router/services/request_service/request.py—route_disaggregated_prefill_nixl_requestflowsrc/vllm_router/app.py— ZMQ task lifecycle in FastAPI lifespansrc/vllm_router/parsers/parser.py— nixl CLI argssrc/vllm_router/service_discovery.py— prefill/decode client session managementTest plan
go test --ginkgo.focus="VLLMRuntime")lmcache/vllm-openai:latest(vLLM 0.15.0, LMCache 0.3.13, NIXL 0.9.0), modelmeta-llama/Llama-3.2-3B-Instruct— 4/4 tests pass-swhen doinggit commit[Bugfix],[Feat], and[CI].Detailed Checklist (Click to Expand)
Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]for bug fixes.[CI/Build]for build or continuous integration improvements.[Doc]for documentation fixes and improvements.[Feat]for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).[Router]for changes to thevllm_router(e.g., routing algorithm, router observability, etc.).[Misc]for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
pre-committo format your code. SeeREADME.mdfor installation.DCO and Signed-off-by
When contributing changes to this project, you must agree to the DCO. Commits must include a
Signed-off-by:header which certifies agreement with the terms of the DCO.Using
-swithgit commitwill automatically add this header.What to Expect for the Reviews
We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.