Skip to content

[Feat] Support PD Disaggregation via CRD Operator#841

Open
Vivo50E wants to merge 8 commits intovllm-project:mainfrom
Vivo50E:pd_crd
Open

[Feat] Support PD Disaggregation via CRD Operator#841
Vivo50E wants to merge 8 commits intovllm-project:mainfrom
Vivo50E:pd_crd

Conversation

@Vivo50E
Copy link
Copy Markdown

@Vivo50E Vivo50E commented Feb 19, 2026

Summary

Add first-class CRD support for Prefill-Decode (PD) disaggregated serving in the production stack operator. Previously, PD disaggregation was only configurable via Helm values with manual multi-modelSpec setup. This PR enables a declarative xPyD topology (e.g., 2P2D) through a single VLLMRuntime resource, using NIXL for point-to-point GPU KV cache transfer.

Key Changes

Operator / CRD

  • Add enablePDDisaggregation and topology (prefill/decode) fields to VLLMRuntime CRD
  • Operator controller creates separate Deployments, Services, and PVCs for prefill and decode node pools
  • Add LMCache/NIXL environment variable injection for KV transfer config
  • Add xPyD sample YAMLs (VLLMRuntime + VLLMRouter) and unit tests

Router (NIXL-based disaggregated_prefill routing)

Adds a new route_disaggregated_prefill_nixl_request alongside the existing route_disaggregated_prefill_request (LMCache shared storage mode). The NIXL path is automatically selected when ZMQ proxy is active (hasattr(app.state, 'zmq_proxy')), preserving backward compatibility for Helm-based disagg deployments.

  • _prepare_nixl_prefill_request() — tokenization (NIXL requires token IDs), disagg_spec construction with decode node IP, kv_transfer_params injection
  • _convert_completion_chunk_to_chat() — converts /v1/completions SSE chunks to chat.completion.chunk format
  • _clean_completion_chunk() — strips extra fields (prompt_token_ids, token_ids) from completion chunks
  • Add ZMQ proxy module (zmq_proxy.py) for KV transfer completion notifications
  • Add wait_decode_kv_ready() with 10s timeout for recompute fallback
  • Fix ZMQ message decode: LMCache 0.3.13+ sends ProxyNotif but router expected NixlMsg — added fallback dict decoding and removed fatal break
  • Fix NixlMsg import: add fallback chain for LMCache 0.3.13+ compatibility
  • Fix HTTP client consistency: StaticServiceDiscovery used httpx.AsyncClient while rest of codebase uses aiohttp.ClientSession
  • Add --nixl-peer-host/port, --nixl-proxy-host/port CLI args
  • Add zmq/msgspec deps to pyproject.toml

Docs

  • Add tutorial 25-disagg-prefill-crd-enabled.md

Component Diagram (2P2D)

graph TB
    Client([Client])

    subgraph "VLLMRouter CR"
        R[Router Pod]
        ZMQ[ZMQ PULL :7500]
    end

    subgraph "VLLMRuntime CR (enablePDDisaggregation: true)"
        subgraph "topology.prefill (replicas: 2)"
            P1["Prefill Pod 1<br/><i>vLLM + LMCache + NIXL</i><br/>kv_producer / sender"]
            P2["Prefill Pod 2<br/><i>vLLM + LMCache + NIXL</i><br/>kv_producer / sender"]
        end
        subgraph "topology.decode (replicas: 2)"
            D1["Decode Pod 1<br/><i>vLLM + LMCache + NIXL</i><br/>kv_consumer / receiver"]
            D2["Decode Pod 2<br/><i>vLLM + LMCache + NIXL</i><br/>kv_consumer / receiver"]
        end
    end

    Client -->|"① request"| R
    R -->|"② prefill"| P1 & P2
    P1 & P2 -.->|"③ NIXL KV transfer"| D1 & D2
    P1 & P2 -->|"④ ZMQ notify"| ZMQ
    R -->|"⑤ decode"| D1 & D2
    D1 & D2 -->|"⑥ tokens"| R
    R -->|"⑦ response"| Client
Loading

Relation to PR #669

The router-side NIXL KV transfer logic builds on #669 ([Feat][PD] latest PD support from LMCache with NIXL by @kobe0938). Components originating from #669:

  • src/vllm_router/services/request_service/request.pyroute_disaggregated_prefill_nixl_request flow
  • src/vllm_router/app.py — ZMQ task lifecycle in FastAPI lifespan
  • src/vllm_router/parsers/parser.py — nixl CLI args
  • src/vllm_router/service_discovery.py — prefill/decode client session management

Test plan

  • Go operator controller unit tests (6/6 passing via go test --ginkgo.focus="VLLMRuntime")
  • Pre-commit checks all passing
  • End-to-end manual test on Minikube + 4 GPUs (4x A16) with lmcache/vllm-openai:latest (vLLM 0.15.0, LMCache 0.3.13, NIXL 0.9.0), model meta-llama/Llama-3.2-3B-Instruct — 4/4 tests pass

  • Make sure the code changes pass the pre-commit checks.
  • Sign-off your commit by using -s when doing git commit
  • Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].
Detailed Checklist (Click to Expand)

Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
  • [Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @Vivo50E, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the vLLM production stack by introducing native Kubernetes Custom Resource Definition (CRD) support for Prefill-Decode (PD) disaggregated serving. This change allows users to define a 2-Prefill, 2-Decode (2P2D) topology declaratively within a single VLLMRuntime resource, streamlining the deployment and management of complex, high-throughput model serving architectures. The operator now intelligently provisions and manages distinct prefill and decode components, facilitating efficient KV cache transfer and robust routing.

Highlights

  • CRD Operator Support for Prefill-Decode (PD) Disaggregation: Introduced first-class CRD support for Prefill-Decode (PD) disaggregated serving, enabling declarative 2-Prefill, 2-Decode (2P2D) topology through a single VLLMRuntime resource, moving away from manual Helm value configurations.
  • VLLMRuntime CRD Enhancements: Added enablePDDisaggregation and topology fields to the VLLMRuntime CRD, allowing the operator to create separate Deployments, Services, and PVCs for prefill and decode node pools.
  • VLLMRouter Integration for PD: Extended VLLMRouter to support PD-specific arguments, including Nixl proxy and peer host/port configurations, and conditionally starts a ZMQ task only when routing logic is disaggregated_prefill.
  • Improved ZMQ Message Decoding: Implemented fixes for ZMQ proxy message decoding, addressing issues where LMCache sent ProxyNotif while the router expected NixlMsg, preventing crashes and adding fallback dictionary decoding.
  • KV Transfer Timeout: Added a 10-second timeout to wait_decode_kv_ready() to ensure decode operations proceed via kv_load_failure_policy=recompute if KV transfer notifications are delayed.
  • New Dockerfile and Documentation: Included a new Dockerfile.pd for building PD-capable router images and comprehensive documentation, including a tutorial for end-to-end 2P2D CRD deployment and troubleshooting guides.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • .github/values-06-session-routing.yaml
    • Updated model specifications to use llama-prefill and llama-decode names.
    • Adjusted lmcacheConfig for Prefill-Decode (PD) settings.
    • Updated router image and removed some router resource and probe configurations.
  • .github/values-07-prefix-routing.yaml
    • Updated model specifications to use llama-prefill and llama-decode names.
    • Adjusted lmcacheConfig for Prefill-Decode (PD) settings.
    • Updated router image and removed some router resource and probe configurations.
  • .github/values-08-roundrobin-routing.yaml
    • Updated model specifications to use llama-prefill and llama-decode names.
    • Adjusted lmcacheConfig for Prefill-Decode (PD) settings.
    • Updated router image and removed some router resource and probe configurations.
  • .github/values-09-kvaware-routing.yaml
    • Updated model specifications to use llama-prefill and llama-decode names.
    • Adjusted lmcacheConfig for Prefill-Decode (PD) settings.
    • Updated router image and removed some router resource and probe configurations.
  • .github/values-10-disagg-prefill.yaml
    • Updated model names and image tags for prefill and decode nodes.
    • Modified lmcacheConfig for PD disaggregation, including Nixl proxy and peer settings.
    • Updated router image and added Nixl peer/proxy host/port arguments to the router configuration.
  • docker/Dockerfile.pd
    • Added a new Dockerfile for building a PD-capable router image, including ZMQ, msgspec, and httpx dependencies.
  • docs/source/developer_guide/docker.rst
    • Updated Docker build command to reference the new Dockerfile.pd.
  • helm/templates/deployment-router.yaml
    • Added conditional Nixl peer/proxy host/port arguments to the router deployment.
    • Included new PD-related container ports (7100-7500) for Nixl communication.
  • helm/templates/deployment-vllm-multi.yaml
    • Modified kv-transfer-config for LMCacheConnectorV1 to include rpcPort and skip_last_n_tokens.
    • Added new environment variables for Python hashing and multiprocessing (PYTHONHASHSEED, VLLM_ENABLE_V1_MULTIPROCESSING, VLLM_WORKER_MULTIPROC_METHOD).
    • Removed LMCACHE_LOCAL_CPU and LMCACHE_MAX_LOCAL_CPU_SIZE from PD-specific configuration.
    • Added PD-related container ports (7100-7500) to vLLM deployments.
  • operator/api/v1alpha1/vllmrouter_types.go
    • Extended RoutingLogic enum to include kvaware, prefixaware, and disaggregated_prefill.
  • operator/api/v1alpha1/vllmruntime_types.go
    • Added EnablePDDisaggregation flag to control PD mode.
    • Introduced TopologySpec with Prefill and Decode NodeConfigs for PD deployments.
    • Added new fields to LMCacheConfig for PD-specific parameters, including KVRole, LocalCPU, Nixl settings, RPCPort, and SkipLastNTokens.
  • operator/api/v1alpha1/zz_generated.deepcopy.go
    • Added deepcopy methods for the newly introduced NodeConfig and TopologySpec types.
  • operator/config/crd/bases/production-stack.vllm.ai_vllmrouters.yaml
    • Updated the CRD schema for VLLMRouter to reflect the new RoutingLogic enum values.
  • operator/config/crd/bases/production-stack.vllm.ai_vllmruntimes.yaml
    • Updated the CRD schema for VLLMRuntime to include enablePDDisaggregation, topology, and the expanded lmCacheConfig fields.
  • operator/config/crd/bases/sample.yaml
    • Added a new sample CRD definition for LLMInference with a detailed PD topology configuration.
  • operator/config/default.yaml
    • Updated CRD schemas for VLLMRouter and VLLMRuntime to align with new PD features.
    • Changed the operator image and pull policy.
  • operator/config/default/kustomization.yaml
    • Modified kustomization to use manager_pull_policy_patch.yaml instead of manager_image_patch.yaml.
  • operator/config/default/manager_image_patch.yaml
    • Updated the operator image to xueey7/production-stack-operator:latest.
  • operator/config/default/manager_pull_policy_patch.yaml
    • Added a new patch file to set the manager's imagePullPolicy to IfNotPresent.
  • operator/config/manager/kustomization.yaml
    • Added image customization for the controller, specifying production-stack-operator with latest tag.
  • operator/config/samples/kustomization.yaml
    • Added new sample YAMLs for PD runtime (production-stack_v1alpha1_vllmruntime_pd.yaml) and router (production-stack_v1alpha1_vllmrouter_pd.yaml).
  • operator/config/samples/production-stack_v1alpha1_cacheserver.yaml
    • Updated the cache server image tag to latest-nightly.
  • operator/config/samples/production-stack_v1alpha1_vllmrouter_pd.yaml
    • Added a new sample VLLMRouter configuration specifically for PD disaggregation, including disaggregated_prefill routing logic and Nixl parameters.
  • operator/config/samples/production-stack_v1alpha1_vllmruntime_2p2d.yaml
    • Added a new sample VLLMRuntime configuration for a 2P2D disaggregated setup, detailing prefill and decode node configurations.
  • operator/config/samples/sample.yaml
    • Added a new sample LLMInference CRD definition with detailed model, topology, KV, routing, autoscaling, rollout, and observability configurations.
  • operator/internal/controller/cacheserver_controller.go
    • Updated the cache server command to use /opt/venv/bin/lmcache_server.
  • operator/internal/controller/vllmrouter_controller.go
    • Modified router deployment to use a new buildRouterContainerPorts function.
    • Added Nixl ports to the router service for disaggregated_prefill routing logic.
    • Introduced buildRouterContainerPorts function to dynamically add ports based on routing logic.
  • operator/internal/controller/vllmruntime_controller.go
    • Implemented reconcilePDDisaggregation logic to manage separate prefill and decode resources.
    • Added helper functions: reconcilePrefillResources, reconcileDecodeResources, serviceForNode, pvcForNode, deploymentForNode, updatePDStatus, buildVLLMArgsForNode, buildEnvironmentVariablesForNode, buildResourceRequirementsForNode, buildVolumesAndMountsForNode, and buildSidecarContainerForNode.
    • Modified serviceNeedsUpdate to skip updates for PD disaggregation mode.
    • Adjusted VLLM_USE_V1 environment variable logic for compatibility with LMCacheConnectorV1.
  • operator/internal/controller/vllmruntime_controller_test.go
    • Added new test contexts to cover PD disaggregation mode and legacy mode.
    • Included assertions for correct creation of prefill/decode deployments and services, and validation of LMCache environment variables in PD mode.
  • proposals/pd-disagg-crd-support.md
    • Added a new proposal document outlining the summary, motivation, goals, and implementation plan for PD disaggregation CRD support.
  • src/vllm_router/app.py
    • Imported DisaggregatedPrefillRouter and ZMQ task management functions (start_zmq_task, stop_zmq_task).
    • Set the event_loop early for service discovery.
    • Conditionally started and stopped the ZMQ task within the FastAPI lifespan manager based on whether DisaggregatedPrefillRouter is active.
  • src/vllm_router/parsers/parser.py
    • Added new command-line arguments for configuring Nixl peer and proxy host/port settings.
  • src/vllm_router/requirements.txt
    • Added httpx as a new dependency.
  • src/vllm_router/routers/main_router.py
    • Modified route_chat_completion and route_completion to delegate to route_disaggregated_prefill_request when DisaggregatedPrefillRouter is in use.
  • src/vllm_router/service_discovery.py
    • Updated client session initialization to use httpx.AsyncClient instead of aiohttp.ClientSession for prefill and decode endpoints.
    • Adjusted client session initialization logic to be called when the event loop is available.
  • src/vllm_router/services/request_service/request.py
    • Introduced ZMQ-related functions (zmq_pull_server, start_zmq_task, stop_zmq_task) for KV transfer notifications.
    • Implemented wait_decode_kv_ready with a timeout for KV transfer completion.
    • Refactored route_disaggregated_prefill_request to handle prompt tokenization, construct kv_transfer_params with dynamic receiver host, send requests to prefiller, and stream responses from the decode service, including format conversion for chat completions.
    • Added fallback decoding for ZMQ messages to handle ProxyNotif and NixlMsg compatibility.
  • tests/e2e/run-static-discovery-routing-test.sh
    • Added Nixl peer and proxy host/port arguments to the router startup command.
  • tutorials/25-disagg-prefill-crd-enabled.md
    • Added a new comprehensive tutorial document detailing how to deploy and manage the vLLM production stack using the CRD operator, focusing on basic and advanced (2P2D) disaggregated prefill deployments.
  • tutorials/assets/values-16-disagg-prefill.yaml
    • Updated model names, image tags, and Nixl configuration parameters for both prefill and decode nodes in the disaggregated prefill setup.
  • utils/get_helm.sh
    • Added a new utility script for fetching and installing Helm.
  • uv.lock
    • Updated Python package versions and hashes for aiofile, aiohttp, awscrt, cufile-python, hf-xet, huggingface-hub, lmcache, setuptools, setuptools-scm, tokenizers, and transformers.
Ignored Files
  • Ignored by pattern: .github/workflows/** (2)
    • .github/workflows/functionality-helm-chart.yml
    • .github/workflows/router-e2e-test.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality by adding first-class CRD support for Prefill-Decode (PD) disaggregation, which is a great step towards more declarative and Kubernetes-native management of vLLM deployments. The changes are extensive, touching Helm charts, the operator controller, CRD definitions, and the router application. The introduction of the TopologySpec in the VLLMRuntime CRD is a clean approach to managing the disaggregated setup, and the accompanying test updates are thorough.

My review has identified a few key areas for improvement. There is a critical issue regarding inconsistent HTTP client usage (aiohttp vs. httpx) in the router's service discovery, which will likely lead to runtime errors. Additionally, for production readiness, the operator image should be sourced from an official organization repository instead of a personal one. I've also noted a couple of potentially extraneous sample files that might need to be removed for clarity. Overall, this is a strong contribution, and addressing these points will further enhance its quality and robustness.

Comment thread src/vllm_router/service_discovery.py Outdated
Comment thread operator/config/default.yaml
Comment thread operator/config/samples/sample.yaml Outdated
@Vivo50E Vivo50E changed the title [WIP][Feat] Support PD Disaggregation via CRD Operator [Feat] Support PD Disaggregation via CRD Operator Feb 19, 2026
@Vivo50E Vivo50E marked this pull request as ready for review February 19, 2026 22:19
Copy link
Copy Markdown
Collaborator

@ruizhang0101 ruizhang0101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Thanks for the contribution :))

I have several concerns towards this PR:

  1. For CRD support, why are there changes related to helm such as helm-related GitHub wf and helm template.
  2. Most of the codes seem too old, it actually tries to "revert" some of the PRs, I would suggest a re-implementation based on the current version.

Comment thread docker/Dockerfile.pd Outdated
Comment thread docs/source/developer_guide/docker.rst
Comment thread src/vllm_router/routers/main_router.py Outdated
Comment thread src/vllm_router/routers/main_router.py Outdated
Comment thread src/vllm_router/services/request_service/request.py Outdated
Comment thread src/vllm_router/services/request_service/request.py Outdated
Comment thread src/vllm_router/services/request_service/request.py Outdated
Comment thread src/vllm_router/services/request_service/request.py Outdated
Comment thread src/vllm_router/services/request_service/request.py
Comment thread src/vllm_router/service_discovery.py Outdated
@Vivo50E
Copy link
Copy Markdown
Author

Vivo50E commented Feb 21, 2026

Hi, Thanks for the contribution :))

I have several concerns towards this PR:

  1. For CRD support, why are there changes related to helm such as helm-related GitHub wf and helm template.
  2. Most of the codes seem too old, it actually tries to "revert" some of the PRs, I would suggest a re-implementation based on the current version.

@ruizhang0101 , Thanks for reviewing. I drafted this PR a long time ago and haven’t had much time to revisit it since then. I’ll rebase my changes onto the latest version and address your comments:>

Copy link
Copy Markdown
Collaborator

@ruizhang0101 ruizhang0101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Thanks for the prompt update and all the contribution! This version looks so much better, I left some comments, please let me know if there is any questions.

Also, a very IMPORTANT thing for this PR to be merged is that, could you show that it is working as expected following the tutorial? And also, it should not break the original functionality.

Comment thread operator/config/default/manager_pull_policy_patch.yaml Outdated
Comment thread operator/config/samples/production-stack_v1alpha1_vllmrouter_pd.yaml Outdated
Comment thread operator/config/samples/production-stack_v1alpha1_vllmrouter_1p1d.yaml Outdated
Comment thread operator/config/samples/production-stack_v1alpha1_vllmruntime_2p2d.yaml Outdated
Comment thread operator/config/samples/production-stack_v1alpha1_vllmruntime_2p2d.yaml Outdated
Comment thread operator/internal/controller/vllmruntime_controller.go Outdated
Comment thread src/vllm_router/parsers/parser.py Outdated
Comment thread src/vllm_router/services/request_service/request.py
Comment thread src/vllm_router/services/request_service/request.py Outdated
Comment thread src/vllm_router/services/request_service/request.py
@Vivo50E Vivo50E force-pushed the pd_crd branch 3 times, most recently from 9ec9991 to e6caf27 Compare March 11, 2026 23:47
@Vivo50E Vivo50E requested a review from ruizhang0101 March 12, 2026 04:39
@ruizhang0101
Copy link
Copy Markdown
Collaborator

Hi, Could you fix the CI issue? Lemme know if you have any questions :)))

Vivo50E added 8 commits April 5, 2026 19:57
Signed-off-by: Yiqi Xue <xuey666@gmail.com>
…t and improved error handling

Signed-off-by: Yiqi Xue <xuey666@gmail.com>
…ency

Signed-off-by: Yiqi Xue <xuey666@gmail.com>
…hunk processing

Signed-off-by: Yiqi Xue <xuey666@gmail.com>
Signed-off-by: Yiqi Xue <xuey666@gmail.com>
Signed-off-by: Yiqi Xue <xuey666@gmail.com>
Signed-off-by: Yiqi Xue <xuey666@gmail.com>
Signed-off-by: Yiqi Xue <xuey666@gmail.com>
Made-with: Cursor
@Vivo50E
Copy link
Copy Markdown
Author

Vivo50E commented Apr 5, 2026

Hi @ruizhang0101, I pushed a fix to ensure the NIXL path is only activated when nixl_proxy_host is configured. I've run the CI tests locally (static-discovery, k8s-discovery, helm chart Two-Pods-Minimal-Example) and they all pass. Also verified CRD-based 2P2D disaggregated prefill works end-to-end locally with the router correctly splitting traffic to prefill/decode nodes and ZMQ proxy running on the configured port.
Could you please approve and trigger the CI workflow? Thanks!

@Vivo50E
Copy link
Copy Markdown
Author

Vivo50E commented Apr 6, 2026

Hi @ruizhang0101, I pushed a fix to ensure the NIXL path is only activated when nixl_proxy_host is configured. I've run the CI tests locally (static-discovery, k8s-discovery, helm chart Two-Pods-Minimal-Example) and they all pass. Also verified CRD-based 2P2D disaggregated prefill works end-to-end locally with the router correctly splitting traffic to prefill/decode nodes and ZMQ proxy running on the configured port. Could you please approve and trigger the CI workflow? Thanks!

@ruizhang0101 The CI failures (Secure-Minimal-Example, CRD-Validation, k8s-discovery-e2e-test) all show minikube host: Stopped at the very first step — this appears to be a runner environment issue rather than a code problem. Could you re-trigger the CI? Thanks

@Vivo50E
Copy link
Copy Markdown
Author

Vivo50E commented Apr 9, 2026

@ruizhang0101 The CI checks have all passed now. Could you please take another look and help move this PR forward when you have a moment? Thanks a lot!

Copy link
Copy Markdown
Collaborator

@ruizhang0101 ruizhang0101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not a good idea to have this breaking change in spec schema because of PD. This introduces so much confusion of the existing spec. Also it introduces many duplications in reconcile logic. The helm chart solution for PD is to assign "role" for each model runtime, and routing based on the role. I would suggest the same pattern. Using different CR to represent prefill/decode.

Also, this PR is too large, I would suggest seperate PR to address this problem.
PR1: PD Routing logic
PR2: CRD deployment for PD

Aside from that, I have attached some comments. Please take a look when you have time.

Comment thread operator/config/default.yaml

// Legacy fields (used when EnablePDDisaggregation=false)
// Model configuration
Model ModelSpec `json:"model,omitempty"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For model, vllmconfig and deployment config, they were required, could you also make them required here?

logger.warning(f"ZMQ: unknown message format: {msg_dict}")
continue
req_id = msg.req_id
self._finished_reqs.add(req_id)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to have ttl for this list since the request number can be huge, this will cause OOM for router pod

}

// Check if PD disaggregation is enabled
log.Info("Checking PD disaggregation flag", "Name", vllmRuntime.Name, "EnablePDDisaggregation", vllmRuntime.Spec.EnablePDDisaggregation)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these PD logs should be debug level instead of info

Comment thread src/vllm_router/app.py
app.state.request_stats_monitor = get_request_stats_monitor()
app.state.router = get_routing_logic()
app.state.request_rewriter = get_request_rewriter()
app.state.args = args
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not recommended to stash all the args, Consider using a dataclass or named field.


if self.event_loop_ready.is_set() and self.event_loop is not None:
try:
# Track all models we've ever seen
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this since it is not relevant to this PR

Initialize aiohttp client sessions for prefill and decode endpoints.
This must be called from an async context during app startup.
"""
logger.info(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should not be info logs

@Vivo50E
Copy link
Copy Markdown
Author

Vivo50E commented Apr 13, 2026

I think it is not a good idea to have this breaking change in spec schema because of PD. This introduces so much confusion of the existing spec. Also it introduces many duplications in reconcile logic. The helm chart solution for PD is to assign "role" for each model runtime, and routing based on the role. I would suggest the same pattern. Using different CR to represent prefill/decode.

Also, this PR is too large, I would suggest seperate PR to address this problem. PR1: PD Routing logic PR2: CRD deployment for PD

Aside from that, I have attached some comments. Please take a look when you have time.

Hi @ruizhang0101, thanks for the feedback!
Agreed on both points. I've split the work as suggested:
PR1 (#913): PD routing logic only — NIXL-based disaggregated_prefill path + ZMQ proxy. No operator/CRD changes. All E2E routing tests pass locally.
PR2 (follow-up): CRD deployment for PD, redesigned using separate CRs for prefill/decode nodes per your suggestion.
I've also addressed the inline review comments. Please take a look when you have time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants