[Feat] Add NIXL-based disaggregated prefill routing support#913
Open
Vivo50E wants to merge 1 commit intovllm-project:mainfrom
Open
[Feat] Add NIXL-based disaggregated prefill routing support#913Vivo50E wants to merge 1 commit intovllm-project:mainfrom
Vivo50E wants to merge 1 commit intovllm-project:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces support for NIXL-based disaggregated prefill routing, incorporating a ZMQ proxy for KV transfer notifications and new CLI configuration options. The review identifies several critical issues: a logic error in session management that breaks load balancing and risks connection errors, a memory leak in the ZMQ proxy due to unbounded storage of request IDs, and an inefficient busy-wait loop. Additionally, there is a redundant event loop assignment and a PEP 8 violation regarding inline imports.
9122eb6 to
dee4d75
Compare
72fb80b to
a6ab626
Compare
6 tasks
Signed-off-by: Yiqi Xue <xuey666@gmail.com>
Author
|
Hi @ruizhang0101 , just a gentle nudge on this PR review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements the router-side PD routing logic for disaggregated prefill using NIXL for
point-to-point GPU KV cache transfer. This is split from #841 per maintainer feedback to keep
routing logic and CRD deployment in separate PRs.
Motivation
The existing
disaggregated_prefillrouting logic relied on LMCache shared storage for KVtransfer. This PR adds a NIXL-based path that works with direct peer-to-peer KV transfer,
enabling lower-latency PD disaggregation without shared storage infrastructure.
Key Changes
New NIXL routing path (
request.py)route_disaggregated_prefill_nixl_request()— handles the full NIXL prefill/decode flow,automatically selected when
nixl_proxy_hostis configured, preserving backward compatibilitywith the existing shared-storage path
_prepare_nixl_prefill_request()— tokenizes the prompt (required by NIXL), constructsdisagg_specwith decode node IP/ports, and injectskv_transfer_params_convert_completion_chunk_to_chat()/_clean_completion_chunk()— format conversionhelpers for
/v1/completions→chat.completion.chunkSSEZMQ proxy (
zmq_proxy.py, new file)ZmqProxyclass: ZMQ PULL server that receives KV transfer completion notifications fromprefill nodes
asyncio.Eventto suspend decode coroutines until KV is ready — avoids busy-wait polling_cleanup_loop) to evict stale entries and prevent OOMfinished_req_ttlandcleanup_intervalare configurable via constructor and CLI args(
--nixl-finished-req-ttl,--nixl-cleanup-interval)NixlMsgandProxyNotif(LMCache 0.3.13+ compatibility) via fallback dictdecoding
App lifecycle (
app.py)DisaggregatedPrefillRouteris active and
nixl_proxy_hostis configuredNixlConfigdataclass onapp.state.nixl_configinstead of rawargparse namespace
CLI args (
parser.py)--nixl-peer-host/init-port/alloc-port— decode node NIXL endpoint--nixl-proxy-host/port— ZMQ proxy bind address--nixl-finished-req-ttl/--nixl-cleanup-interval— tunable TTL for KV-ready entriesService discovery (
service_discovery.py)initialize_client_sessions(): fixed to skip re-initialization if session already set —prevents
RuntimeErrorfrom closing an activeClientSessionwhen new engines are discoveredvia
_add_engine. Note: current implementation supports 1P1D only; xPyD (multipleprefill/decode nodes) is left for a follow-up PR.
Dependencies (
pyproject.toml)pyzmq>=27.0.0andmsgspec>=0.19.0to core dependenciesArchitecture (1P1D)
Test plan
roundrobin,prefixaware,kvaware,disaggregated_prefill,session)lmcache/vllm-openai:latest(vLLM 0.15.0, LMCache 0.3.13, NIXL 0.9.0), model
meta-llama/Llama-3.2-3B-Instruct—router correctly routes prefill to prefill nodes, decode to decode nodes, ZMQ proxy receives
KV-ready signals on the configured port
Backward Compatibility
The NIXL path is opt-in: it is only activated when
--nixl-proxy-hostis provided. Withoutthis flag, the router falls back to the original
route_disaggregated_prefill_request(LMCacheshared-storage mode), so existing Helm-based PD deployments are unaffected.
Known Limitations
prefill_clientand a singledecode_client, so only one prefill node and one decode node are used even if more areavailable. xPyD support (e.g., 2P2D load balancing across multiple prefill/decode nodes) is
left for a follow-up PR.
Relation to #841 / #669
Router-side NIXL logic builds on #669 (
[Feat][PD] latest PD support from LMCache with NIXL), rebased and refactored onto current main. CRD operator changes from #841 willbe submitted as a follow-up PR.
BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE
-swhen doinggit commit[Bugfix],[Feat], and[CI].Detailed Checklist (Click to Expand)
Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]for bug fixes.[CI/Build]for build or continuous integration improvements.[Doc]for documentation fixes and improvements.[Feat]for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).[Router]for changes to thevllm_router(e.g., routing algorithm, router observability, etc.).[Misc]for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
pre-committo format your code. SeeREADME.mdfor installation.DCO and Signed-off-by
When contributing changes to this project, you must agree to the DCO. Commits must include a
Signed-off-by:header which certifies agreement with the terms of the DCO.Using
-swithgit commitwill automatically add this header.What to Expect for the Reviews
We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.