Skip to content

[Feat] Add NIXL-based disaggregated prefill routing support#913

Open
Vivo50E wants to merge 1 commit intovllm-project:mainfrom
Vivo50E:pd_routing
Open

[Feat] Add NIXL-based disaggregated prefill routing support#913
Vivo50E wants to merge 1 commit intovllm-project:mainfrom
Vivo50E:pd_routing

Conversation

@Vivo50E
Copy link
Copy Markdown

@Vivo50E Vivo50E commented Apr 12, 2026

Summary

This PR implements the router-side PD routing logic for disaggregated prefill using NIXL for
point-to-point GPU KV cache transfer. This is split from #841 per maintainer feedback to keep
routing logic and CRD deployment in separate PRs.

Motivation

The existing disaggregated_prefill routing logic relied on LMCache shared storage for KV
transfer. This PR adds a NIXL-based path that works with direct peer-to-peer KV transfer,
enabling lower-latency PD disaggregation without shared storage infrastructure.

Key Changes

New NIXL routing path (request.py)

  • route_disaggregated_prefill_nixl_request() — handles the full NIXL prefill/decode flow,
    automatically selected when nixl_proxy_host is configured, preserving backward compatibility
    with the existing shared-storage path
  • _prepare_nixl_prefill_request() — tokenizes the prompt (required by NIXL), constructs
    disagg_spec with decode node IP/ports, and injects kv_transfer_params
  • _convert_completion_chunk_to_chat() / _clean_completion_chunk() — format conversion
    helpers for /v1/completionschat.completion.chunk SSE

ZMQ proxy (zmq_proxy.py, new file)

  • ZmqProxy class: ZMQ PULL server that receives KV transfer completion notifications from
    prefill nodes
  • Uses asyncio.Event to suspend decode coroutines until KV is ready — avoids busy-wait polling
  • TTL-based cleanup (_cleanup_loop) to evict stale entries and prevent OOM
  • finished_req_ttl and cleanup_interval are configurable via constructor and CLI args
    (--nixl-finished-req-ttl, --nixl-cleanup-interval)
  • Handles both NixlMsg and ProxyNotif (LMCache 0.3.13+ compatibility) via fallback dict
    decoding

App lifecycle (app.py)

  • ZMQ proxy is started/stopped within FastAPI lifespan, only when DisaggregatedPrefillRouter
    is active and nixl_proxy_host is configured
  • NIXL config exposed via NixlConfig dataclass on app.state.nixl_config instead of raw
    argparse namespace

CLI args (parser.py)

  • --nixl-peer-host/init-port/alloc-port — decode node NIXL endpoint
  • --nixl-proxy-host/port — ZMQ proxy bind address
  • --nixl-finished-req-ttl / --nixl-cleanup-interval — tunable TTL for KV-ready entries

Service discovery (service_discovery.py)

  • initialize_client_sessions(): fixed to skip re-initialization if session already set —
    prevents RuntimeError from closing an active ClientSession when new engines are discovered
    via _add_engine. Note: current implementation supports 1P1D only; xPyD (multiple
    prefill/decode nodes) is left for a follow-up PR.

Dependencies (pyproject.toml)

  • Added pyzmq>=27.0.0 and msgspec>=0.19.0 to core dependencies

Architecture (1P1D)

Client
  │ ① request
  ▼
Router Pod
  ├─② prefill request──► Prefill Pod (vLLM + LMCache + NIXL)
  │                            │ ③ NIXL KV transfer ──► Decode Pod
  │                            └─④ ZMQ notify ────────► ZMQ PULL :7500
  └─⑤ decode request──► Decode Pod (vLLM + LMCache + NIXL)
                              └─⑥ streaming tokens ──► Client

Test plan

  • All static-discovery E2E routing tests pass locally (roundrobin, prefixaware, kvaware,
    disaggregated_prefill, session)
  • NIXL path verified end-to-end on Minikube (4× A16) with lmcache/vllm-openai:latest
    (vLLM 0.15.0, LMCache 0.3.13, NIXL 0.9.0), model meta-llama/Llama-3.2-3B-Instruct
    router correctly routes prefill to prefill nodes, decode to decode nodes, ZMQ proxy receives
    KV-ready signals on the configured port

Backward Compatibility

The NIXL path is opt-in: it is only activated when --nixl-proxy-host is provided. Without
this flag, the router falls back to the original route_disaggregated_prefill_request (LMCache
shared-storage mode), so existing Helm-based PD deployments are unaffected.

Known Limitations

  • 1P1D only: the current implementation initializes a single prefill_client and a single
    decode_client, so only one prefill node and one decode node are used even if more are
    available. xPyD support (e.g., 2P2D load balancing across multiple prefill/decode nodes) is
    left for a follow-up PR.

Relation to #841 / #669

Router-side NIXL logic builds on #669 ([Feat][PD] latest PD support from LMCache with NIXL), rebased and refactored onto current main. CRD operator changes from #841 will
be submitted as a follow-up PR.

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE


  • Make sure the code changes pass the pre-commit checks.
  • Sign-off your commit by using -s when doing git commit
  • Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].
Detailed Checklist (Click to Expand)

Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
  • [Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for NIXL-based disaggregated prefill routing, incorporating a ZMQ proxy for KV transfer notifications and new CLI configuration options. The review identifies several critical issues: a logic error in session management that breaks load balancing and risks connection errors, a memory leak in the ZMQ proxy due to unbounded storage of request IDs, and an inefficient busy-wait loop. Additionally, there is a redundant event loop assignment and a PEP 8 violation regarding inline imports.

Comment thread src/vllm_router/service_discovery.py Outdated
Comment thread src/vllm_router/services/request_service/zmq_proxy.py Outdated
Comment thread src/vllm_router/services/request_service/zmq_proxy.py Outdated
Comment thread src/vllm_router/app.py Outdated
Comment thread src/vllm_router/services/request_service/request.py Outdated
@Vivo50E Vivo50E force-pushed the pd_routing branch 2 times, most recently from 9122eb6 to dee4d75 Compare April 12, 2026 05:44
@Vivo50E Vivo50E marked this pull request as ready for review April 12, 2026 19:03
@Vivo50E Vivo50E force-pushed the pd_routing branch 2 times, most recently from 72fb80b to a6ab626 Compare April 12, 2026 19:38
Signed-off-by: Yiqi Xue <xuey666@gmail.com>
@Vivo50E
Copy link
Copy Markdown
Author

Vivo50E commented Apr 23, 2026

Hi @ruizhang0101 , just a gentle nudge on this PR review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant