Steering + capture on the v2 model runner#185
Conversation
…to the v2 model runner
… fallback guard); fix global-steering row for untracked requests
|
Also validated cross-node on 2×RTX 3090 (Ray, NCCL over bond0): steering passes under both TP=2 and PP=2 — global |
…le-node, TP/PP, preemption, streaming)
Full validation matrix (GPU, 2×RTX 3090)Extended the validation well beyond the initial pass. All on the v2 runner ( Steering
Capture
Distributed (Ray, NCCL/bond0)
Hard edges
Coverage: Qwen3-0.6B and gemma-3-4b-it (v2 supports gemma-3; the port is not Qwen3-specific). Plus CPU glue tests for the v2-specific projection logic. Out of scope / not exercised: spec-decode, DP, async-dispatch overload policies, and the store serve path (dormant by design — |
…preemption resume
|
Follow-up fix: capture registration on re-add paths Discovered while auditing whether the "untested" items actually touch the port's diff. Two paths re-enter
The runtime change threads GPU-validated on Qwen3-0.6B (RTX 3090, v2 runner), clean before/after:
CPU glue tests added for the three branches (fresh / streaming re-add / preemption resume) + non-capturer rank; full v2 glue suite 19 passed. Also confirmed the async-dispatch overload policies ( Commit: 40133ec |
What
Ports the activation steering and capture control planes to the experimental v2 GPU model runner (
vllm/v1/worker/gpu/model_runner.py), so they work there the same way they do on the v1 runner.Why
The steering/capture data plane (the
apply_steering/capture_residualcustom ops, per-layer buffers, kernels,SteeringManager/CaptureManager/CaptureStepGate/ActivationStore) already lives inmodel_executor/+v1/capture/and is shared by both runners — both load the same model, so the in-forward hooks already fire on v2 and safely no-op when nothing drives them. Only the runner-side control plane was v1-only. Without it, turning on the v2 runner with steering/capture configured silently did nothing.How
Two v2-native modules wire the existing subsystems into v2's lifecycle (
add_requests/finish_requests/execute_model/sample_tokens); the v1 runner is untouched.gpu/capture_runner_mixin.py(CaptureRunnerMixin): manager/gate/store init, request register/finalize, the per-step force-eager decision (client-spec captures only — global specs ride the cudagraph-safe persistent-buffer path), gather-plan build, and draining results ontoModelRunnerOutput.capture_results.gpu/steering_runner_mixin.py(SteeringRunnerMixin): subclasses the v1SteeringModelRunnerMixin, reusing init / layer discovery / validation / the public RPC API /_resolve_request_steeringunchanged, and overrides only the three methods that touched v1-runner state. It keeps its own per-request state (v2 retains noCachedRequestState), drives prefill→decode transitions and the per-token steering index, and needs no force-eager (persistent buffers are cudagraph-safe).gpu_worker.pyalready forwards the steering RPCs.Design notes:
docs/design/v2_runner_steering_capture.md.Validation
CPU unit tests for the v2 glue (
tests/v1/worker/test_gpu_v2_{steering,capture}_glue.py, 14 tests).GPU-validated on Qwen3-0.6B (RTX 3090, TP1/PP1),
VLLM_USE_V2_MODEL_RUNNER=1:post_attn, layer 5,last_prompt) delivers one(1, hidden)bf16 row to a consumer'son_capture.Not yet exercised on GPU (mirrors v1 but unverified here): TP>1 / PP>1, per-request inline steering and named modules, capture prefix-cache reuse, preemption resume, spec-decode token layout.
Notes
Includes an interim commit that added a fail-closed v1 fallback when steering/capture is configured, and a later commit that removes it now that the port exists — the two net out, so the PR diff is purely additive.