Skip to content

Steering + capture on the v2 model runner#185

Open
RhizoNymph wants to merge 6 commits into
feat/integrationfrom
feat/v2-model-runner
Open

Steering + capture on the v2 model runner#185
RhizoNymph wants to merge 6 commits into
feat/integrationfrom
feat/v2-model-runner

Conversation

@RhizoNymph

Copy link
Copy Markdown
Owner

What

Ports the activation steering and capture control planes to the experimental v2 GPU model runner (vllm/v1/worker/gpu/model_runner.py), so they work there the same way they do on the v1 runner.

Why

The steering/capture data plane (the apply_steering / capture_residual custom ops, per-layer buffers, kernels, SteeringManager/CaptureManager/CaptureStepGate/ActivationStore) already lives in model_executor/ + v1/capture/ and is shared by both runners — both load the same model, so the in-forward hooks already fire on v2 and safely no-op when nothing drives them. Only the runner-side control plane was v1-only. Without it, turning on the v2 runner with steering/capture configured silently did nothing.

How

Two v2-native modules wire the existing subsystems into v2's lifecycle (add_requests / finish_requests / execute_model / sample_tokens); the v1 runner is untouched.

  • gpu/capture_runner_mixin.py (CaptureRunnerMixin): manager/gate/store init, request register/finalize, the per-step force-eager decision (client-spec captures only — global specs ride the cudagraph-safe persistent-buffer path), gather-plan build, and draining results onto ModelRunnerOutput.capture_results.
  • gpu/steering_runner_mixin.py (SteeringRunnerMixin): subclasses the v1 SteeringModelRunnerMixin, reusing init / layer discovery / validation / the public RPC API / _resolve_request_steering unchanged, and overrides only the three methods that touched v1-runner state. It keeps its own per-request state (v2 retains no CachedRequestState), drives prefill→decode transitions and the per-token steering index, and needs no force-eager (persistent buffers are cudagraph-safe). gpu_worker.py already forwards the steering RPCs.

Design notes: docs/design/v2_runner_steering_capture.md.

Validation

CPU unit tests for the v2 glue (tests/v1/worker/test_gpu_v2_{steering,capture}_glue.py, 14 tests).

GPU-validated on Qwen3-0.6B (RTX 3090, TP1/PP1), VLLM_USE_V2_MODEL_RUNNER=1:

  • Steering, eager and cudagraph: global vectors shift the output; clearing restores the exact baseline (confirms the persistent-buffer path is cudagraph-safe — no force-eager).
  • Capture, eager: a client-spec request (post_attn, layer 5, last_prompt) delivers one (1, hidden) bf16 row to a consumer's on_capture.

Not yet exercised on GPU (mirrors v1 but unverified here): TP>1 / PP>1, per-request inline steering and named modules, capture prefix-cache reuse, preemption resume, spec-decode token layout.

Notes

Includes an interim commit that added a fail-closed v1 fallback when steering/capture is configured, and a later commit that removes it now that the port exists — the two net out, so the PR diff is purely additive.

@RhizoNymph

Copy link
Copy Markdown
Owner Author

Also validated cross-node on 2×RTX 3090 (Ray, NCCL over bond0): steering passes under both TP=2 and PP=2 — global set_steering_vectors shifts output and clear restores baseline; per-worker results confirm the rank-replication invariant (TP: both ranks register; PP: stage 1 correctly reports no owned steered layers via locally_owned_layers).

@RhizoNymph

Copy link
Copy Markdown
Owner Author

Full validation matrix (GPU, 2×RTX 3090)

Extended the validation well beyond the initial pass. All on the v2 runner (VLLM_USE_V2_MODEL_RUNNER=1); see docs/design/v2_runner_steering_capture.md for details.

Steering

  • Global (set_steering_vectors) and per-request inline (SamplingParams.steering_vectors), named-module (register_steering_modules + steering_module_ref), and per-request scale (0 → baseline, 1 → steered).
  • Mixed batch: a steered and an unsteered request together — the unsteered output is byte-identical to baseline (per-request rows don't cross-contaminate).
  • Decode-only / prefill→decode transition (lazy decode-config registration at the boundary); prefill-only tier.
  • All three hook points (pre_attn / post_attn / post_mlp); eager and cudagraph (persistent buffers → no force-eager); chunked prefill.

Capture

  • Client-spec eager and under cudagraph (the force-eager gate fires for the capturing step); global-spec persistent-buffer path.
  • Position selectors: last_prompt, all_generated, all, explicit index list.
  • Filesystem + logging consumers, multiple consumers together; activation-store write path (block-hash wiring); server-side prefix-cache reuse (all_generated reuses, all_prompt recaptures — matches v1).

Distributed (Ray, NCCL/bond0)

  • Steering under TP=2 and PP=2 (rank-replication and per-stage locally_owned_layers confirmed).
  • Capture under TP=2 (only TP rank 0 writes) and PP=2 (stage 0 → its layer, stage 1 → its layer).

Hard edges

  • Preemption resume: 248 preemption events, all steered requests stayed steered; capture under preemption: 72 events, 24/24 requests delivered.
  • Streaming re-add (async streaming-input session): re-add branch fires, output steered.

Coverage: Qwen3-0.6B and gemma-3-4b-it (v2 supports gemma-3; the port is not Qwen3-specific). Plus CPU glue tests for the v2-specific projection logic.

Out of scope / not exercised: spec-decode, DP, async-dispatch overload policies, and the store serve path (dormant by design — all_prompt recaptures on v1 too).

@RhizoNymph

Copy link
Copy Markdown
Owner Author

Follow-up fix: capture registration on re-add paths

Discovered while auditing whether the "untested" items actually touch the port's diff. Two paths re-enter _capture_add_request for an id the capture manager already holds, and the manager raises on duplicate ids (already registered) — caught as a request error. Steering already handled this symmetric case; capture didn't.

  • Streaming re-addadd_requests_remove_request (which does not touch capture state) → re-add of a still-live request with a grown prompt. The stale prior-chunk registration is now discarded (gate drop + unregister_request, no finalize) and re-registered against the new prompt.
  • Preemption resume — on v2 the scheduler folds scheduled_resumed_reqs into scheduled_new_reqs, so a resumed request flows through _capture_add_request with was_present=False while its registration intentionally survived preemption. It's now kept as-is (skip re-registration), preserving rows captured before preemption.

The runtime change threads was_present (the _remove_request return) into _capture_add_request to distinguish a live re-add from a fresh admit / resume.

GPU-validated on Qwen3-0.6B (RTX 3090, v2 runner), clean before/after:

  • Streaming session: pre-fix logged capture request '...' is already registered on each re-add; post-fix zero, capture delivered.
  • Preemption (24 capturing requests, 64-block KV cache, forced eviction): pre-fix logged 20 already registered rejections; post-fix 0, 24/24 captures delivered.

CPU glue tests added for the three branches (fresh / streaming re-add / preemption resume) + non-capturer rank; full v2 glue suite 19 passed.

Also confirmed the async-dispatch overload policies (spill/drop/block) are not touched by the port — they live in the runner-agnostic transport shared with v1, so they stay out of scope here.

Commit: 40133ec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant