Steering + capture on the v2 model runner by RhizoNymph · Pull Request #185 · RhizoNymph/vllm

RhizoNymph · 2026-06-19T00:29:26Z

What

Ports the activation steering and capture control planes to the experimental v2 GPU model runner (vllm/v1/worker/gpu/model_runner.py), so they work there the same way they do on the v1 runner.

Why

The steering/capture data plane (the apply_steering / capture_residual custom ops, per-layer buffers, kernels, SteeringManager/CaptureManager/CaptureStepGate/ActivationStore) already lives in model_executor/ + v1/capture/ and is shared by both runners — both load the same model, so the in-forward hooks already fire on v2 and safely no-op when nothing drives them. Only the runner-side control plane was v1-only. Without it, turning on the v2 runner with steering/capture configured silently did nothing.

How

Two v2-native modules wire the existing subsystems into v2's lifecycle (add_requests / finish_requests / execute_model / sample_tokens); the v1 runner is untouched.

gpu/capture_runner_mixin.py (CaptureRunnerMixin): manager/gate/store init, request register/finalize, the per-step force-eager decision (client-spec captures only — global specs ride the cudagraph-safe persistent-buffer path), gather-plan build, and draining results onto ModelRunnerOutput.capture_results.
gpu/steering_runner_mixin.py (SteeringRunnerMixin): subclasses the v1 SteeringModelRunnerMixin, reusing init / layer discovery / validation / the public RPC API / _resolve_request_steering unchanged, and overrides only the three methods that touched v1-runner state. It keeps its own per-request state (v2 retains no CachedRequestState), drives prefill→decode transitions and the per-token steering index, and needs no force-eager (persistent buffers are cudagraph-safe). gpu_worker.py already forwards the steering RPCs.

Design notes: docs/design/v2_runner_steering_capture.md.

Validation

CPU unit tests for the v2 glue (tests/v1/worker/test_gpu_v2_{steering,capture}_glue.py, 14 tests).

GPU-validated on Qwen3-0.6B (RTX 3090, TP1/PP1), VLLM_USE_V2_MODEL_RUNNER=1:

Steering, eager and cudagraph: global vectors shift the output; clearing restores the exact baseline (confirms the persistent-buffer path is cudagraph-safe — no force-eager).
Capture, eager: a client-spec request (post_attn, layer 5, last_prompt) delivers one (1, hidden) bf16 row to a consumer's on_capture.

Not yet exercised on GPU (mirrors v1 but unverified here): TP>1 / PP>1, per-request inline steering and named modules, capture prefix-cache reuse, preemption resume, spec-decode token layout.

Notes

Includes an interim commit that added a fail-closed v1 fallback when steering/capture is configured, and a later commit that removes it now that the port exists — the two net out, so the PR diff is purely additive.

… configured

…to the v2 model runner

… fallback guard); fix global-steering row for untracked requests

RhizoNymph · 2026-06-19T02:10:33Z

Also validated cross-node on 2×RTX 3090 (Ray, NCCL over bond0): steering passes under both TP=2 and PP=2 — global set_steering_vectors shifts output and clear restores baseline; per-worker results confirm the rank-replication invariant (TP: both ranks register; PP: stage 1 correctly reports no owned steered layers via locally_owned_layers).

…le-node, TP/PP, preemption, streaming)

RhizoNymph · 2026-06-19T06:18:57Z

Full validation matrix (GPU, 2×RTX 3090)

Extended the validation well beyond the initial pass. All on the v2 runner (VLLM_USE_V2_MODEL_RUNNER=1); see docs/design/v2_runner_steering_capture.md for details.

Steering

Global (set_steering_vectors) and per-request inline (SamplingParams.steering_vectors), named-module (register_steering_modules + steering_module_ref), and per-request scale (0 → baseline, 1 → steered).
Mixed batch: a steered and an unsteered request together — the unsteered output is byte-identical to baseline (per-request rows don't cross-contaminate).
Decode-only / prefill→decode transition (lazy decode-config registration at the boundary); prefill-only tier.
All three hook points (pre_attn / post_attn / post_mlp); eager and cudagraph (persistent buffers → no force-eager); chunked prefill.

Capture

Client-spec eager and under cudagraph (the force-eager gate fires for the capturing step); global-spec persistent-buffer path.
Position selectors: last_prompt, all_generated, all, explicit index list.
Filesystem + logging consumers, multiple consumers together; activation-store write path (block-hash wiring); server-side prefix-cache reuse (all_generated reuses, all_prompt recaptures — matches v1).

Distributed (Ray, NCCL/bond0)

Steering under TP=2 and PP=2 (rank-replication and per-stage locally_owned_layers confirmed).
Capture under TP=2 (only TP rank 0 writes) and PP=2 (stage 0 → its layer, stage 1 → its layer).

Hard edges

Preemption resume: 248 preemption events, all steered requests stayed steered; capture under preemption: 72 events, 24/24 requests delivered.
Streaming re-add (async streaming-input session): re-add branch fires, output steered.

Coverage: Qwen3-0.6B and gemma-3-4b-it (v2 supports gemma-3; the port is not Qwen3-specific). Plus CPU glue tests for the v2-specific projection logic.

Out of scope / not exercised: spec-decode, DP, async-dispatch overload policies, and the store serve path (dormant by design — all_prompt recaptures on v1 too).

…preemption resume

RhizoNymph · 2026-06-20T02:40:43Z

Follow-up fix: capture registration on re-add paths

Discovered while auditing whether the "untested" items actually touch the port's diff. Two paths re-enter _capture_add_request for an id the capture manager already holds, and the manager raises on duplicate ids (already registered) — caught as a request error. Steering already handled this symmetric case; capture didn't.

Streaming re-add — add_requests → _remove_request (which does not touch capture state) → re-add of a still-live request with a grown prompt. The stale prior-chunk registration is now discarded (gate drop + unregister_request, no finalize) and re-registered against the new prompt.
Preemption resume — on v2 the scheduler folds scheduled_resumed_reqs into scheduled_new_reqs, so a resumed request flows through _capture_add_request with was_present=False while its registration intentionally survived preemption. It's now kept as-is (skip re-registration), preserving rows captured before preemption.

The runtime change threads was_present (the _remove_request return) into _capture_add_request to distinguish a live re-add from a fresh admit / resume.

GPU-validated on Qwen3-0.6B (RTX 3090, v2 runner), clean before/after:

Streaming session: pre-fix logged capture request '...' is already registered on each re-add; post-fix zero, capture delivered.
Preemption (24 capturing requests, 64-block KV cache, forced eviction): pre-fix logged 20 already registered rejections; post-fix 0, 24/24 captures delivered.

CPU glue tests added for the three branches (fresh / streaming re-add / preemption resume) + non-capturer rank; full v2 glue suite 19 passed.

Also confirmed the async-dispatch overload policies (spill/drop/block) are not touched by the port — they live in the runner-agnostic transport shared with v1, so they stay out of scope here.

Commit: 40133ec

…h validation

RhizoNymph added 3 commits June 18, 2026 17:07

fix(config): fall back to v1 model runner when steering or capture is…

df1630c

… configured

feat(v2-runner): port activation steering and capture control planes …

2e680d3

…to the v2 model runner

feat(v2-runner): enable steering and capture on v2 (remove interim v1…

40becc7

… fallback guard); fix global-steering row for untracked requests

docs(v2-runner): record full steering/capture validation matrix (sing…

edec905

…le-node, TP/PP, preemption, streaming)

fix(v2-runner): refresh capture registration on streaming re-add and …

40133ec

…preemption resume

docs(v2-runner): record steering-under-APC and capture TP/PP-cudagrap…

4646f0c

…h validation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Steering + capture on the v2 model runner#185

Steering + capture on the v2 model runner#185
RhizoNymph wants to merge 6 commits into
feat/integrationfrom
feat/v2-model-runner

RhizoNymph commented Jun 19, 2026

Uh oh!

RhizoNymph commented Jun 19, 2026

Uh oh!

RhizoNymph commented Jun 19, 2026

Uh oh!

RhizoNymph commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RhizoNymph commented Jun 19, 2026

What

Why

How

Validation

Notes

Uh oh!

RhizoNymph commented Jun 19, 2026

Uh oh!

RhizoNymph commented Jun 19, 2026

Full validation matrix (GPU, 2×RTX 3090)

Uh oh!

RhizoNymph commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant