feat(steering): dynamic steering — activation-conditioned steering by RhizoNymph · Pull Request #180 · RhizoNymph/vllm

RhizoNymph · 2026-06-17T07:50:43Z

Draft. Ties activation capture to activation steering so activations decide when/how to steer. Three controller tiers (async → sync → in-graph), each configuring the one below. Design authority: docs/design/dynamic_steering.md (+ dynamic_steering_apc_notification.md, dynamic_steering_row_gating.md).

What's here

Phase 0 — async transport. In-process steering_action_queue (bounded, decode-tier-only validation), drained at the top of _update_steering_buffers.
Phase 1a — sync consumers + per-request actuation. execution="sync" consumer axis (every TP rank, 1-step latency); dynamic-override row pool (pure routing); observability + GET /v1/steering/dynamic; event-based on_step timing.
Phase 1b — gain primitives. Per-row strength scale (§5.3) + dedicated-gather dynamic additive tier (§5.4).
Phase 2 — in-graph monitor. Graph-safe monitor op computes a per-token gate sigmoid(sharpness·(residual·probe − threshold)) and modulates the §5.4 tier same-forward; and per-request rows (decode-only, prefill protected via a decode mask) when gate_rows is set.
APC correctness. Worker→scheduler effective-decode-steering-signature notification so decode KV produced under dynamic steering is not falsely reused — resolves the streaming-continuation prefix-cache hole.
Example controller emit_mode = scale | monitor.

Status / validation

GPU-validated on gemma4-31B: tp=1 (per-request actuation, tier, APC reuse), tp=2 cross-node (rank-replication + APC re-keying), pp=2, active in-graph monitor (tier + row gating), and row-gating kernel/op/cudagraph parity. Extensive CPU suites.

Notes for review

Includes a fix for a pre-existing decode-only per-request steering short-circuit deadlock (also proposed standalone against the base in fix(steering): decode-only per-request steering dropped by nothing-active short-circuit #178).
Deferred: model_runner_v2 integration (upstream dev-flag-gated).

…step

…d decode steering)

…che analysis)

…tier, packed banks)

…r and on_step

…ring

…n with per-request actuation

…ild backend

…d activation_reward_producer Co-authored-by: Claude

… throughput A/B

…ride e2e test

…ed FP nondeterminism

…th operator decode steering

…ler tiers

…le action)

…l term), replace populate-folding

…e 2 M1)

…cuit (Phase 2 M2)

…hase 2 M3)

…g open items

…r-gain + in-graph probe)

…ification

…otification M1)

…PC correctness

…ock keying)

… validation

…tive short-circuit

… fix

…-up)

… gate_rows; row gating M2)

…ion (M3)

…eq_id)

…ration

…ansport

… finalize)

…; clarify async finalize timing

RhizoNymph · 2026-06-17T21:29:16Z

End-to-end verification summary

Everything on this branch has now been validated end to end on GPU (RTX 3090, gemma-4-31B-it-Q4_K_S GGUF, hidden 5376, 60 layers — gemma4 is the only architecture carrying both capture taps and steering hooks). Below is the full picture: component-level checks, engine-level e2e, parallelism, and the three previously-open consumer-loop gaps, all now closed.

Methodology

Steering is verified via logprobs / num_cached_tokens / direct state inspection, never raw output-token equality — output-token comparison is ambiguous on high-confidence prompts, and capture records the pre-steering residual so it can't witness steering. The per-request e2e tests use a within-run target-vs-control technique (steer one of two identical concurrent requests; compare against the in-batch control) because two identical greedy prompts in one batch are not bitwise identical deep in generation (batched reductions are position-dependent, diverging from FP noise ~token 22). NOISE_FLOOR=10 separates real (early) steering from that floor. All GPU tests force VLLM_USE_FLASHINFER_SAMPLER=0 and VLLM_WORKER_MULTIPROC_METHOD=spawn.

Component / kernel (CPU + standalone GPU)

~312 steering CPU tests pass; ruff clean.
Per-row scale kernel (out = hidden + table[r]·scale[r]): exact at 1.0 / 0.5 / 0.0; cudagraph capture.
Dynamic additive tier (§5.4, + dvec·token_scales[t]): per-token gate math, prefill-zero (decode-only), free gain changes, cudagraph capture.
In-graph monitor (steering_monitor): real Triton kernel matches fp32 eager across N=1..256 (bf16 max|Δ| ≤ 4e-5); a hand-built CUDA graph capturing monitor → apply_steering, replayed across steps with different gains, reproduces eager (rel ≤ 5e-3) — proving the in-place token_scales mutation is visible to the later steer op within the same graph, the per-step overwrite is the reset (no cross-step accumulation), and prefill stays tier-free.
Per-request row gating (steering_row_gate + steering_decode_mask, table[r]·scale[r]·row_gate[t]): kernel parity (bf16 ≤ 6e-3, fp32 ~1e-7), monitor gate_rows gates decode rows while prefill rows stay exactly 1.0 (cache safety), CUDA-graph replay, engine cudagraph boot with the 8-arg apply + 7-arg monitor ops.

Engine-level e2e (GPU)

One-step actuation latency + per-request targeting (test_dynamic_steering_e2e.py): an override emitted at step N changes the target's output starting at N+1, never token 0; the in-batch control is untouched. first_diff=2.
In-graph monitor, active path (engaged vs disengaged by threshold sign only): engaged diverges at token 1, disengaged == unsteered baseline — the monitor gate provably controls the tier through the full controller → SteeringMonitorUpdate → manager → kernel path.
APC steering-aware prefix caching (test_apc_steering_e2e.py): a continuation of a dynamically steered request reuses 0 of its prior decode KV (override-keyed blocks not falsely reused), while a continuation of an unsteered request reuses normally (64/75) — the worker→scheduler effective-decode-signature notification is correct.

Parallelism

tp=1: all of the above.
tp=2 cross-node (Ray, NCCL over bond0): rank-replication smoke (identical steering tables / coherent steered output across ranks) and APC re-keying (steered continuation cached=0, unsteered=64) — the rank-0-canonical notification holds under TP.
pp=2 cross-node: engine + per-request decode steering on both pipeline stages.

Three previously-open consumer-loop gaps — now closed (GPU)

gate_rows end-to-end (test_steering_gating_e2e.py::test_row_gate_*): a consumer installs an override row + a saturated-threshold (±1e6) gate_rows=True monitor. Gate ON → target's per-request row applied (early divergence in [1,10]); gate OFF → row suppressed (target tracks the control past the noise floor). Confirms the monitor gates the per-request row term, not just the tier, through a real consumer loop.
req_id-keyed scale end-to-end (::test_req_id_scale_*): override + SteeringScaleUpdate(req_id=, scale=0) emitted in the same step (override first so the runner resolves req_id → dyn_id in the in-order apply). scale=0 suppresses exactly the target's row; the unscaled override diverges early.
Async transport end-to-end (test_async_steering_e2e.py, AsyncTierExample): a global-tier SteeringVectorUpdate submitted through the action queue from on_capture, exercising queue → drain → _apply_steering_actions.

Behavioral finding from gap 3: on_capture fires when a request finalizes (after its output is emitted), so an async update can never steer its own request — it steers a subsequent one. The first version of this test (single-request cross-load) correctly failed (base == steered), surfacing exactly this; the test was rewritten to repeat the prompt in one engine (gen[0] baseline, gen[1..] steered). The AsyncTierExample docstring + README were corrected accordingly (they had implied in-request "1–3 step latency"). For same-request exactly-one-step latency, use a sync on_step consumer.

Pre-existing bug found + fixed during validation

A decode-only static per-request steering request (prefill_hash==0, decode_hash!=0) was silently dropped at all parallelisms (predating this branch) — the nothing-active short-circuit returned before the prefill→decode transition that registers the decode config. Fixed via a batch_has_per_request_steering guard + regression test. Ported to the base branches (PRs #178 → feat/integration, #179 → feat/steering).

Out of scope (not covered here)

model_runner_v2 steering integration — upstream dev-flag-gated.
The base/prefill allow_cache_unsafe_phases escape hatch — deliberate, cache-unsafe, no example (caller owns invalidation).

…no-sync on_step can't crash the engine

…utating, same-hook) to avoid cudagraph FULL-graph downgrade

…-registers externally-plugged ops e.g. gguf, hitting torch's hard duplicate-registration error)

RhizoNymph added 30 commits June 11, 2026 03:11

feat(steering): in-process dynamic steering action queue drained per …

3ef4057

…step

feat(capture): dynamic steering controller example plugin (probe-gate…

0af6ced

…d decode steering)

docs(steering): dynamic steering design (phases 0-2, determinism + ca…

5570113

…che analysis)

test(steering): real-manager integration tests for dynamic action queue

eef9780

docs(steering): rework phase 1 around sync/async consumer execution axis

335fc84

docs(steering): record phase 1 decisions (per-request first, dynamic …

949b96c

…tier, packed banks)

feat(capture): sync consumer execution axis with all-rank slim manage…

9d345f8

…r and on_step

feat(steering): dynamic-override row pool for per-request decode stee…

35b8f7b

…ring

feat(steering): dynamic steering observability + sync plugin migratio…

4276c96

…n with per-request actuation

docs(steering): record tp=1 GPU validation of phase 1a; fix plugin bu…

8d3767f

…ild backend

fix(examples): use setuptools.build_meta backend in minimal_plugin an…

0f4945c

…d activation_reward_producer Co-authored-by: Claude

feat(steering): event-based on_step timing for honest sync-consumer cost

95f2914

docs(steering): record GPU validation of event-based on_step timing +…

9e02feb

… throughput A/B

feat(steering): sync-consumer warmup hook + engine-level dynamic-over…

42a77bc

…ride e2e test

test(steering): force spawn start method in dynamic-steering e2e test

c4e36dd

test(steering): assert dynamic-override e2e within-run to dodge batch…

a889bc9

…ed FP nondeterminism

fix(test): move NOISE_FLOOR constant above decorators

fb265f4

docs(steering): record warmup-hook and engine-level e2e GPU validation

5c5637a

feat(steering): dynamic additive tier (populate-folding) — compose wi…

906bb3e

…th operator decode steering

docs(steering): lock in policy expressiveness contract across control…

c7782d2

…ler tiers

feat(steering): per-row strength scale tensor (kernel + manager + sca…

050068b

…le action)

feat(steering): dedicated-gather dynamic tier (per-token gate + kerne…

4205263

…l term), replace populate-folding

feat(steering): in-graph monitor op — per-token gate from probe (Phas…

4dc3c75

…e 2 M1)

feat(steering): monitor config in manager + runner populate/short-cir…

5003751

…cuit (Phase 2 M2)

feat(steering): monitor action + dispatch + status + kernel warmup (P…

4538d62

…hase 2 M3)

docs(steering): Phase 2 in-graph monitor design — resolve reset/gatin…

0f70497

…g open items

docs(steering): record Phase 2 monitor CPU + GPU validation (node2)

6cec618

feat(steering): example controller emit_mode=scale|monitor (cheap tie…

6bc09a0

…r-gain + in-graph probe)

docs(steering): plan for worker->scheduler APC steering-signature not…

be882de

…ification

feat(steering): effective decode steering signature in manager (APC n…

b29e6e7

…otification M1)

RhizoNymph added 14 commits June 16, 2026 20:39

feat(steering): worker->scheduler decode-signature notification for A…

98b9ce0

…PC correctness

docs(steering): record as-built APC notification (forward-only per-bl…

1a262a1

…ock keying)

test(steering): APC steered-continuation e2e (GPU-validated) + record…

6e014d8

… validation

fix(steering): decode-only per-request steering dropped by nothing-ac…

f7afef9

…tive short-circuit

docs(steering): record tp2/pp2 validation + decode-only short-circuit…

65b4674

… fix

docs(steering): plan for per-request-row token gating (Phase 2 follow…

cecb5a8

…-up)

feat(steering): per-token row gate buffer + kernel term (row gating M1)

4d61c97

feat(steering): in-graph monitor gates per-request rows (decode mask,…

0336243

… gate_rows; row gating M2)

test(steering): row-gating GPU parity + record implementation/validat…

e100f5c

…ion (M3)

feat(steering): req_id-keyed per-request scale (SteeringScaleUpdate.r…

7e01f11

…eq_id)

examples(steering): minimal single-purpose consumers, one per configu…

4891d18

…ration

test(steering): engine-level e2e for row-gate, req_id scale, async tr…

0d86ccb

…ansport

test(steering): async e2e steers a later request (on_capture fires at…

2b4ca85

… finalize)

docs(steering): record row-gate/req_id-scale/async e2e GPU validation…

898831e

…; clarify async finalize timing

fix(steering): guard sync-consumer gpu-timing read with query() so a …

5656a05

…no-sync on_step can't crash the engine

RhizoNymph changed the title ~~feat(steering): dynamic steering — activation-conditioned steering (Phases 0–2 + APC notification)~~ feat(steering): dynamic steering — activation-conditioned steering Jun 18, 2026

RhizoNymph added 2 commits June 19, 2026 02:28

perf(steering): fuse in-graph monitor gate into apply_steering (non-m…

1021417

…utating, same-hook) to avoid cudagraph FULL-graph downgrade

fix: make direct_register_custom_op idempotent (cold torch.compile re…

bff4426

…-registers externally-plugged ops e.g. gguf, hitting torch's hard duplicate-registration error)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(steering): dynamic steering — activation-conditioned steering#180

feat(steering): dynamic steering — activation-conditioned steering#180
RhizoNymph wants to merge 47 commits into
feat/integrationfrom
feat/dynamic-steering

RhizoNymph commented Jun 17, 2026 •

edited

Loading

Uh oh!

RhizoNymph commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RhizoNymph commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's here

Status / validation

Notes for review

Uh oh!

RhizoNymph commented Jun 17, 2026

End-to-end verification summary

Methodology

Component / kernel (CPU + standalone GPU)

Engine-level e2e (GPU)

Parallelism

Three previously-open consumer-loop gaps — now closed (GPU)

Pre-existing bug found + fixed during validation

Out of scope (not covered here)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RhizoNymph commented Jun 17, 2026 •

edited

Loading