diff --git a/docs/design/capture_consumers.md b/docs/design/capture_consumers.md index 463f0ecf6651..b8290ae09a34 100644 --- a/docs/design/capture_consumers.md +++ b/docs/design/capture_consumers.md @@ -117,7 +117,7 @@ VllmInternalRequestId = NewType("VllmInternalRequestId", str) CaptureKey = tuple[VllmInternalRequestId, int, str] # (request id, layer index, hook name) -HookName = Literal["pre_attn", "post_attn", "post_mlp", "mlp_in", "mlp_out"] +HookName = Literal["pre_attn", "post_attn", "post_block", "mlp_in", "mlp_out"] PositionSelector = ( Literal["last_prompt", "all_prompt", "all_generated", "all"] | list[int] @@ -678,7 +678,7 @@ Writer details (`writer.py`): - TP / PP / EP / DP are all accepted for the replicated residual hooks (no parallel-size rejection). See [Capture Consumers under Parallelism](capture_parallelism.md). -- Every hook name is in `{pre_attn, post_attn, post_mlp, mlp_in, +- Every hook name is in `{pre_attn, post_attn, post_block, mlp_in, mlp_out}`. - Every resolved layer is in `[0, num_hidden_layers)`, the **global** layer count (admission validates the full layer space; the runner then diff --git a/docs/design/capture_parallelism.md b/docs/design/capture_parallelism.md index 40a9d3ae8faa..4965e6a8129c 100644 --- a/docs/design/capture_parallelism.md +++ b/docs/design/capture_parallelism.md @@ -18,7 +18,7 @@ guide see [Capture Consumers](../features/capture_consumers.md). ## TL;DR - **The capturable hooks are replicated.** The three hooks that fire - today — `pre_attn`, `post_attn`, `post_mlp` — read the residual + today — `pre_attn`, `post_attn`, `post_block` — read the residual stream *after* the TP all-reduce and the MoE combine, so the tensor is full `hidden_size`, **byte-identical on every TP and every EP rank**. For these hooks, TP/EP support is a *rank gate*, not a @@ -52,10 +52,10 @@ downstream of the reducing collectives: (`vllm/model_executor/layers/linear.py:1558-1559`), so attention- and MLP-output projections produce full `hidden_size` on every TP rank. - MoE paths all-gather/all-reduce before the residual add - (`vllm/model_executor/models/deepseek_v2.py:384`), so `post_mlp` on an + (`vllm/model_executor/models/deepseek_v2.py:384`), so `post_block` on an EP rank also sees the full residual. -Hence `pre_attn` / `post_attn` / `post_mlp` are `[num_rows, +Hence `pre_attn` / `post_attn` / `post_block` are `[num_rows, hidden_size]` and identical across the TP×EP plane of a PP stage. What is **genuinely sharded** (and not captured today): @@ -166,7 +166,7 @@ declared `location` can select between them. filesystem consumer writes to a path keyed by **global** layer index + request_id on a **shared** mount. - The on-disk layout merges naturally: stage 0 writes - `…/req/12_post_mlp.bin`, stage 1 writes `…/req/40_post_mlp.bin`, no + `…/req/12_post_block.bin`, stage 1 writes `…/req/40_post_block.bin`, no collision. The `packed`/`sharded` layouts (one file per request / per tag) cannot merge by global layer index alone, so under PP each stage writes its **own** file keyed by stage rank diff --git a/docs/features/capture_consumers.md b/docs/features/capture_consumers.md index 4d1a2a2ac59f..85184c8dec31 100644 --- a/docs/features/capture_consumers.md +++ b/docs/features/capture_consumers.md @@ -290,7 +290,7 @@ llm = LLM( model="meta-llama/Llama-3-8B", capture_consumers=[ {"name": "filesystem", "params": {"root": "/tmp/captures"}}, - {"name": "logging", "params": {"hooks": {"post_mlp": [0]}}}, + {"name": "logging", "params": {"hooks": {"post_block": [0]}}}, ], ) ``` @@ -323,7 +323,7 @@ sampling_params = SamplingParams( "filesystem": FilesystemCaptureRequest( request_id="probe_0001", tag="mnist-probe-v1", - hooks={"post_mlp": [12]}, + hooks={"post_block": [12]}, positions="last_prompt", ), }, @@ -355,7 +355,7 @@ response = httpx.post( "filesystem": { "request_id": "probe_train_0001", "tag": "capital-probe", - "hooks": {"post_mlp": [12, 16, 20, 24]}, + "hooks": {"post_block": [12, 16, 20, 24]}, "positions": "last_prompt", "layout": "packed", }, @@ -392,7 +392,7 @@ sampling_params = SamplingParams( "filesystem": FilesystemCaptureRequest( request_id="req1", tag="demo", - hooks={"post_mlp": [0]}, + hooks={"post_block": [0]}, positions="last_prompt", ), }, @@ -428,7 +428,7 @@ response body as `capture_results`, mirroring the structure above. ## Parallelism Capturing the residual-stream hooks (`pre_attn`, `post_attn`, -`post_mlp`) is supported under **tensor, pipeline, expert, and data +`post_block`) is supported under **tensor, pipeline, expert, and data parallelism** for worker-location consumers — including the built-in `filesystem` consumer. How it works: diff --git a/docs/features/steering.md b/docs/features/steering.md index 729ed36f4d87..ae4003f631d8 100644 --- a/docs/features/steering.md +++ b/docs/features/steering.md @@ -50,7 +50,7 @@ Also supported: - Global steering through HTTP endpoints - Per-request steering through `SamplingParams` - Three additive tiers (base / prefill-specific / decode-specific) -- Three hook points: `pre_attn`, `post_attn`, `post_mlp` +- Three hook points: `pre_attn`, `post_attn`, `post_block` - Phase-aware scheduler admission for per-request steering - Prefix-cache separation for different prefill steering configs - Continuous batching @@ -109,7 +109,7 @@ activation that is discarded immediately afterward. | --- | --- | | `pre_attn` | Residual stream before attention | | `post_attn` | Residual stream after attention | -| `post_mlp` | Residual stream after MLP | +| `post_block` | Residual stream after MLP | For supported models, these hooks are wired directly into each decoder layer's forward path. Unused hook points are zero-valued no-ops. @@ -157,7 +157,7 @@ curl -X POST http://localhost:8000/v1/steering/set \ -H "Content-Type: application/json" \ -d '{ "vectors": { - "post_mlp": { + "post_block": { "15": {"vector": [0.1, 0.2], "scale": 2.0} } }, @@ -209,7 +209,7 @@ packed_hook = { requests.post( "http://localhost:8000/v1/steering/set", - json={"vectors": {"post_mlp": packed_hook}}, + json={"vectors": {"post_block": packed_hook}}, ) ``` @@ -248,7 +248,7 @@ params = SamplingParams( max_tokens=64, temperature=0.0, steering_vectors={ - "post_mlp": { + "post_block": { 15: {"vector": [0.1, 0.2], "scale": 2.0}, }, }, @@ -288,7 +288,7 @@ vec = np.random.standard_normal(2560).astype(np.float16) stacked = np.stack([vec], axis=0) # (num_layers, hidden_size) base = { - "post_mlp": { + "post_block": { "dtype": str(stacked.dtype), # "float16" | "float32" | "float64" "shape": list(stacked.shape), "layer_indices": [15], @@ -335,7 +335,7 @@ The JSON file uses the same three-tier format as the global steering API: ```json { "vectors": { - "post_mlp": { + "post_block": { "15": [0.1, 0.2, 0.3], "20": {"vector": [0.4, 0.5, 0.6], "scale": 2.0} } @@ -357,7 +357,7 @@ startup cost: ```json { "vectors": { - "post_mlp": { + "post_block": { "dtype": "float32", "shape": [2, 2560], "layer_indices": [15, 20], @@ -379,7 +379,7 @@ curl -X POST http://localhost:8000/v1/steering/modules/register \ -d '{ "name": "creativity", "vectors": { - "post_mlp": {"15": [0.1, 0.2, 0.3]} + "post_block": {"15": [0.1, 0.2, 0.3]} } }' @@ -412,7 +412,7 @@ requests.post( json={ "name": "creativity", "vectors": { - "post_mlp": { + "post_block": { "dtype": str(stacked.dtype), "shape": list(stacked.shape), "layer_indices": [15], @@ -454,7 +454,7 @@ response = client.chat.completions.create( extra_body={ "steering_name": "creativity", "steering_vectors": { - "post_mlp": {15: [0.05, 0.1, 0.15]}, + "post_block": {15: [0.05, 0.1, 0.15]}, }, }, ) @@ -546,7 +546,7 @@ Returns per-layer hook-point availability aggregated across TP × PP ranks: ```bash curl http://localhost:8000/v1/steering/layers -# {"layers": {"0": {"hook_points": ["post_mlp"]}, "1": {"hook_points": ["post_mlp", "pre_attn"]}, ...}} +# {"layers": {"0": {"hook_points": ["post_block"]}, "1": {"hook_points": ["post_block", "pre_attn"]}, ...}} ``` Useful to confirm which layers of the loaded model are steerable before diff --git a/examples/capture_consumers/activation_reward_producer/README.md b/examples/capture_consumers/activation_reward_producer/README.md index 093c94454933..723c9487f7c3 100644 --- a/examples/capture_consumers/activation_reward_producer/README.md +++ b/examples/capture_consumers/activation_reward_producer/README.md @@ -48,7 +48,7 @@ llm = LLM( "name": "activation_reward", "params": { "layer": 12, - "hook": "post_mlp", + "hook": "post_block", "vector_path": "/models/happy/sadness.pt", "position_slice": {"start": 10, "end": None, "stride": 1}, "scale": 5.0, @@ -64,7 +64,7 @@ llm = LLM( ```bash vllm serve meta-llama/Llama-3-8B \ - --capture-consumers activation_reward:layer=12,hook=post_mlp,vector_path=/models/happy/sadness.pt,scale=5.0,nonlinearity=tanh + --capture-consumers activation_reward:layer=12,hook=post_block,vector_path=/models/happy/sadness.pt,scale=5.0,nonlinearity=tanh ``` ### Parameters @@ -72,7 +72,7 @@ vllm serve meta-llama/Llama-3-8B \ | Field | Type | Default | Purpose | | --- | --- | --- | --- | | `layer` | `int` | required | Layer index to capture at. | -| `hook` | `str` | required | One of `pre_attn`, `post_attn`, `post_mlp`, `mlp_in`, `mlp_out`. | +| `hook` | `str` | required | One of `pre_attn`, `post_attn`, `post_block`, `mlp_in`, `mlp_out`. | | `vector_path` | `str` | required | Path to a `.pt` file holding a 1-D tensor of shape `(hidden_size,)`. L2-normalized at load. | | `position_slice` | `dict` | `{start: 10, end: null, stride: 1}` | Applied to the `all_generated` span before mean-pooling. | | `scale` | `float` | `1.0` | Multiplicative factor on the raw cosine. | @@ -120,7 +120,7 @@ llm = LLM( "name": "activation_reward", "instance_name": "sadness_reward", "params": { - "layer": 12, "hook": "post_mlp", + "layer": 12, "hook": "post_block", "vector_path": "/models/happy/sadness.pt", }, }, diff --git a/examples/capture_consumers/activation_reward_producer/activation_reward_producer/__init__.py b/examples/capture_consumers/activation_reward_producer/activation_reward_producer/__init__.py index 5bc33c61c6da..79dcf74d4b28 100644 --- a/examples/capture_consumers/activation_reward_producer/activation_reward_producer/__init__.py +++ b/examples/capture_consumers/activation_reward_producer/activation_reward_producer/__init__.py @@ -35,7 +35,7 @@ from vllm.v1.capture.types import CaptureContext -_HOOK_NAMES = frozenset({"pre_attn", "post_attn", "post_mlp", "mlp_in", "mlp_out"}) +_HOOK_NAMES = frozenset({"pre_attn", "post_attn", "post_block", "mlp_in", "mlp_out"}) _NONLIN = { "tanh": math.tanh, "sigmoid": lambda x: 1.0 / (1.0 + math.exp(-x)), diff --git a/examples/capture_consumers/activation_reward_producer/test.py b/examples/capture_consumers/activation_reward_producer/test.py index 98bbe4b0c7d7..2568a96be4dc 100644 --- a/examples/capture_consumers/activation_reward_producer/test.py +++ b/examples/capture_consumers/activation_reward_producer/test.py @@ -63,7 +63,7 @@ def test_payload_shape_and_lifecycle(tmp: Path) -> None: _mock_config(), { "layer": 12, - "hook": "post_mlp", + "hook": "post_block", "vector_path": str(vec_path), "position_slice": {"start": 2, "end": None, "stride": 1}, "scale": 5.0, @@ -73,11 +73,11 @@ def test_payload_shape_and_lifecycle(tmp: Path) -> None: # Validator returns the pinned spec. spec = producer.validate_client_spec({}, _ctx()) - assert spec.hooks == {"post_mlp": [12]} + assert spec.hooks == {"post_block": [12]} assert spec.positions == "all_generated" # Two chunks across two steps; total 6 rows; slice starts at 2. - key = (VllmInternalRequestId("req-1"), 12, "post_mlp") + key = (VllmInternalRequestId("req-1"), 12, "post_block") chunk_a = CaptureChunk( key=key, tensor=torch.randn(3, HIDDEN), @@ -127,12 +127,12 @@ def test_empty_window_payload(tmp: Path) -> None: _mock_config(), { "layer": 0, - "hook": "post_mlp", + "hook": "post_block", "vector_path": str(vec_path), "position_slice": {"start": 100, "end": None, "stride": 1}, }, ) - key = (VllmInternalRequestId("short"), 0, "post_mlp") + key = (VllmInternalRequestId("short"), 0, "post_block") producer.submit_chunk( CaptureChunk( key=key, @@ -156,9 +156,9 @@ def test_no_chunks_partial_error(tmp: Path) -> None: producer = ActivationRewardProducer( _mock_config(), - {"layer": 0, "hook": "post_mlp", "vector_path": str(vec_path)}, + {"layer": 0, "hook": "post_block", "vector_path": str(vec_path)}, ) - key = (VllmInternalRequestId("ghost"), 0, "post_mlp") + key = (VllmInternalRequestId("ghost"), 0, "post_block") producer.submit_finalize(CaptureFinalize(key=key)) result = producer.get_result(key) assert result.status == "partial_error" @@ -172,7 +172,7 @@ def test_non_empty_client_spec_rejected(tmp: Path) -> None: producer = ActivationRewardProducer( _mock_config(), - {"layer": 0, "hook": "post_mlp", "vector_path": str(vec_path)}, + {"layer": 0, "hook": "post_block", "vector_path": str(vec_path)}, ) try: producer.validate_client_spec({"layer": 99}, _ctx()) @@ -189,7 +189,7 @@ def test_tp_pp_rejected(tmp: Path) -> None: producer = ActivationRewardProducer( _mock_config(), - {"layer": 0, "hook": "post_mlp", "vector_path": str(vec_path)}, + {"layer": 0, "hook": "post_block", "vector_path": str(vec_path)}, ) try: producer.validate_client_spec({}, _ctx(tp=2)) @@ -207,7 +207,7 @@ def test_bad_layer_rejected(tmp: Path) -> None: try: ActivationRewardProducer( _mock_config(), - {"layer": NUM_LAYERS + 5, "hook": "post_mlp", "vector_path": str(vec_path)}, + {"layer": NUM_LAYERS + 5, "hook": "post_block", "vector_path": str(vec_path)}, ) except ValueError as e: assert "out of range" in str(e) @@ -223,7 +223,7 @@ def test_vector_hidden_size_mismatch(tmp: Path) -> None: try: ActivationRewardProducer( _mock_config(), - {"layer": 0, "hook": "post_mlp", "vector_path": str(vec_path)}, + {"layer": 0, "hook": "post_block", "vector_path": str(vec_path)}, ) except ValueError as e: assert "hidden_size" in str(e) diff --git a/examples/capture_consumers/minimal_plugin/my_plugin/__init__.py b/examples/capture_consumers/minimal_plugin/my_plugin/__init__.py index 65b02837ae9d..250a0a4a9d13 100644 --- a/examples/capture_consumers/minimal_plugin/my_plugin/__init__.py +++ b/examples/capture_consumers/minimal_plugin/my_plugin/__init__.py @@ -28,7 +28,7 @@ def __init__(self, vllm_config: Any, params: dict[str, Any]) -> None: def global_capture_spec(self) -> CaptureSpec: return CaptureSpec( - hooks={"post_mlp": self._layers}, + hooks={"post_block": self._layers}, positions="last_prompt", ) diff --git a/examples/capture_consumers/plugin_authoring.md b/examples/capture_consumers/plugin_authoring.md index 42ddff282624..1603cbd80f8a 100644 --- a/examples/capture_consumers/plugin_authoring.md +++ b/examples/capture_consumers/plugin_authoring.md @@ -40,7 +40,7 @@ class MyConsumer(CaptureConsumer): def global_capture_spec(self) -> CaptureSpec: return CaptureSpec( - hooks={"post_mlp": self._layers}, + hooks={"post_block": self._layers}, positions="last_prompt", ) @@ -86,7 +86,7 @@ pattern — the consumer always needs the same data. ```python def global_capture_spec(self) -> CaptureSpec: return CaptureSpec( - hooks={"post_mlp": [0, 15, 31]}, + hooks={"post_block": [0, 15, 31]}, positions="last_prompt", ) ``` @@ -104,7 +104,7 @@ class FlexConsumer(CaptureConsumer): reads_client_spec = True def validate_client_spec(self, raw_spec, ctx): - hooks = raw_spec.get("hooks", {"post_mlp": list(range(ctx.num_hidden_layers))}) + hooks = raw_spec.get("hooks", {"post_block": list(range(ctx.num_hidden_layers))}) positions = raw_spec.get("positions", "all_prompt") return CaptureSpec(hooks=hooks, positions=positions) ``` @@ -197,7 +197,7 @@ import torch consumer = MyConsumer(MagicMock(), {"layers": [0]}) adapter = _BatchedAdapter(consumer) -key = (VllmInternalRequestId("test-req"), 0, "post_mlp") +key = (VllmInternalRequestId("test-req"), 0, "post_block") adapter.submit_chunk(CaptureChunk( key=key, @@ -247,7 +247,7 @@ class SumConsumer(CaptureConsumer): def global_capture_spec(self) -> CaptureSpec: return CaptureSpec( - hooks={"post_mlp": self._layers}, + hooks={"post_block": self._layers}, positions="last_prompt", ) diff --git a/examples/online_serving/openai_steering_client.py b/examples/online_serving/openai_steering_client.py index bdb30fabc54f..99a575a33c0a 100644 --- a/examples/online_serving/openai_steering_client.py +++ b/examples/online_serving/openai_steering_client.py @@ -80,7 +80,7 @@ def main() -> None: # Base steering applied to both prefill and decode. base = { - "post_mlp": pack_hook( + "post_block": pack_hook( {15: rng.standard_normal(HIDDEN_SIZE).astype(PACK_DTYPE)}, # Per-layer scales: the server multiplies row-by-row without # re-encoding the bytes, so the same vector can be reused at diff --git a/tests/compile/passes/ir/test_inplace_functionalization.py b/tests/compile/passes/ir/test_inplace_functionalization.py index 1e8d5662162f..f19a00b56a1f 100644 --- a/tests/compile/passes/ir/test_inplace_functionalization.py +++ b/tests/compile/passes/ir/test_inplace_functionalization.py @@ -369,7 +369,7 @@ def __init__(self, hidden_size=32, intermediate_size=128): ) # Post-MLP norm - self.post_mlp_norm = nn.Parameter(torch.ones(hidden_size, dtype=torch.bfloat16)) + self.post_block_norm = nn.Parameter(torch.ones(hidden_size, dtype=torch.bfloat16)) def forward(self, x: torch.Tensor): # Attention block with residual @@ -391,7 +391,7 @@ def forward(self, x: torch.Tensor): # Fused add + norm (maybe_inplace: residual1 is donated) normed2, residual2 = ops.fused_add_rms_norm.maybe_inplace( - mlp_out, residual1, self.post_mlp_norm, 1e-5 + mlp_out, residual1, self.post_block_norm, 1e-5 ) return normed2, residual2 diff --git a/tests/entrypoints/openai/test_capture_protocol.py b/tests/entrypoints/openai/test_capture_protocol.py index 03a823c0a8c7..9102c14f1475 100644 --- a/tests/entrypoints/openai/test_capture_protocol.py +++ b/tests/entrypoints/openai/test_capture_protocol.py @@ -61,7 +61,7 @@ def test_capture_request_field_accepted(self) -> None: "filesystem": { "request_id": "r1", "tag": "t1", - "hooks": {"post_mlp": [0]}, + "hooks": {"post_block": [0]}, "positions": "last_prompt", }, }, @@ -226,13 +226,13 @@ def test_populated_dict_builds_response_models(self) -> None: final = _FakeFinal( capture_results={ "fs": CaptureResult( - key=("r1", 0, "post_mlp"), + key=("r1", 0, "post_block"), status="ok", error=None, payload=["/tmp/a.bin", "/tmp/a.json"], ), "log": CaptureResult( - key=("r1", 0, "post_mlp"), + key=("r1", 0, "post_block"), status="partial_error", error="dropped", payload=None, @@ -280,7 +280,7 @@ class _FakeConsumerAccepts(CaptureConsumer): def validate_client_spec(self, raw_spec, ctx): # type: ignore[override] # Returns a valid CaptureSpec derived from the raw payload. - return CaptureSpec(hooks={"post_mlp": [0]}, positions="last_prompt") + return CaptureSpec(hooks={"post_block": [0]}, positions="last_prompt") def on_capture(self, key, tensor, sidecar): # pragma: no cover - unused pass @@ -414,7 +414,7 @@ def test_happy_path_mutates_sampling_params_to_spec(self, monkeypatch) -> None: stub._capture_consumers = {"filesystem": consumer} sp = SamplingParams( - capture={"filesystem": {"tag": "t", "hooks": {"post_mlp": [0]}}} + capture={"filesystem": {"tag": "t", "hooks": {"post_block": [0]}}} ) result = admit( @@ -460,7 +460,7 @@ class _Capturing(CaptureConsumer): def validate_client_spec(self, raw_spec, ctx): # type: ignore[override] received.append(ctx) - return CaptureSpec(hooks={"post_mlp": [0]}, positions="last_prompt") + return CaptureSpec(hooks={"post_block": [0]}, positions="last_prompt") def on_capture(self, key, tensor, sidecar): # pragma: no cover pass diff --git a/tests/entrypoints/openai/test_steering_modules.py b/tests/entrypoints/openai/test_steering_modules.py index fa50dafd009c..ea1900ae2337 100644 --- a/tests/entrypoints/openai/test_steering_modules.py +++ b/tests/entrypoints/openai/test_steering_modules.py @@ -41,12 +41,12 @@ def test_both_empty_returns_none(self): def test_first_none_second_has_data(self): spec: SteeringVectorSpec = { - "post_mlp": {14: [1.0, 2.0, 3.0]}, + "post_block": {14: [1.0, 2.0, 3.0]}, } result = merge_steering_specs(None, spec) assert result is not None # Values should be pre-scaled (scale=1.0 for bare list) - assert result["post_mlp"][14].tolist() == [1.0, 2.0, 3.0] + assert result["post_block"][14].tolist() == [1.0, 2.0, 3.0] def test_first_has_data_second_none(self): spec: SteeringVectorSpec = { @@ -57,70 +57,70 @@ def test_first_has_data_second_none(self): assert result["pre_attn"][5].tolist() == [0.5, 0.6] def test_non_overlapping_hooks_both_preserved(self): - a: SteeringVectorSpec = {"post_mlp": {14: [1.0, 2.0]}} + a: SteeringVectorSpec = {"post_block": {14: [1.0, 2.0]}} b: SteeringVectorSpec = {"pre_attn": {10: [3.0, 4.0]}} result = merge_steering_specs(a, b) assert result is not None - assert result["post_mlp"][14].tolist() == [1.0, 2.0] + assert result["post_block"][14].tolist() == [1.0, 2.0] assert result["pre_attn"][10].tolist() == [3.0, 4.0] def test_non_overlapping_layers_same_hook(self): - a: SteeringVectorSpec = {"post_mlp": {14: [1.0, 2.0]}} - b: SteeringVectorSpec = {"post_mlp": {15: [3.0, 4.0]}} + a: SteeringVectorSpec = {"post_block": {14: [1.0, 2.0]}} + b: SteeringVectorSpec = {"post_block": {15: [3.0, 4.0]}} result = merge_steering_specs(a, b) assert result is not None - assert result["post_mlp"][14].tolist() == [1.0, 2.0] - assert result["post_mlp"][15].tolist() == [3.0, 4.0] + assert result["post_block"][14].tolist() == [1.0, 2.0] + assert result["post_block"][15].tolist() == [3.0, 4.0] def test_overlapping_hook_layer_added(self): - a: SteeringVectorSpec = {"post_mlp": {14: [1.0, 2.0, 3.0]}} - b: SteeringVectorSpec = {"post_mlp": {14: [0.5, 0.5, 0.5]}} + a: SteeringVectorSpec = {"post_block": {14: [1.0, 2.0, 3.0]}} + b: SteeringVectorSpec = {"post_block": {14: [0.5, 0.5, 0.5]}} result = merge_steering_specs(a, b) assert result is not None - assert result["post_mlp"][14].tolist() == [1.5, 2.5, 3.5] + assert result["post_block"][14].tolist() == [1.5, 2.5, 3.5] def test_overlapping_with_scaled_entries(self): a: SteeringVectorSpec = { - "post_mlp": { + "post_block": { 14: {"vector": [1.0, 2.0], "scale": 2.0}, } } b: SteeringVectorSpec = { - "post_mlp": { + "post_block": { 14: {"vector": [3.0, 4.0], "scale": 0.5}, } } result = merge_steering_specs(a, b) assert result is not None # a scaled: [2.0, 4.0], b scaled: [1.5, 2.0], sum: [3.5, 6.0] - assert result["post_mlp"][14].tolist() == [3.5, 6.0] + assert result["post_block"][14].tolist() == [3.5, 6.0] def test_one_scaled_one_bare(self): a: SteeringVectorSpec = { - "post_mlp": { + "post_block": { 14: {"vector": [1.0, 2.0], "scale": 3.0}, } } b: SteeringVectorSpec = { - "post_mlp": { + "post_block": { 14: [0.5, 0.5], } } result = merge_steering_specs(a, b) assert result is not None # a scaled: [3.0, 6.0], b scaled: [0.5, 0.5], sum: [3.5, 6.5] - assert result["post_mlp"][14].tolist() == [3.5, 6.5] + assert result["post_block"][14].tolist() == [3.5, 6.5] def test_passthrough_entry_is_prescaled(self): """Non-overlapping scaled entry should still be pre-scaled.""" spec: SteeringVectorSpec = { - "post_mlp": { + "post_block": { 14: {"vector": [1.0, 2.0], "scale": 0.5}, } } result = merge_steering_specs(spec, None) assert result is not None - assert result["post_mlp"][14].tolist() == [0.5, 1.0] + assert result["post_block"][14].tolist() == [0.5, 1.0] # --------------------------------------------------------------------------- @@ -138,15 +138,15 @@ def test_empty_dict_returns_none(self): assert _convert_layer_keys({}, field_name="vectors") is None def test_converts_string_keys_to_int(self): - spec = {"post_mlp": {"14": [1.0, 2.0], "15": [3.0, 4.0]}} + spec = {"post_block": {"14": [1.0, 2.0], "15": [3.0, 4.0]}} result = _convert_layer_keys(spec, field_name="vectors") assert result is not None - assert 14 in result["post_mlp"] - assert 15 in result["post_mlp"] - assert result["post_mlp"][14] == [1.0, 2.0] + assert 14 in result["post_block"] + assert 15 in result["post_block"] + assert result["post_block"][14] == [1.0, 2.0] def test_rejects_non_dict_layers(self): - spec = {"post_mlp": "not_a_dict"} + spec = {"post_block": "not_a_dict"} with pytest.raises(ValueError, match="must be a JSON object mapping"): _convert_layer_keys(spec, field_name="vectors") @@ -168,19 +168,19 @@ async def test_register_and_get(self): registry = SteeringModuleRegistry() await registry.register( name="test_mod", - vectors={"post_mlp": {14: [1.0, 2.0]}}, + vectors={"post_block": {14: [1.0, 2.0]}}, ) module = registry.get("test_mod") assert module is not None assert module.name == "test_mod" - assert module.vectors == {"post_mlp": {14: [1.0, 2.0]}} + assert module.vectors == {"post_block": {14: [1.0, 2.0]}} @pytest.mark.asyncio async def test_register_overwrites_existing(self): registry = SteeringModuleRegistry() await registry.register( name="mod", - vectors={"post_mlp": {14: [1.0]}}, + vectors={"post_block": {14: [1.0]}}, ) await registry.register( name="mod", @@ -189,14 +189,14 @@ async def test_register_overwrites_existing(self): module = registry.get("mod") assert module is not None assert "pre_attn" in module.vectors - assert "post_mlp" not in module.vectors + assert "post_block" not in module.vectors @pytest.mark.asyncio async def test_unregister_existing_returns_true(self): registry = SteeringModuleRegistry() await registry.register( name="mod", - vectors={"post_mlp": {14: [1.0]}}, + vectors={"post_block": {14: [1.0]}}, ) assert await registry.unregister("mod") is True assert registry.get("mod") is None @@ -213,9 +213,9 @@ def test_get_nonexistent_returns_none(self): @pytest.mark.asyncio async def test_list_modules_sorted(self): registry = SteeringModuleRegistry() - await registry.register("charlie", vectors={"post_mlp": {0: [1.0]}}) - await registry.register("alpha", vectors={"post_mlp": {0: [1.0]}}) - await registry.register("bravo", vectors={"post_mlp": {0: [1.0]}}) + await registry.register("charlie", vectors={"post_block": {0: [1.0]}}) + await registry.register("alpha", vectors={"post_block": {0: [1.0]}}) + await registry.register("bravo", vectors={"post_block": {0: [1.0]}}) assert registry.list_modules() == ["alpha", "bravo", "charlie"] @pytest.mark.asyncio @@ -244,7 +244,7 @@ async def test_register_unknown_layer_index_raises(self): with pytest.raises(ValueError, match="unknown layer index 99"): await registry.register( name="bad_layer", - vectors={"post_mlp": {99: [1.0]}}, + vectors={"post_block": {99: [1.0]}}, ) @pytest.mark.asyncio @@ -253,7 +253,7 @@ async def test_register_malformed_entry_raises(self): with pytest.raises(TypeError): await registry.register( name="bad_entry", - vectors={"post_mlp": {0: "not_a_list_or_dict"}}, + vectors={"post_block": {0: "not_a_list_or_dict"}}, ) @pytest.mark.asyncio @@ -264,7 +264,7 @@ async def test_register_invalid_vector_contents_raise(self): await registry.register( name="bad_values", vectors={ - "post_mlp": { + "post_block": { 0: { "vector": ["bad", 1.0], "scale": 1.0, @@ -277,7 +277,7 @@ async def test_register_invalid_vector_contents_raise(self): await registry.register( name="bad_scale", vectors={ - "post_mlp": { + "post_block": { 0: { "vector": [1.0, 2.0], "scale": math.nan, @@ -292,7 +292,7 @@ async def test_register_invalid_vector_contents_raise(self): async def test_load_from_file_valid_json(self): registry = SteeringModuleRegistry() data = { - "vectors": {"post_mlp": {"14": [0.1, 0.2, 0.3]}}, + "vectors": {"post_block": {"14": [0.1, 0.2, 0.3]}}, "prefill_vectors": {"pre_attn": {"5": [0.4, 0.5, 0.6]}}, "decode_vectors": None, } @@ -306,7 +306,7 @@ async def test_load_from_file_valid_json(self): assert module is not None assert module.name == "loaded" # Layer keys should be ints - assert 14 in module.vectors["post_mlp"] + assert 14 in module.vectors["post_block"] assert 5 in module.prefill_vectors["pre_attn"] assert module.decode_vectors is None finally: @@ -323,7 +323,7 @@ async def test_load_from_file_converts_string_keys(self): registry = SteeringModuleRegistry() data = { "vectors": { - "post_mlp": {"0": [1.0], "99": [2.0]}, + "post_block": {"0": [1.0], "99": [2.0]}, }, } with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f: @@ -334,10 +334,10 @@ async def test_load_from_file_converts_string_keys(self): await registry.load_from_file("conv_keys", tmp_path) module = registry.get("conv_keys") assert module is not None - assert 0 in module.vectors["post_mlp"] - assert 99 in module.vectors["post_mlp"] + assert 0 in module.vectors["post_block"] + assert 99 in module.vectors["post_block"] # String keys should NOT be present - assert "0" not in module.vectors["post_mlp"] + assert "0" not in module.vectors["post_block"] finally: os.unlink(tmp_path) @@ -359,7 +359,7 @@ async def test_load_from_file_invalid_vector_contents_raise(self): registry = SteeringModuleRegistry() data = { "vectors": { - "post_mlp": { + "post_block": { "14": { "vector": [1.0, "bad"], "scale": 1.0, @@ -382,7 +382,7 @@ async def test_load_from_file_rejects_non_dict_hook_payload(self): registry = SteeringModuleRegistry() data = { "vectors": { - "post_mlp": [1.0, 2.0], + "post_block": [1.0, 2.0], }, } with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f: @@ -415,7 +415,7 @@ async def test_load_from_file_packed_tier(self): "data": base64.b64encode(stacked.tobytes()).decode("ascii"), } data = { - "vectors": {"post_mlp": packed_hook}, + "vectors": {"post_block": packed_hook}, } with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f: json.dump(data, f) @@ -426,7 +426,7 @@ async def test_load_from_file_packed_tier(self): await registry.load_from_file("packed", tmp_path) module = registry.get("packed") assert module is not None - stored = module.vectors["post_mlp"][14] + stored = module.vectors["post_block"][14] assert isinstance(stored, list) assert [round(v, 5) for v in stored] == [ round(float(x), 5) for x in vec @@ -451,7 +451,7 @@ async def test_load_from_file_packed_with_scales(self): "data": base64.b64encode(stacked.tobytes()).decode("ascii"), "scales": [3.0], } - data = {"vectors": {"post_mlp": packed_hook}} + data = {"vectors": {"post_block": packed_hook}} with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f: json.dump(data, f) tmp_path = f.name @@ -459,7 +459,7 @@ async def test_load_from_file_packed_with_scales(self): try: registry = SteeringModuleRegistry() await registry.load_from_file("packed_scaled", tmp_path) - stored = registry.get("packed_scaled").vectors["post_mlp"][14] + stored = registry.get("packed_scaled").vectors["post_block"][14] assert [round(v, 5) for v in stored] == [3.0, 6.0] finally: os.unlink(tmp_path) @@ -478,7 +478,7 @@ async def test_load_from_file_packed_malformed_raises(self): "layer_indices": [14], "data": base64.b64encode(b"\x00" * 8).decode("ascii"), } - data = {"vectors": {"post_mlp": bad_hook}} + data = {"vectors": {"post_block": bad_hook}} with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f: json.dump(data, f) tmp_path = f.name @@ -502,7 +502,7 @@ async def test_init_app_state_only_sets_registry_when_steering_enabled(): engine_client.renderer = MagicMock() engine_client.io_processor = MagicMock() engine_client.collective_rpc = AsyncMock( - return_value=[{0: ["post_mlp"], 1: ["post_mlp"]}] + return_value=[{0: ["post_block"], 1: ["post_block"]}] ) args = Namespace( @@ -613,14 +613,14 @@ async def test_known_name_no_inline(self): registry = SteeringModuleRegistry() await registry.register( "my_mod", - vectors={"post_mlp": {14: [1.0, 2.0]}}, + vectors={"post_block": {14: [1.0, 2.0]}}, prefill_vectors={"pre_attn": {5: [0.5, 0.6]}}, ) v, p, d, err = registry.resolve_for_request("my_mod", None, None, None) assert err is None # Vectors are pre-scaled (scale=1.0 bare lists) assert v is not None - assert v["post_mlp"][14].tolist() == [1.0, 2.0] + assert v["post_block"][14].tolist() == [1.0, 2.0] assert p is not None assert p["pre_attn"][5].tolist() == [0.5, 0.6] assert d is None @@ -630,27 +630,27 @@ async def test_known_name_with_inline_merge(self): registry = SteeringModuleRegistry() await registry.register( "base", - vectors={"post_mlp": {14: [1.0, 2.0]}}, + vectors={"post_block": {14: [1.0, 2.0]}}, ) - inline: SteeringVectorSpec = {"post_mlp": {14: [0.5, 0.5]}} + inline: SteeringVectorSpec = {"post_block": {14: [0.5, 0.5]}} v, p, d, err = registry.resolve_for_request("base", inline, None, None) assert err is None assert v is not None - assert v["post_mlp"][14].tolist() == [1.5, 2.5] + assert v["post_block"][14].tolist() == [1.5, 2.5] @pytest.mark.asyncio async def test_named_one_tier_inline_different_tier(self): registry = SteeringModuleRegistry() await registry.register( "named", - vectors={"post_mlp": {14: [1.0, 2.0]}}, + vectors={"post_block": {14: [1.0, 2.0]}}, ) inline_prefill: SteeringVectorSpec = {"pre_attn": {5: [0.3, 0.4]}} v, p, d, err = registry.resolve_for_request("named", None, inline_prefill, None) assert err is None # Named vectors tier assert v is not None - assert v["post_mlp"][14].tolist() == [1.0, 2.0] + assert v["post_block"][14].tolist() == [1.0, 2.0] # Inline prefill tier assert p is not None assert p["pre_attn"][5].tolist() == [0.3, 0.4] @@ -660,8 +660,8 @@ async def test_named_one_tier_inline_different_tier(self): @pytest.mark.asyncio async def test_error_message_lists_available_modules(self): registry = SteeringModuleRegistry() - await registry.register("a", vectors={"post_mlp": {0: [1.0]}}) - await registry.register("b", vectors={"post_mlp": {0: [1.0]}}) + await registry.register("a", vectors={"post_block": {0: [1.0]}}) + await registry.register("b", vectors={"post_block": {0: [1.0]}}) _, _, _, err = registry.resolve_for_request("missing", None, None, None) assert err is not None assert "['a', 'b']" in err @@ -671,10 +671,10 @@ async def test_dimension_mismatch_returns_error(self): registry = SteeringModuleRegistry() await registry.register( "named", - vectors={"post_mlp": {14: [1.0, 2.0]}}, + vectors={"post_block": {14: [1.0, 2.0]}}, ) - inline: SteeringVectorSpec = {"post_mlp": {14: [0.5]}} + inline: SteeringVectorSpec = {"post_block": {14: [0.5]}} v, p, d, err = registry.resolve_for_request("named", inline, None, None) assert v is None diff --git a/tests/entrypoints/openai/test_steering_protocol.py b/tests/entrypoints/openai/test_steering_protocol.py index d8f16889446b..6e43b6bc61fd 100644 --- a/tests/entrypoints/openai/test_steering_protocol.py +++ b/tests/entrypoints/openai/test_steering_protocol.py @@ -129,7 +129,7 @@ def test_legacy_scaled_dict_rejected(self): with pytest.raises(ValidationError): _make_chat( steering_vectors={ - "post_mlp": {10: {"vector": [0.4, 0.5, 0.6], "scale": 2.0}} + "post_block": {10: {"vector": [0.4, 0.5, 0.6], "scale": 2.0}} } ) @@ -160,11 +160,11 @@ def test_to_sampling_params_none_when_absent(self): def test_per_row_scales_applied_at_unpack(self): packed = { - "post_mlp": _pack({10: [1.0] * _HIDDEN}, scales=[2.0]), + "post_block": _pack({10: [1.0] * _HIDDEN}, scales=[2.0]), } req = _make_chat(steering_vectors=packed) sp = req.to_sampling_params(max_tokens=100, default_sampling_params={}) - assert sp.steering_vectors["post_mlp"][10].tolist() == pytest.approx( + assert sp.steering_vectors["post_block"][10].tolist() == pytest.approx( [2.0] * _HIDDEN ) @@ -213,7 +213,7 @@ def test_legacy_scaled_dict_rejected(self): with pytest.raises(ValidationError): _make_completion( steering_vectors={ - "post_mlp": {10: {"vector": [0.4, 0.5, 0.6], "scale": 2.0}} + "post_block": {10: {"vector": [0.4, 0.5, 0.6], "scale": 2.0}} } ) @@ -243,11 +243,11 @@ def test_to_sampling_params_none_when_absent(self): def test_per_row_scales_applied_at_unpack(self): packed = { - "post_mlp": _pack({10: [1.0] * _HIDDEN}, scales=[2.0]), + "post_block": _pack({10: [1.0] * _HIDDEN}, scales=[2.0]), } req = _make_completion(steering_vectors=packed) sp = req.to_sampling_params(max_tokens=100) - assert sp.steering_vectors["post_mlp"][10].tolist() == pytest.approx( + assert sp.steering_vectors["post_block"][10].tolist() == pytest.approx( [2.0] * _HIDDEN ) diff --git a/tests/entrypoints/serve/steering/test_api_router.py b/tests/entrypoints/serve/steering/test_api_router.py index 493324767bf7..214089c7c8b7 100644 --- a/tests/entrypoints/serve/steering/test_api_router.py +++ b/tests/entrypoints/serve/steering/test_api_router.py @@ -477,7 +477,7 @@ class TestNormalizeSpec: def test_normalize_spec_drops_empty_hook(self): """Hooks whose layer dict is empty are dropped from the result. - An input like ``{"post_mlp": {}}`` is functionally + An input like ``{"post_block": {}}`` is functionally equivalent to omitting the hook entirely: no layers and no vectors would be applied. Keeping the empty hook in the normalized spec would produce a truthy-but-empty entry that diff --git a/tests/entrypoints/serve/steering/test_api_router_distributed.py b/tests/entrypoints/serve/steering/test_api_router_distributed.py index a1e6179e3734..b16ded0c8585 100644 --- a/tests/entrypoints/serve/steering/test_api_router_distributed.py +++ b/tests/entrypoints/serve/steering/test_api_router_distributed.py @@ -143,7 +143,7 @@ class TestErrorConsolidation: def test_size_mismatch_single_400(self, client, engine): """SteeringVectorError from any rank → single 400 with clean message.""" engine.collective_rpc.side_effect = SteeringVectorError( - "Rank 1: Layer 0 (post_mlp): expected vector of size 128, got 2" + "Rank 1: Layer 0 (post_block): expected vector of size 128, got 2" ) resp = client.post("/v1/steering/set", json=_vecs({0: [1.0, 2.0]})) assert resp.status_code == 400 @@ -153,7 +153,7 @@ def test_size_mismatch_single_400(self, client, engine): def test_non_finite_single_400(self, client, engine): engine.collective_rpc.side_effect = RuntimeError( - "Rank 0: Layer 0 (post_mlp): steering vector contains " + "Rank 0: Layer 0 (post_block): steering vector contains " "non-finite values (NaN or Infinity)" ) resp = client.post("/v1/steering/set", json=_vecs({0: [1.0, 2.0]})) @@ -169,31 +169,31 @@ class TestDeepMergeStatus: def test_merges_disjoint(self): result = deep_merge_status( [ - {0: {"post_mlp": {"norm": 1.0}}}, - {5: {"post_mlp": {"norm": 2.5}}}, + {0: {"post_block": {"norm": 1.0}}}, + {5: {"post_block": {"norm": 2.5}}}, ] ) assert result == { - 0: {"post_mlp": {"norm": 1.0}}, - 5: {"post_mlp": {"norm": 2.5}}, + 0: {"post_block": {"norm": 1.0}}, + 5: {"post_block": {"norm": 2.5}}, } def test_merges_identical_tp_duplicates(self): """TP ranks report identical state — merge must not raise.""" result = deep_merge_status( [ - {0: {"post_mlp": {"norm": 1.0}}}, - {0: {"post_mlp": {"norm": 1.0}}}, + {0: {"post_block": {"norm": 1.0}}}, + {0: {"post_block": {"norm": 1.0}}}, ] ) - assert result == {0: {"post_mlp": {"norm": 1.0}}} + assert result == {0: {"post_block": {"norm": 1.0}}} def test_raises_on_divergence(self): with pytest.raises(RuntimeError, match="divergence"): deep_merge_status( [ - {0: {"post_mlp": {"norm": 1.0}}}, - {0: {"post_mlp": {"norm": 2.0}}}, + {0: {"post_block": {"norm": 1.0}}}, + {0: {"post_block": {"norm": 2.0}}}, ] ) @@ -206,8 +206,8 @@ def test_handles_empty_inputs(self): class TestGetSteeringDivergence: def test_divergence_surfaces_as_500(self, client, engine): engine.collective_rpc.return_value = [ - {0: {"post_mlp": {"norm": 1.0}}}, - {0: {"post_mlp": {"norm": 2.0}}}, + {0: {"post_block": {"norm": 1.0}}}, + {0: {"post_block": {"norm": 2.0}}}, ] resp = client.get("/v1/steering") assert resp.status_code == 500 @@ -221,17 +221,17 @@ class TestGetSteeringLayers: def test_merges_hook_points_across_workers(self, client, engine): """PP-disjoint layers + TP-identical hooks are merged correctly.""" engine.collective_rpc.return_value = [ - {0: ["post_mlp"], 1: ["post_mlp", "pre_attn"]}, - {0: ["post_mlp"], 1: ["post_mlp", "pre_attn"]}, - {2: ["post_mlp"], 3: ["post_mlp"]}, - {2: ["post_mlp"], 3: ["post_mlp"]}, + {0: ["post_block"], 1: ["post_block", "pre_attn"]}, + {0: ["post_block"], 1: ["post_block", "pre_attn"]}, + {2: ["post_block"], 3: ["post_block"]}, + {2: ["post_block"], 3: ["post_block"]}, ] resp = client.get("/v1/steering/layers") assert resp.status_code == 200 layers = resp.json()["layers"] assert set(layers.keys()) == {"0", "1", "2", "3"} - assert layers["1"]["hook_points"] == ["post_mlp", "pre_attn"] - assert layers["2"]["hook_points"] == ["post_mlp"] + assert layers["1"]["hook_points"] == ["post_block", "pre_attn"] + assert layers["2"]["hook_points"] == ["post_block"] def test_empty_worker_results(self, client, engine): engine.collective_rpc.return_value = [{}, {}] diff --git a/tests/entrypoints/serve/steering/test_protocol.py b/tests/entrypoints/serve/steering/test_protocol.py index 5f586246b2d7..1d98b098a94d 100644 --- a/tests/entrypoints/serve/steering/test_protocol.py +++ b/tests/entrypoints/serve/steering/test_protocol.py @@ -11,19 +11,19 @@ class TestSetSteeringRequest: """Validate SetSteeringRequest Pydantic model.""" def test_basic_vectors(self): - req = SetSteeringRequest(vectors={"post_mlp": {0: [1.0, 2.0], 5: [3.0, 4.0]}}) + req = SetSteeringRequest(vectors={"post_block": {0: [1.0, 2.0], 5: [3.0, 4.0]}}) assert req.vectors is not None - assert req.vectors["post_mlp"][0] == [1.0, 2.0] + assert req.vectors["post_block"][0] == [1.0, 2.0] assert req.prefill_vectors is None assert req.decode_vectors is None assert req.replace is False def test_with_co_located_scale(self): req = SetSteeringRequest( - vectors={"post_mlp": {0: {"vector": [1.0, 2.0], "scale": 2.5}}}, + vectors={"post_block": {0: {"vector": [1.0, 2.0], "scale": 2.5}}}, ) assert req.vectors is not None - entry = req.vectors["post_mlp"][0] + entry = req.vectors["post_block"][0] assert isinstance(entry, dict) assert entry["vector"] == [1.0, 2.0] assert entry["scale"] == 2.5 @@ -36,7 +36,7 @@ def test_replace_flag(self): assert req.replace is True def test_replace_defaults_false(self): - req = SetSteeringRequest(vectors={"post_mlp": {0: [1.0]}}) + req = SetSteeringRequest(vectors={"post_block": {0: [1.0]}}) assert req.replace is False def test_empty_vectors_allowed(self): @@ -54,22 +54,22 @@ def test_all_fields_none_by_default(self): def test_string_keys_coerced_to_int(self): """JSON dict keys are strings; Pydantic should coerce to int.""" req = SetSteeringRequest.model_validate( - {"vectors": {"post_mlp": {"0": [1.0, 2.0]}}} + {"vectors": {"post_block": {"0": [1.0, 2.0]}}} ) assert req.vectors is not None - assert 0 in req.vectors["post_mlp"] + assert 0 in req.vectors["post_block"] def test_full_request(self): req = SetSteeringRequest( vectors={ "pre_attn": {0: [1.0, 0.5]}, - "post_mlp": {3: [0.0, 1.0]}, + "post_block": {3: [0.0, 1.0]}, }, prefill_vectors={ "pre_attn": {0: {"vector": [0.1, 0.2], "scale": 2.0}}, }, decode_vectors={ - "post_mlp": {3: [0.5, 0.5]}, + "post_block": {3: [0.5, 0.5]}, }, replace=True, ) @@ -84,7 +84,7 @@ def test_multiple_hook_points(self): vectors={ "pre_attn": {0: [1.0]}, "post_attn": {0: [2.0]}, - "post_mlp": {0: [3.0]}, + "post_block": {0: [3.0]}, } ) assert req.vectors is not None @@ -106,7 +106,7 @@ def test_unknown_field_rejected(self): with pytest.raises(pydantic.ValidationError): SetSteeringRequest( - vectors={"post_mlp": {0: [1.0, 2.0]}}, + vectors={"post_block": {0: [1.0, 2.0]}}, scales={0: 2.5}, ) @@ -117,7 +117,7 @@ def test_unknown_field_via_model_validate(self): with pytest.raises(pydantic.ValidationError): SetSteeringRequest.model_validate( { - "vectors": {"post_mlp": {"0": [1.0]}}, + "vectors": {"post_block": {"0": [1.0]}}, "scales": {"0": 2.5}, } ) diff --git a/tests/entrypoints/serve/steering/test_worker_steering.py b/tests/entrypoints/serve/steering/test_worker_steering.py index 89d652571b85..c71916e6c251 100644 --- a/tests/entrypoints/serve/steering/test_worker_steering.py +++ b/tests/entrypoints/serve/steering/test_worker_steering.py @@ -3,7 +3,7 @@ """Unit tests for steering model-runner mixin methods using a mock model. All hook-point-aware tests use the default hook point -(``post_mlp``) unless testing multi-hook-point behaviour. +(``post_block``) unless testing multi-hook-point behaviour. Tests cover three-tier steering (base, prefill, decode) and co-located scale format (bare list vs dict with scale). @@ -22,7 +22,7 @@ from vllm.v1.worker.worker_base import WorkerBase # Shorthand for test readability -_HP = DEFAULT_HOOK_POINT.value # "post_mlp" +_HP = DEFAULT_HOOK_POINT.value # "post_block" class FakeDecoderLayer(nn.Module): @@ -31,9 +31,9 @@ class FakeDecoderLayer(nn.Module): def __init__(self, layer_idx: int, hidden_size: int, max_steering_configs: int = 0): super().__init__() self.layer_idx = layer_idx - # Default hook point buffers (post_mlp) — table + index only + # Default hook point buffers (post_block) — table + index only self.register_buffer( - "steering_table_post_mlp", + "steering_table_post_block", torch.zeros(max_steering_configs + 2, hidden_size), persistent=False, ) diff --git a/tests/v1/capture/consumers/filesystem/test_coalescing.py b/tests/v1/capture/consumers/filesystem/test_coalescing.py index 363269fe51f8..b8f93f2e05d8 100644 --- a/tests/v1/capture/consumers/filesystem/test_coalescing.py +++ b/tests/v1/capture/consumers/filesystem/test_coalescing.py @@ -52,7 +52,7 @@ def _drive( for r in range(num_requests): req = f"req_{r:04d}" steps = rng.randint(1, max_steps) - layer, hook = rng.randint(0, 5), "post_mlp" + layer, hook = rng.randint(0, 5), "post_block" d = root / req d.mkdir(parents=True, exist_ok=True) bp = d / f"{layer}_{hook}.bin" diff --git a/tests/v1/capture/consumers/filesystem/test_consumer.py b/tests/v1/capture/consumers/filesystem/test_consumer.py index e9099a6f8ce1..c224054b90ad 100644 --- a/tests/v1/capture/consumers/filesystem/test_consumer.py +++ b/tests/v1/capture/consumers/filesystem/test_consumer.py @@ -388,12 +388,12 @@ def test_parallel_sizes_accepted_for_residual_hooks( raw = FilesystemCaptureRequest( request_id="par-req", tag="par-tag", - hooks={"post_mlp": [0, 1, 2]}, + hooks={"post_block": [0, 1, 2]}, positions="last_prompt", ) spec = consumer.validate_client_spec(raw, ctx) assert isinstance(spec, CaptureSpec) - assert spec.hooks["post_mlp"] == [0, 1, 2] + assert spec.hooks["post_block"] == [0, 1, 2] finally: consumer.shutdown(timeout=5.0) @@ -412,7 +412,7 @@ def test_layouts_accepted_under_pipeline_parallelism(self, layout: str) -> None: raw = FilesystemCaptureRequest( request_id="pp-accepts", tag="pp-accepts", - hooks={"post_mlp": [0, 1]}, + hooks={"post_block": [0, 1]}, positions="last_prompt", layout=layout, ) @@ -505,7 +505,7 @@ def test_ok_after_finalize(self, tmp_path: pathlib.Path) -> None: consumer = _make_consumer(tmp_path) try: request_id = VllmInternalRequestId("result-test") - key: CaptureKey = (request_id, 1, "post_mlp") + key: CaptureKey = (request_id, 1, "post_block") tensor = torch.randn(1, 4, dtype=torch.float32) consumer.submit_chunk( @@ -548,7 +548,7 @@ def test_returns_ok_after_finalize(self, tmp_path: pathlib.Path) -> None: consumer = _make_consumer(tmp_path) try: request_id = VllmInternalRequestId("wait-ok") - key: CaptureKey = (request_id, 2, "post_mlp") + key: CaptureKey = (request_id, 2, "post_block") tensor = torch.randn(1, 4, dtype=torch.float32) consumer.submit_chunk( diff --git a/tests/v1/capture/consumers/filesystem/test_packed.py b/tests/v1/capture/consumers/filesystem/test_packed.py index e9f04aed81bb..b48c1cc5e3cb 100644 --- a/tests/v1/capture/consumers/filesystem/test_packed.py +++ b/tests/v1/capture/consumers/filesystem/test_packed.py @@ -100,10 +100,10 @@ def _write_packed( class TestReader: def test_per_file_round_trip(self, tmp_path: pathlib.Path) -> None: arr = np.arange(2 * 8, dtype=np.float32).reshape(2, 8) - _write_per_file(tmp_path / "req-1", 3, "post_mlp", arr, "float32") - entry = read_per_file(tmp_path / "req-1" / "3_post_mlp.bin") + _write_per_file(tmp_path / "req-1", 3, "post_block", arr, "float32") + entry = read_per_file(tmp_path / "req-1" / "3_post_block.bin") assert entry.layer == 3 - assert entry.hook == "post_mlp" + assert entry.hook == "post_block" assert entry.dtype == "float32" np.testing.assert_array_equal(entry.array, arr) @@ -111,44 +111,44 @@ def test_packed_round_trip(self, tmp_path: pathlib.Path) -> None: a = np.random.randn(4, 16).astype(np.float32) b = np.random.randn(1, 16).astype(np.float32) c = np.random.randn(7, 16).astype(np.float32) - tensors = [(0, "post_mlp", a), (5, "post_mlp", b), (5, "post_attn", c)] + tensors = [(0, "post_block", a), (5, "post_block", b), (5, "post_attn", c)] _write_packed(tmp_path / "req-2", tensors, "float32") got = read_packed(tmp_path / "req-2") - assert set(got) == {(0, "post_mlp"), (5, "post_mlp"), (5, "post_attn")} - np.testing.assert_array_equal(got[(0, "post_mlp")].array, a) - np.testing.assert_array_equal(got[(5, "post_mlp")].array, b) + assert set(got) == {(0, "post_block"), (5, "post_block"), (5, "post_attn")} + np.testing.assert_array_equal(got[(0, "post_block")].array, a) + np.testing.assert_array_equal(got[(5, "post_block")].array, b) np.testing.assert_array_equal(got[(5, "post_attn")].array, c) def test_packed_accepts_index_or_bin_or_dir(self, tmp_path: pathlib.Path) -> None: arr = np.random.randn(2, 4).astype(np.float32) - _write_packed(tmp_path / "r", [(1, "post_mlp", arr)], "float32") + _write_packed(tmp_path / "r", [(1, "post_block", arr)], "float32") for target in ( tmp_path / "r", tmp_path / "r" / PACKED_INDEX_NAME, tmp_path / "r" / PACKED_BIN_NAME, ): got = read_packed(target) - np.testing.assert_array_equal(got[(1, "post_mlp")].array, arr) + np.testing.assert_array_equal(got[(1, "post_block")].array, arr) def test_read_request_autodetects_layout(self, tmp_path: pathlib.Path) -> None: # per_file dir pf = tmp_path / "pf" - _write_per_file(pf, 0, "post_mlp", np.ones((2, 4), np.float32), "float32") - _write_per_file(pf, 1, "post_mlp", np.zeros((3, 4), np.float32), "float32") + _write_per_file(pf, 0, "post_block", np.ones((2, 4), np.float32), "float32") + _write_per_file(pf, 1, "post_block", np.zeros((3, 4), np.float32), "float32") got_pf = read_request(pf) - assert set(got_pf) == {(0, "post_mlp"), (1, "post_mlp")} + assert set(got_pf) == {(0, "post_block"), (1, "post_block")} # packed dir pk = tmp_path / "pk" - _write_packed(pk, [(0, "post_mlp", np.ones((2, 4), np.float32))], "float32") + _write_packed(pk, [(0, "post_block", np.ones((2, 4), np.float32))], "float32") got_pk = read_request(pk) - assert set(got_pk) == {(0, "post_mlp")} + assert set(got_pk) == {(0, "post_block")} def test_bfloat16_returns_uint16(self, tmp_path: pathlib.Path) -> None: # bf16 is stored as raw uint16; the reader returns it as uint16. raw = np.array([1, 2, 3, 4], dtype=np.uint16).reshape(2, 2) - _write_per_file(tmp_path / "bf", 0, "post_mlp", raw, "bfloat16") - entry = read_per_file(tmp_path / "bf" / "0_post_mlp.bin") + _write_per_file(tmp_path / "bf", 0, "post_block", raw, "bfloat16") + entry = read_per_file(tmp_path / "bf" / "0_post_block.bin") assert entry.dtype == "bfloat16" assert entry.array.dtype == np.uint16 np.testing.assert_array_equal(entry.array, raw) @@ -156,7 +156,7 @@ def test_bfloat16_returns_uint16(self, tmp_path: pathlib.Path) -> None: def test_truncated_packed_raises(self, tmp_path: pathlib.Path) -> None: arr = np.random.randn(4, 8).astype(np.float32) d = tmp_path / "trunc" - _write_packed(d, [(0, "post_mlp", arr)], "float32") + _write_packed(d, [(0, "post_block", arr)], "float32") # Corrupt: truncate the bin so the entry's bytes are missing. (d / PACKED_BIN_NAME).write_bytes(b"\x00\x00") try: @@ -276,70 +276,70 @@ def test_packed_round_trip(self, tmp_path: pathlib.Path) -> None: req = "req-pk" c = _consumer(tmp_path) try: - _register_packed(c, req, {"post_mlp": [0, 2]}) - # (0,post_mlp) spans 2 steps; (2,post_mlp) one step. Submit + _register_packed(c, req, {"post_block": [0, 2]}) + # (0,post_block) spans 2 steps; (2,post_block) one step. Submit # interleaved across keys to exercise per-chunk indexing. a0 = torch.randn(2, 8, dtype=torch.float32) a1 = torch.randn(3, 8, dtype=torch.float32) b0 = torch.randn(1, 8, dtype=torch.float32) - c.submit_chunk(_chunk(req, 0, "post_mlp", a0, 0)) - c.submit_chunk(_chunk(req, 2, "post_mlp", b0, 0)) - c.submit_chunk(_chunk(req, 0, "post_mlp", a1, 1)) + c.submit_chunk(_chunk(req, 0, "post_block", a0, 0)) + c.submit_chunk(_chunk(req, 2, "post_block", b0, 0)) + c.submit_chunk(_chunk(req, 0, "post_block", a1, 1)) for layer in (0, 2): - c.submit_finalize(_finalize(req, layer, "post_mlp")) + c.submit_finalize(_finalize(req, layer, "post_block")) - key0: CaptureKey = (VllmInternalRequestId(req), 0, "post_mlp") + key0: CaptureKey = (VllmInternalRequestId(req), 0, "post_block") assert _wait(c, key0).status == "ok" req_dir = tmp_path / "t" / req assert (req_dir / PACKED_BIN_NAME).exists() assert (req_dir / PACKED_INDEX_NAME).exists() - assert not list(req_dir.glob("*_post_mlp.bin")), "no per-file bins" + assert not list(req_dir.glob("*_post_block.bin")), "no per-file bins" got = read_request(req_dir) - assert set(got) == {(0, "post_mlp"), (2, "post_mlp")} + assert set(got) == {(0, "post_block"), (2, "post_block")} np.testing.assert_array_equal( - got[(0, "post_mlp")].array, torch.cat([a0, a1]).numpy() + got[(0, "post_block")].array, torch.cat([a0, a1]).numpy() ) - np.testing.assert_array_equal(got[(2, "post_mlp")].array, b0.numpy()) + np.testing.assert_array_equal(got[(2, "post_block")].array, b0.numpy()) finally: c.shutdown(timeout=5.0) def test_submit_chunk_batch_round_trip(self, tmp_path: pathlib.Path) -> None: # Batched submit: a step's worth of (layer) chunks handed over in # one call must produce the same packed file as per-chunk submits. - # Two steps batched; (0,post_mlp) spans both, (2,post_mlp) only + # Two steps batched; (0,post_block) spans both, (2,post_block) only # step 0 — concatenation order must follow submission order. req = "req-batch" c = _consumer(tmp_path) try: - _register_packed(c, req, {"post_mlp": [0, 2]}) + _register_packed(c, req, {"post_block": [0, 2]}) a0 = torch.randn(2, 8, dtype=torch.float32) b0 = torch.randn(1, 8, dtype=torch.float32) a1 = torch.randn(3, 8, dtype=torch.float32) # step 0: both layers, in one batch c.submit_chunk_batch( - [_chunk(req, 0, "post_mlp", a0, 0), _chunk(req, 2, "post_mlp", b0, 0)] + [_chunk(req, 0, "post_block", a0, 0), _chunk(req, 2, "post_block", b0, 0)] ) # step 1: only layer 0 - c.submit_chunk_batch([_chunk(req, 0, "post_mlp", a1, 1)]) + c.submit_chunk_batch([_chunk(req, 0, "post_block", a1, 1)]) for layer in (0, 2): - c.submit_finalize(_finalize(req, layer, "post_mlp")) + c.submit_finalize(_finalize(req, layer, "post_block")) - key0: CaptureKey = (VllmInternalRequestId(req), 0, "post_mlp") + key0: CaptureKey = (VllmInternalRequestId(req), 0, "post_block") assert _wait(c, key0).status == "ok" req_dir = tmp_path / "t" / req # One packed file for the whole request, no per-file bins. assert (req_dir / PACKED_BIN_NAME).exists() - assert not list(req_dir.glob("*_post_mlp.bin")) + assert not list(req_dir.glob("*_post_block.bin")) got = read_request(req_dir) - assert set(got) == {(0, "post_mlp"), (2, "post_mlp")} + assert set(got) == {(0, "post_block"), (2, "post_block")} np.testing.assert_array_equal( - got[(0, "post_mlp")].array, torch.cat([a0, a1]).numpy() + got[(0, "post_block")].array, torch.cat([a0, a1]).numpy() ) - np.testing.assert_array_equal(got[(2, "post_mlp")].array, b0.numpy()) + np.testing.assert_array_equal(got[(2, "post_block")].array, b0.numpy()) finally: c.shutdown(timeout=5.0) @@ -352,9 +352,9 @@ def test_batch_matches_per_chunk_bytes(self, tmp_path: pathlib.Path) -> None: def run(req: str, batched: bool) -> bytes: c = _consumer(tmp_path) try: - _register_packed(c, req, {"post_mlp": layers}) + _register_packed(c, req, {"post_block": layers}) step_chunks = [ - _chunk(req, layer, "post_mlp", tensors[layer], 0) + _chunk(req, layer, "post_block", tensors[layer], 0) for layer in layers ] if batched: @@ -363,8 +363,8 @@ def run(req: str, batched: bool) -> bytes: for ch in step_chunks: c.submit_chunk(ch) for layer in layers: - c.submit_finalize(_finalize(req, layer, "post_mlp")) - key0: CaptureKey = (VllmInternalRequestId(req), 0, "post_mlp") + c.submit_finalize(_finalize(req, layer, "post_block")) + key0: CaptureKey = (VllmInternalRequestId(req), 0, "post_block") assert _wait(c, key0).status == "ok" return (tmp_path / "t" / req / PACKED_BIN_NAME).read_bytes() finally: @@ -378,19 +378,19 @@ def test_finalize_aggregation_waits_for_all_keys( req = "req-agg" c = _consumer(tmp_path) try: - _register_packed(c, req, {"post_mlp": [0, 1]}) - c.submit_chunk(_chunk(req, 0, "post_mlp", torch.randn(2, 8), 0)) - c.submit_chunk(_chunk(req, 1, "post_mlp", torch.randn(2, 8), 0)) + _register_packed(c, req, {"post_block": [0, 1]}) + c.submit_chunk(_chunk(req, 0, "post_block", torch.randn(2, 8), 0)) + c.submit_chunk(_chunk(req, 1, "post_block", torch.randn(2, 8), 0)) # Finalize only the first key — packed file must NOT publish. - c.submit_finalize(_finalize(req, 0, "post_mlp")) + c.submit_finalize(_finalize(req, 0, "post_block")) time.sleep(0.2) req_dir = tmp_path / "t" / req assert not (req_dir / PACKED_INDEX_NAME).exists(), ( "packed index published before all keys finalized" ) # Finalize the last expected key — now it publishes. - c.submit_finalize(_finalize(req, 1, "post_mlp")) - key0: CaptureKey = (VllmInternalRequestId(req), 0, "post_mlp") + c.submit_finalize(_finalize(req, 1, "post_block")) + key0: CaptureKey = (VllmInternalRequestId(req), 0, "post_block") assert _wait(c, key0).status == "ok" assert (req_dir / PACKED_INDEX_NAME).exists() finally: @@ -402,14 +402,14 @@ def test_zero_chunk_key(self, tmp_path: pathlib.Path) -> None: req = "req-zero" c = _consumer(tmp_path) try: - _register_packed(c, req, {"post_mlp": [0, 1]}) - c.submit_chunk(_chunk(req, 0, "post_mlp", torch.randn(2, 8), 0)) + _register_packed(c, req, {"post_block": [0, 1]}) + c.submit_chunk(_chunk(req, 0, "post_block", torch.randn(2, 8), 0)) for layer in (0, 1): # key 1 had no chunk - c.submit_finalize(_finalize(req, layer, "post_mlp")) - key0: CaptureKey = (VllmInternalRequestId(req), 0, "post_mlp") + c.submit_finalize(_finalize(req, layer, "post_block")) + key0: CaptureKey = (VllmInternalRequestId(req), 0, "post_block") assert _wait(c, key0).status == "ok" got = read_request(tmp_path / "t" / req) - assert set(got) == {(0, "post_mlp")} + assert set(got) == {(0, "post_block")} finally: c.shutdown(timeout=5.0) @@ -421,19 +421,19 @@ def test_per_file_default_unchanged(self, tmp_path: pathlib.Path) -> None: raw = FilesystemCaptureRequest( request_id=req, tag="t", - hooks={"post_mlp": [0]}, + hooks={"post_block": [0]}, positions="last_prompt", ) c.validate_client_spec(raw, _ctx(req)) t0 = torch.randn(2, 8, dtype=torch.float32) - c.submit_chunk(_chunk(req, 0, "post_mlp", t0, 0)) - c.submit_finalize(_finalize(req, 0, "post_mlp")) - key0: CaptureKey = (VllmInternalRequestId(req), 0, "post_mlp") + c.submit_chunk(_chunk(req, 0, "post_block", t0, 0)) + c.submit_finalize(_finalize(req, 0, "post_block")) + key0: CaptureKey = (VllmInternalRequestId(req), 0, "post_block") assert _wait(c, key0).status == "ok" req_dir = tmp_path / "t" / req - assert (req_dir / "0_post_mlp.bin").exists() + assert (req_dir / "0_post_block.bin").exists() assert not (req_dir / PACKED_INDEX_NAME).exists() - entry = read_per_file(req_dir / "0_post_mlp.bin") + entry = read_per_file(req_dir / "0_post_block.bin") np.testing.assert_array_equal(entry.array, t0.numpy()) assert entry.dtype == "float32" # sidecar now self-describing finally: @@ -451,21 +451,21 @@ def test_dict_spec_layout_packed(self, tmp_path: pathlib.Path) -> None: { "request_id": req, "tag": "t", - "hooks": {"post_mlp": [0, 1]}, + "hooks": {"post_block": [0, 1]}, "positions": "last_prompt", "layout": "packed", }, _ctx(req), ) for layer in (0, 1): - c.submit_chunk(_chunk(req, layer, "post_mlp", torch.randn(2, 8), 0)) + c.submit_chunk(_chunk(req, layer, "post_block", torch.randn(2, 8), 0)) for layer in (0, 1): - c.submit_finalize(_finalize(req, layer, "post_mlp")) - key0: CaptureKey = (VllmInternalRequestId(req), 0, "post_mlp") + c.submit_finalize(_finalize(req, layer, "post_block")) + key0: CaptureKey = (VllmInternalRequestId(req), 0, "post_block") assert _wait(c, key0).status == "ok" req_dir = tmp_path / "t" / req assert (req_dir / PACKED_INDEX_NAME).exists() - assert set(read_request(req_dir)) == {(0, "post_mlp"), (1, "post_mlp")} + assert set(read_request(req_dir)) == {(0, "post_block"), (1, "post_block")} finally: c.shutdown(timeout=5.0) @@ -478,17 +478,17 @@ def test_dict_spec_defaults_per_file(self, tmp_path: pathlib.Path) -> None: { "request_id": req, "tag": "t", - "hooks": {"post_mlp": [0]}, + "hooks": {"post_block": [0]}, "positions": "last_prompt", }, _ctx(req), ) - c.submit_chunk(_chunk(req, 0, "post_mlp", torch.randn(2, 8), 0)) - c.submit_finalize(_finalize(req, 0, "post_mlp")) - key0: CaptureKey = (VllmInternalRequestId(req), 0, "post_mlp") + c.submit_chunk(_chunk(req, 0, "post_block", torch.randn(2, 8), 0)) + c.submit_finalize(_finalize(req, 0, "post_block")) + key0: CaptureKey = (VllmInternalRequestId(req), 0, "post_block") assert _wait(c, key0).status == "ok" req_dir = tmp_path / "t" / req - assert (req_dir / "0_post_mlp.bin").exists() + assert (req_dir / "0_post_block.bin").exists() assert not (req_dir / PACKED_INDEX_NAME).exists() finally: c.shutdown(timeout=5.0) @@ -507,7 +507,7 @@ def test_invalid_layout_rejected(self, tmp_path: pathlib.Path) -> None: { "request_id": req, "tag": "t", - "hooks": {"post_mlp": [0]}, + "hooks": {"post_block": [0]}, "positions": "last_prompt", "layout": "bogus", }, @@ -539,7 +539,7 @@ def test_two_stage_split_and_merge(self, tmp_path: pathlib.Path) -> None: # back into the request's full layer set. req = "req-pp" total = 4 - hooks = {"post_mlp": [0, 1, 2, 3]} + hooks = {"post_block": [0, 1, 2, 3]} tensors = {layer: torch.randn(2, 8, dtype=torch.float32) for layer in range(4)} c0 = _pp_consumer(tmp_path, pp_size=2, pp_rank=0, total_layers=total) c1 = _pp_consumer(tmp_path, pp_size=2, pp_rank=1, total_layers=total) @@ -548,16 +548,16 @@ def test_two_stage_split_and_merge(self, tmp_path: pathlib.Path) -> None: c1.validate_client_spec(_pp_raw(req, hooks), _ctx(req)) # The manager only feeds each stage its owned layers. for layer in (0, 1): - c0.submit_chunk(_chunk(req, layer, "post_mlp", tensors[layer], 0)) + c0.submit_chunk(_chunk(req, layer, "post_block", tensors[layer], 0)) for layer in (0, 1): - c0.submit_finalize(_finalize(req, layer, "post_mlp")) + c0.submit_finalize(_finalize(req, layer, "post_block")) for layer in (2, 3): - c1.submit_chunk(_chunk(req, layer, "post_mlp", tensors[layer], 0)) + c1.submit_chunk(_chunk(req, layer, "post_block", tensors[layer], 0)) for layer in (2, 3): - c1.submit_finalize(_finalize(req, layer, "post_mlp")) + c1.submit_finalize(_finalize(req, layer, "post_block")) - assert _wait(c0, (VllmInternalRequestId(req), 0, "post_mlp")).status == "ok" - assert _wait(c1, (VllmInternalRequestId(req), 2, "post_mlp")).status == "ok" + assert _wait(c0, (VllmInternalRequestId(req), 0, "post_block")).status == "ok" + assert _wait(c1, (VllmInternalRequestId(req), 2, "post_block")).status == "ok" req_dir = tmp_path / "t" / req # Per-stage files exist; the pp-agnostic packed.json does not. @@ -569,10 +569,10 @@ def test_two_stage_split_and_merge(self, tmp_path: pathlib.Path) -> None: assert not (req_dir / PACKED_BIN_NAME).exists() got = read_request(req_dir) - assert set(got) == {(layer, "post_mlp") for layer in range(4)} + assert set(got) == {(layer, "post_block") for layer in range(4)} for layer in range(4): np.testing.assert_array_equal( - got[(layer, "post_mlp")].array, tensors[layer].numpy() + got[(layer, "post_block")].array, tensors[layer].numpy() ) finally: c0.shutdown(timeout=5.0) @@ -588,7 +588,7 @@ def test_stage_owning_no_layers_writes_nothing( c1 = _pp_consumer(tmp_path, pp_size=2, pp_rank=1, total_layers=4) try: spec = c1.validate_client_spec( - _pp_raw(req, {"post_mlp": [0, 1]}), _ctx(req) + _pp_raw(req, {"post_block": [0, 1]}), _ctx(req) ) assert isinstance(spec, CaptureSpec) # No accumulation state for a stage that owns none of the layers. @@ -604,8 +604,8 @@ def test_expected_keys_filtered_to_local_slice( req = "req-pp-expected" c1 = _pp_consumer(tmp_path, pp_size=2, pp_rank=1, total_layers=4) try: - c1.validate_client_spec(_pp_raw(req, {"post_mlp": [0, 1, 2, 3]}), _ctx(req)) + c1.validate_client_spec(_pp_raw(req, {"post_block": [0, 1, 2, 3]}), _ctx(req)) state = c1._packed_states[req] - assert state.expected_keys == {(2, "post_mlp"), (3, "post_mlp")} + assert state.expected_keys == {(2, "post_block"), (3, "post_block")} finally: c1.shutdown(timeout=5.0) diff --git a/tests/v1/capture/consumers/filesystem/test_sharded.py b/tests/v1/capture/consumers/filesystem/test_sharded.py index c724a04b0b0d..ae0b548be513 100644 --- a/tests/v1/capture/consumers/filesystem/test_sharded.py +++ b/tests/v1/capture/consumers/filesystem/test_sharded.py @@ -86,17 +86,17 @@ def test_round_trip_multi_request(self, tmp_path: pathlib.Path) -> None: 0, 0, [ - ("reqA", 0, "post_mlp", a), - ("reqB", 0, "post_mlp", b), - ("reqA", 1, "post_mlp", c), + ("reqA", 0, "post_block", a), + ("reqB", 0, "post_block", b), + ("reqA", 1, "post_block", c), ], "float32", ) got = read_sharded(tag) assert set(got) == {"reqA", "reqB"} - np.testing.assert_array_equal(got["reqA"][(0, "post_mlp")].array, a) - np.testing.assert_array_equal(got["reqA"][(1, "post_mlp")].array, c) - np.testing.assert_array_equal(got["reqB"][(0, "post_mlp")].array, b) + np.testing.assert_array_equal(got["reqA"][(0, "post_block")].array, a) + np.testing.assert_array_equal(got["reqA"][(1, "post_block")].array, c) + np.testing.assert_array_equal(got["reqB"][(0, "post_block")].array, b) def test_request_spanning_two_shards(self, tmp_path: pathlib.Path) -> None: # reqA L0 has rows in seq 0 then seq 1 (sealed mid-request); reader @@ -104,11 +104,11 @@ def test_request_spanning_two_shards(self, tmp_path: pathlib.Path) -> None: tag = tmp_path / "t" a0 = np.arange(2 * 8, dtype=np.float32).reshape(2, 8) a1 = (np.arange(3 * 8, dtype=np.float32) + 100).reshape(3, 8) - _write_shard(tag, 0, 0, [("reqA", 0, "post_mlp", a0)], "float32") - _write_shard(tag, 0, 1, [("reqA", 0, "post_mlp", a1)], "float32") + _write_shard(tag, 0, 0, [("reqA", 0, "post_block", a0)], "float32") + _write_shard(tag, 0, 1, [("reqA", 0, "post_block", a1)], "float32") got = read_sharded(tag) np.testing.assert_array_equal( - got["reqA"][(0, "post_mlp")].array, np.concatenate([a0, a1]) + got["reqA"][(0, "post_block")].array, np.concatenate([a0, a1]) ) @@ -182,7 +182,7 @@ def test_many_requests_one_shard_round_trip(self, tmp_path: pathlib.Path) -> Non expected: dict[str, dict[tuple[int, str], np.ndarray]] = {} try: for rid in reqs: - _register(c, rid, {"post_mlp": [0, 1]}) + _register(c, rid, {"post_block": [0, 1]}) # Interleave chunks across requests and layers; 2 steps each. tensors: dict = {} for step in range(2): @@ -190,15 +190,15 @@ def test_many_requests_one_shard_round_trip(self, tmp_path: pathlib.Path) -> Non for layer in (0, 1): t = torch.randn(2, 8, dtype=torch.float32) tensors.setdefault((rid, layer), []).append(t) - c.submit_chunk(_chunk(rid, layer, "post_mlp", t, step)) + c.submit_chunk(_chunk(rid, layer, "post_block", t, step)) for rid in reqs: for layer in (0, 1): - c.submit_finalize(_finalize(rid, layer, "post_mlp")) - expected.setdefault(rid, {})[(layer, "post_mlp")] = torch.cat( + c.submit_finalize(_finalize(rid, layer, "post_block")) + expected.setdefault(rid, {})[(layer, "post_block")] = torch.cat( tensors[(rid, layer)] ).numpy() # results are ok before seal (data captured, readable after seal) - r = _wait(c, (VllmInternalRequestId("req0"), 0, "post_mlp")) + r = _wait(c, (VllmInternalRequestId("req0"), 0, "post_block")) assert r is not None and r.status == "ok" assert r.payload and all("shard-" in p for p in r.payload) finally: @@ -209,8 +209,8 @@ def test_many_requests_one_shard_round_trip(self, tmp_path: pathlib.Path) -> Non for rid in reqs: for layer in (0, 1): np.testing.assert_array_equal( - got[rid][(layer, "post_mlp")].array, - expected[rid][(layer, "post_mlp")], + got[rid][(layer, "post_block")].array, + expected[rid][(layer, "post_block")], ) def test_size_based_sealing_rotates(self, tmp_path: pathlib.Path) -> None: @@ -218,14 +218,14 @@ def test_size_based_sealing_rotates(self, tmp_path: pathlib.Path) -> None: # Each row is 8*4=32 bytes; cap at 200 bytes -> seal every ~6 rows. c = _consumer(tmp_path, num_shards=1, shard_max_bytes=200) try: - _register(c, "r", {"post_mlp": [0]}) + _register(c, "r", {"post_block": [0]}) tensors = [] for step in range(20): t = torch.randn(1, 8, dtype=torch.float32) tensors.append(t) - c.submit_chunk(_chunk("r", 0, "post_mlp", t, step)) - c.submit_finalize(_finalize("r", 0, "post_mlp")) - assert _wait(c, (VllmInternalRequestId("r"), 0, "post_mlp")).status == "ok" + c.submit_chunk(_chunk("r", 0, "post_block", t, step)) + c.submit_finalize(_finalize("r", 0, "post_block")) + assert _wait(c, (VllmInternalRequestId("r"), 0, "post_block")).status == "ok" finally: c.shutdown(timeout=5.0) tag = tmp_path / "t" @@ -233,7 +233,7 @@ def test_size_based_sealing_rotates(self, tmp_path: pathlib.Path) -> None: assert len(shards) >= 2, f"expected rotation into multiple shards, got {shards}" got = read_sharded(tag) np.testing.assert_array_equal( - got["r"][(0, "post_mlp")].array, torch.cat(tensors).numpy() + got["r"][(0, "post_block")].array, torch.cat(tensors).numpy() ) @@ -280,7 +280,7 @@ def test_two_stage_shards_merge(self, tmp_path: pathlib.Path) -> None: # stage seals its own shard-pp{rank} files; read_sharded merges by # request across both, recovering the full layer set. req = "req" - hooks = {"post_mlp": [0, 1, 2, 3]} + hooks = {"post_block": [0, 1, 2, 3]} tensors = {layer: torch.randn(2, 8, dtype=torch.float32) for layer in range(4)} c0 = _pp_consumer(tmp_path, pp_rank=0, num_shards=1) c1 = _pp_consumer(tmp_path, pp_rank=1, num_shards=1) @@ -288,11 +288,11 @@ def test_two_stage_shards_merge(self, tmp_path: pathlib.Path) -> None: _register(c0, req, hooks) _register(c1, req, hooks) for layer in (0, 1): - c0.submit_chunk(_chunk(req, layer, "post_mlp", tensors[layer], 0)) - c0.submit_finalize(_finalize(req, layer, "post_mlp")) + c0.submit_chunk(_chunk(req, layer, "post_block", tensors[layer], 0)) + c0.submit_finalize(_finalize(req, layer, "post_block")) for layer in (2, 3): - c1.submit_chunk(_chunk(req, layer, "post_mlp", tensors[layer], 0)) - c1.submit_finalize(_finalize(req, layer, "post_mlp")) + c1.submit_chunk(_chunk(req, layer, "post_block", tensors[layer], 0)) + c1.submit_finalize(_finalize(req, layer, "post_block")) finally: c0.shutdown(timeout=5.0) # seal each stage's open shard c1.shutdown(timeout=5.0) @@ -303,10 +303,10 @@ def test_two_stage_shards_merge(self, tmp_path: pathlib.Path) -> None: assert sorted(p.name for p in tag.glob("shard-pp01-*.bin")) got = read_sharded(tag) assert set(got) == {req} - assert set(got[req]) == {(layer, "post_mlp") for layer in range(4)} + assert set(got[req]) == {(layer, "post_block") for layer in range(4)} for layer in range(4): np.testing.assert_array_equal( - got[req][(layer, "post_mlp")].array, tensors[layer].numpy() + got[req][(layer, "post_block")].array, tensors[layer].numpy() ) def test_stage_owning_no_layers_creates_no_state( @@ -315,7 +315,7 @@ def test_stage_owning_no_layers_creates_no_state( req = "req-skip" c1 = _pp_consumer(tmp_path, pp_rank=1, num_shards=1) try: - _register(c1, req, {"post_mlp": [0, 1]}) # all on stage 0 + _register(c1, req, {"post_block": [0, 1]}) # all on stage 0 assert req not in c1._sharded_requests finally: c1.shutdown(timeout=5.0) diff --git a/tests/v1/capture/consumers/test_logging.py b/tests/v1/capture/consumers/test_logging.py index cae5c094657d..811ebab33eb5 100644 --- a/tests/v1/capture/consumers/test_logging.py +++ b/tests/v1/capture/consumers/test_logging.py @@ -26,7 +26,7 @@ _LOGGER_NAME = "vllm.capture.logging" -def _key(req_id: str = "req-1", layer: int = 0, hook: str = "post_mlp") -> CaptureKey: +def _key(req_id: str = "req-1", layer: int = 0, hook: str = "post_block") -> CaptureKey: return (VllmInternalRequestId(req_id), layer, hook) @@ -60,7 +60,7 @@ def test_construction(): """LoggingConsumer constructs without error when given valid params.""" consumer = LoggingConsumer( _MOCK_CONFIG, - {"hooks": {"post_mlp": [0]}, "positions": "last_prompt"}, + {"hooks": {"post_block": [0]}, "positions": "last_prompt"}, ) assert consumer is not None @@ -75,11 +75,11 @@ def test_global_capture_spec_returns_configured_spec(): and positions.""" consumer = LoggingConsumer( _MOCK_CONFIG, - {"hooks": {"post_mlp": [0, 1], "pre_attn": [2]}, "positions": "all"}, + {"hooks": {"post_block": [0, 1], "pre_attn": [2]}, "positions": "all"}, ) spec = consumer.global_capture_spec() assert isinstance(spec, CaptureSpec) - assert spec.hooks == {"post_mlp": [0, 1], "pre_attn": [2]} + assert spec.hooks == {"post_block": [0, 1], "pre_attn": [2]} assert spec.positions == "all" @@ -93,7 +93,7 @@ def test_on_capture_logs_key_rows_dtype(caplog: pytest.LogCaptureFixture): dtype.""" consumer = LoggingConsumer( _MOCK_CONFIG, - {"hooks": {"post_mlp": [0]}}, + {"hooks": {"post_block": [0]}}, ) key = _key() tensor = torch.randn(5, 16) @@ -117,7 +117,7 @@ def test_custom_level_debug(caplog: pytest.LogCaptureFixture): """Construct with level='DEBUG', verify log message at DEBUG level.""" consumer = LoggingConsumer( _MOCK_CONFIG, - {"hooks": {"post_mlp": [0]}, "level": "DEBUG"}, + {"hooks": {"post_block": [0]}, "level": "DEBUG"}, ) key = _key() tensor = torch.randn(3, 8) @@ -138,7 +138,7 @@ def test_default_positions(): """Construct without positions param, verify default is 'last_prompt'.""" consumer = LoggingConsumer( _MOCK_CONFIG, - {"hooks": {"post_mlp": [0]}}, + {"hooks": {"post_block": [0]}}, ) spec = consumer.global_capture_spec() assert spec.positions == "last_prompt" diff --git a/tests/v1/capture/test_consumer_base.py b/tests/v1/capture/test_consumer_base.py index 63e447629fb4..4ae2cd5daf04 100644 --- a/tests/v1/capture/test_consumer_base.py +++ b/tests/v1/capture/test_consumer_base.py @@ -31,7 +31,7 @@ # --------------------------------------------------------------------------- -def _key(req_id: str = "req-1", layer: int = 3, hook: str = "post_mlp") -> CaptureKey: +def _key(req_id: str = "req-1", layer: int = 3, hook: str = "post_block") -> CaptureKey: return (VllmInternalRequestId(req_id), layer, hook) @@ -165,7 +165,7 @@ def test_multiple_keys_finalize_independently(): adapter = _BatchedAdapter(consumer) key_a = _key("req-a", layer=1, hook="pre_attn") - key_b = _key("req-b", layer=5, hook="post_mlp") + key_b = _key("req-b", layer=5, hook="post_block") adapter.submit_chunk( CaptureChunk( @@ -330,7 +330,7 @@ def __init__(self) -> None: self.sums: dict[CaptureKey, float] = {} def global_capture_spec(self) -> CaptureSpec | None: - return CaptureSpec(hooks={"post_mlp": [0]}, positions="last_prompt") + return CaptureSpec(hooks={"post_block": [0]}, positions="last_prompt") def on_capture( self, @@ -347,7 +347,7 @@ def test_hello_world_consumer_through_batched_adapter(): spec = consumer.global_capture_spec() assert spec is not None - assert spec.hooks == {"post_mlp": [0]} + assert spec.hooks == {"post_block": [0]} assert spec.positions == "last_prompt" key = _key() diff --git a/tests/v1/capture/test_driver_bridge.py b/tests/v1/capture/test_driver_bridge.py index 6b468a140ef6..7a15f08af024 100644 --- a/tests/v1/capture/test_driver_bridge.py +++ b/tests/v1/capture/test_driver_bridge.py @@ -40,7 +40,7 @@ # --------------------------------------------------------------------------- -def _key(req_id: str = "req-1", layer: int = 3, hook: str = "post_mlp") -> CaptureKey: +def _key(req_id: str = "req-1", layer: int = 3, hook: str = "post_block") -> CaptureKey: return (VllmInternalRequestId(req_id), layer, hook) @@ -158,7 +158,7 @@ def test_multiple_keys_finalize_independently(): shim = _DriverQueueShim(event_q, result_q, timeout=5.0) key_a = _key("req-a", layer=1, hook="pre_attn") - key_b = _key("req-b", layer=5, hook="post_mlp") + key_b = _key("req-b", layer=5, hook="post_block") shim.submit_chunk( CaptureChunk( diff --git a/tests/v1/capture/test_manager.py b/tests/v1/capture/test_manager.py index 251c587058ef..5ab123c6ca69 100644 --- a/tests/v1/capture/test_manager.py +++ b/tests/v1/capture/test_manager.py @@ -60,7 +60,7 @@ def _make_manager( if specs is None: specs = ( CaptureSpec( - hooks={"post_mlp": [0, 1]}, + hooks={"post_block": [0, 1]}, positions="last_prompt", ), ) * len(sinks) @@ -134,10 +134,10 @@ def test_register_build_dispatch_finalize(self): ) plan = mgr.build_step_plan(view) - # The spec asks for post_mlp at layers [0, 1] and "last_prompt" + # The spec asks for post_block at layers [0, 1] and "last_prompt" # which is position 9 for a 10-token prompt. - assert (0, "post_mlp") in plan.gather_indices - assert (1, "post_mlp") in plan.gather_indices + assert (0, "post_block") in plan.gather_indices + assert (1, "post_block") in plan.gather_indices assert len(plan.entries) == 2 # one entry per layer for entry in plan.entries: @@ -190,7 +190,7 @@ def _make_buffer_manager( sinks = (_make_sink(),) if specs is None: specs = ( - CaptureSpec(hooks={"post_mlp": [0, 1]}, positions="last_prompt"), + CaptureSpec(hooks={"post_block": [0, 1]}, positions="last_prompt"), ) * len(sinks) mgr = CaptureManager( consumers=sinks, @@ -206,7 +206,7 @@ def _make_buffer_manager( class TestGlobalSpecBufferPath: def test_buffers_allocated_for_global_keys(self): mgr, _ = _make_buffer_manager(max_num_tokens=16) - assert mgr._global_keys == frozenset({(0, "post_mlp"), (1, "post_mlp")}) + assert mgr._global_keys == frozenset({(0, "post_block"), (1, "post_block")}) for key in mgr._global_keys: buf = mgr._global_buffers[key] assert buf.shape == (16, HIDDEN_SIZE) @@ -235,10 +235,10 @@ def test_build_step_plan_routes_global_keys_to_global_gather(self): plan = mgr.build_step_plan(view) # Global keys take the buffer path, not the dynamic in-hook gather. assert plan.gather_indices == {} - assert (0, "post_mlp") in plan.global_gather_indices - assert (1, "post_mlp") in plan.global_gather_indices + assert (0, "post_block") in plan.global_gather_indices + assert (1, "post_block") in plan.global_gather_indices # last_prompt of a 10-token prompt is absolute row 9. - assert plan.global_gather_indices[(0, "post_mlp")].tolist() == [9] + assert plan.global_gather_indices[(0, "post_block")].tolist() == [9] assert len(plan.entries) == 2 def test_on_hook_copies_full_residual_into_buffer(self): @@ -255,13 +255,13 @@ def test_on_hook_copies_full_residual_into_buffer(self): hidden = torch.arange(10 * HIDDEN_SIZE, dtype=MODEL_DTYPE).reshape( 10, HIDDEN_SIZE ) - mgr.on_hook(0, "post_mlp", hidden) - buf = mgr._global_buffers[(0, "post_mlp")] + mgr.on_hook(0, "post_block", hidden) + buf = mgr._global_buffers[(0, "post_block")] # The full residual is copied (fixed-shape, graph-safe), not gathered. torch.testing.assert_close(buf[:10], hidden) # on_hook must not populate scratch for global keys (host does that # post-forward in _materialize_global_keys). - assert (0, "post_mlp") not in mgr._step_plan.scratch_gpu + assert (0, "post_block") not in mgr._step_plan.scratch_gpu def test_materialize_dispatch_finalize_via_buffer(self): mgr, (sink,) = _make_buffer_manager(max_num_tokens=16) @@ -279,8 +279,8 @@ def test_materialize_dispatch_finalize_via_buffer(self): hidden = torch.arange(10 * HIDDEN_SIZE, dtype=MODEL_DTYPE).reshape( 10, HIDDEN_SIZE ) - mgr.on_hook(0, "post_mlp", hidden) - mgr.on_hook(1, "post_mlp", hidden + 1000.0) + mgr.on_hook(0, "post_block", hidden) + mgr.on_hook(1, "post_block", hidden + 1000.0) mgr.dispatch_step_captures(plan) mgr._drain_dispatch_queue() @@ -308,7 +308,7 @@ def test_global_and_client_keys_coexist(self): mgr, _ = _make_buffer_manager( sinks=(global_sink, client_sink), specs=( - CaptureSpec(hooks={"post_mlp": [0]}, positions="last_prompt"), + CaptureSpec(hooks={"post_block": [0]}, positions="last_prompt"), None, # consumer 1 has no global spec — client-driven ), max_num_tokens=16, @@ -323,15 +323,15 @@ def test_global_and_client_keys_coexist(self): ) plan = mgr.build_step_plan(view) # Global key on the buffer path; client key on the dynamic path. - assert (0, "post_mlp") in plan.global_gather_indices + assert (0, "post_block") in plan.global_gather_indices assert (2, "pre_attn") in plan.gather_indices - assert (0, "post_mlp") not in plan.gather_indices + assert (0, "post_block") not in plan.gather_indices hidden = torch.arange(10 * HIDDEN_SIZE, dtype=MODEL_DTYPE).reshape( 10, HIDDEN_SIZE ) # Global key: full-residual copy. Client key: dynamic gather (eager). - mgr.on_hook(0, "post_mlp", hidden) + mgr.on_hook(0, "post_block", hidden) mgr.on_hook(2, "pre_attn", hidden + 1000.0) # The client key's scratch was populated by the dynamic gather. assert (2, "pre_attn") in plan.scratch_gpu @@ -353,7 +353,7 @@ def test_union_gather_both_dispatched(self): sink0 = _make_sink("sink0") sink1 = _make_sink("sink1") spec = CaptureSpec( - hooks={"post_mlp": [0]}, + hooks={"post_block": [0]}, positions="last_prompt", ) @@ -371,7 +371,7 @@ def test_union_gather_both_dispatched(self): ) plan = mgr.build_step_plan(view) - # Union: only one entry for (layer=0, post_mlp, pos=9), but the + # Union: only one entry for (layer=0, post_block, pos=9), but the # consumer_mask should have bits 0 and 1 set. assert len(plan.entries) == 1 entry = plan.entries[0] @@ -388,8 +388,8 @@ def test_union_gather_both_dispatched(self): def test_different_layers_produce_separate_entries(self): sink0 = _make_sink("sink0") sink1 = _make_sink("sink1") - spec0 = CaptureSpec(hooks={"post_mlp": [0]}, positions="last_prompt") - spec1 = CaptureSpec(hooks={"post_mlp": [1]}, positions="last_prompt") + spec0 = CaptureSpec(hooks={"post_block": [0]}, positions="last_prompt") + spec1 = CaptureSpec(hooks={"post_block": [1]}, positions="last_prompt") mgr, _ = _make_manager( sinks=(sink0, sink1), @@ -405,8 +405,8 @@ def test_different_layers_produce_separate_entries(self): ) plan = mgr.build_step_plan(view) - # Two entries: (layer=0, post_mlp) for consumer 0, - # (layer=1, post_mlp) for consumer 1. + # Two entries: (layer=0, post_block) for consumer 0, + # (layer=1, post_block) for consumer 1. assert len(plan.entries) == 2 masks = {e.layer: e.consumer_mask for e in plan.entries} assert masks[0] == 0b01 # only consumer 0 @@ -421,7 +421,7 @@ def test_different_layers_produce_separate_entries(self): class TestPerRequestClientSpec: def test_client_spec_overrides_global(self): sink = _make_sink() - global_spec = CaptureSpec(hooks={"post_mlp": [0]}, positions="last_prompt") + global_spec = CaptureSpec(hooks={"post_block": [0]}, positions="last_prompt") client_spec = CaptureSpec(hooks={"pre_attn": [2]}, positions="all_prompt") mgr, _ = _make_manager(sinks=(sink,), specs=(global_spec,)) @@ -438,20 +438,20 @@ def test_client_spec_overrides_global(self): # Should use client spec: pre_attn at layer 2, all_prompt = [0..4]. assert (2, "pre_attn") in plan.gather_indices - assert (0, "post_mlp") not in plan.gather_indices + assert (0, "post_block") not in plan.gather_indices assert len(plan.entries) == 5 def test_client_spec_for_specific_consumer_only(self): """Only consumer 1 gets a client spec; consumer 0 uses global.""" sink0 = _make_sink("sink0") sink1 = _make_sink("sink1") - global0 = CaptureSpec(hooks={"post_mlp": [0]}, positions="last_prompt") + global0 = CaptureSpec(hooks={"post_block": [0]}, positions="last_prompt") mgr, _ = _make_manager( sinks=(sink0, sink1), specs=(global0, None), ) - client1 = CaptureSpec(hooks={"post_mlp": [0]}, positions="all_prompt") + client1 = CaptureSpec(hooks={"post_block": [0]}, positions="all_prompt") mgr.register_request( "r1", client_specs={1: client1}, @@ -467,8 +467,8 @@ def test_client_spec_for_specific_consumer_only(self): plan = mgr.build_step_plan(view) # Consumer 0 wants position 4 (last_prompt), consumer 1 wants [0..4]. - # The union at (layer=0, post_mlp) should be [0, 1, 2, 3, 4]. - assert (0, "post_mlp") in plan.gather_indices + # The union at (layer=0, post_block) should be [0, 1, 2, 3, 4]. + assert (0, "post_block") in plan.gather_indices assert len(plan.entries) == 5 # positions 0,1,2,3,4 # Position 4 should have both consumers' bits. @@ -493,7 +493,7 @@ def test_failing_submit_chunk_does_not_block_other_consumer(self): sink1 = _make_sink("sink1") sink0.submit_chunk.side_effect = RuntimeError("sink0 exploded") - spec = CaptureSpec(hooks={"post_mlp": [0]}, positions="last_prompt") + spec = CaptureSpec(hooks={"post_block": [0]}, positions="last_prompt") mgr, _ = _make_manager( sinks=(sink0, sink1), specs=(spec, spec), @@ -522,7 +522,7 @@ def test_failing_submit_finalize_does_not_block_other_consumer(self): sink1 = _make_sink("sink1") sink0.submit_finalize.side_effect = RuntimeError("finalize boom") - spec = CaptureSpec(hooks={"post_mlp": [0]}, positions="last_prompt") + spec = CaptureSpec(hooks={"post_block": [0]}, positions="last_prompt") mgr, _ = _make_manager( sinks=(sink0, sink1), specs=(spec, spec), @@ -568,10 +568,10 @@ class TestFinalizeResults: def test_returns_dict_keyed_by_consumer_index(self): sink0 = _make_sink("sink0") sink1 = _make_sink("sink1") - spec = CaptureSpec(hooks={"post_mlp": [0]}, positions="last_prompt") + spec = CaptureSpec(hooks={"post_block": [0]}, positions="last_prompt") # Make sink0 return a specific result. - expected_key = (VllmInternalRequestId("r1"), 0, "post_mlp") + expected_key = (VllmInternalRequestId("r1"), 0, "post_block") sink0.wait_for_result.return_value = CaptureResult( key=expected_key, status="ok", payload={"path": "/tmp/test"} ) @@ -600,11 +600,11 @@ def test_finalize_unknown_request_returns_empty(self): def test_finalize_aggregates_all_keys_and_preserves_payloads(self): sink = _make_sink("sink0") spec = CaptureSpec( - hooks={"post_mlp": [0, 1]}, + hooks={"post_block": [0, 1]}, positions="last_prompt", ) - key0 = (VllmInternalRequestId("r1"), 0, "post_mlp") - key1 = (VllmInternalRequestId("r1"), 1, "post_mlp") + key0 = (VllmInternalRequestId("r1"), 0, "post_block") + key1 = (VllmInternalRequestId("r1"), 1, "post_block") payload0 = {"path": "/tmp/layer0"} payload1 = {"path": "/tmp/layer1"} @@ -636,11 +636,11 @@ def _wait_for_result(key: CaptureKey, timeout: float) -> CaptureResult: def test_finalize_uses_worst_key_result(self): sink = _make_sink("sink0") spec = CaptureSpec( - hooks={"post_mlp": [0, 1]}, + hooks={"post_block": [0, 1]}, positions="last_prompt", ) - key0 = (VllmInternalRequestId("r1"), 0, "post_mlp") - key1 = (VllmInternalRequestId("r1"), 1, "post_mlp") + key0 = (VllmInternalRequestId("r1"), 0, "post_block") + key1 = (VllmInternalRequestId("r1"), 1, "post_block") def _wait_for_result(key: CaptureKey, timeout: float) -> CaptureResult: if key == key0: @@ -672,8 +672,8 @@ def _wait_for_result(key: CaptureKey, timeout: float) -> CaptureResult: def test_finalize_timeout_becomes_error(self): sink = _make_sink("sink0") - spec = CaptureSpec(hooks={"post_mlp": [0]}, positions="last_prompt") - key = (VllmInternalRequestId("r1"), 0, "post_mlp") + spec = CaptureSpec(hooks={"post_block": [0]}, positions="last_prompt") + key = (VllmInternalRequestId("r1"), 0, "post_block") sink.wait_for_result.return_value = None mgr, _ = _make_manager(sinks=(sink,), specs=(spec,)) @@ -688,9 +688,9 @@ def test_finalize_timeout_becomes_error(self): class TestAggregateCaptureResults: def test_prefers_error_over_partial_error_over_ok(self): - key_ok = (VllmInternalRequestId("r1"), 0, "post_mlp") - key_partial = (VllmInternalRequestId("r1"), 1, "post_mlp") - key_error = (VllmInternalRequestId("r1"), 2, "post_mlp") + key_ok = (VllmInternalRequestId("r1"), 0, "post_block") + key_partial = (VllmInternalRequestId("r1"), 1, "post_block") + key_error = (VllmInternalRequestId("r1"), 2, "post_block") result = _aggregate_capture_results( [ @@ -720,7 +720,7 @@ def test_prefers_error_over_partial_error_over_ok(self): } def test_single_result_preserves_payload_shape(self): - key = (VllmInternalRequestId("r1"), 0, "post_mlp") + key = (VllmInternalRequestId("r1"), 0, "post_block") payload = ["/tmp/capture.bin"] result = _aggregate_capture_results( @@ -767,7 +767,7 @@ def test_finalize_after_unregister_returns_empty(self): class TestPositionExpansion: def test_last_prompt(self): - spec = CaptureSpec(hooks={"post_mlp": [0]}, positions="last_prompt") + spec = CaptureSpec(hooks={"post_block": [0]}, positions="last_prompt") mgr, _ = _make_manager(specs=(spec,)) mgr.register_request("r1", client_specs=None, num_prompt_tokens=10) @@ -782,7 +782,7 @@ def test_last_prompt(self): assert positions == [9] def test_all_prompt(self): - spec = CaptureSpec(hooks={"post_mlp": [0]}, positions="all_prompt") + spec = CaptureSpec(hooks={"post_block": [0]}, positions="all_prompt") mgr, _ = _make_manager(specs=(spec,)) mgr.register_request("r1", client_specs=None, num_prompt_tokens=5) @@ -797,7 +797,7 @@ def test_all_prompt(self): assert positions == [0, 1, 2, 3, 4] def test_all_generated(self): - spec = CaptureSpec(hooks={"post_mlp": [0]}, positions="all_generated") + spec = CaptureSpec(hooks={"post_block": [0]}, positions="all_generated") mgr, _ = _make_manager(specs=(spec,)) mgr.register_request("r1", client_specs=None, num_prompt_tokens=5) @@ -824,7 +824,7 @@ def test_all_generated(self): assert positions == [5] def test_all(self): - spec = CaptureSpec(hooks={"post_mlp": [0]}, positions="all") + spec = CaptureSpec(hooks={"post_block": [0]}, positions="all") mgr, _ = _make_manager(specs=(spec,)) mgr.register_request("r1", client_specs=None, num_prompt_tokens=3) @@ -851,7 +851,7 @@ def test_all(self): assert positions == [3] def test_explicit_list(self): - spec = CaptureSpec(hooks={"post_mlp": [0]}, positions=[2, 7]) + spec = CaptureSpec(hooks={"post_block": [0]}, positions=[2, 7]) mgr, _ = _make_manager(specs=(spec,)) mgr.register_request("r1", client_specs=None, num_prompt_tokens=10) @@ -874,7 +874,7 @@ def test_explicit_list(self): class TestStepWindowIntersection: def test_positions_outside_window_excluded(self): """Explicit list with some positions outside the current window.""" - spec = CaptureSpec(hooks={"post_mlp": [0]}, positions=[0, 5, 9]) + spec = CaptureSpec(hooks={"post_block": [0]}, positions=[0, 5, 9]) mgr, _ = _make_manager(specs=(spec,)) mgr.register_request("r1", client_specs=None, num_prompt_tokens=10) @@ -891,7 +891,7 @@ def test_positions_outside_window_excluded(self): def test_all_prompt_only_captures_scheduled_window(self): """all_prompt is [0..9] but window [0, 3) only captures 0,1,2.""" - spec = CaptureSpec(hooks={"post_mlp": [0]}, positions="all_prompt") + spec = CaptureSpec(hooks={"post_block": [0]}, positions="all_prompt") mgr, _ = _make_manager(specs=(spec,)) mgr.register_request("r1", client_specs=None, num_prompt_tokens=10) @@ -907,7 +907,7 @@ def test_all_prompt_only_captures_scheduled_window(self): def test_decode_step_window(self): """During decode, the window is [N, N+1) for one token.""" - spec = CaptureSpec(hooks={"post_mlp": [0]}, positions="all") + spec = CaptureSpec(hooks={"post_block": [0]}, positions="all") mgr, _ = _make_manager(specs=(spec,)) mgr.register_request("r1", client_specs=None, num_prompt_tokens=5) @@ -1036,7 +1036,7 @@ def test_client_spec_out_of_range_raises(self): mgr.register_request( "r1", client_specs={ - 99: CaptureSpec(hooks={"post_mlp": [0]}, positions="last_prompt") + 99: CaptureSpec(hooks={"post_block": [0]}, positions="last_prompt") }, num_prompt_tokens=10, ) @@ -1047,7 +1047,7 @@ def test_layer_out_of_range_raises(self): mgr.register_request( "r1", client_specs={ - 0: CaptureSpec(hooks={"post_mlp": [999]}, positions="last_prompt") + 0: CaptureSpec(hooks={"post_block": [999]}, positions="last_prompt") }, num_prompt_tokens=10, ) @@ -1094,7 +1094,7 @@ def _captured_layers(mgr: CaptureManager) -> set[int]: class TestLocalLayerRangeFiltering: def test_first_stage_keeps_only_its_layers(self): - spec = CaptureSpec(hooks={"post_mlp": [2, 6]}, positions="last_prompt") + spec = CaptureSpec(hooks={"post_block": [2, 6]}, positions="last_prompt") mgr, sink = _make_pp_manager((0, 4), spec) assert _captured_layers(mgr) == {2} # Finalize touches only the in-range layer (layer 2), not layer 6. @@ -1105,17 +1105,17 @@ def test_first_stage_keeps_only_its_layers(self): assert finalized_layers == {2} def test_second_stage_keeps_only_its_layers(self): - spec = CaptureSpec(hooks={"post_mlp": [2, 6]}, positions="last_prompt") + spec = CaptureSpec(hooks={"post_block": [2, 6]}, positions="last_prompt") mgr, _ = _make_pp_manager((4, 8), spec) assert _captured_layers(mgr) == {6} def test_none_range_keeps_all_layers(self): - spec = CaptureSpec(hooks={"post_mlp": [2, 6]}, positions="last_prompt") + spec = CaptureSpec(hooks={"post_block": [2, 6]}, positions="last_prompt") mgr, _ = _make_pp_manager(None, spec) assert _captured_layers(mgr) == {2, 6} def test_all_layers_out_of_local_range_inactive(self): - spec = CaptureSpec(hooks={"post_mlp": [6, 7]}, positions="last_prompt") + spec = CaptureSpec(hooks={"post_block": [6, 7]}, positions="last_prompt") mgr, _ = _make_pp_manager((0, 4), spec) mgr.register_request("r1", client_specs=None, num_prompt_tokens=10) # No requested layer lives on this stage → request not registered. @@ -1125,7 +1125,7 @@ def test_all_layers_out_of_local_range_inactive(self): def test_out_of_global_range_still_raises_per_stage(self): # A genuinely out-of-range layer is rejected even though it is also # outside this stage's local slice. - spec = CaptureSpec(hooks={"post_mlp": [100]}, positions="last_prompt") + spec = CaptureSpec(hooks={"post_block": [100]}, positions="last_prompt") mgr, _ = _make_pp_manager((0, 4), spec) with pytest.raises(ValueError, match="out of range"): mgr.register_request("r1", client_specs=None, num_prompt_tokens=10) @@ -1133,7 +1133,7 @@ def test_out_of_global_range_still_raises_per_stage(self): def test_partial_hook_layers_filtered(self): # Multiple hooks, each split across the stage boundary. spec = CaptureSpec( - hooks={"post_mlp": [1, 5], "post_attn": [3, 7]}, + hooks={"post_block": [1, 5], "post_attn": [3, 7]}, positions="last_prompt", ) mgr, _ = _make_pp_manager((0, 4), spec) @@ -1145,11 +1145,11 @@ def test_partial_hook_layers_filtered(self): num_scheduled_tokens=[10], ) plan = mgr.build_step_plan(view) - assert set(plan.gather_indices) == {(1, "post_mlp"), (3, "post_attn")} + assert set(plan.gather_indices) == {(1, "post_block"), (3, "post_attn")} @pytest.mark.parametrize("bad_range", [(-1, 4), (4, 2), (0, 9)]) def test_invalid_local_range_rejected(self, bad_range): - spec = CaptureSpec(hooks={"post_mlp": [0]}, positions="last_prompt") + spec = CaptureSpec(hooks={"post_block": [0]}, positions="last_prompt") with pytest.raises(ValueError, match="local_layer_range"): _make_pp_manager(bad_range, spec) @@ -1161,7 +1161,7 @@ def test_invalid_local_range_rejected(self, bad_range): def _result(req: str, layer: int, status: str = "ok", payload=None) -> CaptureResult: return CaptureResult( - key=(VllmInternalRequestId(req), layer, "post_mlp"), + key=(VllmInternalRequestId(req), layer, "post_block"), status=status, payload=payload, ) @@ -1237,8 +1237,8 @@ def test_none_target_is_noop(self): class TestFinalizeAsync: def test_callback_receives_aggregated_results(self): sink = _make_sink("sink0") - spec = CaptureSpec(hooks={"post_mlp": [0]}, positions="last_prompt") - key = (VllmInternalRequestId("r1"), 0, "post_mlp") + spec = CaptureSpec(hooks={"post_block": [0]}, positions="last_prompt") + key = (VllmInternalRequestId("r1"), 0, "post_block") sink.wait_for_result.return_value = CaptureResult( key=key, status="ok", payload={"path": "/tmp/x"} ) @@ -1270,8 +1270,8 @@ def test_does_not_block_the_caller(self): # The caller (model-runner step thread) must return before the # blocking wait_for_result completes. sink = _make_sink("sink0") - spec = CaptureSpec(hooks={"post_mlp": [0]}, positions="last_prompt") - key = (VllmInternalRequestId("r1"), 0, "post_mlp") + spec = CaptureSpec(hooks={"post_block": [0]}, positions="last_prompt") + key = (VllmInternalRequestId("r1"), 0, "post_block") entered = threading.Event() release = threading.Event() diff --git a/tests/v1/capture/test_multi_consumer_runner.py b/tests/v1/capture/test_multi_consumer_runner.py index 63a1a29600d2..1429b21e5c29 100644 --- a/tests/v1/capture/test_multi_consumer_runner.py +++ b/tests/v1/capture/test_multi_consumer_runner.py @@ -82,7 +82,7 @@ def test_two_consumers_both_see_captures(tmp_path: pathlib.Path) -> None: # is what ``build_consumers`` does for ``CaptureConsumer`` subclasses. recording = _RecordingConsumer( _FakeVllmConfig(), - params={"hooks": {"post_mlp": [1]}, "positions": "last_prompt"}, + params={"hooks": {"post_block": [1]}, "positions": "last_prompt"}, ) recording_sink = _BatchedAdapter(recording) @@ -109,7 +109,7 @@ def test_two_consumers_both_see_captures(tmp_path: pathlib.Path) -> None: req_id = "req-multi-1" fs_client_spec = CaptureSpec( - hooks={"post_mlp": [1]}, + hooks={"post_block": [1]}, positions="last_prompt", ) @@ -136,7 +136,7 @@ def test_two_consumers_both_see_captures(tmp_path: pathlib.Path) -> None: ) plan = mgr.build_step_plan(batch_view) hidden = torch.arange(32, dtype=torch.float32).reshape(4, 8) - mgr.on_hook(1, "post_mlp", hidden) + mgr.on_hook(1, "post_block", hidden) mgr.dispatch_step_captures(plan) # Finalize — indexed by consumer index. @@ -152,12 +152,12 @@ def test_two_consumers_both_see_captures(tmp_path: pathlib.Path) -> None: # Give filesystem writer time to flush before asserting the on-disk # result status. - _wait_for_filesystem_result(fs_consumer, (req_id, 1, "post_mlp")) + _wait_for_filesystem_result(fs_consumer, (req_id, 1, "post_block")) # Recording consumer received the capture via ``on_capture``. assert len(recording.captured) == 1 rec_key, rec_shape = recording.captured[0] - assert rec_key == (VllmInternalRequestId(req_id), 1, "post_mlp") + assert rec_key == (VllmInternalRequestId(req_id), 1, "post_block") # "last_prompt" at num_prompt_tokens=4 → one row, hidden_size=8. assert rec_shape == (1, 8) @@ -174,7 +174,7 @@ class _FailingConsumer(CaptureConsumer): def global_capture_spec(self) -> CaptureSpec: return CaptureSpec( - hooks={"post_mlp": [0]}, + hooks={"post_block": [0]}, positions="last_prompt", ) @@ -184,7 +184,7 @@ def on_capture(self, key, tensor, sidecar): failing = _FailingConsumer(_FakeVllmConfig(), params={}) recording = _RecordingConsumer( _FakeVllmConfig(), - params={"hooks": {"post_mlp": [0]}, "positions": "last_prompt"}, + params={"hooks": {"post_block": [0]}, "positions": "last_prompt"}, ) failing_sink = _BatchedAdapter(failing) recording_sink = _BatchedAdapter(recording) @@ -216,7 +216,7 @@ def on_capture(self, key, tensor, sidecar): ) plan = mgr.build_step_plan(batch_view) hidden = torch.zeros((2, 4), dtype=torch.float32) - mgr.on_hook(0, "post_mlp", hidden) + mgr.on_hook(0, "post_block", hidden) mgr.dispatch_step_captures(plan) indexed = mgr.finalize_request("req-isolated") diff --git a/tests/v1/capture/test_plan.py b/tests/v1/capture/test_plan.py index a2ed004f5cb4..7ecab1845c35 100644 --- a/tests/v1/capture/test_plan.py +++ b/tests/v1/capture/test_plan.py @@ -66,7 +66,7 @@ def test_single_consumer_mask(self): entry = CapturePositionEntry( request_id="r1", layer=0, - hook="post_mlp", + hook="post_block", logical_pos=9, scratch_row=0, step_index=0, @@ -79,7 +79,7 @@ def test_multi_consumer_mask(self): entry = CapturePositionEntry( request_id="r1", layer=0, - hook="post_mlp", + hook="post_block", logical_pos=9, scratch_row=0, step_index=0, @@ -111,7 +111,7 @@ def test_consumer_mask_zero_means_no_consumer(self): entry = CapturePositionEntry( request_id="r1", layer=0, - hook="post_mlp", + hook="post_block", logical_pos=0, scratch_row=0, step_index=0, @@ -131,14 +131,14 @@ def test_gather_indices_dtype_and_shape(self): indices = torch.tensor([0, 3, 7], dtype=torch.int64) scratch = torch.empty((3, 16), dtype=torch.float32) plan = StepCapturePlan( - gather_indices={(0, "post_mlp"): indices}, - scratch_gpu={(0, "post_mlp"): scratch}, - scratch_dtype={(0, "post_mlp"): torch.float32}, + gather_indices={(0, "post_block"): indices}, + scratch_gpu={(0, "post_block"): scratch}, + scratch_dtype={(0, "post_block"): torch.float32}, entries=[], ) - assert plan.gather_indices[(0, "post_mlp")].dtype == torch.int64 - assert plan.gather_indices[(0, "post_mlp")].shape == (3,) - assert plan.scratch_gpu[(0, "post_mlp")].shape == (3, 16) + assert plan.gather_indices[(0, "post_block")].dtype == torch.int64 + assert plan.gather_indices[(0, "post_block")].shape == (3,) + assert plan.scratch_gpu[(0, "post_block")].shape == (3, 16) def test_empty_plan(self): plan = StepCapturePlan( @@ -155,23 +155,23 @@ def test_multiple_layer_hook_pairs(self): plan = StepCapturePlan( gather_indices={ (0, "pre_attn"): torch.tensor([0], dtype=torch.int64), - (0, "post_mlp"): torch.tensor([0, 1], dtype=torch.int64), - (1, "post_mlp"): torch.tensor([2], dtype=torch.int64), + (0, "post_block"): torch.tensor([0, 1], dtype=torch.int64), + (1, "post_block"): torch.tensor([2], dtype=torch.int64), }, scratch_gpu={ (0, "pre_attn"): torch.empty((1, 8)), - (0, "post_mlp"): torch.empty((2, 8)), - (1, "post_mlp"): torch.empty((1, 8)), + (0, "post_block"): torch.empty((2, 8)), + (1, "post_block"): torch.empty((1, 8)), }, scratch_dtype={ (0, "pre_attn"): torch.float32, - (0, "post_mlp"): torch.float32, - (1, "post_mlp"): torch.float32, + (0, "post_block"): torch.float32, + (1, "post_block"): torch.float32, }, entries=[], ) assert len(plan.gather_indices) == 3 - assert plan.scratch_gpu[(0, "post_mlp")].shape[0] == 2 + assert plan.scratch_gpu[(0, "post_block")].shape[0] == 2 def test_request_errors_default_empty(self): plan = StepCapturePlan( diff --git a/tests/v1/capture/test_runner_integration.py b/tests/v1/capture/test_runner_integration.py index 4eecf133be96..3f4c657332ac 100644 --- a/tests/v1/capture/test_runner_integration.py +++ b/tests/v1/capture/test_runner_integration.py @@ -101,7 +101,7 @@ def test_filesystem_consumer_end_to_end_via_manager(tmp_path: pathlib.Path) -> N # ``_register_capture_request`` resolves via # ``validate_client_spec``. client_spec = CaptureSpec( - hooks={"post_mlp": [1]}, + hooks={"post_block": [1]}, positions="last_prompt", ) @@ -127,10 +127,10 @@ def test_filesystem_consumer_end_to_end_via_manager(tmp_path: pathlib.Path) -> N ) plan = mgr.build_step_plan(batch_view) - # Simulate ``on_hook`` firing: for the single (layer=1, hook=post_mlp) + # Simulate ``on_hook`` firing: for the single (layer=1, hook=post_block) # key, populate the scratch with a known tensor. hidden = torch.arange(24, dtype=torch.float32).reshape(3, 8) - mgr.on_hook(1, "post_mlp", hidden) + mgr.on_hook(1, "post_block", hidden) # Drain. mgr.dispatch_step_captures(plan) @@ -139,11 +139,11 @@ def test_filesystem_consumer_end_to_end_via_manager(tmp_path: pathlib.Path) -> N assert list(results.keys()) == [0] # Wait for the writer pool to flush. - _wait_for_status(consumer, (req_id, 1, "post_mlp")) + _wait_for_status(consumer, (req_id, 1, "post_block")) consumer.shutdown() # Verify the expected file exists under the consumer's layout. - bin_path = tmp_path / "default" / req_id / "1_post_mlp.bin" + bin_path = tmp_path / "default" / req_id / "1_post_block.bin" sidecar_path = bin_path.with_suffix(".json") assert bin_path.exists(), f"missing bin file {bin_path}" assert sidecar_path.exists(), f"missing sidecar {sidecar_path}" @@ -152,7 +152,7 @@ def test_filesystem_consumer_end_to_end_via_manager(tmp_path: pathlib.Path) -> N sidecar = json.loads(sidecar_path.read_text()) assert sidecar["request_id"] == req_id assert sidecar["layer"] == 1 - assert sidecar["hook"] == "post_mlp" + assert sidecar["hook"] == "post_block" # --------------------------------------------------------------------------- @@ -176,7 +176,7 @@ def test_manager_admission_error_yields_error_result() -> None: mgr = CaptureManager( consumers=(sink,), - consumer_specs=(CaptureSpec(hooks={"post_mlp": [0]}, positions="last_prompt"),), + consumer_specs=(CaptureSpec(hooks={"post_block": [0]}, positions="last_prompt"),), num_hidden_layers=2, hidden_size=4, model_dtype=torch.float32, @@ -230,7 +230,7 @@ def test_filesystem_consumer_byte_for_byte_matches_writer( ) tensor = torch.arange(16, dtype=torch.float32).reshape(2, 8) - key = (VllmInternalRequestId("req-gold"), 3, "post_mlp") + key = (VllmInternalRequestId("req-gold"), 3, "post_block") consumer.submit_chunk( CaptureChunk( @@ -253,10 +253,10 @@ def test_filesystem_consumer_byte_for_byte_matches_writer( }, ) ) - _wait_for_status(consumer, ("req-gold", 3, "post_mlp")) + _wait_for_status(consumer, ("req-gold", 3, "post_block")) consumer.shutdown() - consumer_bin = consumer_root / "gold" / "req-gold" / "3_post_mlp.bin" + consumer_bin = consumer_root / "gold" / "req-gold" / "3_post_block.bin" assert consumer_bin.exists() consumer_bytes = consumer_bin.read_bytes() @@ -265,14 +265,14 @@ def test_filesystem_consumer_byte_for_byte_matches_writer( writer_root.mkdir() writer = ActivationWriter(writer_root, num_threads=1) try: - writer_bin = writer_root / "gold" / "req-gold" / "3_post_mlp.bin" + writer_bin = writer_root / "gold" / "req-gold" / "3_post_block.bin" writer_bin.parent.mkdir(parents=True, exist_ok=True) writer.submit( WriteTask( path=writer_bin, payload=bytes(tensor.numpy().tobytes()), append=True, - key=("req-gold", 3, "post_mlp"), + key=("req-gold", 3, "post_block"), ) ) writer.submit( @@ -285,14 +285,14 @@ def test_filesystem_consumer_byte_for_byte_matches_writer( "shape": [2, 8], "dtype": "float32", }, - key=("req-gold", 3, "post_mlp"), + key=("req-gold", 3, "post_block"), ) ) # Spin until writer finalizes. deadline = time.monotonic() + 5.0 while time.monotonic() < deadline: - result = writer.get_result(("req-gold", 3, "post_mlp")) + result = writer.get_result(("req-gold", 3, "post_block")) if result is not None and result.status in ("ok", "error"): break time.sleep(0.005) @@ -396,14 +396,14 @@ def test_pipeline_parallel_two_stage_shared_fs(tmp_path: pathlib.Path) -> None: its ``CaptureManager`` is built with the *global* layer count and the stage's *local* ``[start, end)`` slice, and both write to the same root (the shared mount). A client spec spanning both stages - (``post_mlp`` at layers 1 and 3 of a 4-layer model) must land exactly + (``post_block`` at layers 1 and 3 of a 4-layer model) must land exactly one file per layer under its global-layer path, with each stage writing only the layers it owns — the Option-A merge the engine then unions at the result level. """ GLOBAL = 4 req_id = "req-pp" - client_spec = CaptureSpec(hooks={"post_mlp": [1, 3]}, positions="last_prompt") + client_spec = CaptureSpec(hooks={"post_block": [1, 3]}, positions="last_prompt") def _drive_stage(local_range: tuple[int, int], owned_layer: int) -> None: consumer = FilesystemConsumer( @@ -438,15 +438,15 @@ def _drive_stage(local_range: tuple[int, int], owned_layer: int) -> None: ) plan = mgr.build_step_plan(batch_view) # Only this stage's owned layer is planned. - assert set(plan.gather_indices) == {(owned_layer, "post_mlp")} + assert set(plan.gather_indices) == {(owned_layer, "post_block")} hidden = torch.arange(24, dtype=torch.float32).reshape(3, 8) # Firing the other stage's layer is a no-op on this manager. - mgr.on_hook(owned_layer, "post_mlp", hidden) + mgr.on_hook(owned_layer, "post_block", hidden) mgr.dispatch_step_captures(plan) results = mgr.finalize_request(req_id) assert list(results.keys()) == [0] - _wait_for_status(consumer, (req_id, owned_layer, "post_mlp")) + _wait_for_status(consumer, (req_id, owned_layer, "post_block")) consumer.shutdown() # Stage 0 owns global layers [0, 2) → captures layer 1. @@ -457,4 +457,4 @@ def _drive_stage(local_range: tuple[int, int], owned_layer: int) -> None: req_dir = tmp_path / "default" / req_id written = sorted(p.name for p in req_dir.glob("*.bin")) # Exactly one file per requested layer, keyed by the GLOBAL layer index. - assert written == ["1_post_mlp.bin", "3_post_mlp.bin"] + assert written == ["1_post_block.bin", "3_post_block.bin"] diff --git a/tests/v1/capture/test_sampling_params.py b/tests/v1/capture/test_sampling_params.py index 6aed9057679e..9981d1e2b85e 100644 --- a/tests/v1/capture/test_sampling_params.py +++ b/tests/v1/capture/test_sampling_params.py @@ -37,7 +37,7 @@ def test_empty_dict_is_accepted(self) -> None: def test_dict_with_string_keys_is_accepted(self) -> None: spec = { - "filesystem": {"tag": "t", "hooks": {"post_mlp": [0]}}, + "filesystem": {"tag": "t", "hooks": {"post_block": [0]}}, "logging": {"level": "INFO"}, } params = SamplingParams(capture=spec) diff --git a/tests/v1/capture/test_step_gate.py b/tests/v1/capture/test_step_gate.py index 17c2b45b221c..505c069b654f 100644 --- a/tests/v1/capture/test_step_gate.py +++ b/tests/v1/capture/test_step_gate.py @@ -78,7 +78,7 @@ def test_extract_selectors_none_and_empty(): def test_extract_selectors_dict_spec(): - raw = {"filesystem": {"hooks": {"post_mlp": "all"}, "positions": "last_prompt"}} + raw = {"filesystem": {"hooks": {"post_block": "all"}, "positions": "last_prompt"}} assert _extract_selectors(raw) == ["last_prompt"] diff --git a/tests/v1/capture/test_types.py b/tests/v1/capture/test_types.py index 7fd2e5a6b045..325dda5f0239 100644 --- a/tests/v1/capture/test_types.py +++ b/tests/v1/capture/test_types.py @@ -20,7 +20,7 @@ ) -def _key(req_id: str = "req-1", layer: int = 3, hook: str = "post_mlp") -> CaptureKey: +def _key(req_id: str = "req-1", layer: int = 3, hook: str = "post_block") -> CaptureKey: return (VllmInternalRequestId(req_id), layer, hook) @@ -31,15 +31,15 @@ def test_capture_key_is_a_three_tuple(): req_id, layer, hook = key assert req_id == "req-1" assert layer == 3 - assert hook == "post_mlp" + assert hook == "post_block" def test_capture_spec_is_frozen(): spec = CaptureSpec( - hooks={"post_mlp": [1, 2, 3]}, + hooks={"post_block": [1, 2, 3]}, positions="last_prompt", ) - assert spec.hooks == {"post_mlp": [1, 2, 3]} + assert spec.hooks == {"post_block": [1, 2, 3]} assert spec.positions == "last_prompt" with pytest.raises(dataclasses.FrozenInstanceError): diff --git a/tests/v1/core/test_steering_hash_determinism.py b/tests/v1/core/test_steering_hash_determinism.py index 551677f203b0..07a7c77c8e8b 100644 --- a/tests/v1/core/test_steering_hash_determinism.py +++ b/tests/v1/core/test_steering_hash_determinism.py @@ -33,47 +33,47 @@ def test_empty_and_none_hash_zero(self): assert _hash({}, module_ref=None) == 0 def test_identical_specs_hash_equal(self): - a = {"post_mlp": {0: [1.0, 2.0, 3.0]}} - b = {"post_mlp": {0: [1.0, 2.0, 3.0]}} + a = {"post_block": {0: [1.0, 2.0, 3.0]}} + b = {"post_block": {0: [1.0, 2.0, 3.0]}} assert _hash(a) == _hash(b) def test_dict_insertion_order_does_not_matter(self): a = { - "post_mlp": {0: [1.0, 2.0], 1: [3.0, 4.0]}, + "post_block": {0: [1.0, 2.0], 1: [3.0, 4.0]}, "pre_attn": {5: [5.0, 6.0]}, } # Same data, different insertion orders. b: dict = {} b["pre_attn"] = {5: [5.0, 6.0]} - b["post_mlp"] = {} - b["post_mlp"][1] = [3.0, 4.0] - b["post_mlp"][0] = [1.0, 2.0] + b["post_block"] = {} + b["post_block"][1] = [3.0, 4.0] + b["post_block"][0] = [1.0, 2.0] assert _hash(a) == _hash(b) def test_different_vector_values_hash_different(self): - a = {"post_mlp": {0: [1.0, 2.0, 3.0]}} - b = {"post_mlp": {0: [1.0, 2.0, 3.1]}} + a = {"post_block": {0: [1.0, 2.0, 3.0]}} + b = {"post_block": {0: [1.0, 2.0, 3.1]}} assert _hash(a) != _hash(b) def test_different_layer_indices_hash_different(self): - a = {"post_mlp": {0: [1.0, 2.0, 3.0]}} - b = {"post_mlp": {1: [1.0, 2.0, 3.0]}} + a = {"post_block": {0: [1.0, 2.0, 3.0]}} + b = {"post_block": {1: [1.0, 2.0, 3.0]}} assert _hash(a) != _hash(b) def test_different_hook_points_hash_different(self): - a = {"post_mlp": {0: [1.0, 2.0, 3.0]}} + a = {"post_block": {0: [1.0, 2.0, 3.0]}} b = {"pre_attn": {0: [1.0, 2.0, 3.0]}} assert _hash(a) != _hash(b) def test_fits_in_int64(self): - a = {"post_mlp": {0: [1.0] * 1024}} + a = {"post_block": {0: [1.0] * 1024}} h = _hash(a) assert 0 <= h < 2**63, f"Hash {h} outside signed int64 range" def test_module_ref_changes_hash(self): """A module ref folds into the hash; same vectors + different ``(name, scale)`` tuples must produce different hashes.""" - a = {"post_mlp": {0: [1.0, 2.0, 3.0]}} + a = {"post_block": {0: [1.0, 2.0, 3.0]}} h_no_ref = _hash(a) h_ref_foo = _hash(a, module_ref=("foo", 1.0)) h_ref_bar = _hash(a, module_ref=("bar", 1.0)) @@ -94,7 +94,7 @@ def test_module_ref_default_matches_explicit_none(self): """``module_ref=None`` must reduce to the original inline-only hash bit-for-bit so existing prefix-cache reuse is preserved. """ - a = {"post_mlp": {0: [1.0, 2.0, 3.0], 1: [4.0, 5.0, 6.0]}} + a = {"post_block": {0: [1.0, 2.0, 3.0], 1: [4.0, 5.0, 6.0]}} # Default arg. h_default = hash_steering_config(a) # Explicit None. @@ -107,7 +107,7 @@ def test_module_ref_identical_specs_hash_equal(self): produce the same hash regardless of when (or whether) the worker-side registry has been populated. The hash is a pure function of the reference, not the resolved vectors.""" - inline = {"post_mlp": {14: [0.1, 0.2]}} + inline = {"post_block": {14: [0.1, 0.2]}} ref = ("foo", 1.0) first = _hash(inline, module_ref=ref) second = _hash(inline, module_ref=ref) @@ -125,7 +125,7 @@ def test_across_processes(self): script = ( "from vllm.config.steering_types import hash_steering_config; " "print(hash_steering_config(" - "{'post_mlp': {0: [1.0, 2.0, 3.0], 1: [4.0, 5.0, 6.0]}}" + "{'post_block': {0: [1.0, 2.0, 3.0], 1: [4.0, 5.0, 6.0]}}" "))" ) first = subprocess.check_output([sys.executable, "-c", script]) @@ -134,5 +134,5 @@ def test_across_processes(self): f"Hash differs across processes: {first!r} vs {second!r}" ) # And matches the in-process hash. - in_process = _hash({"post_mlp": {0: [1.0, 2.0, 3.0], 1: [4.0, 5.0, 6.0]}}) + in_process = _hash({"post_block": {0: [1.0, 2.0, 3.0], 1: [4.0, 5.0, 6.0]}}) assert int(first.strip()) == in_process diff --git a/tests/v1/executor/test_executor.py b/tests/v1/executor/test_executor.py index 525dc2ee407a..59b556ba09e6 100644 --- a/tests/v1/executor/test_executor.py +++ b/tests/v1/executor/test_executor.py @@ -174,7 +174,7 @@ def _result(req, layer): from vllm.v1.capture.types import CaptureResult, VllmInternalRequestId return CaptureResult( - key=(VllmInternalRequestId(req), layer, "post_mlp"), + key=(VllmInternalRequestId(req), layer, "post_block"), status="ok", ) diff --git a/tests/v1/test_request_steering.py b/tests/v1/test_request_steering.py index c0e1e15b197a..757257a8cc99 100644 --- a/tests/v1/test_request_steering.py +++ b/tests/v1/test_request_steering.py @@ -30,8 +30,8 @@ # Helpers # --------------------------------------------------------------------------- -STEERING_A = {"post_mlp": {0: [1.0, 2.0]}} -STEERING_B = {"post_mlp": {0: [99.0, 100.0]}} +STEERING_A = {"post_block": {0: [1.0, 2.0]}} +STEERING_B = {"post_block": {0: [99.0, 100.0]}} init_none_hash(sha256_cbor) diff --git a/tests/v1/test_steering_inline_packed.py b/tests/v1/test_steering_inline_packed.py index 89a613e8e591..6e7aebaf2be2 100644 --- a/tests/v1/test_steering_inline_packed.py +++ b/tests/v1/test_steering_inline_packed.py @@ -40,37 +40,37 @@ def test_torch_dtype_mapping(self): assert _torch_dtype_to_pack_dtype(torch.bfloat16) == np.dtype(np.float32) def test_pack_steering_for_dtype_bare_list(self): - spec = {"post_mlp": {0: [1.0, 2.0, 3.0]}} + spec = {"post_block": {0: [1.0, 2.0, 3.0]}} out = pack_steering_for_dtype(spec, np.float32) assert out is not None - arr = out["post_mlp"][0] + arr = out["post_block"][0] assert arr.dtype == np.float32 assert arr.tolist() == [1.0, 2.0, 3.0] def test_pack_steering_for_dtype_with_scale(self): - spec = {"post_mlp": {0: {"vector": [1.0, 2.0], "scale": 3.0}}} + spec = {"post_block": {0: {"vector": [1.0, 2.0], "scale": 3.0}}} out = pack_steering_for_dtype(spec, np.float32) assert out is not None - assert out["post_mlp"][0].tolist() == [3.0, 6.0] + assert out["post_block"][0].tolist() == [3.0, 6.0] def test_pack_effective_steering_resolves_then_casts(self): - base = {"post_mlp": {0: [1.0, 2.0]}} - prefill = {"post_mlp": {0: [10.0, 20.0]}} + base = {"post_block": {0: [1.0, 2.0]}} + prefill = {"post_block": {0: [10.0, 20.0]}} out = pack_effective_steering(base, prefill, np.float32) assert out is not None # 1.0+10.0=11.0, 2.0+20.0=22.0 - assert out["post_mlp"][0].dtype == np.float32 - assert out["post_mlp"][0].tolist() == [11.0, 22.0] + assert out["post_block"][0].dtype == np.float32 + assert out["post_block"][0].tolist() == [11.0, 22.0] def test_pack_effective_steering_handles_none(self): assert pack_effective_steering(None, None, np.float32) is None assert pack_effective_steering({}, {}, np.float32) is None def test_pack_dtype_fp16_loses_some_precision_but_preserves_shape(self): - spec = {"post_mlp": {0: list(range(16))}} + spec = {"post_block": {0: list(range(16))}} out = pack_steering_for_dtype(spec, np.float16) assert out is not None - arr = out["post_mlp"][0] + arr = out["post_block"][0] assert arr.dtype == np.float16 assert arr.shape == (16,) # fp16 represents small ints exactly. @@ -99,7 +99,7 @@ def test_named_only_is_noop(self): def test_inline_packs_and_clears_originals(self): sp = SamplingParams( max_tokens=1, - steering_vectors={"post_mlp": {0: [1.0, 2.0]}}, + steering_vectors={"post_block": {0: [1.0, 2.0]}}, ) maybe_pack_inline_steering_for_request(sp, torch.float32) assert sp.steering_vectors is None @@ -108,11 +108,11 @@ def test_inline_packs_and_clears_originals(self): assert sp._effective_prefill_steering_packed is not None assert sp._effective_decode_steering_packed is not None # Both phases resolve to the same result when only base is set. - assert sp._effective_prefill_steering_packed["post_mlp"][0].tolist() == [ + assert sp._effective_prefill_steering_packed["post_block"][0].tolist() == [ 1.0, 2.0, ] - assert sp._effective_decode_steering_packed["post_mlp"][0].tolist() == [ + assert sp._effective_decode_steering_packed["post_block"][0].tolist() == [ 1.0, 2.0, ] @@ -120,16 +120,16 @@ def test_inline_packs_and_clears_originals(self): def test_phase_specific_resolves_per_phase(self): sp = SamplingParams( max_tokens=1, - steering_vectors={"post_mlp": {0: [1.0, 2.0]}}, - prefill_steering_vectors={"post_mlp": {0: [10.0, 20.0]}}, - decode_steering_vectors={"post_mlp": {0: [100.0, 200.0]}}, + steering_vectors={"post_block": {0: [1.0, 2.0]}}, + prefill_steering_vectors={"post_block": {0: [10.0, 20.0]}}, + decode_steering_vectors={"post_block": {0: [100.0, 200.0]}}, ) maybe_pack_inline_steering_for_request(sp, torch.float32) - assert sp._effective_prefill_steering_packed["post_mlp"][0].tolist() == [ + assert sp._effective_prefill_steering_packed["post_block"][0].tolist() == [ 11.0, 22.0, ] - assert sp._effective_decode_steering_packed["post_mlp"][0].tolist() == [ + assert sp._effective_decode_steering_packed["post_block"][0].tolist() == [ 101.0, 202.0, ] @@ -137,7 +137,7 @@ def test_phase_specific_resolves_per_phase(self): def test_idempotent_when_already_packed(self): sp = SamplingParams( max_tokens=1, - steering_vectors={"post_mlp": {0: [1.0, 2.0]}}, + steering_vectors={"post_block": {0: [1.0, 2.0]}}, ) maybe_pack_inline_steering_for_request(sp, torch.float32) first = sp._effective_prefill_steering_packed @@ -148,12 +148,12 @@ def test_idempotent_when_already_packed(self): def test_effective_steering_returns_packed_after_pack(self): sp = SamplingParams( max_tokens=1, - steering_vectors={"post_mlp": {0: [1.0, 2.0]}}, + steering_vectors={"post_block": {0: [1.0, 2.0]}}, ) maybe_pack_inline_steering_for_request(sp, torch.float32) # The cached_property fallback should now return packed values. assert sp.effective_prefill_steering is not None - assert sp.effective_prefill_steering["post_mlp"][0].tolist() == [1.0, 2.0] + assert sp.effective_prefill_steering["post_block"][0].tolist() == [1.0, 2.0] # --------------------------------------------------------------------------- @@ -165,7 +165,7 @@ class TestHashDeterminism: def test_packed_request_hash_matches_unpacked(self): """A packed and unpacked submission of the same logical request must produce the same prefix-cache hash.""" - vectors = {"post_mlp": {0: [1.0, 2.0, 3.0]}} + vectors = {"post_block": {0: [1.0, 2.0, 3.0]}} sp_unpacked = SamplingParams(max_tokens=1, steering_vectors=vectors) unpacked_hash = sp_unpacked.prefill_steering_config_hash @@ -177,10 +177,10 @@ def test_packed_request_hash_matches_unpacked(self): def test_different_vectors_different_hash(self): sp_a = SamplingParams( - max_tokens=1, steering_vectors={"post_mlp": {0: [1.0, 2.0]}} + max_tokens=1, steering_vectors={"post_block": {0: [1.0, 2.0]}} ) sp_b = SamplingParams( - max_tokens=1, steering_vectors={"post_mlp": {0: [1.0, 3.0]}} + max_tokens=1, steering_vectors={"post_block": {0: [1.0, 3.0]}} ) maybe_pack_inline_steering_for_request(sp_a, torch.float32) maybe_pack_inline_steering_for_request(sp_b, torch.float32) @@ -197,7 +197,7 @@ def test_packed_field_round_trips_through_msgspec(self): """Packed ndarrays survive msgspec encode/decode with dtype + values.""" sp_in = SamplingParams( max_tokens=1, - steering_vectors={"post_mlp": {0: [1.0, 2.0, 3.0]}}, + steering_vectors={"post_block": {0: [1.0, 2.0, 3.0]}}, ) maybe_pack_inline_steering_for_request(sp_in, torch.float32) assert sp_in._effective_prefill_steering_packed is not None @@ -208,15 +208,15 @@ def test_packed_field_round_trips_through_msgspec(self): sp_out = dec.decode(bufs) assert sp_out._effective_prefill_steering_packed is not None - out_arr = sp_out._effective_prefill_steering_packed["post_mlp"][0] - in_arr = sp_in._effective_prefill_steering_packed["post_mlp"][0] + out_arr = sp_out._effective_prefill_steering_packed["post_block"][0] + in_arr = sp_in._effective_prefill_steering_packed["post_block"][0] assert isinstance(out_arr, np.ndarray) assert out_arr.dtype == in_arr.dtype assert np.array_equal(out_arr, in_arr) def test_packed_payload_smaller_than_unpacked(self): """Sanity: the packed wire form is smaller than the unpacked one.""" - vectors = {"post_mlp": {i: [float(j) for j in range(2560)] for i in range(34)}} + vectors = {"post_block": {i: [float(j) for j in range(2560)] for i in range(34)}} sp_unpacked = SamplingParams(max_tokens=1, steering_vectors=vectors) sp_packed = SamplingParams(max_tokens=1, steering_vectors=vectors) maybe_pack_inline_steering_for_request(sp_packed, torch.float32) diff --git a/tests/v1/test_steering_types.py b/tests/v1/test_steering_types.py index ee7bdfedd312..e7ff5e1eaf36 100644 --- a/tests/v1/test_steering_types.py +++ b/tests/v1/test_steering_types.py @@ -370,10 +370,10 @@ def test_mismatched_base_prefill_raises(self): ): SamplingParams( steering_vectors={ - "post_mlp": {15: [1.0, 2.0]}, + "post_block": {15: [1.0, 2.0]}, }, prefill_steering_vectors={ - "post_mlp": {15: [1.0]}, + "post_block": {15: [1.0]}, }, ) @@ -386,10 +386,10 @@ def test_mismatched_base_decode_raises(self): ): SamplingParams( steering_vectors={ - "post_mlp": {0: [1.0, 2.0, 3.0]}, + "post_block": {0: [1.0, 2.0, 3.0]}, }, decode_steering_vectors={ - "post_mlp": {0: [1.0, 2.0]}, + "post_block": {0: [1.0, 2.0]}, }, ) @@ -399,10 +399,10 @@ def test_matching_dimensions_pass(self): params = SamplingParams( steering_vectors={ - "post_mlp": {0: [1.0, 2.0]}, + "post_block": {0: [1.0, 2.0]}, }, prefill_steering_vectors={ - "post_mlp": {0: [3.0, 4.0]}, + "post_block": {0: [3.0, 4.0]}, }, ) assert params.steering_vectors is not None @@ -415,10 +415,10 @@ def test_non_overlapping_different_dims_pass(self): params = SamplingParams( steering_vectors={ - "post_mlp": {0: [1.0, 2.0]}, + "post_block": {0: [1.0, 2.0]}, }, prefill_steering_vectors={ - "post_mlp": {1: [1.0]}, + "post_block": {1: [1.0]}, }, ) assert params.steering_vectors is not None @@ -434,10 +434,10 @@ def test_mismatched_prefill_decode_without_base_raises(self): ): SamplingParams( prefill_steering_vectors={ - "post_mlp": {0: [1.0, 2.0]}, + "post_block": {0: [1.0, 2.0]}, }, decode_steering_vectors={ - "post_mlp": {0: [1.0]}, + "post_block": {0: [1.0]}, }, ) @@ -448,10 +448,10 @@ def test_non_overlapping_prefill_decode_pass(self): params = SamplingParams( prefill_steering_vectors={ - "post_mlp": {0: [1.0, 2.0]}, + "post_block": {0: [1.0, 2.0]}, }, decode_steering_vectors={ - "post_mlp": {1: [1.0]}, + "post_block": {1: [1.0]}, }, ) assert params.prefill_steering_vectors is not None @@ -464,10 +464,10 @@ def test_matching_prefill_decode_without_base_pass(self): params = SamplingParams( prefill_steering_vectors={ - "post_mlp": {0: [1.0, 2.0]}, + "post_block": {0: [1.0, 2.0]}, }, decode_steering_vectors={ - "post_mlp": {0: [3.0, 4.0]}, + "post_block": {0: [3.0, 4.0]}, }, ) assert params.prefill_steering_vectors is not None @@ -483,12 +483,12 @@ def test_mismatched_prefill_decode_scaled_entry_raises(self): ): SamplingParams( prefill_steering_vectors={ - "post_mlp": { + "post_block": { 0: {"vector": [1.0, 2.0], "scale": 0.5}, }, }, decode_steering_vectors={ - "post_mlp": {0: [1.0]}, + "post_block": {0: [1.0]}, }, ) @@ -509,7 +509,7 @@ def test_extra_key_in_steering_vectors_raises(self): with pytest.raises(ValueError, match="unexpected keys"): SamplingParams( steering_vectors={ - "post_mlp": { + "post_block": { 0: {"vector": [1.0, 2.0], "scale": 1.0, "typo": "bad"}, }, }, @@ -522,7 +522,7 @@ def test_extra_key_in_prefill_steering_vectors_raises(self): with pytest.raises(ValueError, match="unexpected keys"): SamplingParams( prefill_steering_vectors={ - "post_mlp": { + "post_block": { 0: {"vector": [1.0], "scale": 1.0, "extra": 42}, }, }, @@ -535,7 +535,7 @@ def test_extra_key_in_decode_steering_vectors_raises(self): with pytest.raises(ValueError, match="unexpected keys"): SamplingParams( decode_steering_vectors={ - "post_mlp": { + "post_block": { 0: {"vector": [1.0], "scale": 1.0, "foo": 1, "bar": 2}, }, }, diff --git a/tests/v1/worker/test_steering_manager.py b/tests/v1/worker/test_steering_manager.py index d382314a12da..21dab67d5be8 100644 --- a/tests/v1/worker/test_steering_manager.py +++ b/tests/v1/worker/test_steering_manager.py @@ -23,8 +23,8 @@ HIDDEN_SIZE = 8 MAX_CONFIGS = 4 -_HP = DEFAULT_HOOK_POINT.value # "post_mlp" -_TABLE_ATTR = "steering_table_post_mlp" +_HP = DEFAULT_HOOK_POINT.value # "post_block" +_TABLE_ATTR = "steering_table_post_block" # --------------------------------------------------------------------------- diff --git a/tests/v1/worker/test_steering_manager_ownership.py b/tests/v1/worker/test_steering_manager_ownership.py index e074d2c849b3..ea4393409e56 100644 --- a/tests/v1/worker/test_steering_manager_ownership.py +++ b/tests/v1/worker/test_steering_manager_ownership.py @@ -21,7 +21,7 @@ HIDDEN_SIZE = 8 MAX_CONFIGS = 4 -_HP = DEFAULT_HOOK_POINT.value # "post_mlp" +_HP = DEFAULT_HOOK_POINT.value # "post_block" _TABLE_ATTR = HOOK_POINT_TABLE_ATTR[DEFAULT_HOOK_POINT] diff --git a/tests/v1/worker/test_steering_named_resolve_cache.py b/tests/v1/worker/test_steering_named_resolve_cache.py index b75e28220091..6d7ff3b34155 100644 --- a/tests/v1/worker/test_steering_named_resolve_cache.py +++ b/tests/v1/worker/test_steering_named_resolve_cache.py @@ -74,8 +74,8 @@ class TestNamedCacheFastPath: def test_scale_one_no_overrides_returns_cached(self): """scale=1.0 + no inline → cache hit; output equals slow-path output.""" mixin = _StubMixin() - base = _spec("post_mlp", {0: [1.0, 2.0]}) - prefill = _spec("post_mlp", {1: [3.0, 4.0]}) + base = _spec("post_block", {0: [1.0, 2.0]}) + prefill = _spec("post_block", {1: [3.0, 4.0]}) mixin.register_steering_modules( {"m": {"vectors": base, "prefill_vectors": prefill}}, replace=True, @@ -89,9 +89,9 @@ def test_scale_one_no_overrides_returns_cached(self): def test_decode_phase_resolves_separately(self): """The cache holds (prefill, decode) — decode must use its slot.""" mixin = _StubMixin() - base = _spec("post_mlp", {0: [1.0, 2.0]}) - prefill = _spec("post_mlp", {0: [10.0, 20.0]}) - decode = _spec("post_mlp", {0: [100.0, 200.0]}) + base = _spec("post_block", {0: [1.0, 2.0]}) + prefill = _spec("post_block", {0: [10.0, 20.0]}) + decode = _spec("post_block", {0: [100.0, 200.0]}) mixin.register_steering_modules( { "m": { @@ -106,25 +106,25 @@ def test_decode_phase_resolves_separately(self): fast_prefill = mixin._resolve_request_steering(sp, "prefill") fast_decode = mixin._resolve_request_steering(sp, "decode") - assert fast_prefill["post_mlp"][0].tolist() == [11.0, 22.0] - assert fast_decode["post_mlp"][0].tolist() == [101.0, 202.0] + assert fast_prefill["post_block"][0].tolist() == [11.0, 22.0] + assert fast_decode["post_block"][0].tolist() == [101.0, 202.0] def test_scaled_fast_path_multiplies_cached(self): """scale=0.5 + no inline → fast path returns cached * 0.5.""" mixin = _StubMixin() - base = _spec("post_mlp", {0: [2.0, 4.0]}) + base = _spec("post_block", {0: [2.0, 4.0]}) mixin.register_steering_modules({"m": {"vectors": base}}, replace=True) sp = SamplingParams(steering_module_ref=("m", 0.5)) fast = mixin._resolve_request_steering(sp, "prefill") # base resolved alone: [2.0, 4.0]; scaled by 0.5: [1.0, 2.0] - assert fast["post_mlp"][0].tolist() == [1.0, 2.0] + assert fast["post_block"][0].tolist() == [1.0, 2.0] def test_scaled_fast_path_matches_slow_path(self): """For scale!=1.0, fast and slow paths must agree numerically.""" mixin = _StubMixin() - base = _spec("post_mlp", {0: [1.0, 2.0], 1: [3.0, 4.0]}) - prefill = _spec("post_mlp", {0: [10.0, 20.0]}) + base = _spec("post_block", {0: [1.0, 2.0], 1: [3.0, 4.0]}) + prefill = _spec("post_block", {0: [10.0, 20.0]}) mixin.register_steering_modules( {"m": {"vectors": base, "prefill_vectors": prefill}}, replace=True, @@ -149,14 +149,14 @@ def test_scaled_fast_path_matches_slow_path(self): def test_decode_only_module_returns_none_for_prefill(self): """If module has only decode_vectors and no base, prefill returns None.""" mixin = _StubMixin() - decode = _spec("post_mlp", {0: [1.0, 2.0]}) + decode = _spec("post_block", {0: [1.0, 2.0]}) mixin.register_steering_modules({"m": {"decode_vectors": decode}}, replace=True) sp = SamplingParams(steering_module_ref=("m", 1.0)) assert mixin._resolve_request_steering(sp, "prefill") is None decoded = mixin._resolve_request_steering(sp, "decode") assert decoded is not None - assert decoded["post_mlp"][0].tolist() == [1.0, 2.0] + assert decoded["post_block"][0].tolist() == [1.0, 2.0] # --------------------------------------------------------------------------- @@ -168,9 +168,9 @@ class TestInlineOverrideFallback: def test_inline_base_falls_through(self): """Inline ``steering_vectors`` forces the merge path.""" mixin = _StubMixin() - base = _spec("post_mlp", {0: [1.0, 2.0]}) + base = _spec("post_block", {0: [1.0, 2.0]}) mixin.register_steering_modules({"m": {"vectors": base}}, replace=True) - inline = _spec("post_mlp", {0: [10.0, 20.0]}) + inline = _spec("post_block", {0: [10.0, 20.0]}) sp = SamplingParams( steering_module_ref=("m", 1.0), steering_vectors=inline, @@ -179,14 +179,14 @@ def test_inline_base_falls_through(self): result = mixin._resolve_request_steering(sp, "prefill") assert result is not None # base + inline = [1.0, 2.0] + [10.0, 20.0] = [11.0, 22.0] - assert result["post_mlp"][0].tolist() == [11.0, 22.0] + assert result["post_block"][0].tolist() == [11.0, 22.0] def test_inline_phase_falls_through(self): """Inline ``prefill_steering_vectors`` forces the merge path.""" mixin = _StubMixin() - base = _spec("post_mlp", {0: [1.0, 2.0]}) + base = _spec("post_block", {0: [1.0, 2.0]}) mixin.register_steering_modules({"m": {"vectors": base}}, replace=True) - inline_prefill = _spec("post_mlp", {1: [5.0, 5.0]}) + inline_prefill = _spec("post_block", {1: [5.0, 5.0]}) sp = SamplingParams( steering_module_ref=("m", 1.0), prefill_steering_vectors=inline_prefill, @@ -195,8 +195,8 @@ def test_inline_phase_falls_through(self): result = mixin._resolve_request_steering(sp, "prefill") assert result is not None # Layer 0 from base, layer 1 from inline_prefill. - assert result["post_mlp"][0].tolist() == [1.0, 2.0] - assert result["post_mlp"][1].tolist() == [5.0, 5.0] + assert result["post_block"][0].tolist() == [1.0, 2.0] + assert result["post_block"][1].tolist() == [5.0, 5.0] # --------------------------------------------------------------------------- @@ -208,12 +208,12 @@ class TestCacheLifecycle: def test_register_replace_clears_cache(self): mixin = _StubMixin() mixin.register_steering_modules( - {"a": {"vectors": _spec("post_mlp", {0: [1.0]})}}, replace=True + {"a": {"vectors": _spec("post_block", {0: [1.0]})}}, replace=True ) assert "a" in mixin._steering_module_resolved_cache mixin.register_steering_modules( - {"b": {"vectors": _spec("post_mlp", {0: [2.0]})}}, replace=True + {"b": {"vectors": _spec("post_block", {0: [2.0]})}}, replace=True ) assert "a" not in mixin._steering_module_resolved_cache assert "b" in mixin._steering_module_resolved_cache @@ -222,8 +222,8 @@ def test_unregister_drops_cache_entry(self): mixin = _StubMixin() mixin.register_steering_modules( { - "a": {"vectors": _spec("post_mlp", {0: [1.0]})}, - "b": {"vectors": _spec("post_mlp", {0: [2.0]})}, + "a": {"vectors": _spec("post_block", {0: [1.0]})}, + "b": {"vectors": _spec("post_block", {0: [2.0]})}, }, replace=True, ) diff --git a/tests/v1/worker/test_steering_pre_materialize.py b/tests/v1/worker/test_steering_pre_materialize.py index fcfb70a94b87..120f8d6add1c 100644 --- a/tests/v1/worker/test_steering_pre_materialize.py +++ b/tests/v1/worker/test_steering_pre_materialize.py @@ -55,8 +55,8 @@ def __init__(self, max_configs: int = 8): def _spec(layer_to_vec: dict[int, list[float]]) -> dict: - """Build a single-hook SteeringVectorSpec on hook ``post_mlp``.""" - return {"post_mlp": dict(layer_to_vec)} + """Build a single-hook SteeringVectorSpec on hook ``post_block``.""" + return {"post_block": dict(layer_to_vec)} def _module_payload( @@ -390,7 +390,7 @@ def test_re_register_drops_stale_pin_then_pin_again(self): # Verify contents match the *new* spec by checking the manager # stored the new vector in its config_vectors map. stored = stub._steering_manager.config_vectors[(named_only_h, "prefill")] - layer0_t = stored["post_mlp"][0].squeeze(0) + layer0_t = stored["post_block"][0].squeeze(0) assert layer0_t.tolist() == [9.0, 9.0, 9.0, 9.0] def test_replace_true_drops_all_prior_pins(self): diff --git a/vllm/entrypoints/openai/chat_completion/protocol.py b/vllm/entrypoints/openai/chat_completion/protocol.py index 22f63ff6f715..d64353257a58 100644 --- a/vllm/entrypoints/openai/chat_completion/protocol.py +++ b/vllm/entrypoints/openai/chat_completion/protocol.py @@ -492,7 +492,7 @@ class ChatCompletionRequest(OpenAIBaseModel): steering_vectors: SteeringVectorSpecPacked | None = Field( default=None, description="Per-request activation steering vectors keyed by hook " - "point name (pre_attn, post_attn, post_mlp). Each hook carries one " + "point name (pre_attn, post_attn, post_block). Each hook carries one " "base64-encoded (num_layers, hidden_size) blob plus a sibling " "layer_indices list (and optional per-row scales).", ) diff --git a/vllm/entrypoints/openai/completion/protocol.py b/vllm/entrypoints/openai/completion/protocol.py index 2561d4e05a6a..5daaf784717d 100644 --- a/vllm/entrypoints/openai/completion/protocol.py +++ b/vllm/entrypoints/openai/completion/protocol.py @@ -217,7 +217,7 @@ class CompletionRequest(OpenAIBaseModel): steering_vectors: SteeringVectorSpecPacked | None = Field( default=None, description="Per-request activation steering vectors keyed by hook " - "point name (pre_attn, post_attn, post_mlp). Each hook carries one " + "point name (pre_attn, post_attn, post_block). Each hook carries one " "base64-encoded (num_layers, hidden_size) blob plus a sibling " "layer_indices list (and optional per-row scales).", ) diff --git a/vllm/entrypoints/openai/steering/registry.py b/vllm/entrypoints/openai/steering/registry.py index 874161ab8847..3de8cf7aec36 100644 --- a/vllm/entrypoints/openai/steering/registry.py +++ b/vllm/entrypoints/openai/steering/registry.py @@ -153,7 +153,7 @@ async def load_from_file(self, name: str, path: str) -> None: Each tier in the JSON file may be either the legacy shape:: - {"vectors": {"post_mlp": {"14": [0.1, ...]}}} + {"vectors": {"post_block": {"14": [0.1, ...]}}} (string layer keys are converted to int) or the binary-wire ``SteeringVectorSpecPacked`` shape (base64-encoded ``data`` field diff --git a/vllm/entrypoints/serve/steering/protocol.py b/vllm/entrypoints/serve/steering/protocol.py index 18cac7cea69b..2d97d5c8a23d 100644 --- a/vllm/entrypoints/serve/steering/protocol.py +++ b/vllm/entrypoints/serve/steering/protocol.py @@ -23,7 +23,7 @@ class SetSteeringRequest(BaseModel): default=None, description="Base steering vectors applied to both prefill and " "decode phases. Keyed by hook point name (pre_attn, post_attn, " - "post_mlp). Each hook's value is either a legacy layer map " + "post_block). Each hook's value is either a legacy layer map " "({layer_idx: list[float] | {\"vector\": [...], \"scale\": float}}) " "or a binary-wire SteeringHookPacked blob (base64-encoded " "(num_layers, hidden_size) buffer + layer_indices + dtype/shape, " diff --git a/vllm/model_executor/layers/activation_capture.py b/vllm/model_executor/layers/activation_capture.py index 1838abb7e5c8..c35e9fd86f0d 100644 --- a/vllm/model_executor/layers/activation_capture.py +++ b/vllm/model_executor/layers/activation_capture.py @@ -45,7 +45,7 @@ _HOOK_NAME_TO_ID: dict[str, int] = { "pre_attn": 0, "post_attn": 1, - "post_mlp": 2, + "post_block": 2, "mlp_in": 3, "mlp_out": 4, } diff --git a/vllm/model_executor/layers/steering.py b/vllm/model_executor/layers/steering.py index 6c65fe332cd1..bb547b35401d 100644 --- a/vllm/model_executor/layers/steering.py +++ b/vllm/model_executor/layers/steering.py @@ -36,7 +36,7 @@ class SteeringHookPoint(str, Enum): POST_ATTN = "post_attn" """Steer the residual skip tensor in the post-attention region.""" - POST_MLP = "post_mlp" + POST_BLOCK = "post_block" """Steer the residual skip tensor in the post-MLP region.""" @@ -44,7 +44,7 @@ class SteeringHookPoint(str, Enum): HOOK_POINT_TABLE_ATTR: dict[SteeringHookPoint, str] = { SteeringHookPoint.PRE_ATTN: "steering_table_pre_attn", SteeringHookPoint.POST_ATTN: "steering_table_post_attn", - SteeringHookPoint.POST_MLP: "steering_table_post_mlp", + SteeringHookPoint.POST_BLOCK: "steering_table_post_block", } # Per-hook ``any-active`` flag attribute names. The flag is a single-element @@ -59,7 +59,7 @@ class SteeringHookPoint(str, Enum): # Valid hook point string values for validation. VALID_HOOK_POINT_NAMES: frozenset[str] = frozenset(hp.value for hp in SteeringHookPoint) -DEFAULT_HOOK_POINT = SteeringHookPoint.POST_MLP +DEFAULT_HOOK_POINT = SteeringHookPoint.POST_BLOCK def register_steering_buffers( diff --git a/vllm/model_executor/models/AXK1.py b/vllm/model_executor/models/AXK1.py index f2fdc629f5a2..b3196a1d2438 100644 --- a/vllm/model_executor/models/AXK1.py +++ b/vllm/model_executor/models/AXK1.py @@ -649,7 +649,7 @@ def __init__( self.post_attention_layernorm = RMSNorm( config.hidden_size, eps=config.rms_norm_eps ) - self.post_mlp_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) + self.post_block_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) self.routed_scaling_factor = config.routed_scaling_factor def _is_layer_sparse(self) -> bool: @@ -701,7 +701,7 @@ def forward( hidden_states = self.mlp(hidden_states) if self.is_layer_sparse: - hidden_states = self.post_mlp_layernorm(hidden_states) + hidden_states = self.post_block_layernorm(hidden_states) if isinstance(self.mlp, AXK1MLP) and hidden_states.dtype == torch.float16: # Fix FP16 overflow @@ -710,7 +710,7 @@ def forward( # The scaling of AXK1MOE output would be done in the forward # of AXK1MOE hidden_states *= 1.0 / self.routed_scaling_factor - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/afmoe.py b/vllm/model_executor/models/afmoe.py index 2216e4948bd9..1aa111add2ae 100644 --- a/vllm/model_executor/models/afmoe.py +++ b/vllm/model_executor/models/afmoe.py @@ -339,7 +339,7 @@ def __init__( config.hidden_size, eps=config.rms_norm_eps ) self.pre_mlp_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) - self.post_mlp_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) + self.post_block_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) def forward( self, @@ -364,7 +364,7 @@ def forward( hidden_states, residual ) hidden_states = self.mlp(hidden_states) - hidden_states = self.post_mlp_layernorm(hidden_states) # ffn norm b + hidden_states = self.post_block_layernorm(hidden_states) # ffn norm b return hidden_states, residual diff --git a/vllm/model_executor/models/apertus.py b/vllm/model_executor/models/apertus.py index 234818b38307..c2bb8a9c7452 100644 --- a/vllm/model_executor/models/apertus.py +++ b/vllm/model_executor/models/apertus.py @@ -337,7 +337,7 @@ def forward( hidden_states, residual = self.feedforward_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/arcee.py b/vllm/model_executor/models/arcee.py index cf1653db4eef..3d9dd49bb0e7 100644 --- a/vllm/model_executor/models/arcee.py +++ b/vllm/model_executor/models/arcee.py @@ -195,7 +195,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/arctic.py b/vllm/model_executor/models/arctic.py index ccdf6c862b1a..a9c6f6d2fe63 100644 --- a/vllm/model_executor/models/arctic.py +++ b/vllm/model_executor/models/arctic.py @@ -401,7 +401,7 @@ def forward( hidden_states = self.block_sparse_moe(hidden_states) hidden_states = residual_attn + hidden_states hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states diff --git a/vllm/model_executor/models/baichuan.py b/vllm/model_executor/models/baichuan.py index 80a7c299f40e..e70e2a60cc1b 100644 --- a/vllm/model_executor/models/baichuan.py +++ b/vllm/model_executor/models/baichuan.py @@ -293,7 +293,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/commandr.py b/vllm/model_executor/models/commandr.py index c40d7d5fb439..1f8dfea5d7f5 100644 --- a/vllm/model_executor/models/commandr.py +++ b/vllm/model_executor/models/commandr.py @@ -297,7 +297,7 @@ def forward( ) hidden_states = hidden_states + hidden_states_mlp hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states, residual diff --git a/vllm/model_executor/models/deepseek_v2.py b/vllm/model_executor/models/deepseek_v2.py index cb3ac5c39cbb..1de61b805cc2 100644 --- a/vllm/model_executor/models/deepseek_v2.py +++ b/vllm/model_executor/models/deepseek_v2.py @@ -1214,7 +1214,7 @@ def forward( # The scaling of DeepseekV2MOE output would be done in the forward # of DeepseekV2MOE hidden_states *= 1.0 / self.routed_scaling_factor - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/dots1.py b/vllm/model_executor/models/dots1.py index 16d6d7eee2d9..1ddbe318451e 100644 --- a/vllm/model_executor/models/dots1.py +++ b/vllm/model_executor/models/dots1.py @@ -356,7 +356,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/ernie45_moe.py b/vllm/model_executor/models/ernie45_moe.py index ec62e532eaa3..4e74d7b1efec 100644 --- a/vllm/model_executor/models/ernie45_moe.py +++ b/vllm/model_executor/models/ernie45_moe.py @@ -421,7 +421,7 @@ def forward( residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/exaone.py b/vllm/model_executor/models/exaone.py index c10a5a234361..07ed518fc7ca 100644 --- a/vllm/model_executor/models/exaone.py +++ b/vllm/model_executor/models/exaone.py @@ -314,7 +314,7 @@ def forward( hidden_states, residual = self.ln_2(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/exaone4.py b/vllm/model_executor/models/exaone4.py index c9db3f8bfd1a..fada21d69cbb 100644 --- a/vllm/model_executor/models/exaone4.py +++ b/vllm/model_executor/models/exaone4.py @@ -314,7 +314,7 @@ def forward( hidden_states = self.post_feedforward_layernorm(hidden_states) hidden_states = residual + hidden_states hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states, residual diff --git a/vllm/model_executor/models/exaone_moe.py b/vllm/model_executor/models/exaone_moe.py index 5f47817fe155..56681f38c753 100644 --- a/vllm/model_executor/models/exaone_moe.py +++ b/vllm/model_executor/models/exaone_moe.py @@ -259,7 +259,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/falcon.py b/vllm/model_executor/models/falcon.py index 974aaa2a18ba..4b76b1eea30a 100644 --- a/vllm/model_executor/models/falcon.py +++ b/vllm/model_executor/models/falcon.py @@ -384,7 +384,7 @@ def forward( mlp_output += mlp_bias output = mlp_output + residual - output = apply_layer_steering(self, output, SteeringHookPoint.POST_MLP) + output = apply_layer_steering(self, output, SteeringHookPoint.POST_BLOCK) return output diff --git a/vllm/model_executor/models/flex_olmo.py b/vllm/model_executor/models/flex_olmo.py index 84b5a86708ef..4b04d593c159 100644 --- a/vllm/model_executor/models/flex_olmo.py +++ b/vllm/model_executor/models/flex_olmo.py @@ -167,7 +167,7 @@ def forward( hidden_states = self.post_feedforward_layernorm(hidden_states) hidden_states = residual + hidden_states hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states, None diff --git a/vllm/model_executor/models/gemma.py b/vllm/model_executor/models/gemma.py index 16f1326d7250..e8385fcd6870 100644 --- a/vllm/model_executor/models/gemma.py +++ b/vllm/model_executor/models/gemma.py @@ -278,7 +278,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/gemma2.py b/vllm/model_executor/models/gemma2.py index 0abcf9e9c256..f57ff9a15499 100644 --- a/vllm/model_executor/models/gemma2.py +++ b/vllm/model_executor/models/gemma2.py @@ -268,7 +268,7 @@ def forward( hidden_states, residual ) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) hidden_states = self.post_feedforward_layernorm(hidden_states) return hidden_states, residual diff --git a/vllm/model_executor/models/gemma3.py b/vllm/model_executor/models/gemma3.py index 74043f4adcbc..8a66d94e5104 100644 --- a/vllm/model_executor/models/gemma3.py +++ b/vllm/model_executor/models/gemma3.py @@ -330,7 +330,7 @@ def forward( ) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) hidden_states = self.post_feedforward_layernorm(hidden_states) diff --git a/vllm/model_executor/models/gemma3n.py b/vllm/model_executor/models/gemma3n.py index 7235ca2a533d..4e613338c614 100644 --- a/vllm/model_executor/models/gemma3n.py +++ b/vllm/model_executor/models/gemma3n.py @@ -578,7 +578,7 @@ def forward( attn_ffw_norm = self.post_feedforward_layernorm(attn_ffw) attn_ffw_laurel_gated = attn_laurel + attn_ffw_norm attn_ffw_laurel_gated = apply_layer_steering( - self, attn_ffw_laurel_gated, SteeringHookPoint.POST_MLP + self, attn_ffw_laurel_gated, SteeringHookPoint.POST_BLOCK ) # ActUp (connect). diff --git a/vllm/model_executor/models/gemma4.py b/vllm/model_executor/models/gemma4.py index b98024095e1b..8e73d689c898 100644 --- a/vllm/model_executor/models/gemma4.py +++ b/vllm/model_executor/models/gemma4.py @@ -769,9 +769,9 @@ def forward( hidden_states = self.post_feedforward_layernorm(hidden_states) hidden_states = hidden_states + residual - maybe_capture_residual(hidden_states, self.layer_idx, "post_mlp") + maybe_capture_residual(hidden_states, self.layer_idx, "post_block") hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) # Apply PLE (Per-Layer Embedding) if configured diff --git a/vllm/model_executor/models/glm4.py b/vllm/model_executor/models/glm4.py index 24e7d00cd0ae..b2888f53daa6 100644 --- a/vllm/model_executor/models/glm4.py +++ b/vllm/model_executor/models/glm4.py @@ -209,7 +209,7 @@ def __init__( self.post_self_attn_layernorm = RMSNorm( config.hidden_size, eps=config.rms_norm_eps ) - self.post_mlp_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) + self.post_block_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) def forward( self, @@ -235,8 +235,8 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - hidden_states = self.post_mlp_layernorm(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + hidden_states = self.post_block_layernorm(hidden_states) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/glm4_moe.py b/vllm/model_executor/models/glm4_moe.py index d1e8779b3664..9753fab7d00e 100644 --- a/vllm/model_executor/models/glm4_moe.py +++ b/vllm/model_executor/models/glm4_moe.py @@ -405,7 +405,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/glm4_moe_lite.py b/vllm/model_executor/models/glm4_moe_lite.py index 3cc07a8ed9bd..2c8899990a0e 100644 --- a/vllm/model_executor/models/glm4_moe_lite.py +++ b/vllm/model_executor/models/glm4_moe_lite.py @@ -219,7 +219,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/gpt_neox.py b/vllm/model_executor/models/gpt_neox.py index 7522744d1600..0e0d522c2d52 100644 --- a/vllm/model_executor/models/gpt_neox.py +++ b/vllm/model_executor/models/gpt_neox.py @@ -212,7 +212,7 @@ def forward( self, hidden_states, SteeringHookPoint.POST_ATTN ) hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) else: # pseudocode: @@ -226,7 +226,7 @@ def forward( mlp_output = self.mlp(mlp_input) hidden_states = mlp_output + attn_output hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states diff --git a/vllm/model_executor/models/granite.py b/vllm/model_executor/models/granite.py index 86a8e8465e4d..2d6d72c98223 100644 --- a/vllm/model_executor/models/granite.py +++ b/vllm/model_executor/models/granite.py @@ -274,7 +274,7 @@ def forward( hidden_states = self.mlp(hidden_states) hidden_states = residual + hidden_states * self.residual_multiplier hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states diff --git a/vllm/model_executor/models/granitemoe.py b/vllm/model_executor/models/granitemoe.py index 5cb7133aec45..06fefc6ed29c 100644 --- a/vllm/model_executor/models/granitemoe.py +++ b/vllm/model_executor/models/granitemoe.py @@ -308,7 +308,7 @@ def forward( hidden_states = self.block_sparse_moe(hidden_states) hidden_states = residual + hidden_states * self.residual_multiplier hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states diff --git a/vllm/model_executor/models/granitemoeshared.py b/vllm/model_executor/models/granitemoeshared.py index 5c4b60d8097c..3ac70b017837 100644 --- a/vllm/model_executor/models/granitemoeshared.py +++ b/vllm/model_executor/models/granitemoeshared.py @@ -172,7 +172,7 @@ def forward( del moe_hidden_states hidden_states = residual + hidden_states * self.residual_multiplier hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states diff --git a/vllm/model_executor/models/grok1.py b/vllm/model_executor/models/grok1.py index 0d74632363ad..8d84a7c9f551 100644 --- a/vllm/model_executor/models/grok1.py +++ b/vllm/model_executor/models/grok1.py @@ -452,7 +452,7 @@ def forward( else: hidden_states = self.moe_block(hidden_states) hidden_states = self.post_moe_norm(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/hunyuan_v1.py b/vllm/model_executor/models/hunyuan_v1.py index cb66dbe95c8d..038bef9cbdba 100644 --- a/vllm/model_executor/models/hunyuan_v1.py +++ b/vllm/model_executor/models/hunyuan_v1.py @@ -599,7 +599,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual, ori_kv_states diff --git a/vllm/model_executor/models/hyperclovax.py b/vllm/model_executor/models/hyperclovax.py index 7f54923505fa..46746ceca4f5 100644 --- a/vllm/model_executor/models/hyperclovax.py +++ b/vllm/model_executor/models/hyperclovax.py @@ -319,7 +319,7 @@ def forward( # The residual is added outside the layernorm function to apply muP. hidden_states = residual + hidden_states * self.residual_multiplier # muP hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states, residual diff --git a/vllm/model_executor/models/internlm2.py b/vllm/model_executor/models/internlm2.py index daf9a4863d22..d2948e4459c8 100644 --- a/vllm/model_executor/models/internlm2.py +++ b/vllm/model_executor/models/internlm2.py @@ -266,7 +266,7 @@ def forward( hidden_states, residual = self.ffn_norm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.feed_forward(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/internlm2_ve.py b/vllm/model_executor/models/internlm2_ve.py index 49c027359d39..e4f4a431cb70 100644 --- a/vllm/model_executor/models/internlm2_ve.py +++ b/vllm/model_executor/models/internlm2_ve.py @@ -110,7 +110,7 @@ def forward( ).flatten() else: hidden_states = self.feed_forward(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/interns1_pro.py b/vllm/model_executor/models/interns1_pro.py index fdfc9cf89b5c..ab225a5bda8a 100644 --- a/vllm/model_executor/models/interns1_pro.py +++ b/vllm/model_executor/models/interns1_pro.py @@ -476,7 +476,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/iquest_loopcoder.py b/vllm/model_executor/models/iquest_loopcoder.py index b06f1a70281e..5a1c9d2c20be 100644 --- a/vllm/model_executor/models/iquest_loopcoder.py +++ b/vllm/model_executor/models/iquest_loopcoder.py @@ -296,7 +296,7 @@ def forward( hidden_states = self.mlp(hidden_states) hidden_states = hidden_states + residual hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states diff --git a/vllm/model_executor/models/jais2.py b/vllm/model_executor/models/jais2.py index 42a5fc8c0414..236a617e1966 100644 --- a/vllm/model_executor/models/jais2.py +++ b/vllm/model_executor/models/jais2.py @@ -304,7 +304,7 @@ def forward( ) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual def get_quant_config(self, vllm_config: VllmConfig) -> QuantizationConfig | None: diff --git a/vllm/model_executor/models/kimi_linear.py b/vllm/model_executor/models/kimi_linear.py index 64f9c9935609..45f4f2647393 100644 --- a/vllm/model_executor/models/kimi_linear.py +++ b/vllm/model_executor/models/kimi_linear.py @@ -398,7 +398,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/llama.py b/vllm/model_executor/models/llama.py index 5ff962cab4c9..94d8206fe9f5 100644 --- a/vllm/model_executor/models/llama.py +++ b/vllm/model_executor/models/llama.py @@ -352,7 +352,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual def get_quant_config(self, vllm_config: VllmConfig) -> QuantizationConfig | None: diff --git a/vllm/model_executor/models/llama4.py b/vllm/model_executor/models/llama4.py index 6560ae4e7cb2..8c33afc4b4e4 100644 --- a/vllm/model_executor/models/llama4.py +++ b/vllm/model_executor/models/llama4.py @@ -402,7 +402,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.feed_forward(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/minicpm.py b/vllm/model_executor/models/minicpm.py index 0572c2b15df6..bfeab6b78769 100644 --- a/vllm/model_executor/models/minicpm.py +++ b/vllm/model_executor/models/minicpm.py @@ -418,7 +418,7 @@ def forward( self.config.scale_depth / math.sqrt(self.config.num_hidden_layers) ) hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states, None diff --git a/vllm/model_executor/models/minimax_m2.py b/vllm/model_executor/models/minimax_m2.py index 54f6ae32d0a7..64a2eb097784 100644 --- a/vllm/model_executor/models/minimax_m2.py +++ b/vllm/model_executor/models/minimax_m2.py @@ -340,7 +340,7 @@ def forward( residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.block_sparse_moe(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/minimax_text_01.py b/vllm/model_executor/models/minimax_text_01.py index 1429e6f6c72c..602915879bb0 100644 --- a/vllm/model_executor/models/minimax_text_01.py +++ b/vllm/model_executor/models/minimax_text_01.py @@ -501,7 +501,7 @@ def forward( hidden_states = residual + hidden_states hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states, None diff --git a/vllm/model_executor/models/mistral.py b/vllm/model_executor/models/mistral.py index 566e4d3c0159..387dcf49233c 100644 --- a/vllm/model_executor/models/mistral.py +++ b/vllm/model_executor/models/mistral.py @@ -206,7 +206,7 @@ def forward( hidden_states = hidden_states * (1 + self.ada_rms_norm_t_cond(t_cond)) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/mixtral.py b/vllm/model_executor/models/mixtral.py index 7077ae2166e9..0d4d960d3e7d 100644 --- a/vllm/model_executor/models/mixtral.py +++ b/vllm/model_executor/models/mixtral.py @@ -313,7 +313,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.block_sparse_moe(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/molmo.py b/vllm/model_executor/models/molmo.py index f4ba85a90ebe..e3e298f99d1e 100644 --- a/vllm/model_executor/models/molmo.py +++ b/vllm/model_executor/models/molmo.py @@ -664,7 +664,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual @@ -694,7 +694,7 @@ def forward( hidden_states = self.post_attention_layernorm(hidden_states) hidden_states = hidden_states + residual hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) residual = None return hidden_states, residual diff --git a/vllm/model_executor/models/molmo2.py b/vllm/model_executor/models/molmo2.py index 52214da9f831..3729221c2d73 100644 --- a/vllm/model_executor/models/molmo2.py +++ b/vllm/model_executor/models/molmo2.py @@ -1140,7 +1140,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual @@ -1172,7 +1172,7 @@ def forward( hidden_states = self.post_attention_layernorm(hidden_states) hidden_states = hidden_states + residual hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) residual = None return hidden_states, residual diff --git a/vllm/model_executor/models/nemotron.py b/vllm/model_executor/models/nemotron.py index eea3702c450b..7fb440bbddea 100644 --- a/vllm/model_executor/models/nemotron.py +++ b/vllm/model_executor/models/nemotron.py @@ -312,7 +312,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/nemotron_nas.py b/vllm/model_executor/models/nemotron_nas.py index 2a72295f7704..a04707529673 100644 --- a/vllm/model_executor/models/nemotron_nas.py +++ b/vllm/model_executor/models/nemotron_nas.py @@ -243,7 +243,7 @@ def forward( ) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/olmo.py b/vllm/model_executor/models/olmo.py index bc8a7655173a..8b1fab09ab8d 100644 --- a/vllm/model_executor/models/olmo.py +++ b/vllm/model_executor/models/olmo.py @@ -261,7 +261,7 @@ def forward( hidden_states = self.mlp(hidden_states) hidden_states = residual + hidden_states hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states diff --git a/vllm/model_executor/models/olmo2.py b/vllm/model_executor/models/olmo2.py index 5b21a88baea5..40b92f351066 100644 --- a/vllm/model_executor/models/olmo2.py +++ b/vllm/model_executor/models/olmo2.py @@ -300,7 +300,7 @@ def forward( hidden_states = self.post_feedforward_layernorm(hidden_states) hidden_states = residual + hidden_states hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states diff --git a/vllm/model_executor/models/olmo_hybrid.py b/vllm/model_executor/models/olmo_hybrid.py index 0847ac76e57c..65e106422980 100644 --- a/vllm/model_executor/models/olmo_hybrid.py +++ b/vllm/model_executor/models/olmo_hybrid.py @@ -839,7 +839,7 @@ def forward( hidden_states = self.mlp(hidden_states) hidden_states = residual + hidden_states hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) else: residual = hidden_states @@ -856,7 +856,7 @@ def forward( hidden_states = self.post_feedforward_layernorm(hidden_states) hidden_states = residual + hidden_states hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states diff --git a/vllm/model_executor/models/olmoe.py b/vllm/model_executor/models/olmoe.py index fca2b878ab1d..f13747172764 100644 --- a/vllm/model_executor/models/olmoe.py +++ b/vllm/model_executor/models/olmoe.py @@ -287,7 +287,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/openpangu.py b/vllm/model_executor/models/openpangu.py index ab638971fa68..39f004405dc7 100644 --- a/vllm/model_executor/models/openpangu.py +++ b/vllm/model_executor/models/openpangu.py @@ -960,7 +960,7 @@ def __init__( self.pre_mlp_layernorm = RMSNorm( config.hidden_size, eps=config.rms_norm_eps ) - self.post_mlp_layernorm = RMSNorm( + self.post_block_layernorm = RMSNorm( config.hidden_size, eps=config.rms_norm_eps ) @@ -1015,8 +1015,8 @@ def forward( hidden_states *= 1.0 / self.routed_scaling_factor if self.sandwich_norm: - hidden_states = self.post_mlp_layernorm(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + hidden_states = self.post_block_layernorm(hidden_states) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/opt.py b/vllm/model_executor/models/opt.py index 05c7d025fcbe..96263cb3f369 100644 --- a/vllm/model_executor/models/opt.py +++ b/vllm/model_executor/models/opt.py @@ -217,7 +217,7 @@ def forward( hidden_states, _ = self.fc2(hidden_states) hidden_states = residual + hidden_states hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) # 350m applies layer norm AFTER attention if not self.do_layer_norm_before: diff --git a/vllm/model_executor/models/orion.py b/vllm/model_executor/models/orion.py index 9fdc26bf1181..992ac8bc2250 100644 --- a/vllm/model_executor/models/orion.py +++ b/vllm/model_executor/models/orion.py @@ -241,7 +241,7 @@ def forward( hidden_states = self.mlp(hidden_states) hidden_states = residual + hidden_states hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states diff --git a/vllm/model_executor/models/ouro.py b/vllm/model_executor/models/ouro.py index d295647de6c1..5f0b0e1b210f 100644 --- a/vllm/model_executor/models/ouro.py +++ b/vllm/model_executor/models/ouro.py @@ -302,7 +302,7 @@ def forward( residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) hidden_states = self.post_attention_layernorm_2(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/persimmon.py b/vllm/model_executor/models/persimmon.py index 5e3a557c365b..5e086b5d7d0e 100644 --- a/vllm/model_executor/models/persimmon.py +++ b/vllm/model_executor/models/persimmon.py @@ -259,7 +259,7 @@ def forward( hidden_states = hidden_states + residual hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) outputs = hidden_states diff --git a/vllm/model_executor/models/phi.py b/vllm/model_executor/models/phi.py index f278f6f9bc04..e9e35a2152fa 100644 --- a/vllm/model_executor/models/phi.py +++ b/vllm/model_executor/models/phi.py @@ -227,7 +227,7 @@ def forward( ) hidden_states = attn_hidden_states + feed_forward_hidden_states hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states diff --git a/vllm/model_executor/models/phimoe.py b/vllm/model_executor/models/phimoe.py index 304cd92e0d69..04a58c15f3de 100644 --- a/vllm/model_executor/models/phimoe.py +++ b/vllm/model_executor/models/phimoe.py @@ -470,7 +470,7 @@ def forward( hidden_states = hidden_states + residual hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states, residual diff --git a/vllm/model_executor/models/plamo2.py b/vllm/model_executor/models/plamo2.py index deab6ea524d7..8098ef02819a 100644 --- a/vllm/model_executor/models/plamo2.py +++ b/vllm/model_executor/models/plamo2.py @@ -690,7 +690,7 @@ def __init__( self.pre_mixer_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) self.post_mixer_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) self.pre_mlp_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) - self.post_mlp_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) + self.post_block_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) def forward( self, @@ -729,9 +729,9 @@ def forward( # Fully Connected hidden_states, residual = self.pre_mlp_norm(hidden_states, residual) hidden_states = self.mlp(hidden_states) - hidden_states = self.post_mlp_norm(hidden_states) + hidden_states = self.post_block_norm(hidden_states) hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states, residual @@ -1007,7 +1007,7 @@ def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): loaded_weight += 1.0 / 5 elif ".pre_mlp_norm" in name: loaded_weight += 1.0 - elif ".post_mlp_norm" in name: + elif ".post_block_norm" in name: loaded_weight += 1.0 / (5**1.5) elif "model.norm.weight" in name: loaded_weight += 1.0 diff --git a/vllm/model_executor/models/plamo3.py b/vllm/model_executor/models/plamo3.py index 637f88cfd340..a7097b847159 100644 --- a/vllm/model_executor/models/plamo3.py +++ b/vllm/model_executor/models/plamo3.py @@ -275,9 +275,9 @@ def __init__( self.pre_mlp_norm.weight, {"weight_loader": rms_norm_weight_loader(offset=1.0)}, ) - self.post_mlp_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) + self.post_block_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) set_weight_attrs( - self.post_mlp_norm.weight, + self.post_block_norm.weight, {"weight_loader": rms_norm_weight_loader(offset=1.0 / (5**1.5))}, ) @@ -305,9 +305,9 @@ def forward( # Fully Connected hidden_states, residual = self.pre_mlp_norm(hidden_states, residual) hidden_states = self.mlp(hidden_states) - hidden_states = self.post_mlp_norm(hidden_states) + hidden_states = self.post_block_norm(hidden_states) hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states, residual diff --git a/vllm/model_executor/models/qwen2.py b/vllm/model_executor/models/qwen2.py index 85b7dfc0bdc8..268fea5a6071 100644 --- a/vllm/model_executor/models/qwen2.py +++ b/vllm/model_executor/models/qwen2.py @@ -342,7 +342,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/qwen2_moe.py b/vllm/model_executor/models/qwen2_moe.py index 0699f0a71358..c05aaf90391a 100644 --- a/vllm/model_executor/models/qwen2_moe.py +++ b/vllm/model_executor/models/qwen2_moe.py @@ -375,7 +375,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/qwen3.py b/vllm/model_executor/models/qwen3.py index f023b0c8c75f..6e9980d5cf98 100644 --- a/vllm/model_executor/models/qwen3.py +++ b/vllm/model_executor/models/qwen3.py @@ -252,7 +252,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/qwen3_moe.py b/vllm/model_executor/models/qwen3_moe.py index 2af8af43044d..bbaf3493f084 100644 --- a/vllm/model_executor/models/qwen3_moe.py +++ b/vllm/model_executor/models/qwen3_moe.py @@ -455,7 +455,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/qwen3_next.py b/vllm/model_executor/models/qwen3_next.py index 7bd3323df58b..855eea6b33cb 100644 --- a/vllm/model_executor/models/qwen3_next.py +++ b/vllm/model_executor/models/qwen3_next.py @@ -468,7 +468,7 @@ def forward( self.ffn_layer_scale.to(hidden_states.dtype) + 1 ) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/seed_oss.py b/vllm/model_executor/models/seed_oss.py index 5810f45e513c..94aaa4ee6ab3 100644 --- a/vllm/model_executor/models/seed_oss.py +++ b/vllm/model_executor/models/seed_oss.py @@ -276,7 +276,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/solar.py b/vllm/model_executor/models/solar.py index 1d1a6c53b5db..4c30baee06ee 100644 --- a/vllm/model_executor/models/solar.py +++ b/vllm/model_executor/models/solar.py @@ -271,7 +271,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/stablelm.py b/vllm/model_executor/models/stablelm.py index 0ccd2c98aa4b..dd89dbfe4813 100644 --- a/vllm/model_executor/models/stablelm.py +++ b/vllm/model_executor/models/stablelm.py @@ -236,7 +236,7 @@ def forward( hidden_states = self.mlp(hidden_states) hidden_states = residual + hidden_states hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states, residual diff --git a/vllm/model_executor/models/starcoder2.py b/vllm/model_executor/models/starcoder2.py index 6d470eda0e51..44074d2b4ded 100644 --- a/vllm/model_executor/models/starcoder2.py +++ b/vllm/model_executor/models/starcoder2.py @@ -239,7 +239,7 @@ def forward( hidden_states = self.mlp(hidden_states) hidden_states = residual + hidden_states hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states diff --git a/vllm/model_executor/models/step1.py b/vllm/model_executor/models/step1.py index eabc61cba139..5d70209d4865 100644 --- a/vllm/model_executor/models/step1.py +++ b/vllm/model_executor/models/step1.py @@ -266,7 +266,7 @@ def forward( hidden_states, residual = self.post_attention_layernorm(hidden_states, residual) residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_ATTN) hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: diff --git a/vllm/model_executor/models/step3_text.py b/vllm/model_executor/models/step3_text.py index 1a5d9b699dc2..342b38278f5c 100644 --- a/vllm/model_executor/models/step3_text.py +++ b/vllm/model_executor/models/step3_text.py @@ -328,7 +328,7 @@ def forward( hidden_states = share_output + moe_output else: hidden_states = self.mlp(hidden_states) - residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_MLP) + residual = apply_layer_steering(self, residual, SteeringHookPoint.POST_BLOCK) return hidden_states, residual diff --git a/vllm/model_executor/models/step3p5.py b/vllm/model_executor/models/step3p5.py index 06fa322f6972..82fb8c695802 100644 --- a/vllm/model_executor/models/step3p5.py +++ b/vllm/model_executor/models/step3p5.py @@ -562,7 +562,7 @@ def forward( ffn_output = self.mlp(hidden_states) hidden_states = ffn_output + residual hidden_states = apply_layer_steering( - self, hidden_states, SteeringHookPoint.POST_MLP + self, hidden_states, SteeringHookPoint.POST_BLOCK ) return hidden_states diff --git a/vllm/sampling_params.py b/vllm/sampling_params.py index 3f02d47ec315..c37c9afa1aac 100644 --- a/vllm/sampling_params.py +++ b/vllm/sampling_params.py @@ -342,7 +342,7 @@ class SamplingParams( steering_vectors: SteeringVectorSpec | None = None """Base steering vectors applied to both prefill and decode phases. - Keyed by hook point name (pre_attn, post_attn, post_mlp), then + Keyed by hook point name (pre_attn, post_attn, post_block), then layer index. Values are either bare ``list[float]`` (scale=1.0) or ``{"vector": [...], "scale": float}``.""" diff --git a/vllm/v1/capture/consumers/filesystem/validation.py b/vllm/v1/capture/consumers/filesystem/validation.py index 5285686bda15..60816c15c330 100644 --- a/vllm/v1/capture/consumers/filesystem/validation.py +++ b/vllm/v1/capture/consumers/filesystem/validation.py @@ -66,7 +66,7 @@ # so admission rejects them until they are wired; re-add here once # implemented. _VALID_HOOK_NAMES: frozenset[str] = frozenset( - ("pre_attn", "post_attn", "post_mlp") + ("pre_attn", "post_attn", "post_block") ) _VALID_POSITION_KINDS: frozenset[str] = frozenset( @@ -375,7 +375,7 @@ def validate_filesystem_request( _structural_validate(raw) # 2. Parallelism. The residual hooks captured today (pre_attn / - # post_attn / post_mlp) read the residual stream after the + # post_attn / post_block) read the residual stream after the # tensor-parallel all-reduce / MoE combine, so it is replicated and # full-width across the TP and EP planes; data parallelism partitions # requests across independent engine cores. All four axes are diff --git a/vllm/v1/capture/manager.py b/vllm/v1/capture/manager.py index 6fe02d134593..a8f9656c6114 100644 --- a/vllm/v1/capture/manager.py +++ b/vllm/v1/capture/manager.py @@ -1461,7 +1461,7 @@ def _run_finalize( dummy_key = ( VllmInternalRequestId(req_id), 0, - "post_mlp", + "post_block", ) results[consumer_idx] = CaptureResult( key=dummy_key, diff --git a/vllm/v1/capture/plan.py b/vllm/v1/capture/plan.py index 700d755ba570..9ca7164f44c2 100644 --- a/vllm/v1/capture/plan.py +++ b/vllm/v1/capture/plan.py @@ -62,7 +62,7 @@ class CapturePositionEntry: Attributes: request_id: The owning request's id. layer: Decoder-layer index. - hook: Hook-point name (e.g. ``"post_mlp"``). + hook: Hook-point name (e.g. ``"post_block"``). logical_pos: Absolute position in the request's token sequence. scratch_row: Index within the ``(layer, hook)``'s scratch tensor. step_index: Capture-step ordinal for this request. diff --git a/vllm/v1/capture/types.py b/vllm/v1/capture/types.py index 75313f703161..94c961f1edd7 100644 --- a/vllm/v1/capture/types.py +++ b/vllm/v1/capture/types.py @@ -42,7 +42,7 @@ HookName = Literal[ "pre_attn", "post_attn", - "post_mlp", + "post_block", "mlp_in", "mlp_out", ] diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py index d2f2d487e3ab..ef52aaa058c3 100644 --- a/vllm/v1/worker/gpu_model_runner.py +++ b/vllm/v1/worker/gpu_model_runner.py @@ -558,7 +558,7 @@ def __init__( self._capture_step_gate = CaptureStepGate() # Capturer-rank gate. The replicated residual hooks - # (pre_attn/post_attn/post_mlp) read the residual stream after + # (pre_attn/post_attn/post_block) read the residual stream after # the tensor-parallel all-reduce / MoE combine, so it is # byte-identical across the tensor-parallel group within each # (data-parallel, pipeline) cell. Exactly one rank — TP rank 0 diff --git a/vllm/v1/worker/steering_manager.py b/vllm/v1/worker/steering_manager.py index 46539fb3f098..78dac194d2ea 100644 --- a/vllm/v1/worker/steering_manager.py +++ b/vllm/v1/worker/steering_manager.py @@ -305,7 +305,7 @@ def update_global_vectors( """Update cached global vector for a hook point and layer. Args: - hook_point: Hook point string (e.g. ``"post_mlp"``). + hook_point: Hook point string (e.g. ``"post_block"``). layer_idx: Layer index. vector: The global vector tensor. phase: ``"base"``, ``"prefill"``, or ``"decode"``.