capture: v2-runner guard, capture_wait, global filesystem capture, NFS fsync docs by maxsloef-goodfire · Pull Request #174 · RhizoNymph/vllm

maxsloef-goodfire · 2026-06-11T20:19:54Z

PR 3 of 3 — stacked on #175 (mechanical rename) and #176 (true block-output capture). The combined diff shown includes both until they merge — review the top commit only (capture: v2-runner guard, capture_wait, global filesystem capture, NFS fsync docs): 17 files, +303/−36. I'll rebase as the stack merges.

Hi! We've been evaluating this branch as the capture/steering backend for our interpretability pipelines at Goodfire (residual-stream extraction + readout over large model batches), and validated it end-to-end on B200 (qwen3-8b, 103 requests x 36 layers x all prompt positions, ground-truthed against HF transformers hidden states). The design is excellent — bounded overload policies, atomic publish, per-request opt-in — and this PR contains the fixes and two small features that came out of that validation.

Fixes

1. (moved to #176: true block-output capture.)

2. Force the v1 model runner when capture/steering is configured.
Capture/steering are wired into gpu_model_runner.py only; the v2 runner (auto-selected for DEFAULT_V2_MODEL_RUNNER_ARCHITECTURES) silently no-ops everything — requests 200, zero files, zero errors. Both features now register in _get_v2_model_runner_unsupported_features().

3. Five latent bugs in the capture-result reporting path.
capture_results was empty for every request, always:

the finalize callback captured the stash dict by reference at closure-creation time, but the per-step drain swaps that dict — results landed in an orphaned dict;
results only rode ModelRunnerOutputs, which stop when the engine idles — finalize (~seconds later) starved;
the async core client dropped EngineCoreOutputs carrying neither outputs nor stats;
the output handler only processed outputs inside the per-request slicing loop (zero outputs = zero delivery);
engine-internal request ids carry a random suffix vs the client-facing id.
Also: Path objects msgspec-round-trip into bytes across the engine IPC boundary; the response payload coercer now stringifies defensively.

4. fsync-on-NFS docs correction. The writer holds files open in its LRU FD cache, so close-time COMMIT semantics don't apply — each publish-time fsync is a synchronous round-trip (measured ~100 files/s; a 103-request x 36-layer per_file batch takes ~35s to flush). Added a "Durability vs. response timing" section.

Features

5. capture_wait: true (chat + completions). Holds the response until that request's captures are durable, then reports them in capture_results (requests that opt in without waiting now get status: "pending" instead of a silent omission). Validated: 103/103 at concurrency 64 release in ~36s per_file / 4.1s packed, all ok, files present at response time.

6. Boot-time global capture for the filesystem consumer. global_capture_spec() previously returned None, so disk capture required per-request opt-in. --capture-consumers "filesystem:root=...,global_hooks=post_block:all" now captures every request via the CUDA-graph-safe persistent-buffer path (layer syntax is CLI-shorthand-safe: all | a-b | i.j.k). Validated: 103/103 captured with no per-request field, graphs on.

Validation summary

configuration	end-to-end to durable	integrity
packed + capture_wait	4.1 s	100%, receipted
packed	2.5 s	100%
per_file (NFS)	12.4 s	100%

Compute is a non-factor for prefill-heavy capture (0.4s; eager == graphs), and the forced-eager penalty of per-request capture measured zero for this workload shape.

Test status: 396 passed in the capture/steering suites; the 4 failures in test_capture_protocol.py / TestPPGeometry are pre-existing on the base commit (test-stub ParallelConfig drift), identical with and without this change.

Known issues we noticed but did not address here: the extra_body.capture docs example only works via the openai-SDK (raw JSON needs top-level capture); vllm::capture_residual triggers a PyTorch aliasing UserWarning that becomes an error in torch 2.12 (needs a .clone() or op-schema fix); the multi-client (DP) routing of late capture results currently delivers to client 0 only.

(Split as requested into this stack: #175 carries the mechanical rename.)

🤖 Generated with Claude Code

…ange) Token-level rename across code, tests, docs, and examples. The hook fires after the mlp() call in program order, but in deferred-residual architectures the captured/steered `residual` does not yet include the MLP contribution (the add happens in the NEXT layer's fused add+norm) -- so 'post_mlp' misdescribes the dataflow. 'post_block' names the position in the layer, not a dataflow claim. No functional change: same tensors captured/steered as before. The semantic correction (capturing the true block output, residual + hidden) is stacked on top of this PR. Clients sending 'post_mlp' in capture specs must switch to 'post_block' (pre-release branch; no alias kept). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Introduce apply_block_steering(): captures residual + hidden_states (the true block output, what HF exposes as hidden_states[L+1]) instead of the bare pre-MLP-add residual, which was byte-identical to post_attn's capture in deferred-residual architectures. The sum is computed only when a capture manager is installed on the rank (static for the process lifetime, so torch.compile traces it as a constant branch); steering still applies to `residual` -- identical propagation to the old behavior. The 43 deferred-residual model files are a mechanical one-line call-site conversion + import each (verified: no other changes). The 30 materialized-stream models (opt/gpt_neox/gemma4/...) already hooked the materialized stream and need no change. Validated on qwen3-8b vs HF transformers: post_block[L] == hs[L+1] at mean cos 0.99995 (layers 0-34; HF's final tuple entry is post-final-norm so the last layer is not directly comparable). Stacked on the mechanical post_mlp->post_block rename. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…S fsync docs Fixes found while validating the capture path end-to-end on B200 (qwen3-8b, 103 requests x 36 layers x all prompt positions, ground-truthed against HF transformers hidden states): 2. Force the v1 model runner when capture/steering is configured. Both features are wired into the v1 runner only; the v2 runner (auto-selected for DEFAULT_V2_MODEL_RUNNER_ARCHITECTURES) silently no-ops every capture and steer. Now registered in _get_v2_model_runner_unsupported_features() -> v1 fallback + warning. 3. New: per-request `capture_wait` (chat + completions). Capture writes are asynchronous (on NFS, ~40s behind the response for a 103x36 per_file batch); with `capture_wait: true` the response is held until that request's captures are durable, then capture_results reports them. Making this work surfaced five latent bugs in the result path -- before these, capture_results was ALWAYS empty for every request: a. finalize callback captured the stash dict reference at closure-creation time, but the per-step drain swaps that dict -- results landed in an orphaned dict and vanished (gpu_model_runner). b. results only rode ModelRunnerOutputs, which stop when the engine goes idle -- finalize (~seconds later) starved. The engine-core idle loop now ticks (1s) and drains/emits late results batch-level (EngineCoreOutputs.late_capture_results). c. the async core client dropped EngineCoreOutputs carrying neither per-request outputs nor scheduler stats (core_client). d. the AsyncLLM output handler only processed outputs inside the per-request slicing loop -- zero outputs meant zero delivery. e. engine-internal request ids carry a random suffix vs the client-facing id -- late results are now indexed under both. Plus: results that cross the engine IPC boundary msgspec-round-trip Path objects into bytes; the response payload coercer now stringifies defensively. Requests that opt in but don't wait get capture_results={status: "pending"} instead of a silent omission. Validated: 103/103 requests at conc 64 release in ~36s with status "ok" and all 3,708 files durable at response time; 0 failures. 4. New: boot-time GLOBAL capture spec for the filesystem consumer (`global_hooks=post_block:all[,global_positions=...]`, CLI-shorthand-safe layer syntax: all | a-b | i.j.k). Previously global_capture_spec() returned None, so disk capture required per-request opt-in; dedicated batch-capture servers can now capture every request via the CUDA-graph-safe persistent-buffer path (no forced-eager steps). Validated: 103/103 requests captured with NO per-request capture field, graphs on, packed layout durable in seconds. 5. Correct the fsync-on-NFS guidance: the writer holds files open in its LRU FD cache, so close-time COMMIT semantics don't apply -- each publish-time fsync is a synchronous COMMIT round-trip (measured ~90 files/s). Added a "Durability vs. response timing" docs section: wait for the expected file set (or use capture_wait), never a fixed sleep. Test status: 396 passed in the capture/steering suites; the 4 failures in test_capture_protocol.py / TestPPGeometry are pre-existing on the base commit (test-stub ParallelConfig drift), identical with and without this change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

maxsloef-goodfire marked this pull request as draft June 11, 2026 20:24

maxsloef-goodfire force-pushed the capture-fixes branch from 6ad1795 to 23adc50 Compare June 11, 2026 20:28

maxsloef-goodfire mentioned this pull request Jun 11, 2026

rename: post_mlp -> post_block hook point (mechanical, no behavior change) #175

Draft

maxsloef-goodfire force-pushed the capture-fixes branch from 23adc50 to fe5b6e7 Compare June 11, 2026 20:42

maxsloef-goodfire mentioned this pull request Jun 11, 2026

capture: true block-output at the post_block hook #176

Draft

maxsloef-goodfire changed the title ~~capture: true block-output hook, v2-runner guard, capture_wait, global filesystem capture~~ capture: v2-runner guard, capture_wait, global filesystem capture, NFS fsync docs Jun 11, 2026

maxsloef-goodfire and others added 2 commits June 11, 2026 20:51

maxsloef-goodfire force-pushed the capture-fixes branch from fe5b6e7 to afacb62 Compare June 11, 2026 20:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

capture: v2-runner guard, capture_wait, global filesystem capture, NFS fsync docs#174

capture: v2-runner guard, capture_wait, global filesystem capture, NFS fsync docs#174
maxsloef-goodfire wants to merge 3 commits into
RhizoNymph:feat/integrationfrom
maxsloef-goodfire:capture-fixes

maxsloef-goodfire commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maxsloef-goodfire commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes

Features

Validation summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maxsloef-goodfire commented Jun 11, 2026 •

edited

Loading