capture: v2-runner guard, capture_wait, global filesystem capture, NFS fsync docs#174
Draft
maxsloef-goodfire wants to merge 3 commits into
Draft
capture: v2-runner guard, capture_wait, global filesystem capture, NFS fsync docs#174maxsloef-goodfire wants to merge 3 commits into
maxsloef-goodfire wants to merge 3 commits into
Conversation
…ange) Token-level rename across code, tests, docs, and examples. The hook fires after the mlp() call in program order, but in deferred-residual architectures the captured/steered `residual` does not yet include the MLP contribution (the add happens in the NEXT layer's fused add+norm) -- so 'post_mlp' misdescribes the dataflow. 'post_block' names the position in the layer, not a dataflow claim. No functional change: same tensors captured/steered as before. The semantic correction (capturing the true block output, residual + hidden) is stacked on top of this PR. Clients sending 'post_mlp' in capture specs must switch to 'post_block' (pre-release branch; no alias kept). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
6ad1795 to
23adc50
Compare
23adc50 to
fe5b6e7
Compare
Introduce apply_block_steering(): captures residual + hidden_states (the true block output, what HF exposes as hidden_states[L+1]) instead of the bare pre-MLP-add residual, which was byte-identical to post_attn's capture in deferred-residual architectures. The sum is computed only when a capture manager is installed on the rank (static for the process lifetime, so torch.compile traces it as a constant branch); steering still applies to `residual` -- identical propagation to the old behavior. The 43 deferred-residual model files are a mechanical one-line call-site conversion + import each (verified: no other changes). The 30 materialized-stream models (opt/gpt_neox/gemma4/...) already hooked the materialized stream and need no change. Validated on qwen3-8b vs HF transformers: post_block[L] == hs[L+1] at mean cos 0.99995 (layers 0-34; HF's final tuple entry is post-final-norm so the last layer is not directly comparable). Stacked on the mechanical post_mlp->post_block rename. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…S fsync docs
Fixes found while validating the capture path end-to-end on B200 (qwen3-8b,
103 requests x 36 layers x all prompt positions, ground-truthed against HF
transformers hidden states):
2. Force the v1 model runner when capture/steering is configured.
Both features are wired into the v1 runner only; the v2 runner
(auto-selected for DEFAULT_V2_MODEL_RUNNER_ARCHITECTURES) silently
no-ops every capture and steer. Now registered in
_get_v2_model_runner_unsupported_features() -> v1 fallback + warning.
3. New: per-request `capture_wait` (chat + completions). Capture writes are
asynchronous (on NFS, ~40s behind the response for a 103x36 per_file
batch); with `capture_wait: true` the response is held until that
request's captures are durable, then capture_results reports them.
Making this work surfaced five latent bugs in the result path -- before
these, capture_results was ALWAYS empty for every request:
a. finalize callback captured the stash dict reference at closure-creation
time, but the per-step drain swaps that dict -- results landed in an
orphaned dict and vanished (gpu_model_runner).
b. results only rode ModelRunnerOutputs, which stop when the engine goes
idle -- finalize (~seconds later) starved. The engine-core idle loop
now ticks (1s) and drains/emits late results batch-level
(EngineCoreOutputs.late_capture_results).
c. the async core client dropped EngineCoreOutputs carrying neither
per-request outputs nor scheduler stats (core_client).
d. the AsyncLLM output handler only processed outputs inside the
per-request slicing loop -- zero outputs meant zero delivery.
e. engine-internal request ids carry a random suffix vs the client-facing
id -- late results are now indexed under both.
Plus: results that cross the engine IPC boundary msgspec-round-trip
Path objects into bytes; the response payload coercer now stringifies
defensively. Requests that opt in but don't wait get
capture_results={status: "pending"} instead of a silent omission.
Validated: 103/103 requests at conc 64 release in ~36s with status "ok"
and all 3,708 files durable at response time; 0 failures.
4. New: boot-time GLOBAL capture spec for the filesystem consumer
(`global_hooks=post_block:all[,global_positions=...]`, CLI-shorthand-safe
layer syntax: all | a-b | i.j.k). Previously global_capture_spec()
returned None, so disk capture required per-request opt-in; dedicated
batch-capture servers can now capture every request via the
CUDA-graph-safe persistent-buffer path (no forced-eager steps).
Validated: 103/103 requests captured with NO per-request capture field,
graphs on, packed layout durable in seconds.
5. Correct the fsync-on-NFS guidance: the writer holds files open in its
LRU FD cache, so close-time COMMIT semantics don't apply -- each
publish-time fsync is a synchronous COMMIT round-trip (measured ~90
files/s). Added a "Durability vs. response timing" docs section: wait
for the expected file set (or use capture_wait), never a fixed sleep.
Test status: 396 passed in the capture/steering suites; the 4 failures in
test_capture_protocol.py / TestPPGeometry are pre-existing on the base
commit (test-stub ParallelConfig drift), identical with and without this
change.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
fe5b6e7 to
afacb62
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi! We've been evaluating this branch as the capture/steering backend for our interpretability pipelines at Goodfire (residual-stream extraction + readout over large model batches), and validated it end-to-end on B200 (qwen3-8b, 103 requests x 36 layers x all prompt positions, ground-truthed against HF transformers hidden states). The design is excellent — bounded overload policies, atomic publish, per-request opt-in — and this PR contains the fixes and two small features that came out of that validation.
Fixes
1. (moved to #176: true block-output capture.)
2. Force the v1 model runner when capture/steering is configured.
Capture/steering are wired into
gpu_model_runner.pyonly; the v2 runner (auto-selected forDEFAULT_V2_MODEL_RUNNER_ARCHITECTURES) silently no-ops everything — requests 200, zero files, zero errors. Both features now register in_get_v2_model_runner_unsupported_features().3. Five latent bugs in the capture-result reporting path.
capture_resultswas empty for every request, always:ModelRunnerOutputs, which stop when the engine idles — finalize (~seconds later) starved;EngineCoreOutputscarrying neither outputs nor stats;Also:
Pathobjects msgspec-round-trip into bytes across the engine IPC boundary; the response payload coercer now stringifies defensively.4. fsync-on-NFS docs correction. The writer holds files open in its LRU FD cache, so close-time COMMIT semantics don't apply — each publish-time fsync is a synchronous round-trip (measured ~100 files/s; a 103-request x 36-layer per_file batch takes ~35s to flush). Added a "Durability vs. response timing" section.
Features
5.
capture_wait: true(chat + completions). Holds the response until that request's captures are durable, then reports them incapture_results(requests that opt in without waiting now getstatus: "pending"instead of a silent omission). Validated: 103/103 at concurrency 64 release in ~36s per_file / 4.1s packed, allok, files present at response time.6. Boot-time global capture for the filesystem consumer.
global_capture_spec()previously returnedNone, so disk capture required per-request opt-in.--capture-consumers "filesystem:root=...,global_hooks=post_block:all"now captures every request via the CUDA-graph-safe persistent-buffer path (layer syntax is CLI-shorthand-safe:all|a-b|i.j.k). Validated: 103/103 captured with no per-request field, graphs on.Validation summary
Compute is a non-factor for prefill-heavy capture (0.4s; eager == graphs), and the forced-eager penalty of per-request capture measured zero for this workload shape.
Test status: 396 passed in the capture/steering suites; the 4 failures in
test_capture_protocol.py/TestPPGeometryare pre-existing on the base commit (test-stub ParallelConfig drift), identical with and without this change.Known issues we noticed but did not address here: the
extra_body.capturedocs example only works via the openai-SDK (raw JSON needs top-levelcapture);vllm::capture_residualtriggers a PyTorch aliasing UserWarning that becomes an error in torch 2.12 (needs a.clone()or op-schema fix); the multi-client (DP) routing of late capture results currently delivers to client 0 only.(Split as requested into this stack: #175 carries the mechanical rename.)
🤖 Generated with Claude Code