Skip to content

capture: v2-runner guard, capture_wait, global filesystem capture, NFS fsync docs#174

Draft
maxsloef-goodfire wants to merge 3 commits into
RhizoNymph:feat/integrationfrom
maxsloef-goodfire:capture-fixes
Draft

capture: v2-runner guard, capture_wait, global filesystem capture, NFS fsync docs#174
maxsloef-goodfire wants to merge 3 commits into
RhizoNymph:feat/integrationfrom
maxsloef-goodfire:capture-fixes

Conversation

@maxsloef-goodfire

@maxsloef-goodfire maxsloef-goodfire commented Jun 11, 2026

Copy link
Copy Markdown

PR 3 of 3 — stacked on #175 (mechanical rename) and #176 (true block-output capture). The combined diff shown includes both until they merge — review the top commit only (capture: v2-runner guard, capture_wait, global filesystem capture, NFS fsync docs): 17 files, +303/−36. I'll rebase as the stack merges.

Hi! We've been evaluating this branch as the capture/steering backend for our interpretability pipelines at Goodfire (residual-stream extraction + readout over large model batches), and validated it end-to-end on B200 (qwen3-8b, 103 requests x 36 layers x all prompt positions, ground-truthed against HF transformers hidden states). The design is excellent — bounded overload policies, atomic publish, per-request opt-in — and this PR contains the fixes and two small features that came out of that validation.

Fixes

1. (moved to #176: true block-output capture.)

2. Force the v1 model runner when capture/steering is configured.
Capture/steering are wired into gpu_model_runner.py only; the v2 runner (auto-selected for DEFAULT_V2_MODEL_RUNNER_ARCHITECTURES) silently no-ops everything — requests 200, zero files, zero errors. Both features now register in _get_v2_model_runner_unsupported_features().

3. Five latent bugs in the capture-result reporting path.
capture_results was empty for every request, always:

  • the finalize callback captured the stash dict by reference at closure-creation time, but the per-step drain swaps that dict — results landed in an orphaned dict;
  • results only rode ModelRunnerOutputs, which stop when the engine idles — finalize (~seconds later) starved;
  • the async core client dropped EngineCoreOutputs carrying neither outputs nor stats;
  • the output handler only processed outputs inside the per-request slicing loop (zero outputs = zero delivery);
  • engine-internal request ids carry a random suffix vs the client-facing id.
    Also: Path objects msgspec-round-trip into bytes across the engine IPC boundary; the response payload coercer now stringifies defensively.

4. fsync-on-NFS docs correction. The writer holds files open in its LRU FD cache, so close-time COMMIT semantics don't apply — each publish-time fsync is a synchronous round-trip (measured ~100 files/s; a 103-request x 36-layer per_file batch takes ~35s to flush). Added a "Durability vs. response timing" section.

Features

5. capture_wait: true (chat + completions). Holds the response until that request's captures are durable, then reports them in capture_results (requests that opt in without waiting now get status: "pending" instead of a silent omission). Validated: 103/103 at concurrency 64 release in ~36s per_file / 4.1s packed, all ok, files present at response time.

6. Boot-time global capture for the filesystem consumer. global_capture_spec() previously returned None, so disk capture required per-request opt-in. --capture-consumers "filesystem:root=...,global_hooks=post_block:all" now captures every request via the CUDA-graph-safe persistent-buffer path (layer syntax is CLI-shorthand-safe: all | a-b | i.j.k). Validated: 103/103 captured with no per-request field, graphs on.

Validation summary

configuration end-to-end to durable integrity
packed + capture_wait 4.1 s 100%, receipted
packed 2.5 s 100%
per_file (NFS) 12.4 s 100%

Compute is a non-factor for prefill-heavy capture (0.4s; eager == graphs), and the forced-eager penalty of per-request capture measured zero for this workload shape.

Test status: 396 passed in the capture/steering suites; the 4 failures in test_capture_protocol.py / TestPPGeometry are pre-existing on the base commit (test-stub ParallelConfig drift), identical with and without this change.

Known issues we noticed but did not address here: the extra_body.capture docs example only works via the openai-SDK (raw JSON needs top-level capture); vllm::capture_residual triggers a PyTorch aliasing UserWarning that becomes an error in torch 2.12 (needs a .clone() or op-schema fix); the multi-client (DP) routing of late capture results currently delivers to client 0 only.

(Split as requested into this stack: #175 carries the mechanical rename.)

🤖 Generated with Claude Code

@maxsloef-goodfire maxsloef-goodfire marked this pull request as draft June 11, 2026 20:24
…ange)

Token-level rename across code, tests, docs, and examples. The hook fires
after the mlp() call in program order, but in deferred-residual
architectures the captured/steered `residual` does not yet include the
MLP contribution (the add happens in the NEXT layer's fused add+norm) --
so 'post_mlp' misdescribes the dataflow. 'post_block' names the position
in the layer, not a dataflow claim.

No functional change: same tensors captured/steered as before. The
semantic correction (capturing the true block output, residual + hidden)
is stacked on top of this PR. Clients sending 'post_mlp' in capture specs
must switch to 'post_block' (pre-release branch; no alias kept).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@maxsloef-goodfire maxsloef-goodfire changed the title capture: true block-output hook, v2-runner guard, capture_wait, global filesystem capture capture: v2-runner guard, capture_wait, global filesystem capture, NFS fsync docs Jun 11, 2026
maxsloef-goodfire and others added 2 commits June 11, 2026 20:51
Introduce apply_block_steering(): captures residual + hidden_states (the
true block output, what HF exposes as hidden_states[L+1]) instead of the
bare pre-MLP-add residual, which was byte-identical to post_attn's
capture in deferred-residual architectures. The sum is computed only when
a capture manager is installed on the rank (static for the process
lifetime, so torch.compile traces it as a constant branch); steering
still applies to `residual` -- identical propagation to the old
behavior.

The 43 deferred-residual model files are a mechanical one-line call-site
conversion + import each (verified: no other changes). The 30
materialized-stream models (opt/gpt_neox/gemma4/...) already hooked the
materialized stream and need no change.

Validated on qwen3-8b vs HF transformers: post_block[L] == hs[L+1] at
mean cos 0.99995 (layers 0-34; HF's final tuple entry is post-final-norm
so the last layer is not directly comparable).

Stacked on the mechanical post_mlp->post_block rename.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…S fsync docs

Fixes found while validating the capture path end-to-end on B200 (qwen3-8b,
103 requests x 36 layers x all prompt positions, ground-truthed against HF
transformers hidden states):

2. Force the v1 model runner when capture/steering is configured.
   Both features are wired into the v1 runner only; the v2 runner
   (auto-selected for DEFAULT_V2_MODEL_RUNNER_ARCHITECTURES) silently
   no-ops every capture and steer. Now registered in
   _get_v2_model_runner_unsupported_features() -> v1 fallback + warning.

3. New: per-request `capture_wait` (chat + completions). Capture writes are
   asynchronous (on NFS, ~40s behind the response for a 103x36 per_file
   batch); with `capture_wait: true` the response is held until that
   request's captures are durable, then capture_results reports them.
   Making this work surfaced five latent bugs in the result path -- before
   these, capture_results was ALWAYS empty for every request:
   a. finalize callback captured the stash dict reference at closure-creation
      time, but the per-step drain swaps that dict -- results landed in an
      orphaned dict and vanished (gpu_model_runner).
   b. results only rode ModelRunnerOutputs, which stop when the engine goes
      idle -- finalize (~seconds later) starved. The engine-core idle loop
      now ticks (1s) and drains/emits late results batch-level
      (EngineCoreOutputs.late_capture_results).
   c. the async core client dropped EngineCoreOutputs carrying neither
      per-request outputs nor scheduler stats (core_client).
   d. the AsyncLLM output handler only processed outputs inside the
      per-request slicing loop -- zero outputs meant zero delivery.
   e. engine-internal request ids carry a random suffix vs the client-facing
      id -- late results are now indexed under both.
   Plus: results that cross the engine IPC boundary msgspec-round-trip
   Path objects into bytes; the response payload coercer now stringifies
   defensively. Requests that opt in but don't wait get
   capture_results={status: "pending"} instead of a silent omission.
   Validated: 103/103 requests at conc 64 release in ~36s with status "ok"
   and all 3,708 files durable at response time; 0 failures.

4. New: boot-time GLOBAL capture spec for the filesystem consumer
   (`global_hooks=post_block:all[,global_positions=...]`, CLI-shorthand-safe
   layer syntax: all | a-b | i.j.k). Previously global_capture_spec()
   returned None, so disk capture required per-request opt-in; dedicated
   batch-capture servers can now capture every request via the
   CUDA-graph-safe persistent-buffer path (no forced-eager steps).
   Validated: 103/103 requests captured with NO per-request capture field,
   graphs on, packed layout durable in seconds.

5. Correct the fsync-on-NFS guidance: the writer holds files open in its
   LRU FD cache, so close-time COMMIT semantics don't apply -- each
   publish-time fsync is a synchronous COMMIT round-trip (measured ~90
   files/s). Added a "Durability vs. response timing" docs section: wait
   for the expected file set (or use capture_wait), never a fixed sleep.

Test status: 396 passed in the capture/steering suites; the 4 failures in
test_capture_protocol.py / TestPPGeometry are pre-existing on the base
commit (test-stub ParallelConfig drift), identical with and without this
change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant