feat(capture+steering): DeepSeek-V4 mHC capture and steering targets#181
Open
RhizoNymph wants to merge 9 commits into
Open
feat(capture+steering): DeepSeek-V4 mHC capture and steering targets#181RhizoNymph wants to merge 9 commits into
RhizoNymph wants to merge 9 commits into
Conversation
feat(capture): support DeepSeek-V4 mHC activations as capture targets
…capture # Conflicts: # vllm/entrypoints/openai/chat_completion/serving.py # vllm/entrypoints/openai/completion/serving.py # vllm/v1/capture/__init__.py # vllm/v1/capture/manager.py
feat(steering): mHC multi-stream + sublayer steering for DeepSeek-V4
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds DeepSeek-V4 manifold-hyperconnection (mHC) activations as both capture targets and steering targets. The capture side landed via #177; the steering side via #183 (both now merged into
feat/mhc-capture), so this PR integrates both intofeat/integration(alongside the prefix-cache capture work already there).mHC threads
hc_multparallel residual streams through each layer and mixes them with manifold-constrained (Sinkhorn) coefficients. The same hook points are exposed for capture (read) and steering (write); the string names are shared so one identifier captures and steers the same tensor.Capture: hook points
New
mhc_*hooks (DeepSeek-V4 only; rejected on other models):mhc_streams_pre_attn(n, hc_mult, hidden)mhc_streams_pre_mlp(n, hc_mult, hidden)mhc_streams_finalhc_headstreams at the model tail (last PP rank)(n, hc_mult, hidden)mhc_attn_post_mix(n, hc_mult)mhc_ffn_post_mix(n, hc_mult)mhc_attn_res_mix(n, hc_mult, hc_mult)mhc_ffn_res_mix(n, hc_mult, hc_mult)Reused standard hooks for V4's single-stream
(n, hidden)bf16 tensors:pre_attn(pre-mixed attn input),post_attn(attn output),mlp_in(pre-mixed FFN input),mlp_out(FFN output). V4 has no single-streampost_mlp— the end-of-layer residual is multi-stream, captured viamhc_streams_*.mhc_streams_finalis a model-level hook (fires once at the tail, not per layer); its layer selector is ignored and normalized to the last layer, so callers just write{"mhc_streams_final": "all"}.Capture: changes required
The framework previously assumed every captured row was
(hidden_size,)in the model dtype. mHC breaks both (wider streams, fp32 coefficients), so:mhc_*names added toHookNameand the mirrored_HOOK_NAME_TO_IDtable (new ids appended; existing ids unchanged so compiled graphs stay valid).HookSchema(width, dtype, logical_shape)+build_hook_schema(hidden, dtype, hc_mult), replacing the singlehidden_size/model_dtypeassumption. Sourced fromhf_config.hc_mult; non-mHC models get only the standard wired hooks. Carried onCaptureContext.hook_schemaand built centrally inbuild_capture_context(admission) + the model runner.hc_mult*hidden). Chunk metadata now carriesrow_shapeand per-rowpositions.mhc_*/mlp_in/mlp_outare accepted only where the model actually taps them. Model-level hooks normalize their layer selector to the tail.row_shape(reshape flat(n, width)back to e.g.(n, hc_mult, hidden)), per-entrydtype(so one packed file can mix bf16 streams + fp32 coefficients), and per-rowpositions+ alatest_per_position()reader helper (for speculative-decode dedup). All fields are additive/optional — existing captures round-trip unchanged.maybe_capture_residualtaps added to the V4 decoder layer (nvidia+amd, fused-CUDA + native paths) and the model tail, pluslayer_idxviaextract_layer_index. Gated so they constant-fold out of the compiled graph when capture is disabled.Steering at mHC hooks
Steering applies vectors at the same mHC hook points (the write/apply path), distinct from capturing them. The steering tables were hard-wired to one
hidden_sizewidth in the model dtype; this generalizes them to per-hook width — the steering analog of the captureHookSchema.SteeringHookPoint:mlp_in,mlp_out(single-stream) andmhc_streams_pre_attn/mhc_streams_pre_mlp/mhc_streams_final(multi-stream).register_steering_bufferstakes ahook_widthsmap and defaults to exactly the three standard single-stream hooks athidden_size(every existing model untouched). DeepSeek-V4 registers single-stream hooks athiddenand multi-stream residual hooks athc_mult*hidden;mhc_streams_finalonly on the last layer.apply_layer_steering_streamsflattens(n, hc_mult, hidden)→ 2-D, runs the existingapply_steeringgather/add (the kernel's hidden dim is a runtime arg), and reshapes back. The per-token row index, row layout, andany_activeshort-circuit are all width-agnostic.populate_steering_tablesgroups active tables by(width, dtype)so mixed widths coexist on one layer (a single group in the non-mHC case); the zero sentinel row is cached per width. The kernel warmup covers each distinct width._steer_and_capture_mhc; the fp32 mixing coefficients stay capture-only (routing weights, no steering semantics).hc_mult*hidden(per-stream granularity); the packed blob andSamplingParamslist-of-floats need no change. Coefficient (Sinkhorn-routing) steering is intentionally out of scope.Backward compatibility: purely additive. Existing capture requests, CLI flags, and
.binpayloads are unchanged; standard-model capture and steering hook validation are identical to before (steering still registers exactly the three standard hooks by default); the new sidecar fields are optional and read gracefully whether present or absent.Validation
mhc_streams_pre_attnand a single-stream-targeted vector atmhc_streams_pre_mlpchange the output (per-stream granularity confirmed); the per-width kernel warmup runs for both table widths, and CUDA graphs capture cleanly with steering enabled.