feat(tune): surface-B GDN LoRA weight gradients + train_grad_full fields#202
Draft
ohdearquant wants to merge 3 commits into
Draft
feat(tune): surface-B GDN LoRA weight gradients + train_grad_full fields#202ohdearquant wants to merge 3 commits into
ohdearquant wants to merge 3 commits into
Conversation
…parity test GDN-layer LoRA weight-gradient surface (gdn_backward) plus the full-model trainer path (train_grad_full, train_grad_layer23) and a LoRA forward parity test. Option-gated: the None path is byte-identical to prior behavior. UNVERIFIED: single/multi-head gradcheck is written but not yet executed on a real model. The surface-B weight grads must pass gradcheck before this merges. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
E2E Parity ReportPASS: all 3 prompts match within first 3 tokens
print(fib
print(fib |
- gdn_forward_save: unconditionally reset saved LoRA state (rank, scale, h_* caches, weight matrices) before the bound-only populate, so a reused GdnSaved from a prior LoRA call cannot leak stale state into a no-LoRA forward (gdn_backward gates its LoRA-grad path on saved.lora_rank) - train_grad_full --save: branch on slot_kinds; GDN slots now save their five real modules (in_proj_qkv/z/b/a, out_proj) with loader-matched names and shapes instead of empty q_proj/v_proj; target_modules reflects the modules actually present (GQA / GDN / mixed) - clippy under --features train-backward (not linted by default CI): factor the 10-slice lora_bound tuple into a LoraBound type alias; drop the unused GdnGrads import and the dead LoraParams::zeros alias Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
train_grad_full --save now emits all five GDN LoRA modules, but validate_against only recognized in_proj_qkv/in_proj_z/out_proj, so a saved full-GDN adapter failed validation before it could load. The forward pass already applies in_proj_b/in_proj_a LoRA (gdn_fused.rs) with d_out = linear_num_key_heads, and safetensors parsing + set_lora accept them; the only gap was these two match arms. Add them so the validation contract matches the forward and the trainer's saved output. Adds a regression test covering all five GDN LoRA modules with config-derived dims, so the contract can't silently drift again. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Surface-B: LoRA weight gradients through GDN (gated delta-net) layers, plus the
train_grad_fullfield fixes and a forward-parity test.Why
Extends the micro-LoRA backward pass to compute weight gradients for GDN-layer LoRA adapters (previously only attention/FFN surfaces had them). The Option-gated None path is byte-identical to before — when GDN LoRA grads are not requested, behavior is unchanged.
Files
crates/inference/src/attention/gdn_backward.rs(+978/-29, cfg-gated undertrain-backward)crates/tune/src/bin/train_grad_full.rs(+592/-132, includes the E0063 GDN-fields fix)crates/tune/src/bin/train_grad_layer23.rscrates/inference/src/backward/tape.rs,crates/inference/src/backward/ops.rscrates/inference/examples/diff_gdn_layer.rs(required-features = train-backward,f16)crates/inference/tests/lora_forward_parity_test.rs(+917, new; compiles under default features via internal cfg guards, CI-safe)Verification
cargo build --release -p lattice-tune --bin train_grad_full --features train-backwardclean. Thetrain_grad_fullE0063 (missing GDN gradient fields) was the second of the two regressions that motivated the PR-G gate. Built green in the integrated-tree gate.The numerical gradient-check for the GDN weight-grad surface has not been run (tracked: khive task fe42740c). The forward path and compile are verified; the backward correctness is unverified. This PR is up for review structure and to keep the code on a branch, but the gradcheck must pass before merge. Marking ready-for-review on the non-gradient parts; holding the GDN-grad merge on that run.
Bench
Backward/training code is
train-backward-gated, off the default decode path.make bench-compare's comparator errored assembling the delta (known two-worktree fragility; base benches ran clean) — bench-neutral by construction.Series
Part of the PR #193 engine-slice (finest split). All engine code lands on main; the macOS app surfaces a subset (Models + Chat) for v0.0.1.