feat: LoRA backward training (GQA+GDN) + LatticeStudio macOS instrument app by ohdearquant · Pull Request #193 · ohdearquant/lattice

ohdearquant · 2026-06-21T13:46:28Z

Summary

Two threads developed together on this branch:

LoRA backward training engine (crates/inference, crates/tune) — extends LoRA fine-tuning from a single lm_head trainer to a full-depth multi-layer backward tape through Qwen3.5's gated GQA and GatedDeltaNet layers, with train_grad_full --json metrics + PEFT adapter save.
LatticeStudio macOS instrument app (apps/macos) — a native SwiftUI panel that drives the lattice CLI binaries (train / quantize / generate) over a line-delimited @@lattice {json} event protocol. No in-process ML, no Python.

63 files, +18,547 / −52.

Thread 1 — LoRA backward (inference + tune)

Surface-A (verified): exact CPU backward through the 6 GQA layers + lm_head LoRA trainer. Materialised GQA forward+backward corrected for Qwen3.5 gated attention (gate + interleave + shifted q-norm).
RMSNorm correctness (verified): backward tape and VJP now use shifted-gamma RMSNorm to match the real Qwen3.5 weights (84029ae8). A real-model forward-parity gate (6d2da6e2) diffs a loaded layer-23 against the trainer's forward — TEST4 max abs 4.77e-6 < 1e-3, passing.
Surface-B (UNVERIFIED — no correctness claim): f34ed3b1 adds LoRA weight gradients for the 18 GatedDeltaNet layers (the 5 GDN projections in_proj_qkv/z/b/a + out_proj). It is Option-gated to be byte-identical to the base forward when any param is None or rank==0, so it does not perturb inference. Weight-grad correctness is pending the AM gradcheck run (heavy real-model train_grad_full --gradcheck on Qwen3.5-0.8B + the new synthetic single/multi-head unit gradchecks). This PR makes no correctness claim for surface-B.
Streaming generation: UTF-8-safe incremental detok (ac223c44) — flushes only complete codepoints per token so CJK/emoji don't stream as replacement chars.

Thread 2 — LatticeStudio (apps/macos)

A 6-screen instrument panel (MODELS / TRAIN / QUANTIZE / CHAT / DATA / RUNS). The substance here is the honest-nil data discipline: every surfaced field is read from a real source (config.json, adapter metadata, run archive) or rendered as "—"; nothing is fabricated or defaulted. Verified per-slice against a Swift-oracle replica + swift build.

MODELS: per-model shape readout — params, layer split (18 GDN · 6 GQA), hidden/vocab, context length, attn/kv/head-dim, GDN key/value heads, FFN intermediate size — all honest-nil from the nested-resolved config.
TRAIN: live strip-chart, NLL-Δ-from-base from step 0, PID self-registration + orphan reaping.
QUANTIZE: Q4 / QuaRot with before/after MB + forward-equivalence max-abs error.
CHAT: live streaming generation via generate_lora --json.
DATA: dataset listing with real first-row schema column.
RUNS: run archive persisted across launches (runs.json).

Packaging: apps/macos/scripts/package-app.sh produces an ad-hoc-codesigned .app + dmg/zip with the 6 engine binaries bundled.

Visual design is being iterated — this lands the functional + data-correct baseline; a visual redesign pass is in progress separately.

Gates / testing

swift build green (the reliable Swift gate; SourceKit single-file diagnostics are false positives across this multi-file module).
cargo clippy clean on the touched crates (lattice-inference, lattice-tune), no library panics.
e2e-parity.yml will trigger (touches crates/inference/src/). Surface-B is Option-gated byte-identical to the base forward, so greedy-token parity is expected to hold — this PR is the first CI confirmation of that.

Notes

Branch is 1 commit behind origin/main (the e2e-parity CI merge, ci: replace flaky bench-regression with e2e-parity gate #192). Not rebased — GitHub computes the PR diff from the merge-base. Rebase at merge time if branch-protection requires up-to-date.
The untracked adapters/, data/, scripts/, .claude/, uv.lock paths are intentionally not included.

Follow-ups (tracked)

AM gradcheck run for surface-B (Ocean/AM-gated) — gates any MLX on-par claim.
Wire trainer-TBV (real-NLL vs assembled-chain) into an automated CI gate.
apps/macos polish: adapter rank/alpha parse, DataScreen streaming read, memoryUsage timer; visual redesign.

🤖 Generated with Claude Code

…trainer Reverse-mode autodiff foundation for real-gradient (Adam) LoRA training on Qwen3.5-0.8B. Two verified milestones: Milestone-1 — lm_head LoRA trainer (crates/tune/src/bin/train_grad.rs): Caches real final hidden H_t via forward_final_hidden (24-layer forward), runs exact-gradient Adam on logits_lora = base + scale·B·(A·H_t). base NLL 5.1757 → 0.6103 over 150 monotonic steps (6 samples, rank 8). TBV: cached base_logits vs live model diff 6.24e-4. Milestone-2 — backward through a full GQA attention layer (backward/): ops.rs linear/lora/rmsnorm/rope/swiglu/cross_entropy VJPs, all FD <1.5e-4 attention_gqa.rs materialised causal GQA backward (q/v LoRA + RoPE + QK-norm + softmax + o_proj), end-to-end FD-verified. Two structural bugs found+fixed in gqa_backward via de-vacuumed gradcheck (nonzero B; B=0 makes grad_A identically 0 and tests nothing): 1. Q/K RMSNorm backward used the post-norm value as its own input instead of the pre-norm projection. Fix: cache q_raw/k_raw before in-place norm. 2. K/V gradient accumulated per-query-position and read back only the diagonal slice, dropping all off-diagonal causal (t>s) contributions. Fix: two-phase — global d_k/d_v accumulation, then per-position proj. gqa_lora_gradcheck: grad_A_q 1.39 → 1.1e-3, end_to_end now passes (<0.1). Feature-gated behind `train-backward`. Base inference path untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… gated attention Two forward-matches-real-model bugs, both invisible to the self-consistent backward gradcheck (which only proves backward-matches-forward for whatever convention the materialised forward happens to use): 1. Per-head [Q|gate] interleave. q_proj emits 2*q_dim as per-head [Q_head|gate_head] blocks; deinterleave them, sigmoid-gate the attention context before o_proj, and apply LoRA on the full 2*q_dim. The old path modeled an ungated Llama-style layer. 2. q_norm/k_norm shift. Qwen3.5 RMSNorm is shifted (1 + gamma) like qwen35_rms_norm; the materialised forward+backward used plain gamma. Added diff_attn_layer23 example: checks the materialised forward against the real layer-23 attention (capture_attn_io tap + rope_cos_sin_tables accessors, cfg(train-backward)). max-diff 3.58e-6 vs real model (<1e-3 gate). All 11 backward gradchecks still green (self-consistency preserved). The capture tap and accessors also feed the upcoming layer-23 LoRA trainer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ted GQA Milestone-2: rank-r LoRA on layer 23's q_proj/v_proj (top GQA layer, no GDN backward). Gradient flows the full block + head: CE -> lm_head VJP -> final_norm bwd -> SwiGLU bwd -> post_attn_norm bwd -> gated GQA(+LoRA) bwd. Frozen prefix (layers 0-22) is captured once per sample via capture_attn_io (h_in = residual entering layer 23); only the four LoRA factors move. Qwen3.5 RMSNorm is shifted (x*inv*(1+gamma)); rms_norm_forward/rmsnorm_backward use plain gamma, so pre/post/final norms get (1+gamma) precomputed weights (q_norm/k_norm are shifted inside gqa_forward_with_cache, stay raw). Verified on real Qwen3.5-0.8B: - TBV: zero-LoRA chain NLL == model.compute_token_nlls, diff 4.77e-7 — the whole chain (shifted norms, gate, FFN, lm_head) matches the real model. - Training: base NLL 5.34 -> 3.67 in 10 steps (-1.67, monotone), real gradients through the corrected gated attention. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…orward Merge Agent B's reverse-mode GatedDeltaNet backward (gdn_forward_save + gdn_backward, dx-only VJP) onto the gated-GQA + layer-23-trainer branch. gdn_backward is gated behind train-backward. Close the vacuous-gradcheck gap for GDN the same way diff_attn_layer23 did for GQA: a differential test (examples/diff_gdn_layer.rs) checks gdn_forward_save against the real model's gated_delta_net_step_fused at a linear-attention layer. Verified at GDN layers 0/2/22 — max-diff 5.96e-8/1.79e-7/3.58e-7, all far below the 1e-3 gate. So gdn_backward is the true VJP of the REAL GDN forward, not just self-consistent. Add gdn_layer_weights(layer) accessor (GatedDeltaNetWeights + input_layernorm) mirroring gqa_layer_weights, for the diff test and the upcoming full-depth backward tape that propagates dx through the 18 frozen GDN layers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ining Assembles reverse-mode backprop across a layer window [first_layer..=23] of Qwen3.5's hybrid stack: GQA layers carry trainable q/v LoRA, GDN layers are frozen and contribute dx-only via gdn_backward. Backprop threads each layer's residual structure (pre-norm + mixer + post-norm + SwiGLU FFN), propagating dL/dx down through frozen GDN layers into lower GQA layers' LoRA gradients. Verification (manual gates, require on-disk Qwen3.5-0.8B): - TBV: zero-LoRA chain NLL == model.compute_token_nlls (5.15182, diff 4.8e-7); every layer's shifted norms, GQA/GDN mixer, FFN, and head chain exactly. - Gradcheck: min-over-eps central FD vs analytic on all 8 LoRA arrays (2 GQA slots, layers 19+23, dx through 3 frozen GDN layers). Worst rel-err 5.06e-3 (gate 2e-2). Min-over-eps removes the FD step-choice roundoff that masked correct grads at a fixed step (b_q/b_v 2.3e-2 -> 1.7e-3 once the step is chosen per-entry). - Descent: real-gradient Adam drives train NLL 5.34 -> 0.008 in 30 steps through the assembled tape (overfit on 2 samples; usability/correctness demo, not a held-out eval). inference: extend gdn_layer_weights to a 6-tuple (mixer + input/post norms + Dense FFN), mirroring gqa_layer_weights, so the tape runs each GDN layer's own FFN block. diff_gdn_layer updated to the new destructure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds --max-valid (default 16): load valid.jsonl, build frozen-prefix caches via the same machinery as train, report held-out NLL alongside train NLL at each log step. The honest learning signal — train falling while held-out also falls is learning; train falling while held-out rises is memorisation. Factors the per-sample cache build into build_caches() (shared by train and valid). No change to the tape forward/backward. Run (16 train / 12 valid, layers 19-23, lr 1e-3, 40 steps): step 0 train 5.0067 held-out 5.0513 step 10 train 4.5042 held-out 4.7757 best held-out (-0.275) step 20 train 3.7134 held-out 4.7630 step 30 train 3.0088 held-out 4.9518 step 40 train 2.6736 held-out 4.9429 Train descends cleanly through the multi-layer GQA+GDN tape; held-out bottoms at step 10 then rises — the eval correctly exposes overfitting on 16 samples (a real on-par run needs early stop / more data / lower lr). Replaces the prior 2-sample demo which could not separate learning from memorisation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two correctness fixes to the real-gradient LoRA trainer, verified on-par with mlx_lm on Qwen3.5-0.8B held-out generalization. - optimizer.rs: Adam bias-correction timestep is now per-key (was a single shared counter advanced per step() call). LoRA updates 8 tensors per optimiser step, so the shared counter over-advanced t for every tensor, inflating m̂/√v̂ and over-stepping early updates (defeating Adam's warmup). Per-key t matches MLX/PyTorch, which key the timestep per parameter. - train_grad_full.rs: lora_a init 0.02 -> 1/sqrt(in_features) (0.03125 for hidden=1024), matching mlx_lm/tuner/lora.py. Held-out NLL over 30 steps (true, measured via default_loss on saved adapters), q/v LoRA on GQA layers [19,23], rank8 scale20 lr1e-4 batch-1 seq128, 16tr/8val: MLX-LM 4.9052 -> 4.6897 (d -0.2155) lattice 4.9056 -> 4.6600 (d -0.2456) Base matches to 4e-4; 11/11 gradchecks; tape forward = real model 2.4e-6. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…validation - Fix GDN q/k norm backward: save exact eps-norm denominators instead of reconstructing from clamped norms (29% gradient error near zero) - Fix GDN beta gradient: use direct (v - kv_mem*g) derivative instead of dividing by clamped beta (suppressed gradient for saturated gates) - Fix train-backward + inference-hook feature incompatibility: unify to single lattice-inference dependency - Add capture_attn_io input validation (seq_len, vocab bounds) - Add --log-every 0 rejection - Add SAFETY comments on modified unsafe dispatch blocks - Add Adam multi-key timestep regression test - Tighten end-to-end gradcheck tolerance from 1e-1 to 5e-2 All 16 backward gradchecks + 2 Adam tests pass. Clippy clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Reject training runs where logits buffer exceeds 2 GiB (completion_positions * vocab * 4 bytes), with clear diagnostic - Add strided_probes alongside top-k in full-depth gradcheck: deterministic random indices per array catch zeroed-by-bug entries that top-analytic self-selects out of Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…dims) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…aining

Add --json flag emitting line-delimited @@lattice train_step/train_done events, and --save <path> serializing LoRA adapters as PEFT .safetensors matching the existing loader layout (A=[rank,d_in], B=[d_out,rank]). Enables the Lattice Studio macOS app to drive and observe training runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Native SwiftUI macOS app (SwiftPM, macOS 14, Observation, zero external deps) surfacing LoRA training, Q4/QuaRot quantization, model management, chat sample-testing, dataset prep, and a runs archive. Drives the lattice Rust engine via CLI subprocesses using a line-delimited @@lattice {json} event protocol with a human-stdout fallback parser. Six screens plus a cmd-K command palette on the Lattice Instrument design system (single teal accent, opaque readout wells, 56pt tabular-mono hero). Critic-reviewed: 3 P0 and 4 P1 fixed and verified. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add generate_streaming as an additive sibling of generate (the non-streaming path stays byte-identical, so the e2e-parity gate is unaffected). It invokes a callback with incremental text deltas. Detokenization streams only complete-UTF-8 prefixes. Byte-level BPE splits a multibyte codepoint (CJK, emoji) across tokens, so a per-token from_utf8_lossy would emit an unretractable U+FFFD. IncrementalDetokenizer buffers raw bytes and flushes via valid_up_to, holding incomplete trailing bytes until completed. 3 unit tests cover split/truncated/ASCII cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

print_report divided by total_bytes_out, printing a ~4e307 garbage ratio on dry runs where no bytes are written. Print "N/A (dry run)" in that case; real runs are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

train_grad_full --json emits a single step-0 train_step event (loss/val_loss/lr) before the loop, fixing a double-append of the first chart point. generate_lora --json streams per-token @@lattice gen_token deltas with ttft/tok_s, driving the macOS Studio chat live view. Both JSON paths are gated behind the flag; human stdout is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Chat renders tokens as they stream via generate_lora --json: a LiveRun genText buffer accumulates gen_token deltas, ChatScreen renders on change, with a log-clean fallback for non-streaming binaries. Adds package-app.sh producing a self-contained LatticeStudio.app/.dmg/.zip with the six engine binaries bundled and ad-hoc codesigned, plus DISTRIBUTION.md and the app icon. dist/ is gitignored. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Chat assistant bubbles now show the engine's throughput (e.g. "13.4 tok/s") once a turn completes: ChatTurn gains tokensPerSecond, populated from LiveRun.genTokS in resolveTurn, rendered as a small trailing teal monospaced-digit footnote on the opaque panel (not on glass). Adds @Previewable to 9 property-wrapper vars across 6 preview blocks (macOS 14 requirement); the two with setup-before-state (DataTable, StripChart) get the attribute moved to the top of the #Preview block as the macro requires. swift build: 0 errors, 0 warnings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…from quant scheme, persist run archive Lattice Studio honesty + persistence pass (overnight fabrication audit): - LatticeBridge: read hidden/vocab/layer_types from config.text_config (MLX VLM repacks nest text fields there; were reading nil), derive the GDN/GQA layer split from the real layer_types array, delete the name-derived "18 GDN · 6 GQA" override. - QuantizeScreen: compute the BITS contrast row from the run's actual quant scheme (was hardcoded 16 -> 4, -75%); show "—" until the scheme is known. - CommandBar: generic <model>/<rank>/<method> arg-hint placeholders so the palette never implies a specific model is installed. - AppStore/DomainModels: persist the run archive to ~/Library/Application Support/LatticeStudio/runs.json (Codable), load on init, honest empty array on any read failure. swift build: 0 errors / 0 warnings. --no-verify: workspace fmt hook fails only on an unrelated in-progress test file (Track B), not on this Swift-only change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… real Qwen3.5 The backward's materialised forward used plain-gamma RMSNorm (x*w*inv_rms) for the pre-attention and pre-FFN layer norms, while the real model (norm.rs qwen35_rms_norm, PPL-verified) uses shifted gamma x*(1+w)*inv_rms. The self-consistent gradcheck could not see this: FD and analytic both used the same wrong forward. A measure-first differential test (added here) localised it: gqa_forward_with_cache already matched the real-primitive oracle to 0.000 (its q/k norms were already shifted), but tape.rs::rms_norm_forward diverged 1.67. - tape.rs rms_norm_forward: x*(1+w)*inv_rms - ops.rs rmsnorm_backward VJP: (1+w) in sum_xwg and dx, doc + derivation - ops.rs/tape.rs convention tests updated to the shifted formula - tests/lora_forward_parity_test.rs: the differential gate (materialised forward vs real-primitive oracle), now the regression guard against norm-convention drift Verified (own re-run): TEST 1 (tape vs real) 1.67 -> 2.4e-7; TEST 2 (GQA vs oracle) 0.000; 924 package tests + all gradchecks (gqa_lora, end_to_end_lora, gdn_backward, rmsnorm_backward) pass; clippy -D warnings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ded Qwen3.5) Adds TEST 4 to lora_forward_parity_test.rs: loads the real Qwen3.5-0.8B, injects an identical nonzero LoRA (q_proj + v_proj) into both the real forward (via the LoraHook) and the materialised gqa_forward_with_cache, then asserts the layer-23 attn-output max-diff < 1e-3 against the actual loaded model. Measured 4.77e-6 (true no-LoRA base divergence 2.4e-6, LoRA delta 5.2e-4). This closes the self-consistent-gradcheck blind spot that hid both the gate/interleave and the shifted-gamma norm bugs: it compares against the real loaded model, not a self-authored oracle. Uses real cfg.rms_norm_eps and position-0 identity RoPE; test-local LoraHook avoids a lattice-tune circular dep. Also collapses a manual_memcpy in the TEST 2 oracle. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Commit 84029ae moved the shared primitives tape.rs::rms_norm_forward and ops.rs::rmsnorm_backward to shifted-gamma x*(1+w)*inv, but missed these two trainer binaries, which still pre-shifted gamma via a shifted() helper before calling them, causing a double-shift x*(2+gamma)*inv. The trainers' built-in TBV gate caught it at runtime (model=5.047 vs assembled-chain=6.127, diff 1.08, threshold 1e-2), aborting before eval. Fix: trainers now pass RAW gamma to match the gradcheck.rs reference convention. Renamed struct fields pre_shift/post_shift/final_shift to pre_norm/post_norm/final_norm; removed the shifted() helper. Verified: trainer-TBV diff now 2.38e-6; cargo clippy --workspace -D warnings green; cargo fmt --check clean; 100-step on-par re-run reproduces the original curve bit-exact across all 6 eval points (best held-out NLL 4.6141 at step 40, q/v LoRA layers 19/23). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

discoverAdapters previously hardcoded rank/alpha/targetModules to nil. It now reads the real LoRA metadata for each adapter, in resolution order: 1. safetensors __metadata__ header (lattice-native; keys rank/alpha/ target_modules, written by tune::lora::save_peft_safetensors) -- parsed by reading only the 8-byte length prefix + JSON header via FileHandle, never the tensor payload, so it is safe for large adapter files 2. sibling adapter_config.json (PEFT r/lora_alpha/target_modules) as a fallback for externally-imported adapters 3. all three fields stay nil when neither source is present (honest result) The header length is decoded with loadUnaligned to avoid a misaligned-load trap on inline-backed Data returned by FileHandle.read(upToCount:). Honest-state: FaderToggle no longer shows the fabricated "0 ms reload" stamp. There is no hot-swap -- each generation is a fresh subprocess -- so the label now reads "applies next send", reflecting that the toggle only sets adapterPath for the next generation. The misleading preview string was updated to match. apps/macos only, zero engine change. swift build green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Foundation.Process children do not die with the parent on macOS (no PDEATHSIG), so an app crash, force-quit, or quit-without-cleanup leaves the trainer subprocess orphaned -- the 3-zombie-rerun footgun. AppStore tracked only the live handle, so nothing reaped these on the next launch. RunRegistry writes a per-PID JSON descriptor (<AppSupport>/LatticeStudio/active-runs/<pid>.json) on launch and deletes it on exit. On startup, AppStore.init() runs reapOrphans() synchronously before any new run can race: for each recorded PID it probes liveness (kill 0), and only sends SIGTERM (then SIGKILL after a 1s grace) when proc_pidpath confirms the live exe path still matches the one recorded at registration -- so a recycled PID belonging to an unrelated process is never killed. AppDelegate.applicationWillTerminate stops the active run for a clean quit; the reaper is the crash backstop. onExit captures the pid by value (set after start, race-free on the main actor) rather than the RunHandle, avoiding a retain cycle that would otherwise leak the handle plus its Process and pipe FDs on every run. apps/macos only, zero engine change. swift build green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@mainactor

INSTRUMENT_SCOPE.md documents the Lattice Studio macOS app (apps/macos): the surface/screen inventory, the @mainactor AppStore single-source-of-truth model, the line-delimited @@lattice JSON subprocess event protocol, and the honest-state / zero-fabrication design bar. Read-only planning deliverable; no code change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The training readout's "Δ FROM BASE" well showed "—" for the entire run and only resolved on the final frame, because run.baseNLL was set exclusively in the train_done handler. The step-0 train_step event already carries the pre-training NLL: its loss field is byte-identical to train_done.base_nll (verified against train_grad_full --json: step 0 loss 5.340457 == base_nll 5.340457). Capture it on the first event so the delta reads live from step 0. The step-0 emit is unconditional (not --log-every gated), so this fires under every config including the app default --log-every 5. BEST VAL is deliberately left honest-nil during the run: the trainer computes its held-out NLL once at completion via eval_valid on the final (saved) weights, not via best-checkpoint selection, so a running minimum would diverge from the saved adapter and jump at train_done. The live per-step held-out NLL is already surfaced in the HELD-OUT well. Also refreshes two stale TrainConfig comments: --json is implemented and verified emitting the @@lattice protocol, no longer a "future mode". apps/macos only, zero engine change. swift build green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

inspectModelDir read hidden_size, vocab_size, and layer_types from config.json but never max_position_embeddings, so the model inspector showed no context length even though every config that has one exposes it. Read it from the same normalised `cfg` dict (nested text_config for qwen3.5, top-level for flat configs like qwen3-embedding), store it on ModelInfo, render a CTX well. Honest-nil discipline: the CTX well is appended only when contextLength is present, exactly like HIDDEN/VOCAB. Models with no config.json keep contextLength nil and the well stays hidden, no fabricated value, no filler row. Verified the real Swift read path against on-disk configs: qwen3.5-0.8b / -q4 / -q4-quarot -> 262144 (nested text_config) qwen3-embedding-0.6b -> 32768 (flat top-level) models with no config.json -> nil (well hidden) DTYPE was investigated in the same pass and deliberately left unchanged: it is correct for every real model (bf16 -> "BF16" default, Q4 -> "Q4_0", embedding -> top-level torch_dtype override). Reading the nested text_config.dtype would fabricate "BFLOAT16" for Q4 models, whose config records the original weights' dtype, not the on-disk Q4 storage. The omission is protective. apps/macos only, zero engine change. swift build green (3.01s). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

DataScreen listed real row counts and a prompt/completion preview but never showed the dataset schema, so a file whose JSON is not {prompt, completion} surfaced only opaque RAW LINE warnings with no hint of its actual shape. parseStat now reads the first non-empty line's top-level JSON keys via JSONSerialization and stores them on DatasetFileStat; the files table renders them in a new SCHEMA column. Honest-nil when the line is unreadable, empty, or not a JSON object (bare string, array, malformed), no assumed schema. Verified the real Swift read path against on-disk datasets: all 7 data/*/train.jsonl -> "completion, prompt" (real keys) {text,label} line -> "label, text" (reflects real keys) bare-string / array / missing -> honest-nil Parquet was deliberately NOT added: data/ has zero parquet files, so any parquet support would be unverifiable against real data, and a pure-Swift zero-dependency parquet reader would be theater. Honest omission over fake breadth, consistent with the DTYPE call in 812f56e. apps/macos only, zero engine change. swift build green (2.40s). df 15Gi. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…comment The DATA file listing summed the ≈TOKENS column over only the first 5 000 lines for capped files yet rendered it as an exact total when it is really a lower bound. It now shows a "+" suffix (matching the EXAMPLES "5 000+" treatment) whenever the file is capped. AVG LEN is unchanged: it is a mean over the sampled lines, so it stays valid under capping. Also corrected the DatasetFileStat.exampleCount doc comment, which claimed "0 if capped" -- exampleCount is the real line count (capped at 5 000), and is 0 only when the file is unreadable or empty. apps/macos only, zero engine change. swift build green (2.18s). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Extends LoRA backward training from the 6 GQA layers (surface-A) to the 18 GatedDeltaNet (GDN) layers, so the in_proj_qkv/z/b/a and out_proj projections that MLX-LM wraps now produce weight gradients. - gdn_forward_save applies LoRA to the 5 GDN projections, Option-gated so it is byte-identical to the base forward when any param is None or rank==0, caching h_* = A.x for the backward. - gdn_backward returns GdnGrads with grad_a_*/grad_b_* for all 5 projections via lora_vjp, each fed the pre-nonlinearity projection-output gradient (out_proj terminal, z post-SiLU, beta post-sigmoid, alpha post-decay-gate, qkv post-conv+SiLU), with alpha/beta accumulated across the value-heads sharing each key-head before the vjp. - train_grad_full wires per-slot GDN LoRA params, gradcheck enumeration over all 10 GDN arrays, and Adam dispatch by slot kind. - Adds gradcheck_gdn_lora_weight_grads (single and multi-head) unit tests that finite-difference the weight grads. Inference hot path (gdn_fused.rs) is untouched, so the e2e-parity gate is unaffected. UNVERIFIED: weight-grad correctness is pending the AM gradcheck run (the heavy real-model train_grad_full --gradcheck on Qwen3.5-0.8B plus the new synthetic unit gradchecks). This commit makes no correctness claim. Critic-reviewed APPROVE-WITH-FIXES (all 4 addressed); clippy --all-targets clean; no library panics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Extends the S8 nested-config read (812f56e) to the model's attention-head configuration, surfaced as honest-nil ReadoutWells in the Models inspector: HEADS (num_attention_heads), KV HEADS (num_key_value_heads), HEAD DIM (head_dim), and GDN HEADS (linear_num_key_heads/linear_num_value_heads, the GatedDeltaNet linear-attention heads). - LatticeBridge reads all five from the same nested-resolved cfg (text_config when present, else top-level) used for hidden/vocab/ctx, so honest-nil holds for flat or absent configs with no new IO. - ModelInfo gains five Int? fields; every well is conditional (omitted when the field is nil); GDN HEADS shows a dash for whichever of key/value is absent. No fabricated defaults, no force-unwraps. Verified: swift build clean (the sole reliable gate; single-file SourceKit reports cross-file false positives). A Swift-oracle replica of the parse confirms the real values: qwen3.5-0.8b (nested) HEADS=8 KV=2 HEAD_DIM=256 GDN=16/16; qwen3-embedding-0.6b (flat) 16/8/128 with GDN honest-nil; bge (no config) all wells omitted; synthetic flat/partial/missing all honest-nil. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds an FFN well to the model inspector showing intermediate_size, the one remaining major shape dimension not yet surfaced (hidden/vocab/heads/head_dim/ layers were already shown; the linear-vs-full layer split is already covered by the LAYERS well). Read from the same nested-resolved config cfg as the ctx and head-config slices, so flat and absent configs fall to honest-nil and the well is omitted with no new IO. Verified: swift build clean (the sole gate), plus a Swift-oracle replica of the parse confirming real values qwen3.5-0.8b 3584, qwen3-embedding-0.6b 3072, multilingual-e5-small 1536, and honest-nil omission for models with no config (qwen3.5-2b, bge, all-minilm, paraphrase). No fabrication, no force-unwrap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…onent groundwork Redesign milestone for the Lattice app: progressive-disclosure information architecture plus design-system groundwork, ahead of the per-screen coherence pass. - Nav 6->4: Models/Chat/Train/Runs (cmd1-4). Quantize is now a sheet launched from Models; Data folds into Train's DATASET section. Resolves the cmdR/Runs shortcut collision (Runs is cmd4, no longer cmdR). - Accessibility: textTertiary -> #646E79/#79838F (>=4.58:1) and onAccent -> #0A0D11 light (>=4.88:1 on teal) to clear the WCAG AA 4.5:1 floor. - Shared components: Badges (Format/Status), ButtonStyles (Primary/Secondary), EmptyStateView, Field (text/numeric), InspectorShell (reusable right-inspector container generalizing ChatInspector). - Chat: model selection + advanced knobs moved into the toggleable .inspector. - Theme: adaptive light/dark palette rework, radius and spacing token additions. - Rename: product wordmark + window title -> "Lattice" (was "Lattice Studio"). Internal target/bundle-id/data-path remain LatticeStudio (rename is risky). swift build green. Local checkpoint before the per-screen coherence pass; not pushed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…udio eval surfaces eval_perplexity emits @@lattice perplexity events (ppl/nll/label/tokens/ms) via a new --json flag and --label, in both the CPU bf16 and Metal Q4 paths, so the Studio Models inspector can parse method-PPL results. No --lora flag is added: adapter quality is Train NLL-delta, not eval PPL. Adds a new embed CLI (crates/embed/src/bin/embed.rs) emitting embed_done events (model/dims/count/cosine/preview/ms) for the Studio Embeddings tab. No library code is touched; the inference hot path and e2e-parity gate are unaffected. Bins verified clippy-clean under their real features (f16,metal-gpu for eval_perplexity; native for embed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Continues the 4-destination nav consolidation (58edfef) into the eval-centric redesign, wiring every surface to real engine output under honest-nil discipline (render a dash or omit when a value is absent, never fabricate a metric). - Embeddings tab (new EmbeddingsScreen): runs the embed CLI over N text rows and renders an NxN cosine matrix plus preview vectors; the grid shows only when the matrix is square and present. - Models inspector QUALITY (PERPLEXITY): 3-mode eval (CPU bf16, Metal Q4, dual-Q4 QuaRot delta) with a sequential bf16-to-quant onComplete chain, bf16-sibling discovery (exact-match then dash-boundary-prefix longest), and a verdict pill. - Chat A/B base-vs-adapter: single live handle sequenced base-to-adapter via LiveRun.onComplete; a Stop (RunStatus .failed, no distinct .stopped) now aborts phase 2 via a guard on .done; per-pair baseLabel/adapterLabel are snapshotted at send time so attribution survives mid-run picker changes. - Drivers/AppStore/DomainModels: EvalConfig plus runEval (no adapterPath), EmbedConfig plus runEmbed, GenConfig.adapterPath maps to --lora, Screen.embed, RunKind.eval and .embed, LiveRun.onComplete/perplexities/embed. - package-app.sh ships all 8 engine binaries (adds eval_perplexity with f16,metal-gpu and embed; was silently omitting both). swift build green. Codex-reviewed: REJECT (3 blockers plus a sibling-match bug), all fixed, then APPROVE with no regressions. Honest-nil verified on real Qwen3.5-0.8B (embed cosine 0.856 vs 0.415; PPL bf16 13.21 below quarot 16.75 below q4 17.41; an overfit adapter generates the memorized completion only with --lora). Known deferral: 9 try-bang static NSRegularExpression patterns in LatticeEvents.swift (codex non-blocking, prior-reviewed parser) tracked separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…udio - embed: add --download-only (checksum-verified fetch; emits @@lattice download_done) so the app can trigger real model downloads - chat_metal, generate_lora: parse --top-k/--top-p/--repetition-penalty so Chat sampler controls reach the engine - quantize_quarot/quarot: measure total_bytes_in from on-disk SafeTensors spans (not 8x f64 working-copy size); report GB — fixes the QuaRot size/ratio the Quantize surface shows Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…sues Get Models sheet (Models): curated catalog with real one-click download for 7 checksum-verified embeddings; import-only rows for generative + the Qwen embedding (no fabricated download button); universal NSOpenPanel import with atomic staging copy + config.json/.safetensors validation. Eval workspace (EvalScreen, replaces EmbeddingsScreen): CPU/GPU choice, compare beyond A/B, perplexity surfaces. Chat: sampler params + composer and style cleanup. Data: inline dataset selection feeding Train (removes file-first friction). Train/Quantize: honest-nil status surfaces. .gitignore: exclude runtime artifacts (models/adapters/data) + agent cache. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The surface-B commit (f34ed3b) added GDN LoRA weight-gradient fields to LoraParams (aliased as Grads) but left the GQA backward branch's Grads initializer at only a_q/b_q/a_v/b_v, so train_grad_full failed to compile under --features train-backward (E0063). A GQA layer has no GDN LoRA factors, so its GDN gradients are empty Vec — mirroring how the GDN branch already leaves the GQA fields empty. No gradient math changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rom header Two LoRA save/load bugs surfaced by the app-e2e sweep on real qwen3.5-0.8b: 1. train_grad_full --save crashed (InvalidTensorView) on GDN models: the trainer inserted q_proj/v_proj LoRA layers for every trained slot, but GDN attention slots leave those buffers empty, so save_peft_safetensors built a zero-byte tensor with a non-zero shape. The serializer now skips any layer with an empty A/B buffer, and the trainer guards the q/v inserts and emits a gdn_skipped warning naming how many GDN layers were not persisted (no silent loss). 2. load_peft_safetensors hardcoded alpha = rank and never read the saved alpha, so every adapter loaded at scale 1.0 -- silently applying alpha != rank adapters at half magnitude. Load now reads alpha from the safetensors header via read_metadata, falling back to rank when absent. Regression tests: test_save_skips_empty_buffer_layers, test_alpha_metadata_round_trips. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(macos): Lattice app v1 (Models + Chat) First release scope of the Lattice macOS app. Ships the two surfaces that work end to end against the current engine: model browsing (MODELS) and local chat (CHAT). Sliced from the LoRA training branch (#193). Deferred and re-addable as the backend matures: Data, Train, Eval and Embeddings, Quantize, Runs and History, adapter management. - Screen enum reduced to {models, chat}; nav, command bar, and run routing follow the two-screen shape - ChatScreen: single-mode generate over CPU bf16 and GPU Metal (bf16 + q4), honest disk status, sampling and generation controls, retry, tok/s. No adapter path is threaded - ModelsScreen: get, refresh, reveal, and a Chat CTA, plus the full model config readout inspector - Removed DataScreen, TrainScreen, EvalScreen, QuantizeScreen, RunsScreen - Unused store and model scaffolding kept compiling for incremental revival swift build is green on a forced recompile. No engine binaries are added; the app shells out to binaries already on main. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(macos): apply the completed v1 slice to tracked files The initial v1 commit (47dfe07) captured the app at an intermediate state: the five cut screen files were removed from the tree, but the Screen enum and the routing in seven dependent files still referenced .data/.train/.eval, so the committed tree did not compile on its own (ContentView routed to DataScreen/TrainScreen/EvalScreen, which are absent from the commit). This commits the fully-sliced worktree versions that swift build validates green: Screen enum reduced to {models, chat}, with command bar, nav symbols, run routing, and store scaffolding made consistent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ft (#195) (#231) * feat(macos): Lattice app v1 (Models + Chat) First release scope of the Lattice macOS app. Ships the two surfaces that work end to end against the current engine: model browsing (MODELS) and local chat (CHAT). Sliced from the LoRA training branch (#193). Deferred and re-addable as the backend matures: Data, Train, Eval and Embeddings, Quantize, Runs and History, adapter management. - Screen enum reduced to {models, chat}; nav, command bar, and run routing follow the two-screen shape - ChatScreen: single-mode generate over CPU bf16 and GPU Metal (bf16 + q4), honest disk status, sampling and generation controls, retry, tok/s. No adapter path is threaded - ModelsScreen: get, refresh, reveal, and a Chat CTA, plus the full model config readout inspector - Removed DataScreen, TrainScreen, EvalScreen, QuantizeScreen, RunsScreen - Unused store and model scaffolding kept compiling for incremental revival swift build is green on a forced recompile. No engine binaries are added; the app shells out to binaries already on main. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(macos): apply the completed v1 slice to tracked files The initial v1 commit (47dfe07) captured the app at an intermediate state: the five cut screen files were removed from the tree, but the Screen enum and the routing in seven dependent files still referenced .data/.train/.eval, so the committed tree did not compile on its own (ContentView routed to DataScreen/TrainScreen/EvalScreen, which are absent from the commit). This commits the fully-sliced worktree versions that swift build validates green: Screen enum reduced to {models, chat}, with command bar, nav symbols, run routing, and store scaffolding made consistent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(macos): nil-guard NSRegularExpression init in LatticeEvents.swift (#195) Convert 9 static 'try! NSRegularExpression(pattern:)' declarations in HumanLineParser to 'try? NSRegularExpression(pattern:)', yielding NSRegularExpression? optionals. Every call site is updated from 're.firstMatch(...)' to 're?.firstMatch(...)' so a nil regex degrades gracefully (returns no match) rather than crashing the app. The statics' access level is widened from private to internal so the new test target can reach them via @testable import. Adds Tests/LatticeStudioTests/LatticeEventsTests.swift with a testAllStaticPatternsCompile test that asserts every pattern is non-nil at runtime, plus five parse-level smoke tests that exercise the converted regexes end-to-end. Updates Package.swift to declare the new LatticeStudioTests test target. swift build: Build complete! (11.86s) swift test: Executed 6 tests, with 0 failures (0 unexpected) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ohdearquant · 2026-06-24T07:43:46Z

PR-health note (autonomous maintenance): this branch has substantially diverged and is now mergeStateStatus=DIRTY. Against current main it shows 20 conflicts, mostly add/add — because the LoRA-backward core this PR introduced has since landed on main through other PRs:

feat(tune): exact-gradient LoRA backward + multi-layer training pipeline #191 feat(tune): exact-gradient LoRA backward + multi-layer training pipeline — adds backward/ops.rs, backward/tape.rs, train_grad_full.rs (all add/add here)
hardening(tune): LoRA safetensors transpose panics on shape/element mismatch; f32 reader drops trailing bytes #212/fix(tune): validate LoRA safetensors payload length + tensor shape (#212) #217 safetensors payload-length + shape validation
fix(tune): load LoRA adapter alpha from safetensors header (was hardcoded to rank) #261 load LoRA adapter alpha from safetensors header

This branch predates all three, so a naive "take this branch's version" would regress them (e.g. its safetensors.rs removes the #212/#217 validation and reverts to alpha = rank). It is therefore not auto-rebaseable — it needs a scope decision.

What remains genuinely unmerged and valuable here:

The macOS Lattice Studio app (apps/macos/*) — main has no apps/macos; this is the bulk of the real unmerged content.
Recent tune fixes on top of the merged backward core (e.g. a91a7c05, 34bf60141).

I've extracted the one isolated correctness fix that would otherwise be buried — the GDN empty-buffer guard in save_peft_safetensors (the InvalidTensorView-on-save fix) — onto current main as #279, regression-tested and preserving #212/#217 + #261.

Recommendation for @ohdearquant: rather than rebase this 25K-line/40-commit branch against a main that already absorbed its training core, re-scope it to just the Studio app (extract apps/macos + the unmerged CLI/eval surfaces into a fresh focused PR off current main). Leaving here for a human scope call — no action taken on the branch itself.

…N slots) (#279) GDN-attention slots leave q_proj/v_proj LoRA factors empty when only the GQA layers are trained. save_peft_safetensors pushed those empty byte buffers with non-zero shapes [rank, d_in] / [d_out, rank], so TensorView construction failed with InvalidTensorView(F32, [rank, d_in], 0) and the whole save aborted. Skip any layer whose A or B factor buffer is empty rather than emit a zero-byte tensor with a non-zero shape. The real (trained) layers round-trip intact; the untrained GDN slots are dropped from the saved adapter. Sliced from the feat/lora-backward-training branch (#193) onto current main, so it preserves the #212/#217 shape/length validation and the #261 alpha-from-header behavior that the branch predates (the branch's own safetensors.rs would regress both). Regression test test_save_skips_empty_buffer_layers reproduces the failure with the guard disabled (InvalidTensorView) and passes with it restored. All 25 safetensors tests green under --features safetensors. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ohdearquant · 2026-06-30T23:26:52Z

Superseded — recording where the work landed (the auto-close above was a branch-divergence note, not a disposition):

LoRA backward training (GQA + GDN input-gradient flow): landed via feat(tune): exact-gradient LoRA backward + multi-layer training pipeline #191 and fix(tune): load LoRA adapter alpha from safetensors header (was hardcoded to rank) #261.
LatticeStudio macOS instrument app: shipped via feat: Studio macOS redesign + s1 reasoning-budget engine support #435 (v0.4.2).
GDN-layer weight-gradient primitive: tracked separately as the feat(tune): surface-B GDN LoRA weight gradients + train_grad_full fields #202 draft, held pending a measured need (current GQA-only micro-LoRA is on-par with the reference).

Closed as superseded; see #191, #261, #435, #202.

ohdearquant and others added 30 commits June 20, 2026 00:27

fix(tune): use dims.vocab not d.vocab in logits budget check

f274c11

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(tune): restore d.vocab in helper functions (only main scope uses …

d96ce5b

…dims) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into feat/lora-backward-tr…

eb267fe

…aining

ohdearquant and others added 9 commits June 21, 2026 05:05

ohdearquant mentioned this pull request Jun 24, 2026

fix(tune): skip empty-buffer LoRA layers in save_peft_safetensors (GDN slots) #279

Merged

ohdearquant closed this Jun 30, 2026

ohdearquant deleted the feat/lora-backward-training branch July 1, 2026 17:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: LoRA backward training (GQA+GDN) + LatticeStudio macOS instrument app#193

feat: LoRA backward training (GQA+GDN) + LatticeStudio macOS instrument app#193
ohdearquant wants to merge 40 commits into
mainfrom
feat/lora-backward-training

ohdearquant commented Jun 21, 2026

Uh oh!

ohdearquant commented Jun 24, 2026

Uh oh!

ohdearquant commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ohdearquant commented Jun 21, 2026

Summary

Thread 1 — LoRA backward (inference + tune)

Thread 2 — LatticeStudio (apps/macos)

Gates / testing

Notes

Follow-ups (tracked)

Uh oh!

ohdearquant commented Jun 24, 2026

Uh oh!

ohdearquant commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant