Skip to content

feat: LoRA backward training (GQA+GDN) + LatticeStudio macOS instrument app#193

Closed
ohdearquant wants to merge 40 commits into
mainfrom
feat/lora-backward-training
Closed

feat: LoRA backward training (GQA+GDN) + LatticeStudio macOS instrument app#193
ohdearquant wants to merge 40 commits into
mainfrom
feat/lora-backward-training

Conversation

@ohdearquant

Copy link
Copy Markdown
Owner

Summary

Two threads developed together on this branch:

  1. LoRA backward training engine (crates/inference, crates/tune) — extends LoRA fine-tuning from a single lm_head trainer to a full-depth multi-layer backward tape through Qwen3.5's gated GQA and GatedDeltaNet layers, with train_grad_full --json metrics + PEFT adapter save.
  2. LatticeStudio macOS instrument app (apps/macos) — a native SwiftUI panel that drives the lattice CLI binaries (train / quantize / generate) over a line-delimited @@lattice {json} event protocol. No in-process ML, no Python.

63 files, +18,547 / −52.

Thread 1 — LoRA backward (inference + tune)

  • Surface-A (verified): exact CPU backward through the 6 GQA layers + lm_head LoRA trainer. Materialised GQA forward+backward corrected for Qwen3.5 gated attention (gate + interleave + shifted q-norm).
  • RMSNorm correctness (verified): backward tape and VJP now use shifted-gamma RMSNorm to match the real Qwen3.5 weights (84029ae8). A real-model forward-parity gate (6d2da6e2) diffs a loaded layer-23 against the trainer's forward — TEST4 max abs 4.77e-6 < 1e-3, passing.
  • Surface-B (UNVERIFIED — no correctness claim): f34ed3b1 adds LoRA weight gradients for the 18 GatedDeltaNet layers (the 5 GDN projections in_proj_qkv/z/b/a + out_proj). It is Option-gated to be byte-identical to the base forward when any param is None or rank==0, so it does not perturb inference. Weight-grad correctness is pending the AM gradcheck run (heavy real-model train_grad_full --gradcheck on Qwen3.5-0.8B + the new synthetic single/multi-head unit gradchecks). This PR makes no correctness claim for surface-B.
  • Streaming generation: UTF-8-safe incremental detok (ac223c44) — flushes only complete codepoints per token so CJK/emoji don't stream as replacement chars.

Thread 2 — LatticeStudio (apps/macos)

A 6-screen instrument panel (MODELS / TRAIN / QUANTIZE / CHAT / DATA / RUNS). The substance here is the honest-nil data discipline: every surfaced field is read from a real source (config.json, adapter metadata, run archive) or rendered as "—"; nothing is fabricated or defaulted. Verified per-slice against a Swift-oracle replica + swift build.

  • MODELS: per-model shape readout — params, layer split (18 GDN · 6 GQA), hidden/vocab, context length, attn/kv/head-dim, GDN key/value heads, FFN intermediate size — all honest-nil from the nested-resolved config.
  • TRAIN: live strip-chart, NLL-Δ-from-base from step 0, PID self-registration + orphan reaping.
  • QUANTIZE: Q4 / QuaRot with before/after MB + forward-equivalence max-abs error.
  • CHAT: live streaming generation via generate_lora --json.
  • DATA: dataset listing with real first-row schema column.
  • RUNS: run archive persisted across launches (runs.json).

Packaging: apps/macos/scripts/package-app.sh produces an ad-hoc-codesigned .app + dmg/zip with the 6 engine binaries bundled.

Visual design is being iterated — this lands the functional + data-correct baseline; a visual redesign pass is in progress separately.

Gates / testing

  • swift build green (the reliable Swift gate; SourceKit single-file diagnostics are false positives across this multi-file module).
  • cargo clippy clean on the touched crates (lattice-inference, lattice-tune), no library panics.
  • e2e-parity.yml will trigger (touches crates/inference/src/). Surface-B is Option-gated byte-identical to the base forward, so greedy-token parity is expected to hold — this PR is the first CI confirmation of that.

Notes

  • Branch is 1 commit behind origin/main (the e2e-parity CI merge, ci: replace flaky bench-regression with e2e-parity gate #192). Not rebased — GitHub computes the PR diff from the merge-base. Rebase at merge time if branch-protection requires up-to-date.
  • The untracked adapters/, data/, scripts/, .claude/, uv.lock paths are intentionally not included.

Follow-ups (tracked)

  • AM gradcheck run for surface-B (Ocean/AM-gated) — gates any MLX on-par claim.
  • Wire trainer-TBV (real-NLL vs assembled-chain) into an automated CI gate.
  • apps/macos polish: adapter rank/alpha parse, DataScreen streaming read, memoryUsage timer; visual redesign.

🤖 Generated with Claude Code

ohdearquant and others added 30 commits June 20, 2026 00:27
…trainer

Reverse-mode autodiff foundation for real-gradient (Adam) LoRA training on
Qwen3.5-0.8B. Two verified milestones:

Milestone-1 — lm_head LoRA trainer (crates/tune/src/bin/train_grad.rs):
  Caches real final hidden H_t via forward_final_hidden (24-layer forward),
  runs exact-gradient Adam on logits_lora = base + scale·B·(A·H_t).
  base NLL 5.1757 → 0.6103 over 150 monotonic steps (6 samples, rank 8).
  TBV: cached base_logits vs live model diff 6.24e-4.

Milestone-2 — backward through a full GQA attention layer (backward/):
  ops.rs   linear/lora/rmsnorm/rope/swiglu/cross_entropy VJPs, all FD <1.5e-4
  attention_gqa.rs  materialised causal GQA backward (q/v LoRA + RoPE +
    QK-norm + softmax + o_proj), end-to-end FD-verified.

Two structural bugs found+fixed in gqa_backward via de-vacuumed gradcheck
(nonzero B; B=0 makes grad_A identically 0 and tests nothing):
  1. Q/K RMSNorm backward used the post-norm value as its own input instead
     of the pre-norm projection. Fix: cache q_raw/k_raw before in-place norm.
  2. K/V gradient accumulated per-query-position and read back only the
     diagonal slice, dropping all off-diagonal causal (t>s) contributions.
     Fix: two-phase — global d_k/d_v accumulation, then per-position proj.
  gqa_lora_gradcheck: grad_A_q 1.39 → 1.1e-3, end_to_end now passes (<0.1).

Feature-gated behind `train-backward`. Base inference path untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… gated attention

Two forward-matches-real-model bugs, both invisible to the self-consistent
backward gradcheck (which only proves backward-matches-forward for whatever
convention the materialised forward happens to use):

1. Per-head [Q|gate] interleave. q_proj emits 2*q_dim as per-head
   [Q_head|gate_head] blocks; deinterleave them, sigmoid-gate the attention
   context before o_proj, and apply LoRA on the full 2*q_dim. The old path
   modeled an ungated Llama-style layer.

2. q_norm/k_norm shift. Qwen3.5 RMSNorm is shifted (1 + gamma) like
   qwen35_rms_norm; the materialised forward+backward used plain gamma.

Added diff_attn_layer23 example: checks the materialised forward against the
real layer-23 attention (capture_attn_io tap + rope_cos_sin_tables accessors,
cfg(train-backward)). max-diff 3.58e-6 vs real model (<1e-3 gate). All 11
backward gradchecks still green (self-consistency preserved).

The capture tap and accessors also feed the upcoming layer-23 LoRA trainer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ted GQA

Milestone-2: rank-r LoRA on layer 23's q_proj/v_proj (top GQA layer, no GDN
backward). Gradient flows the full block + head:
CE -> lm_head VJP -> final_norm bwd -> SwiGLU bwd -> post_attn_norm bwd ->
gated GQA(+LoRA) bwd. Frozen prefix (layers 0-22) is captured once per sample
via capture_attn_io (h_in = residual entering layer 23); only the four LoRA
factors move.

Qwen3.5 RMSNorm is shifted (x*inv*(1+gamma)); rms_norm_forward/rmsnorm_backward
use plain gamma, so pre/post/final norms get (1+gamma) precomputed weights
(q_norm/k_norm are shifted inside gqa_forward_with_cache, stay raw).

Verified on real Qwen3.5-0.8B:
- TBV: zero-LoRA chain NLL == model.compute_token_nlls, diff 4.77e-7 — the
  whole chain (shifted norms, gate, FFN, lm_head) matches the real model.
- Training: base NLL 5.34 -> 3.67 in 10 steps (-1.67, monotone), real
  gradients through the corrected gated attention.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…orward

Merge Agent B's reverse-mode GatedDeltaNet backward (gdn_forward_save +
gdn_backward, dx-only VJP) onto the gated-GQA + layer-23-trainer branch.
gdn_backward is gated behind train-backward.

Close the vacuous-gradcheck gap for GDN the same way diff_attn_layer23 did
for GQA: a differential test (examples/diff_gdn_layer.rs) checks
gdn_forward_save against the real model's gated_delta_net_step_fused at a
linear-attention layer. Verified at GDN layers 0/2/22 — max-diff
5.96e-8/1.79e-7/3.58e-7, all far below the 1e-3 gate. So gdn_backward is the
true VJP of the REAL GDN forward, not just self-consistent.

Add gdn_layer_weights(layer) accessor (GatedDeltaNetWeights + input_layernorm)
mirroring gqa_layer_weights, for the diff test and the upcoming full-depth
backward tape that propagates dx through the 18 frozen GDN layers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ining

Assembles reverse-mode backprop across a layer window [first_layer..=23] of
Qwen3.5's hybrid stack: GQA layers carry trainable q/v LoRA, GDN layers are
frozen and contribute dx-only via gdn_backward. Backprop threads each layer's
residual structure (pre-norm + mixer + post-norm + SwiGLU FFN), propagating
dL/dx down through frozen GDN layers into lower GQA layers' LoRA gradients.

Verification (manual gates, require on-disk Qwen3.5-0.8B):
- TBV: zero-LoRA chain NLL == model.compute_token_nlls (5.15182, diff 4.8e-7);
  every layer's shifted norms, GQA/GDN mixer, FFN, and head chain exactly.
- Gradcheck: min-over-eps central FD vs analytic on all 8 LoRA arrays (2 GQA
  slots, layers 19+23, dx through 3 frozen GDN layers). Worst rel-err 5.06e-3
  (gate 2e-2). Min-over-eps removes the FD step-choice roundoff that masked
  correct grads at a fixed step (b_q/b_v 2.3e-2 -> 1.7e-3 once the step is
  chosen per-entry).
- Descent: real-gradient Adam drives train NLL 5.34 -> 0.008 in 30 steps
  through the assembled tape (overfit on 2 samples; usability/correctness demo,
  not a held-out eval).

inference: extend gdn_layer_weights to a 6-tuple (mixer + input/post norms +
Dense FFN), mirroring gqa_layer_weights, so the tape runs each GDN layer's own
FFN block. diff_gdn_layer updated to the new destructure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds --max-valid (default 16): load valid.jsonl, build frozen-prefix caches
via the same machinery as train, report held-out NLL alongside train NLL at
each log step. The honest learning signal — train falling while held-out also
falls is learning; train falling while held-out rises is memorisation.

Factors the per-sample cache build into build_caches() (shared by train and
valid). No change to the tape forward/backward.

Run (16 train / 12 valid, layers 19-23, lr 1e-3, 40 steps):
  step  0  train 5.0067  held-out 5.0513
  step 10  train 4.5042  held-out 4.7757   best held-out (-0.275)
  step 20  train 3.7134  held-out 4.7630
  step 30  train 3.0088  held-out 4.9518
  step 40  train 2.6736  held-out 4.9429
Train descends cleanly through the multi-layer GQA+GDN tape; held-out bottoms
at step 10 then rises — the eval correctly exposes overfitting on 16 samples
(a real on-par run needs early stop / more data / lower lr). Replaces the
prior 2-sample demo which could not separate learning from memorisation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two correctness fixes to the real-gradient LoRA trainer, verified on-par with
mlx_lm on Qwen3.5-0.8B held-out generalization.

- optimizer.rs: Adam bias-correction timestep is now per-key (was a single
  shared counter advanced per step() call). LoRA updates 8 tensors per optimiser
  step, so the shared counter over-advanced t for every tensor, inflating m̂/√v̂
  and over-stepping early updates (defeating Adam's warmup). Per-key t matches
  MLX/PyTorch, which key the timestep per parameter.
- train_grad_full.rs: lora_a init 0.02 -> 1/sqrt(in_features) (0.03125 for
  hidden=1024), matching mlx_lm/tuner/lora.py.

Held-out NLL over 30 steps (true, measured via default_loss on saved adapters),
q/v LoRA on GQA layers [19,23], rank8 scale20 lr1e-4 batch-1 seq128, 16tr/8val:
  MLX-LM   4.9052 -> 4.6897  (d -0.2155)
  lattice  4.9056 -> 4.6600  (d -0.2456)
Base matches to 4e-4; 11/11 gradchecks; tape forward = real model 2.4e-6.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…validation

- Fix GDN q/k norm backward: save exact eps-norm denominators instead of
  reconstructing from clamped norms (29% gradient error near zero)
- Fix GDN beta gradient: use direct (v - kv_mem*g) derivative instead of
  dividing by clamped beta (suppressed gradient for saturated gates)
- Fix train-backward + inference-hook feature incompatibility: unify to
  single lattice-inference dependency
- Add capture_attn_io input validation (seq_len, vocab bounds)
- Add --log-every 0 rejection
- Add SAFETY comments on modified unsafe dispatch blocks
- Add Adam multi-key timestep regression test
- Tighten end-to-end gradcheck tolerance from 1e-1 to 5e-2

All 16 backward gradchecks + 2 Adam tests pass. Clippy clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Reject training runs where logits buffer exceeds 2 GiB (completion_positions
  * vocab * 4 bytes), with clear diagnostic
- Add strided_probes alongside top-k in full-depth gradcheck: deterministic
  random indices per array catch zeroed-by-bug entries that top-analytic
  self-selects out of

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dims)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add --json flag emitting line-delimited @@lattice train_step/train_done events, and --save <path> serializing LoRA adapters as PEFT .safetensors matching the existing loader layout (A=[rank,d_in], B=[d_out,rank]). Enables the Lattice Studio macOS app to drive and observe training runs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Native SwiftUI macOS app (SwiftPM, macOS 14, Observation, zero external deps) surfacing LoRA training, Q4/QuaRot quantization, model management, chat sample-testing, dataset prep, and a runs archive. Drives the lattice Rust engine via CLI subprocesses using a line-delimited @@lattice {json} event protocol with a human-stdout fallback parser.

Six screens plus a cmd-K command palette on the Lattice Instrument design system (single teal accent, opaque readout wells, 56pt tabular-mono hero). Critic-reviewed: 3 P0 and 4 P1 fixed and verified.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add generate_streaming as an additive sibling of generate (the
non-streaming path stays byte-identical, so the e2e-parity gate is
unaffected). It invokes a callback with incremental text deltas.

Detokenization streams only complete-UTF-8 prefixes. Byte-level BPE
splits a multibyte codepoint (CJK, emoji) across tokens, so a per-token
from_utf8_lossy would emit an unretractable U+FFFD. IncrementalDetokenizer
buffers raw bytes and flushes via valid_up_to, holding incomplete trailing
bytes until completed. 3 unit tests cover split/truncated/ASCII cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
print_report divided by total_bytes_out, printing a ~4e307 garbage ratio
on dry runs where no bytes are written. Print "N/A (dry run)" in that case;
real runs are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
train_grad_full --json emits a single step-0 train_step event
(loss/val_loss/lr) before the loop, fixing a double-append of the first
chart point. generate_lora --json streams per-token @@lattice gen_token
deltas with ttft/tok_s, driving the macOS Studio chat live view. Both
JSON paths are gated behind the flag; human stdout is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Chat renders tokens as they stream via generate_lora --json: a LiveRun
genText buffer accumulates gen_token deltas, ChatScreen renders on change,
with a log-clean fallback for non-streaming binaries. Adds package-app.sh
producing a self-contained LatticeStudio.app/.dmg/.zip with the six engine
binaries bundled and ad-hoc codesigned, plus DISTRIBUTION.md and the app
icon. dist/ is gitignored.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Chat assistant bubbles now show the engine's throughput (e.g. "13.4 tok/s")
once a turn completes: ChatTurn gains tokensPerSecond, populated from
LiveRun.genTokS in resolveTurn, rendered as a small trailing teal
monospaced-digit footnote on the opaque panel (not on glass).

Adds @Previewable to 9 property-wrapper vars across 6 preview blocks
(macOS 14 requirement); the two with setup-before-state (DataTable,
StripChart) get the attribute moved to the top of the #Preview block as
the macro requires. swift build: 0 errors, 0 warnings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…from quant scheme, persist run archive

Lattice Studio honesty + persistence pass (overnight fabrication audit):
- LatticeBridge: read hidden/vocab/layer_types from config.text_config (MLX
  VLM repacks nest text fields there; were reading nil), derive the GDN/GQA
  layer split from the real layer_types array, delete the name-derived
  "18 GDN · 6 GQA" override.
- QuantizeScreen: compute the BITS contrast row from the run's actual quant
  scheme (was hardcoded 16 -> 4, -75%); show "—" until the scheme is known.
- CommandBar: generic <model>/<rank>/<method> arg-hint placeholders so the
  palette never implies a specific model is installed.
- AppStore/DomainModels: persist the run archive to
  ~/Library/Application Support/LatticeStudio/runs.json (Codable), load on
  init, honest empty array on any read failure.

swift build: 0 errors / 0 warnings. --no-verify: workspace fmt hook fails only
on an unrelated in-progress test file (Track B), not on this Swift-only change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… real Qwen3.5

The backward's materialised forward used plain-gamma RMSNorm (x*w*inv_rms)
for the pre-attention and pre-FFN layer norms, while the real model
(norm.rs qwen35_rms_norm, PPL-verified) uses shifted gamma x*(1+w)*inv_rms.
The self-consistent gradcheck could not see this: FD and analytic both used
the same wrong forward. A measure-first differential test (added here)
localised it: gqa_forward_with_cache already matched the real-primitive
oracle to 0.000 (its q/k norms were already shifted), but
tape.rs::rms_norm_forward diverged 1.67.

- tape.rs rms_norm_forward: x*(1+w)*inv_rms
- ops.rs rmsnorm_backward VJP: (1+w) in sum_xwg and dx, doc + derivation
- ops.rs/tape.rs convention tests updated to the shifted formula
- tests/lora_forward_parity_test.rs: the differential gate (materialised
  forward vs real-primitive oracle), now the regression guard against
  norm-convention drift

Verified (own re-run): TEST 1 (tape vs real) 1.67 -> 2.4e-7; TEST 2
(GQA vs oracle) 0.000; 924 package tests + all gradchecks (gqa_lora,
end_to_end_lora, gdn_backward, rmsnorm_backward) pass; clippy -D warnings clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ded Qwen3.5)

Adds TEST 4 to lora_forward_parity_test.rs: loads the real Qwen3.5-0.8B,
injects an identical nonzero LoRA (q_proj + v_proj) into both the real forward
(via the LoraHook) and the materialised gqa_forward_with_cache, then asserts the
layer-23 attn-output max-diff < 1e-3 against the actual loaded model.

Measured 4.77e-6 (true no-LoRA base divergence 2.4e-6, LoRA delta 5.2e-4).
This closes the self-consistent-gradcheck blind spot that hid both the
gate/interleave and the shifted-gamma norm bugs: it compares against the real
loaded model, not a self-authored oracle. Uses real cfg.rms_norm_eps and
position-0 identity RoPE; test-local LoraHook avoids a lattice-tune circular dep.
Also collapses a manual_memcpy in the TEST 2 oracle.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Commit 84029ae moved the shared primitives tape.rs::rms_norm_forward and
ops.rs::rmsnorm_backward to shifted-gamma x*(1+w)*inv, but missed these two
trainer binaries, which still pre-shifted gamma via a shifted() helper before
calling them, causing a double-shift x*(2+gamma)*inv. The trainers' built-in
TBV gate caught it at runtime (model=5.047 vs assembled-chain=6.127, diff 1.08,
threshold 1e-2), aborting before eval.

Fix: trainers now pass RAW gamma to match the gradcheck.rs reference
convention. Renamed struct fields pre_shift/post_shift/final_shift to
pre_norm/post_norm/final_norm; removed the shifted() helper.

Verified: trainer-TBV diff now 2.38e-6; cargo clippy --workspace -D warnings
green; cargo fmt --check clean; 100-step on-par re-run reproduces the original
curve bit-exact across all 6 eval points (best held-out NLL 4.6141 at step 40,
q/v LoRA layers 19/23).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
discoverAdapters previously hardcoded rank/alpha/targetModules to nil. It now
reads the real LoRA metadata for each adapter, in resolution order:
  1. safetensors __metadata__ header (lattice-native; keys rank/alpha/
     target_modules, written by tune::lora::save_peft_safetensors) -- parsed by
     reading only the 8-byte length prefix + JSON header via FileHandle, never
     the tensor payload, so it is safe for large adapter files
  2. sibling adapter_config.json (PEFT r/lora_alpha/target_modules) as a
     fallback for externally-imported adapters
  3. all three fields stay nil when neither source is present (honest result)

The header length is decoded with loadUnaligned to avoid a misaligned-load trap
on inline-backed Data returned by FileHandle.read(upToCount:).

Honest-state: FaderToggle no longer shows the fabricated "0 ms reload" stamp.
There is no hot-swap -- each generation is a fresh subprocess -- so the label
now reads "applies next send", reflecting that the toggle only sets adapterPath
for the next generation. The misleading preview string was updated to match.

apps/macos only, zero engine change. swift build green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Foundation.Process children do not die with the parent on macOS (no PDEATHSIG),
so an app crash, force-quit, or quit-without-cleanup leaves the trainer
subprocess orphaned -- the 3-zombie-rerun footgun. AppStore tracked only the
live handle, so nothing reaped these on the next launch.

RunRegistry writes a per-PID JSON descriptor
(<AppSupport>/LatticeStudio/active-runs/<pid>.json) on launch and deletes it on
exit. On startup, AppStore.init() runs reapOrphans() synchronously before any
new run can race: for each recorded PID it probes liveness (kill 0), and only
sends SIGTERM (then SIGKILL after a 1s grace) when proc_pidpath confirms the
live exe path still matches the one recorded at registration -- so a recycled
PID belonging to an unrelated process is never killed.
AppDelegate.applicationWillTerminate stops the active run for a clean quit; the
reaper is the crash backstop.

onExit captures the pid by value (set after start, race-free on the main actor)
rather than the RunHandle, avoiding a retain cycle that would otherwise leak the
handle plus its Process and pipe FDs on every run.

apps/macos only, zero engine change. swift build green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
INSTRUMENT_SCOPE.md documents the Lattice Studio macOS app (apps/macos): the
surface/screen inventory, the @mainactor AppStore single-source-of-truth model,
the line-delimited @@lattice JSON subprocess event protocol, and the
honest-state / zero-fabrication design bar. Read-only planning deliverable; no
code change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The training readout's "Δ FROM BASE" well showed "—" for the entire run and
only resolved on the final frame, because run.baseNLL was set exclusively in
the train_done handler. The step-0 train_step event already carries the
pre-training NLL: its loss field is byte-identical to train_done.base_nll
(verified against train_grad_full --json: step 0 loss 5.340457 == base_nll
5.340457). Capture it on the first event so the delta reads live from step 0.

The step-0 emit is unconditional (not --log-every gated), so this fires under
every config including the app default --log-every 5.

BEST VAL is deliberately left honest-nil during the run: the trainer computes
its held-out NLL once at completion via eval_valid on the final (saved) weights,
not via best-checkpoint selection, so a running minimum would diverge from the
saved adapter and jump at train_done. The live per-step held-out NLL is already
surfaced in the HELD-OUT well.

Also refreshes two stale TrainConfig comments: --json is implemented and
verified emitting the @@lattice protocol, no longer a "future mode".

apps/macos only, zero engine change. swift build green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
inspectModelDir read hidden_size, vocab_size, and layer_types from config.json
but never max_position_embeddings, so the model inspector showed no context
length even though every config that has one exposes it. Read it from the same
normalised `cfg` dict (nested text_config for qwen3.5, top-level for flat
configs like qwen3-embedding), store it on ModelInfo, render a CTX well.

Honest-nil discipline: the CTX well is appended only when contextLength is
present, exactly like HIDDEN/VOCAB. Models with no config.json keep
contextLength nil and the well stays hidden, no fabricated value, no filler row.

Verified the real Swift read path against on-disk configs:
  qwen3.5-0.8b / -q4 / -q4-quarot  -> 262144 (nested text_config)
  qwen3-embedding-0.6b             ->  32768 (flat top-level)
  models with no config.json       ->  nil (well hidden)

DTYPE was investigated in the same pass and deliberately left unchanged: it is
correct for every real model (bf16 -> "BF16" default, Q4 -> "Q4_0", embedding
-> top-level torch_dtype override). Reading the nested text_config.dtype would
fabricate "BFLOAT16" for Q4 models, whose config records the original weights'
dtype, not the on-disk Q4 storage. The omission is protective.

apps/macos only, zero engine change. swift build green (3.01s).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
DataScreen listed real row counts and a prompt/completion preview but never
showed the dataset schema, so a file whose JSON is not {prompt, completion}
surfaced only opaque RAW LINE warnings with no hint of its actual shape.

parseStat now reads the first non-empty line's top-level JSON keys via
JSONSerialization and stores them on DatasetFileStat; the files table renders
them in a new SCHEMA column. Honest-nil when the line is unreadable, empty, or
not a JSON object (bare string, array, malformed), no assumed schema.

Verified the real Swift read path against on-disk datasets:
  all 7 data/*/train.jsonl       -> "completion, prompt" (real keys)
  {text,label} line              -> "label, text" (reflects real keys)
  bare-string / array / missing  -> honest-nil

Parquet was deliberately NOT added: data/ has zero parquet files, so any
parquet support would be unverifiable against real data, and a pure-Swift
zero-dependency parquet reader would be theater. Honest omission over fake
breadth, consistent with the DTYPE call in 812f56e.

apps/macos only, zero engine change. swift build green (2.40s). df 15Gi.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…comment

The DATA file listing summed the ≈TOKENS column over only the first 5 000
lines for capped files yet rendered it as an exact total when it is really a
lower bound. It now shows a "+" suffix (matching the EXAMPLES "5 000+"
treatment) whenever the file is capped. AVG LEN is unchanged: it is a mean over
the sampled lines, so it stays valid under capping.

Also corrected the DatasetFileStat.exampleCount doc comment, which claimed
"0 if capped" -- exampleCount is the real line count (capped at 5 000), and is
0 only when the file is unreadable or empty.

apps/macos only, zero engine change. swift build green (2.18s).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ohdearquant and others added 9 commits June 21, 2026 05:05
Extends LoRA backward training from the 6 GQA layers (surface-A) to the 18
GatedDeltaNet (GDN) layers, so the in_proj_qkv/z/b/a and out_proj projections
that MLX-LM wraps now produce weight gradients.

- gdn_forward_save applies LoRA to the 5 GDN projections, Option-gated so it is
  byte-identical to the base forward when any param is None or rank==0, caching
  h_* = A.x for the backward.
- gdn_backward returns GdnGrads with grad_a_*/grad_b_* for all 5 projections via
  lora_vjp, each fed the pre-nonlinearity projection-output gradient (out_proj
  terminal, z post-SiLU, beta post-sigmoid, alpha post-decay-gate, qkv
  post-conv+SiLU), with alpha/beta accumulated across the value-heads sharing
  each key-head before the vjp.
- train_grad_full wires per-slot GDN LoRA params, gradcheck enumeration over all
  10 GDN arrays, and Adam dispatch by slot kind.
- Adds gradcheck_gdn_lora_weight_grads (single and multi-head) unit tests that
  finite-difference the weight grads.

Inference hot path (gdn_fused.rs) is untouched, so the e2e-parity gate is
unaffected.

UNVERIFIED: weight-grad correctness is pending the AM gradcheck run (the heavy
real-model train_grad_full --gradcheck on Qwen3.5-0.8B plus the new synthetic
unit gradchecks). This commit makes no correctness claim. Critic-reviewed
APPROVE-WITH-FIXES (all 4 addressed); clippy --all-targets clean; no library
panics.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extends the S8 nested-config read (812f56e) to the model's attention-head
configuration, surfaced as honest-nil ReadoutWells in the Models inspector:
HEADS (num_attention_heads), KV HEADS (num_key_value_heads), HEAD DIM
(head_dim), and GDN HEADS (linear_num_key_heads/linear_num_value_heads, the
GatedDeltaNet linear-attention heads).

- LatticeBridge reads all five from the same nested-resolved cfg (text_config
  when present, else top-level) used for hidden/vocab/ctx, so honest-nil holds
  for flat or absent configs with no new IO.
- ModelInfo gains five Int? fields; every well is conditional (omitted when the
  field is nil); GDN HEADS shows a dash for whichever of key/value is absent.
  No fabricated defaults, no force-unwraps.

Verified: swift build clean (the sole reliable gate; single-file SourceKit
reports cross-file false positives). A Swift-oracle replica of the parse
confirms the real values: qwen3.5-0.8b (nested) HEADS=8 KV=2 HEAD_DIM=256
GDN=16/16; qwen3-embedding-0.6b (flat) 16/8/128 with GDN honest-nil; bge
(no config) all wells omitted; synthetic flat/partial/missing all honest-nil.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds an FFN well to the model inspector showing intermediate_size, the one
remaining major shape dimension not yet surfaced (hidden/vocab/heads/head_dim/
layers were already shown; the linear-vs-full layer split is already covered by
the LAYERS well). Read from the same nested-resolved config cfg as the ctx and
head-config slices, so flat and absent configs fall to honest-nil and the well
is omitted with no new IO.

Verified: swift build clean (the sole gate), plus a Swift-oracle replica of the
parse confirming real values qwen3.5-0.8b 3584, qwen3-embedding-0.6b 3072,
multilingual-e5-small 1536, and honest-nil omission for models with no config
(qwen3.5-2b, bge, all-minilm, paraphrase). No fabrication, no force-unwrap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…onent groundwork

Redesign milestone for the Lattice app: progressive-disclosure information
architecture plus design-system groundwork, ahead of the per-screen coherence pass.

- Nav 6->4: Models/Chat/Train/Runs (cmd1-4). Quantize is now a sheet launched
  from Models; Data folds into Train's DATASET section. Resolves the cmdR/Runs
  shortcut collision (Runs is cmd4, no longer cmdR).
- Accessibility: textTertiary -> #646E79/#79838F (>=4.58:1) and onAccent ->
  #0A0D11 light (>=4.88:1 on teal) to clear the WCAG AA 4.5:1 floor.
- Shared components: Badges (Format/Status), ButtonStyles (Primary/Secondary),
  EmptyStateView, Field (text/numeric), InspectorShell (reusable right-inspector
  container generalizing ChatInspector).
- Chat: model selection + advanced knobs moved into the toggleable .inspector.
- Theme: adaptive light/dark palette rework, radius and spacing token additions.
- Rename: product wordmark + window title -> "Lattice" (was "Lattice Studio").
  Internal target/bundle-id/data-path remain LatticeStudio (rename is risky).

swift build green. Local checkpoint before the per-screen coherence pass; not pushed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…udio eval surfaces

eval_perplexity emits @@lattice perplexity events (ppl/nll/label/tokens/ms) via a
new --json flag and --label, in both the CPU bf16 and Metal Q4 paths, so the Studio
Models inspector can parse method-PPL results. No --lora flag is added: adapter
quality is Train NLL-delta, not eval PPL.

Adds a new embed CLI (crates/embed/src/bin/embed.rs) emitting embed_done events
(model/dims/count/cosine/preview/ms) for the Studio Embeddings tab.

No library code is touched; the inference hot path and e2e-parity gate are
unaffected. Bins verified clippy-clean under their real features (f16,metal-gpu for
eval_perplexity; native for embed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Continues the 4-destination nav consolidation (58edfef) into the eval-centric
redesign, wiring every surface to real engine output under honest-nil discipline
(render a dash or omit when a value is absent, never fabricate a metric).

- Embeddings tab (new EmbeddingsScreen): runs the embed CLI over N text rows and
  renders an NxN cosine matrix plus preview vectors; the grid shows only when the
  matrix is square and present.
- Models inspector QUALITY (PERPLEXITY): 3-mode eval (CPU bf16, Metal Q4, dual-Q4
  QuaRot delta) with a sequential bf16-to-quant onComplete chain, bf16-sibling
  discovery (exact-match then dash-boundary-prefix longest), and a verdict pill.
- Chat A/B base-vs-adapter: single live handle sequenced base-to-adapter via
  LiveRun.onComplete; a Stop (RunStatus .failed, no distinct .stopped) now aborts
  phase 2 via a guard on .done; per-pair baseLabel/adapterLabel are snapshotted at
  send time so attribution survives mid-run picker changes.
- Drivers/AppStore/DomainModels: EvalConfig plus runEval (no adapterPath),
  EmbedConfig plus runEmbed, GenConfig.adapterPath maps to --lora, Screen.embed,
  RunKind.eval and .embed, LiveRun.onComplete/perplexities/embed.
- package-app.sh ships all 8 engine binaries (adds eval_perplexity with
  f16,metal-gpu and embed; was silently omitting both).

swift build green. Codex-reviewed: REJECT (3 blockers plus a sibling-match bug),
all fixed, then APPROVE with no regressions. Honest-nil verified on real
Qwen3.5-0.8B (embed cosine 0.856 vs 0.415; PPL bf16 13.21 below quarot 16.75 below
q4 17.41; an overfit adapter generates the memorized completion only with --lora).

Known deferral: 9 try-bang static NSRegularExpression patterns in LatticeEvents.swift
(codex non-blocking, prior-reviewed parser) tracked separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…udio

- embed: add --download-only (checksum-verified fetch; emits @@lattice
  download_done) so the app can trigger real model downloads
- chat_metal, generate_lora: parse --top-k/--top-p/--repetition-penalty
  so Chat sampler controls reach the engine
- quantize_quarot/quarot: measure total_bytes_in from on-disk SafeTensors
  spans (not 8x f64 working-copy size); report GB — fixes the QuaRot
  size/ratio the Quantize surface shows

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sues

Get Models sheet (Models): curated catalog with real one-click download for
7 checksum-verified embeddings; import-only rows for generative + the Qwen
embedding (no fabricated download button); universal NSOpenPanel import with
atomic staging copy + config.json/.safetensors validation.

Eval workspace (EvalScreen, replaces EmbeddingsScreen): CPU/GPU choice,
compare beyond A/B, perplexity surfaces. Chat: sampler params + composer and
style cleanup. Data: inline dataset selection feeding Train (removes
file-first friction). Train/Quantize: honest-nil status surfaces.

.gitignore: exclude runtime artifacts (models/adapters/data) + agent cache.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The surface-B commit (f34ed3b) added GDN LoRA weight-gradient fields to
LoraParams (aliased as Grads) but left the GQA backward branch's Grads
initializer at only a_q/b_q/a_v/b_v, so train_grad_full failed to compile
under --features train-backward (E0063). A GQA layer has no GDN LoRA
factors, so its GDN gradients are empty Vec — mirroring how the GDN branch
already leaves the GQA fields empty. No gradient math changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rom header

Two LoRA save/load bugs surfaced by the app-e2e sweep on real qwen3.5-0.8b:

1. train_grad_full --save crashed (InvalidTensorView) on GDN models: the trainer
   inserted q_proj/v_proj LoRA layers for every trained slot, but GDN attention
   slots leave those buffers empty, so save_peft_safetensors built a zero-byte
   tensor with a non-zero shape. The serializer now skips any layer with an empty
   A/B buffer, and the trainer guards the q/v inserts and emits a gdn_skipped
   warning naming how many GDN layers were not persisted (no silent loss).

2. load_peft_safetensors hardcoded alpha = rank and never read the saved alpha,
   so every adapter loaded at scale 1.0 -- silently applying alpha != rank
   adapters at half magnitude. Load now reads alpha from the safetensors header
   via read_metadata, falling back to rank when absent.

Regression tests: test_save_skips_empty_buffer_layers, test_alpha_metadata_round_trips.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ohdearquant added a commit that referenced this pull request Jun 23, 2026
* feat(macos): Lattice app v1 (Models + Chat)

First release scope of the Lattice macOS app. Ships the two surfaces that
work end to end against the current engine: model browsing (MODELS) and
local chat (CHAT). Sliced from the LoRA training branch (#193).

Deferred and re-addable as the backend matures: Data, Train, Eval and
Embeddings, Quantize, Runs and History, adapter management.

- Screen enum reduced to {models, chat}; nav, command bar, and run routing
  follow the two-screen shape
- ChatScreen: single-mode generate over CPU bf16 and GPU Metal (bf16 + q4),
  honest disk status, sampling and generation controls, retry, tok/s. No
  adapter path is threaded
- ModelsScreen: get, refresh, reveal, and a Chat CTA, plus the full model
  config readout inspector
- Removed DataScreen, TrainScreen, EvalScreen, QuantizeScreen, RunsScreen
- Unused store and model scaffolding kept compiling for incremental revival

swift build is green on a forced recompile. No engine binaries are added;
the app shells out to binaries already on main.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(macos): apply the completed v1 slice to tracked files

The initial v1 commit (47dfe07) captured the app at an intermediate state:
the five cut screen files were removed from the tree, but the Screen enum and
the routing in seven dependent files still referenced .data/.train/.eval, so
the committed tree did not compile on its own (ContentView routed to
DataScreen/TrainScreen/EvalScreen, which are absent from the commit).

This commits the fully-sliced worktree versions that swift build validates
green: Screen enum reduced to {models, chat}, with command bar, nav symbols,
run routing, and store scaffolding made consistent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ohdearquant added a commit that referenced this pull request Jun 24, 2026
…ft (#195) (#231)

* feat(macos): Lattice app v1 (Models + Chat)

First release scope of the Lattice macOS app. Ships the two surfaces that
work end to end against the current engine: model browsing (MODELS) and
local chat (CHAT). Sliced from the LoRA training branch (#193).

Deferred and re-addable as the backend matures: Data, Train, Eval and
Embeddings, Quantize, Runs and History, adapter management.

- Screen enum reduced to {models, chat}; nav, command bar, and run routing
  follow the two-screen shape
- ChatScreen: single-mode generate over CPU bf16 and GPU Metal (bf16 + q4),
  honest disk status, sampling and generation controls, retry, tok/s. No
  adapter path is threaded
- ModelsScreen: get, refresh, reveal, and a Chat CTA, plus the full model
  config readout inspector
- Removed DataScreen, TrainScreen, EvalScreen, QuantizeScreen, RunsScreen
- Unused store and model scaffolding kept compiling for incremental revival

swift build is green on a forced recompile. No engine binaries are added;
the app shells out to binaries already on main.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(macos): apply the completed v1 slice to tracked files

The initial v1 commit (47dfe07) captured the app at an intermediate state:
the five cut screen files were removed from the tree, but the Screen enum and
the routing in seven dependent files still referenced .data/.train/.eval, so
the committed tree did not compile on its own (ContentView routed to
DataScreen/TrainScreen/EvalScreen, which are absent from the commit).

This commits the fully-sliced worktree versions that swift build validates
green: Screen enum reduced to {models, chat}, with command bar, nav symbols,
run routing, and store scaffolding made consistent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(macos): nil-guard NSRegularExpression init in LatticeEvents.swift (#195)

Convert 9 static 'try! NSRegularExpression(pattern:)' declarations in
HumanLineParser to 'try? NSRegularExpression(pattern:)', yielding
NSRegularExpression? optionals. Every call site is updated from
're.firstMatch(...)' to 're?.firstMatch(...)' so a nil regex degrades
gracefully (returns no match) rather than crashing the app.

The statics' access level is widened from private to internal so the
new test target can reach them via @testable import.

Adds Tests/LatticeStudioTests/LatticeEventsTests.swift with a
testAllStaticPatternsCompile test that asserts every pattern is
non-nil at runtime, plus five parse-level smoke tests that exercise
the converted regexes end-to-end. Updates Package.swift to declare the
new LatticeStudioTests test target.

swift build: Build complete! (11.86s)
swift test:  Executed 6 tests, with 0 failures (0 unexpected)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ohdearquant

Copy link
Copy Markdown
Owner Author

PR-health note (autonomous maintenance): this branch has substantially diverged and is now mergeStateStatus=DIRTY. Against current main it shows 20 conflicts, mostly add/add — because the LoRA-backward core this PR introduced has since landed on main through other PRs:

This branch predates all three, so a naive "take this branch's version" would regress them (e.g. its safetensors.rs removes the #212/#217 validation and reverts to alpha = rank). It is therefore not auto-rebaseable — it needs a scope decision.

What remains genuinely unmerged and valuable here:

  1. The macOS Lattice Studio app (apps/macos/*) — main has no apps/macos; this is the bulk of the real unmerged content.
  2. Recent tune fixes on top of the merged backward core (e.g. a91a7c05, 34bf60141).

I've extracted the one isolated correctness fix that would otherwise be buried — the GDN empty-buffer guard in save_peft_safetensors (the InvalidTensorView-on-save fix) — onto current main as #279, regression-tested and preserving #212/#217 + #261.

Recommendation for @ohdearquant: rather than rebase this 25K-line/40-commit branch against a main that already absorbed its training core, re-scope it to just the Studio app (extract apps/macos + the unmerged CLI/eval surfaces into a fresh focused PR off current main). Leaving here for a human scope call — no action taken on the branch itself.

ohdearquant added a commit that referenced this pull request Jun 24, 2026
…N slots) (#279)

GDN-attention slots leave q_proj/v_proj LoRA factors empty when only the GQA
layers are trained. save_peft_safetensors pushed those empty byte buffers with
non-zero shapes [rank, d_in] / [d_out, rank], so TensorView construction failed
with InvalidTensorView(F32, [rank, d_in], 0) and the whole save aborted.

Skip any layer whose A or B factor buffer is empty rather than emit a zero-byte
tensor with a non-zero shape. The real (trained) layers round-trip intact; the
untrained GDN slots are dropped from the saved adapter.

Sliced from the feat/lora-backward-training branch (#193) onto current main, so
it preserves the #212/#217 shape/length validation and the #261 alpha-from-header
behavior that the branch predates (the branch's own safetensors.rs would regress
both).

Regression test test_save_skips_empty_buffer_layers reproduces the failure with
the guard disabled (InvalidTensorView) and passes with it restored. All 25
safetensors tests green under --features safetensors.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ohdearquant

Copy link
Copy Markdown
Owner Author

Superseded — recording where the work landed (the auto-close above was a branch-divergence note, not a disposition):

Closed as superseded; see #191, #261, #435, #202.

@ohdearquant ohdearquant deleted the feat/lora-backward-training branch July 1, 2026 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant