Write-vector editing: component training, LoRA baseline, blog data export#462
Open
ocg-goodfire wants to merge 40 commits intodevfrom
Open
Write-vector editing: component training, LoRA baseline, blog data export#462ocg-goodfire wants to merge 40 commits intodevfrom
ocg-goodfire wants to merge 40 commits intodevfrom
Conversation
Treat SPD components as rank-1 LoRA adapters and train specific V columns (read vectors) and/or U rows (write vectors) on arbitrary losses. Uses gradient hooks for per-column/row masking and snapshotted weight deltas so the model starts from exact target-model behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- spd/editing/compare.py: train_and_compare() for train + eval + per-token diff - spd/editing/viz.py: render_edit_comparison() HTML heatmap for notebooks - spd/editing/__init__.py: export new modules - component_training_summary.md: dense writeup of write-vector editing results for handoff (analytical unembed replacement, blast radius sweep, geometry) - pyproject.toml: exclude untracked dirs from ruff Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…isons - ComponentTrainer.reset(): restore original params + re-snapshot weight deltas, enabling multiple training runs without model reload - LoRATrainer: rank-1 LoRA baseline with batched KL regularization, reset() support, and prepare_reg() for pre-batched reg sequences - Both classes support the same forward_fn interface for measure_blast_radius() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the stateful ComponentTrainer class with two functions: - write_edit(model, comp_key, u_delta): context manager yielding a forward_fn with the rank-1 delta applied via hook. No snapshots, no gradient masking. - train_write_delta(model, comp_key, train_seqs): returns a learned U delta tensor. Does not mutate the model. The delta formulation (output += activation * u_delta) is simpler than the previous snapshot approach and eliminates the "new trainer absorbs edit" footgun. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rgence - load_model(): replaces EditableModel.from_wandb(), returns (ComponentModel, tok) - get_ci(): free function, was EditableModel.get_ci() - get_component_activations(): free function, was EditableModel method - compare.py now takes ComponentModel directly instead of EditableModel - Delete generate_token_divergence.py (old static viz script, unused) - EditableModel kept in _editing.py for legacy but no longer the primary API Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Strip _editing.py to just the 5 functions actually used: parse_component_key, load_model, get_ci, get_component_activations. Deleted: EditableModel, search_interpretations, generate, measure_kl, measure_token_probs, inspect_component, search_by_token_pmi, make_edit_fn, without_components, optimize_circuit, print_circuit, find_components_by_examples, and all supporting types/dataclasses. 826 lines removed. Nothing outside spd/editing/ imported any of this. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These are used in notebooks to find components by autointerp label regex. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- _editing.py: remove get_ci/get_component_activations (inlined in compare.py) - component_trainer.py: share hook logic via _resolve_hook_args, fix docstring - compare.py: remove train_and_compare/TrainResult wrapper, export compute_diffs directly. Inline CI/activation computation (3 lines, no helper indirection) - lora_baseline.py: reg_seqs required in constructor (no prepare_reg, no None states), remove __call__, remove __future__ import, extract _compute_reg_baselines - __init__.py: only export what exists Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
python -m spd.editing.export_blog_heatmap --out-dir /path/to/vpd-blog/data Generates real KL heatmap data: SPD analytical (0 training examples) vs LoRA rank-1 (1000+ examples, λ=10 KL reg). 30 held-out examples each. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
python -m spd.editing.generate_pareto_plots --out-dir figures/ Sweeps SPD analytical (α), SPD trained (n=1,4,8,16), and LoRA (n×λ) with LLM-labeled emoticon/non-emoticon eval split. Outputs 3 PNG plots. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mples The previous approach sampled from h.0.mlp.c_fc:100 activation examples, which is biased toward that component's firing distribution. Now loads random sequences from the Pile validation split via the eval dataloader. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
load_model now returns config. eval_dataloader(config) builds the dataloader from the run's dataset config instead of hardcoding dataset name/tokenizer/split. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Both SPD and LoRA now measure KL against the same unedited base model. Previously LoRA used its own post-reset baselines (slightly perturbed). - Add temperature=0 to haiku emoticon labeling for determinism. - LLM-verify VERIFIED_IDXS are still emoticons; assert if harvest changed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously collect_firings emitted one entry per (example, position) pair, so multi-firing sequences appeared multiple times. Now the unit is ActivationExample — each unique sequence appears once with all its fire positions. Eval/train split is on examples, surrounding KL measured once per example excluding all fire positions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… pool SPD trained and LoRA now both draw from train_pool[:n_ex] without hand-curated emoticon filtering. Removes the asymmetry where SPD got curated examples while LoRA got unfiltered data at large n. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The [32:82] split was a leftover from the cherry-picked VERIFIED_IDXS era. Now: eval = examples[:N], train = examples[N:]. Simple, no gaps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously KL regularization ran on separate random Pile sequences, so LoRA wasn't penalized for bleeding into surrounding tokens of the examples it trained on. Now: CE at fire positions + KL at all other positions in the same forward pass. No separate reg sequences. LoRATrainer takes train_seqs at init and caches baselines per sequence. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Baselines are now list[Tensor] in the same order as the token list. No more fragile id(tensor) keying. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Baselines were computed with the LoRA hook active, meaning KL reg targeted "match LoRA at random init" not "match clean model". Now: - base_probs cached before hook installation (clean model outputs) - reset() removes hook, re-caches, re-installs - base_probs is list[Tensor] indexed by position (not id()-keyed dict) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove default lr/n_steps from train_write_delta and LoRATrainer (callers always pass explicitly) - Assert hook not installed before computing baselines - Remove n_ce/n_kl counters (always equal len(idxs)) - Remove max(n_kl, 1) defensive guard - Remove _forward_raw indirection (forward is sufficient) - Clean up docstrings (no narrativised comments) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ComponentModel.__call__ with no mask_infos already does a plain target model forward + output extraction. No need for the private method or base_forward wrappers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pers - generate_pareto_plots: exports pareto_data.json with all point data + meta - export_blog_heatmap: reuse get_examples/get_probs/make_train_seqs from generate_pareto_plots instead of duplicating. Cleaner metadata in output. - Remove eval_dataloader (unused indirection) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single 2-panel figure (surrounding KL + global KL) with annotated points, matching the notebook exploration. Replaces the multi-n grid. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Catches the bug where LoRA hook on the same linear contaminates SPD measurements. write_edit and train_write_delta now assert the target linear has no forward hooks before installing theirs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The old write_edit added (x @ V_col) * u_delta on top of the existing layer output — the original U row was still active, so the edit was additive rather than a replacement. Now write_edit directly sets U[u_idx] = new_u and restores on exit. train_write_vector (renamed from train_write_delta) optimizes U[u_idx] directly via gradient descent, with optional KL reg on non-fire tokens. LoRA keeps hooks (consistent with how LoRA actually works). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
try/finally ensures comp.U[u_idx] and comp.U.requires_grad are restored even if training is interrupted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
comp.U[u_idx].copy_(new_u) fails when U requires grad. Use comp.U.data[u_idx] = new_u instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
write_edit: patches target linear's weight matrix with the rank-1 delta V[:,c] ⊗ (new_u - old_u), since model(tokens) goes through target_model weights, not component U/V directly. train_write_vector: uses a forward hook adding V_col ⊗ (u_param - old_u) so gradients flow through u_param. Hook is temporary (try/finally). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
u_replaced directly swaps comp.U.data[u_idx] and runs through the component path (all-ones masks + weight delta). Clean and correct. train_write_vector removed — needs more thought on how to correctly route gradients through the component path during training. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The component path computes V^T @ U + weight_delta. If weight_delta is computed after changing U, the delta absorbs the edit (cancels out). Now: snapshot weight_deltas before swapping U, freeze them, use in every forward call. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LoRATrainer is now just A/B params + hook + optimizer. No train_seqs, no base_probs, no reset(). Caller caches baselines before creating LoRA, passes (batch, baseline) per step. Context manager removes hook. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Accumulates CE + KL across the batch, single optimizer step. Callers sample mini-batches of 8 without replacement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pad sequences to max_len, single forward pass [B, max_len, vocab], masked CE at fire positions, KL at non-fire non-pad positions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
u_replaced now computes ΔW = outer(new_u - old_u, V[:, idx]) and applies it as a forward hook on the target linear layer, instead of going through the component path with frozen weight deltas and all-ones masks. Both SPD and LoRA edits now follow the same pattern: render a weight delta, hook it onto the base model. Also: LoRA train_step takes pre-padded tensor args instead of nested Python structures, fix vocab dimension bug, update all callers for new API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add pad_train_seqs() to generate_pareto_plots.py, replacing inline padding in both generate_pareto_plots.py and export_blog_heatmap.py - Use kl_per_token() in export_blog_heatmap.py instead of inline KL - export_blog_heatmap.py now pre-pads once instead of allocating per step Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- run_pareto_export.py: notebook script for SPD vs LoRA Pareto sweep + export - figures/editing/: pareto plots (png+pdf), KL histograms, heatmap JSON data - Fix mkdir parents=True in export_blog_heatmap.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Write-vector editing toolkit for the paper's editing section. Train or analytically set a component's write vector (U row) to redirect its output, with Pareto-optimal locality vs on-target accuracy.
New modules
spd/editing/component_trainer.py—write_edit()context manager +train_write_delta(). Functional API, no stateful classes. Delta formulation:output += (x @ V_col) * U_delta.spd/editing/lora_baseline.py—LoRATrainerrank-1 LoRA baseline with batched KL regularization andreset()for sweeps without model reload.spd/editing/compare.py—train_and_compare()for per-token before/after diff collection.spd/editing/viz.py— Interactive HTML heatmap renderer for notebooks.Refactored
EditableModelclass and 800+ lines of unused editing codeload_model(),get_ci(),get_component_activations(),search_interpretations()as free functionscompare.pytakesComponentModeldirectlyKey results (emoticon → 'o' on Jose s-55ea3f9b)
U = -3·unembed('o')/|emb|, 0 training examples): 86% P('o'), surr KL 0.013Blog data
Real KL heatmap data exported to
vpd-blog/data/training-heatmap{,-lora}.json(30 examples each, analytical SPD vs LoRA n=1058 λ=10).Motivation and Context
Paper editing section: demonstrating that SPD components enable targeted, localized model edits that LoRA cannot match.
How Has This Been Tested?
Interactive notebook experiments with held-out evaluation. Pareto sweeps over n_examples × λ × U-scale. Convergence verified (CE + KL loss curves). All results in
notebooks/2026-03-25-*.ipynb.Does this PR introduce a breaking change?
Yes —
EditableModelis deleted. No external consumers exist (checked via grep). Old notebooks that imported it will need updating but they're .gitignored.🤖 Generated with Claude Code