Write-vector editing: component training, LoRA baseline, blog data export by ocg-goodfire · Pull Request #462 · goodfire-ai/spd

ocg-goodfire · 2026-03-25T21:03:11Z

Description

Write-vector editing toolkit for the paper's editing section. Train or analytically set a component's write vector (U row) to redirect its output, with Pareto-optimal locality vs on-target accuracy.

New modules

spd/editing/component_trainer.py — write_edit() context manager + train_write_delta(). Functional API, no stateful classes. Delta formulation: output += (x @ V_col) * U_delta.
spd/editing/lora_baseline.py — LoRATrainer rank-1 LoRA baseline with batched KL regularization and reset() for sweeps without model reload.
spd/editing/compare.py — train_and_compare() for per-token before/after diff collection.
spd/editing/viz.py — Interactive HTML heatmap renderer for notebooks.

Refactored

Deleted EditableModel class and 800+ lines of unused editing code
Extracted load_model(), get_ci(), get_component_activations(), search_interpretations() as free functions
compare.py takes ComponentModel directly

Key results (emoticon → 'o' on Jose s-55ea3f9b)

Analytical replacement (U = -3·unembed('o')/|emb|, 0 training examples): 86% P('o'), surr KL 0.013
SPD trained (n=8, 100 steps): 96% P('o'), surr KL 0.015
LoRA rank-1 (n=1058, λ=10 KL reg, 300 steps): 100% P('o'), surr KL 0.30
SPD dominates the Pareto frontier on surrounding KL at every P('o') level

Blog data

Real KL heatmap data exported to vpd-blog/data/training-heatmap{,-lora}.json (30 examples each, analytical SPD vs LoRA n=1058 λ=10).

Motivation and Context

Paper editing section: demonstrating that SPD components enable targeted, localized model edits that LoRA cannot match.

How Has This Been Tested?

Interactive notebook experiments with held-out evaluation. Pareto sweeps over n_examples × λ × U-scale. Convergence verified (CE + KL loss curves). All results in notebooks/2026-03-25-*.ipynb.

Does this PR introduce a breaking change?

Yes — EditableModel is deleted. No external consumers exist (checked via grep). Old notebooks that imported it will need updating but they're .gitignored.

🤖 Generated with Claude Code

Treat SPD components as rank-1 LoRA adapters and train specific V columns (read vectors) and/or U rows (write vectors) on arbitrary losses. Uses gradient hooks for per-column/row masking and snapshotted weight deltas so the model starts from exact target-model behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- spd/editing/compare.py: train_and_compare() for train + eval + per-token diff - spd/editing/viz.py: render_edit_comparison() HTML heatmap for notebooks - spd/editing/__init__.py: export new modules - component_training_summary.md: dense writeup of write-vector editing results for handoff (analytical unembed replacement, blast radius sweep, geometry) - pyproject.toml: exclude untracked dirs from ruff Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…isons - ComponentTrainer.reset(): restore original params + re-snapshot weight deltas, enabling multiple training runs without model reload - LoRATrainer: rank-1 LoRA baseline with batched KL regularization, reset() support, and prepare_reg() for pre-batched reg sequences - Both classes support the same forward_fn interface for measure_blast_radius() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the stateful ComponentTrainer class with two functions: - write_edit(model, comp_key, u_delta): context manager yielding a forward_fn with the rank-1 delta applied via hook. No snapshots, no gradient masking. - train_write_delta(model, comp_key, train_seqs): returns a learned U delta tensor. Does not mutate the model. The delta formulation (output += activation * u_delta) is simpler than the previous snapshot approach and eliminates the "new trainer absorbs edit" footgun. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rgence - load_model(): replaces EditableModel.from_wandb(), returns (ComponentModel, tok) - get_ci(): free function, was EditableModel.get_ci() - get_component_activations(): free function, was EditableModel method - compare.py now takes ComponentModel directly instead of EditableModel - Delete generate_token_divergence.py (old static viz script, unused) - EditableModel kept in _editing.py for legacy but no longer the primary API Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Strip _editing.py to just the 5 functions actually used: parse_component_key, load_model, get_ci, get_component_activations. Deleted: EditableModel, search_interpretations, generate, measure_kl, measure_token_probs, inspect_component, search_by_token_pmi, make_edit_fn, without_components, optimize_circuit, print_circuit, find_components_by_examples, and all supporting types/dataclasses. 826 lines removed. Nothing outside spd/editing/ imported any of this. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

These are used in notebooks to find components by autointerp label regex. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- _editing.py: remove get_ci/get_component_activations (inlined in compare.py) - component_trainer.py: share hook logic via _resolve_hook_args, fix docstring - compare.py: remove train_and_compare/TrainResult wrapper, export compute_diffs directly. Inline CI/activation computation (3 lines, no helper indirection) - lora_baseline.py: reg_seqs required in constructor (no prepare_reg, no None states), remove __call__, remove __future__ import, extract _compute_reg_baselines - __init__.py: only export what exists Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

python -m spd.editing.export_blog_heatmap --out-dir /path/to/vpd-blog/data Generates real KL heatmap data: SPD analytical (0 training examples) vs LoRA rank-1 (1000+ examples, λ=10 KL reg). 30 held-out examples each. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

python -m spd.editing.generate_pareto_plots --out-dir figures/ Sweeps SPD analytical (α), SPD trained (n=1,4,8,16), and LoRA (n×λ) with LLM-labeled emoticon/non-emoticon eval split. Outputs 3 PNG plots. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…mples The previous approach sampled from h.0.mlp.c_fc:100 activation examples, which is biased toward that component's firing distribution. Now loads random sequences from the Pile validation split via the eval dataloader. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

load_model now returns config. eval_dataloader(config) builds the dataloader from the run's dataset config instead of hardcoding dataset name/tokenizer/split. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Both SPD and LoRA now measure KL against the same unedited base model. Previously LoRA used its own post-reset baselines (slightly perturbed). - Add temperature=0 to haiku emoticon labeling for determinism. - LLM-verify VERIFIED_IDXS are still emoticons; assert if harvest changed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously collect_firings emitted one entry per (example, position) pair, so multi-firing sequences appeared multiple times. Now the unit is ActivationExample — each unique sequence appears once with all its fire positions. Eval/train split is on examples, surrounding KL measured once per example excluding all fire positions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… pool SPD trained and LoRA now both draw from train_pool[:n_ex] without hand-curated emoticon filtering. Removes the asymmetry where SPD got curated examples while LoRA got unfiltered data at large n. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The [32:82] split was a leftover from the cherry-picked VERIFIED_IDXS era. Now: eval = examples[:N], train = examples[N:]. Simple, no gaps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously KL regularization ran on separate random Pile sequences, so LoRA wasn't penalized for bleeding into surrounding tokens of the examples it trained on. Now: CE at fire positions + KL at all other positions in the same forward pass. No separate reg sequences. LoRATrainer takes train_seqs at init and caches baselines per sequence. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Baselines are now list[Tensor] in the same order as the token list. No more fragile id(tensor) keying. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Baselines were computed with the LoRA hook active, meaning KL reg targeted "match LoRA at random init" not "match clean model". Now: - base_probs cached before hook installation (clean model outputs) - reset() removes hook, re-caches, re-installs - base_probs is list[Tensor] indexed by position (not id()-keyed dict) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove default lr/n_steps from train_write_delta and LoRATrainer (callers always pass explicitly) - Assert hook not installed before computing baselines - Remove n_ce/n_kl counters (always equal len(idxs)) - Remove max(n_kl, 1) defensive guard - Remove _forward_raw indirection (forward is sufficient) - Clean up docstrings (no narrativised comments) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ComponentModel.__call__ with no mask_infos already does a plain target model forward + output extraction. No need for the private method or base_forward wrappers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…pers - generate_pareto_plots: exports pareto_data.json with all point data + meta - export_blog_heatmap: reuse get_examples/get_probs/make_train_seqs from generate_pareto_plots instead of duplicating. Cleaner metadata in output. - Remove eval_dataloader (unused indirection) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Single 2-panel figure (surrounding KL + global KL) with annotated points, matching the notebook exploration. Replaces the multi-n grid. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Catches the bug where LoRA hook on the same linear contaminates SPD measurements. write_edit and train_write_delta now assert the target linear has no forward hooks before installing theirs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The old write_edit added (x @ V_col) * u_delta on top of the existing layer output — the original U row was still active, so the edit was additive rather than a replacement. Now write_edit directly sets U[u_idx] = new_u and restores on exit. train_write_vector (renamed from train_write_delta) optimizes U[u_idx] directly via gradient descent, with optional KL reg on non-fire tokens. LoRA keeps hooks (consistent with how LoRA actually works). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

try/finally ensures comp.U[u_idx] and comp.U.requires_grad are restored even if training is interrupted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

comp.U[u_idx].copy_(new_u) fails when U requires grad. Use comp.U.data[u_idx] = new_u instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

write_edit: patches target linear's weight matrix with the rank-1 delta V[:,c] ⊗ (new_u - old_u), since model(tokens) goes through target_model weights, not component U/V directly. train_write_vector: uses a forward hook adding V_col ⊗ (u_param - old_u) so gradients flow through u_param. Hook is temporary (try/finally). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

u_replaced directly swaps comp.U.data[u_idx] and runs through the component path (all-ones masks + weight delta). Clean and correct. train_write_vector removed — needs more thought on how to correctly route gradients through the component path during training. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The component path computes V^T @ U + weight_delta. If weight_delta is computed after changing U, the delta absorbs the edit (cancels out). Now: snapshot weight_deltas before swapping U, freeze them, use in every forward call. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

LoRATrainer is now just A/B params + hook + optimizer. No train_seqs, no base_probs, no reset(). Caller caches baselines before creating LoRA, passes (batch, baseline) per step. Context manager removes hook. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Accumulates CE + KL across the batch, single optimizer step. Callers sample mini-batches of 8 without replacement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pad sequences to max_len, single forward pass [B, max_len, vocab], masked CE at fire positions, KL at non-fire non-pad positions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

u_replaced now computes ΔW = outer(new_u - old_u, V[:, idx]) and applies it as a forward hook on the target linear layer, instead of going through the component path with frozen weight deltas and all-ones masks. Both SPD and LoRA edits now follow the same pattern: render a weight delta, hook it onto the base model. Also: LoRA train_step takes pre-padded tensor args instead of nested Python structures, fix vocab dimension bug, update all callers for new API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add pad_train_seqs() to generate_pareto_plots.py, replacing inline padding in both generate_pareto_plots.py and export_blog_heatmap.py - Use kl_per_token() in export_blog_heatmap.py instead of inline KL - export_blog_heatmap.py now pre-pads once instead of allocating per step Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- run_pareto_export.py: notebook script for SPD vs LoRA Pareto sweep + export - figures/editing/: pareto plots (png+pdf), KL histograms, heatmap JSON data - Fix mkdir parents=True in export_blog_heatmap.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ocg-goodfire and others added 7 commits March 25, 2026 10:39

Add back search_interpretations + ComponentMatch

a028d2c

These are used in notebooks to find components by autointerp label regex. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ocg-goodfire changed the base branch from main to dev March 25, 2026 21:04

ocg-goodfire and others added 22 commits March 25, 2026 21:10

Rename _editing.py → utils.py

d55e2bf

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use run's own dataloader for global/reg sequences

ba88c43

load_model now returns config. eval_dataloader(config) builds the dataloader from the run's dataset config instead of hardcoding dataset name/tokenizer/split. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Clean up eval/train split — no more skipped indices

8377944

The [32:82] split was a leftover from the cherry-picked VERIFIED_IDXS era. Now: eval = examples[:N], train = examples[N:]. Simple, no gaps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix hardcoded lora_ns=1058 — derive from actual train pool size

1f6369c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace id()-keyed baseline dict with plain list

43b630d

Baselines are now list[Tensor] in the same order as the token list. No more fragile id(tensor) keying. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Rename cache_baselines → get_probs

b92b186

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Simplify Pareto plot: SPD analytical (sweep α) vs LoRA max-n (sweep λ)

e452d33

Single 2-panel figure (surrounding KL + global KL) with annotated points, matching the notebook exploration. Replaces the multi-n grid. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix train_write_vector: restore requires_grad and U on crash

a0bc6f9

try/finally ensures comp.U[u_idx] and comp.U.requires_grad are restored even if training is interrupted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ocg-goodfire and others added 11 commits March 26, 2026 15:19

Fix write_edit: use .data to bypass autograd for in-place U mutation

6eb2208

comp.U[u_idx].copy_(new_u) fails when U requires grad. Use comp.U.data[u_idx] = new_u instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge get_probs_raw into get_probs (takes ForwardFn, not ComponentModel)

343d7c7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

LoRA train_step accepts batch of (example, baseline) pairs

4ee1b58

Accumulates CE + KL across the batch, single optimizer step. Callers sample mini-batches of 8 without replacement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Real batched forward in LoRA train_step

283be06

Pad sequences to max_len, single forward pass [B, max_len, vocab], masked CE at fire positions, KL at non-fire non-pad positions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ocg-goodfire force-pushed the dev branch from c514203 to bf43563 Compare March 27, 2026 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write-vector editing: component training, LoRA baseline, blog data export#462

Write-vector editing: component training, LoRA baseline, blog data export#462
ocg-goodfire wants to merge 40 commits intodevfrom
paper/editing-section

ocg-goodfire commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ocg-goodfire commented Mar 25, 2026

Description

New modules

Refactored

Key results (emoticon → 'o' on Jose s-55ea3f9b)

Blog data

Motivation and Context

How Has This Been Tested?

Does this PR introduce a breaking change?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant