Skip to content

Write-vector editing: component training, LoRA baseline, blog data export#462

Open
ocg-goodfire wants to merge 40 commits intodevfrom
paper/editing-section
Open

Write-vector editing: component training, LoRA baseline, blog data export#462
ocg-goodfire wants to merge 40 commits intodevfrom
paper/editing-section

Conversation

@ocg-goodfire
Copy link
Copy Markdown
Collaborator

Description

Write-vector editing toolkit for the paper's editing section. Train or analytically set a component's write vector (U row) to redirect its output, with Pareto-optimal locality vs on-target accuracy.

New modules

  • spd/editing/component_trainer.pywrite_edit() context manager + train_write_delta(). Functional API, no stateful classes. Delta formulation: output += (x @ V_col) * U_delta.
  • spd/editing/lora_baseline.pyLoRATrainer rank-1 LoRA baseline with batched KL regularization and reset() for sweeps without model reload.
  • spd/editing/compare.pytrain_and_compare() for per-token before/after diff collection.
  • spd/editing/viz.py — Interactive HTML heatmap renderer for notebooks.

Refactored

  • Deleted EditableModel class and 800+ lines of unused editing code
  • Extracted load_model(), get_ci(), get_component_activations(), search_interpretations() as free functions
  • compare.py takes ComponentModel directly

Key results (emoticon → 'o' on Jose s-55ea3f9b)

  • Analytical replacement (U = -3·unembed('o')/|emb|, 0 training examples): 86% P('o'), surr KL 0.013
  • SPD trained (n=8, 100 steps): 96% P('o'), surr KL 0.015
  • LoRA rank-1 (n=1058, λ=10 KL reg, 300 steps): 100% P('o'), surr KL 0.30
  • SPD dominates the Pareto frontier on surrounding KL at every P('o') level

Blog data

Real KL heatmap data exported to vpd-blog/data/training-heatmap{,-lora}.json (30 examples each, analytical SPD vs LoRA n=1058 λ=10).

Motivation and Context

Paper editing section: demonstrating that SPD components enable targeted, localized model edits that LoRA cannot match.

How Has This Been Tested?

Interactive notebook experiments with held-out evaluation. Pareto sweeps over n_examples × λ × U-scale. Convergence verified (CE + KL loss curves). All results in notebooks/2026-03-25-*.ipynb.

Does this PR introduce a breaking change?

Yes — EditableModel is deleted. No external consumers exist (checked via grep). Old notebooks that imported it will need updating but they're .gitignored.

🤖 Generated with Claude Code

ocg-goodfire and others added 7 commits March 25, 2026 10:39
Treat SPD components as rank-1 LoRA adapters and train specific V columns
(read vectors) and/or U rows (write vectors) on arbitrary losses. Uses
gradient hooks for per-column/row masking and snapshotted weight deltas
so the model starts from exact target-model behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- spd/editing/compare.py: train_and_compare() for train + eval + per-token diff
- spd/editing/viz.py: render_edit_comparison() HTML heatmap for notebooks
- spd/editing/__init__.py: export new modules
- component_training_summary.md: dense writeup of write-vector editing results
  for handoff (analytical unembed replacement, blast radius sweep, geometry)
- pyproject.toml: exclude untracked dirs from ruff

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…isons

- ComponentTrainer.reset(): restore original params + re-snapshot weight deltas,
  enabling multiple training runs without model reload
- LoRATrainer: rank-1 LoRA baseline with batched KL regularization, reset()
  support, and prepare_reg() for pre-batched reg sequences
- Both classes support the same forward_fn interface for measure_blast_radius()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the stateful ComponentTrainer class with two functions:
- write_edit(model, comp_key, u_delta): context manager yielding a forward_fn
  with the rank-1 delta applied via hook. No snapshots, no gradient masking.
- train_write_delta(model, comp_key, train_seqs): returns a learned U delta
  tensor. Does not mutate the model.

The delta formulation (output += activation * u_delta) is simpler than the
previous snapshot approach and eliminates the "new trainer absorbs edit" footgun.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rgence

- load_model(): replaces EditableModel.from_wandb(), returns (ComponentModel, tok)
- get_ci(): free function, was EditableModel.get_ci()
- get_component_activations(): free function, was EditableModel method
- compare.py now takes ComponentModel directly instead of EditableModel
- Delete generate_token_divergence.py (old static viz script, unused)
- EditableModel kept in _editing.py for legacy but no longer the primary API

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Strip _editing.py to just the 5 functions actually used:
parse_component_key, load_model, get_ci, get_component_activations.

Deleted: EditableModel, search_interpretations, generate, measure_kl,
measure_token_probs, inspect_component, search_by_token_pmi,
make_edit_fn, without_components, optimize_circuit, print_circuit,
find_components_by_examples, and all supporting types/dataclasses.

826 lines removed. Nothing outside spd/editing/ imported any of this.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These are used in notebooks to find components by autointerp label regex.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ocg-goodfire ocg-goodfire changed the base branch from main to dev March 25, 2026 21:04
ocg-goodfire and others added 22 commits March 25, 2026 21:10
- _editing.py: remove get_ci/get_component_activations (inlined in compare.py)
- component_trainer.py: share hook logic via _resolve_hook_args, fix docstring
- compare.py: remove train_and_compare/TrainResult wrapper, export compute_diffs
  directly. Inline CI/activation computation (3 lines, no helper indirection)
- lora_baseline.py: reg_seqs required in constructor (no prepare_reg, no None states),
  remove __call__, remove __future__ import, extract _compute_reg_baselines
- __init__.py: only export what exists

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
python -m spd.editing.export_blog_heatmap --out-dir /path/to/vpd-blog/data

Generates real KL heatmap data: SPD analytical (0 training examples)
vs LoRA rank-1 (1000+ examples, λ=10 KL reg). 30 held-out examples each.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
python -m spd.editing.generate_pareto_plots --out-dir figures/

Sweeps SPD analytical (α), SPD trained (n=1,4,8,16), and LoRA (n×λ)
with LLM-labeled emoticon/non-emoticon eval split. Outputs 3 PNG plots.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mples

The previous approach sampled from h.0.mlp.c_fc:100 activation examples,
which is biased toward that component's firing distribution. Now loads
random sequences from the Pile validation split via the eval dataloader.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
load_model now returns config. eval_dataloader(config) builds the
dataloader from the run's dataset config instead of hardcoding
dataset name/tokenizer/split.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Both SPD and LoRA now measure KL against the same unedited base model.
  Previously LoRA used its own post-reset baselines (slightly perturbed).
- Add temperature=0 to haiku emoticon labeling for determinism.
- LLM-verify VERIFIED_IDXS are still emoticons; assert if harvest changed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously collect_firings emitted one entry per (example, position) pair,
so multi-firing sequences appeared multiple times. Now the unit is
ActivationExample — each unique sequence appears once with all its fire
positions. Eval/train split is on examples, surrounding KL measured once
per example excluding all fire positions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… pool

SPD trained and LoRA now both draw from train_pool[:n_ex] without
hand-curated emoticon filtering. Removes the asymmetry where SPD got
curated examples while LoRA got unfiltered data at large n.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The [32:82] split was a leftover from the cherry-picked VERIFIED_IDXS era.
Now: eval = examples[:N], train = examples[N:]. Simple, no gaps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously KL regularization ran on separate random Pile sequences,
so LoRA wasn't penalized for bleeding into surrounding tokens of the
examples it trained on. Now: CE at fire positions + KL at all other
positions in the same forward pass. No separate reg sequences.

LoRATrainer takes train_seqs at init and caches baselines per sequence.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Baselines are now list[Tensor] in the same order as the token list.
No more fragile id(tensor) keying.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Baselines were computed with the LoRA hook active, meaning KL reg
targeted "match LoRA at random init" not "match clean model". Now:
- base_probs cached before hook installation (clean model outputs)
- reset() removes hook, re-caches, re-installs
- base_probs is list[Tensor] indexed by position (not id()-keyed dict)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove default lr/n_steps from train_write_delta and LoRATrainer
  (callers always pass explicitly)
- Assert hook not installed before computing baselines
- Remove n_ce/n_kl counters (always equal len(idxs))
- Remove max(n_kl, 1) defensive guard
- Remove _forward_raw indirection (forward is sufficient)
- Clean up docstrings (no narrativised comments)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ComponentModel.__call__ with no mask_infos already does a plain
target model forward + output extraction. No need for the private
method or base_forward wrappers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pers

- generate_pareto_plots: exports pareto_data.json with all point data + meta
- export_blog_heatmap: reuse get_examples/get_probs/make_train_seqs from
  generate_pareto_plots instead of duplicating. Cleaner metadata in output.
- Remove eval_dataloader (unused indirection)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single 2-panel figure (surrounding KL + global KL) with annotated points,
matching the notebook exploration. Replaces the multi-n grid.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Catches the bug where LoRA hook on the same linear contaminates SPD
measurements. write_edit and train_write_delta now assert the target
linear has no forward hooks before installing theirs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The old write_edit added (x @ V_col) * u_delta on top of the existing
layer output — the original U row was still active, so the edit was
additive rather than a replacement. Now write_edit directly sets
U[u_idx] = new_u and restores on exit.

train_write_vector (renamed from train_write_delta) optimizes U[u_idx]
directly via gradient descent, with optional KL reg on non-fire tokens.

LoRA keeps hooks (consistent with how LoRA actually works).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
try/finally ensures comp.U[u_idx] and comp.U.requires_grad are
restored even if training is interrupted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ocg-goodfire and others added 11 commits March 26, 2026 15:19
comp.U[u_idx].copy_(new_u) fails when U requires grad. Use
comp.U.data[u_idx] = new_u instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
write_edit: patches target linear's weight matrix with the rank-1
delta V[:,c] ⊗ (new_u - old_u), since model(tokens) goes through
target_model weights, not component U/V directly.

train_write_vector: uses a forward hook adding V_col ⊗ (u_param - old_u)
so gradients flow through u_param. Hook is temporary (try/finally).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
u_replaced directly swaps comp.U.data[u_idx] and runs through the
component path (all-ones masks + weight delta). Clean and correct.

train_write_vector removed — needs more thought on how to correctly
route gradients through the component path during training.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The component path computes V^T @ U + weight_delta. If weight_delta
is computed after changing U, the delta absorbs the edit (cancels out).
Now: snapshot weight_deltas before swapping U, freeze them, use in
every forward call.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LoRATrainer is now just A/B params + hook + optimizer. No train_seqs,
no base_probs, no reset(). Caller caches baselines before creating
LoRA, passes (batch, baseline) per step. Context manager removes hook.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Accumulates CE + KL across the batch, single optimizer step.
Callers sample mini-batches of 8 without replacement.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pad sequences to max_len, single forward pass [B, max_len, vocab],
masked CE at fire positions, KL at non-fire non-pad positions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
u_replaced now computes ΔW = outer(new_u - old_u, V[:, idx]) and applies it as
a forward hook on the target linear layer, instead of going through the component
path with frozen weight deltas and all-ones masks. Both SPD and LoRA edits now
follow the same pattern: render a weight delta, hook it onto the base model.

Also: LoRA train_step takes pre-padded tensor args instead of nested Python
structures, fix vocab dimension bug, update all callers for new API.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add pad_train_seqs() to generate_pareto_plots.py, replacing inline
  padding in both generate_pareto_plots.py and export_blog_heatmap.py
- Use kl_per_token() in export_blog_heatmap.py instead of inline KL
- export_blog_heatmap.py now pre-pads once instead of allocating per step

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- run_pareto_export.py: notebook script for SPD vs LoRA Pareto sweep + export
- figures/editing/: pareto plots (png+pdf), KL histograms, heatmap JSON data
- Fix mkdir parents=True in export_blog_heatmap.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant