interp-lab

A mechanistic interpretability sandbox. Notebook-driven, learning-by-doing — not a product, not a library. The goal is to reverse-engineer small pieces of a language model by hand, not to use one.

The arc:

IOI — replicate the Indirect Object Identification circuit on GPT-2 small (Wang et al., 2022): activation patching, attention-head attribution, and the name-mover / S-inhibition / induction-head story.
SAE — train a small sparse autoencoder on a residual-stream layer and inspect the features it learns.

Target model is GPT-2 small (124M params, 12 layers, 12 heads, d_model 768) throughout. It's tiny on purpose: small enough to hold the whole thing in your head, big enough to do real circuits. This is interpretability — full activation access, raw fp32/fp16 weights — not inference serving, so there's no quantization or serving tooling here by design.

What is this actually about?

A transformer maps tokens → next-token logits, and the how is buried in millions of weights. Mechanistic interpretability is the project of recovering human-understandable algorithms ("circuits") from those weights.

The trick that makes it tractable: you don't read weights directly. You run the model with full instrumentation, then intervene on its internal activations and watch the output change. Whatever you can change to break (or restore) a behavior is, causally, the mechanism behind it.

IOI is the canonical first circuit because the behavior is crisp and the metric is a single number. Given

"When John and Mary went to the store, John gave a drink to ___"

a competent model continues with " Mary" — the indirect object, mentioned once — and not " John", the subject, mentioned twice. The metric is the logit difference:

logit_diff = logit(" Mary") − logit(" John")

Positive ⇒ the behavior is present. One scalar you can watch move as you poke at the internals.

The core technique: activation patching

This is ~90% of the IOI work, so it's worth stating up front.

You run two forward passes:

clean — the prompt above; the model wants " Mary". Logit diff is high.
corrupted — one thing changed (e.g. swap a name so the answer flips, or replace a name with a random token). Logit diff collapses.

Now the move: run the corrupted prompt, but at one chosen hook point — say the output of head 9.9 at the final token — overwrite that activation with the value it had in the clean run. Everything else stays corrupted. Read the logit diff.

If splicing that one clean activation back restores the answer, that activation was causally carrying the signal.
If nothing changes, it wasn't.

Sweep over every (layer, position) → a heatmap of where the behavior lives. Sweep over every (layer, head) → which heads implement it. The whole IOI paper is this one primitive applied at increasing resolution.

Why it's causal, not correlational: probing asks "does this activation correlate with the output?" Patching asks "if I set this activation to its clean value, does the behavior come back?" — an intervention, which is what makes the resulting circuit claims actually hold.

It works cleanly because of the residual stream: every layer reads from and writes to a shared [position, d_model] stream additively (the residual connections). Each attention head's contribution is just a vector added in, so you can surgically swap one head's output without disturbing the rest. That linearity is the whole reason patching is surgical instead of chaotic.

What you'll find (GPT-2 small, IOI)

The circuit decomposes into a few head families that compose:

Duplicate-token / induction heads (early layers) — detect that "John" appeared twice.
S-inhibition heads (middle layers) — write a signal that suppresses the duplicated subject, so the answer-writers don't copy it.
Name-mover heads (late, ~L9–L10) — attend from the final position to the indirect object and copy it into the output. The actual answer-writers.

Roughly: detect the duplicate → inhibit the subject → move the remaining name.

Setup

python3 -m venv venv            # or: uv venv && source .venv/bin/activate
source venv/bin/activate
pip install -r requirements.txt
jupyter lab                     # or: jupyter notebook

A CUDA GPU is used automatically if present (any modern card is wild overkill — GPT-2 small + its activations fit in a couple hundred MB). Falls back to CPU, which is still fine for a 124M model, just slower. The first model load downloads the GPT-2 weights into the HuggingFace cache (~/.cache/huggingface).

Quick sanity check

Open notebooks/00_ioi_baseline.ipynb and run it top to bottom. It loads the model, runs the canonical prompt plus a few role-swapped variants, and confirms:

device: cuda
top predicted token: ' Mary'
logit_diff (IO − S): ≈ +3.2          → IOI behavior present
+3.17  IO=Mary  S=John     +2.48  IO=John  S=Mary
+3.79  IO=Alice S=Bob      +2.73  IO=Bob   S=Alice

All four role-swaps positive ⇒ the circuit tracks the indirect-object role, not the literal string "Mary". If that holds, the circuit you're about to take apart is genuinely there.

The core API (this is most of what you'll use)

from transformer_lens import HookedTransformer

model = HookedTransformer.from_pretrained("gpt2")   # GPT-2 small + hooks

logits = model(prompt)                              # ordinary forward pass
logits, cache = model.run_with_cache(prompt)        # ...and keep every activation

cache["resid_post", 9]   # residual stream after layer 9   [batch, pos, d_model]
cache["pattern", 9]      # attention pattern, layer 9       [batch, head, q, k]
cache["z", 9]            # per-head output, layer 9         [batch, pos, head, d_head]

# patch an activation mid-forward-pass:
def hook(act, hook):
    act[:, pos, head] = clean_cache["z", 9][:, pos, head]   # splice in clean value
    return act
patched = model.run_with_hooks(prompt, fwd_hooks=[("blocks.9.attn.hook_z", hook)])

run_with_cache records the tape; run_with_hooks splices and replays it. Everything in IOI is built from those two.

Working through it (and actually learning, not just running cells)

Notebooks are numbered as a curriculum:

00 — baseline (included): confirm the behavior exists and the metric works. Always pin your metric before you start cutting.
01 — activation patching: localize the circuit by (layer, position).
02 — head attribution: narrow to the specific heads above.
03 — attention patterns: visualize the name-movers attending to the IO token, and trace how the families compose.
…then SAEs: a different question — not "what does this circuit do" but "what features does the residual stream encode in a human-readable basis."

The way to actually internalize this, rather than nodding along:

Predict before you run. Before each patch, write down what you expect the logit diff to do. The gap between your prediction and the result is the lesson.
Break things on purpose. Change the names, lengthen the sentence, add a third name. A circuit you can break is one you understand.
Ablate, don't only patch. Zero a head out entirely; if the behavior survives, that head wasn't load-bearing — a useful way to falsify your own hypotheses.
Keep a "expected vs. observed" log in a markdown cell at the top of each notebook. That diff is the curriculum.

Layout

notebooks/         one notebook per investigation, numbered in order
requirements.txt   pinned environment
LICENSE            MIT

Flat and notebook-first on purpose. No premature abstraction — if something earns being pulled into a shared module, it will; until then it lives in the notebook that needs it.

Stack

transformer-lens — HookedTransformer, run_with_cache, hooks/patching. The core tool.
torch — CUDA if available.
jupyter.
sae-lens — added in the SAE phase.

Versions are pinned in requirements.txt.

Learning resources

ARENA — the IOI + SAE curriculum this lab mirrors, with full solutions and exercises. The best hands-on starting point.
Neel Nanda — A Comprehensive Mechanistic Interpretability Explainer & Glossary and 200 Concrete Open Problems in Interpretability for vocabulary and directions.
Wang et al. (2022) — Interpretability in the Wild: a Circuit for IOI in GPT-2 small. The paper being replicated; best read after notebook 02, when the heads it names will mean something.
transformer-lens — docs and its main_demo notebook for the full hook/cache API.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

interp-lab

What is this actually about?

The core technique: activation patching

What you'll find (GPT-2 small, IOI)

Setup

Quick sanity check

The core API (this is most of what you'll use)

Working through it (and actually learning, not just running cells)

Layout

Stack

Learning resources

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

interp-lab

What is this actually about?

The core technique: activation patching

What you'll find (GPT-2 small, IOI)

Setup

Quick sanity check

The core API (this is most of what you'll use)

Working through it (and actually learning, not just running cells)

Layout

Stack

Learning resources

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages