Skip to content

Add data-specific QK component contribution plots#459

Open
lee-goodfire wants to merge 8 commits intofeature/attn_plotsfrom
feature/qk_c_datapoint_plots
Open

Add data-specific QK component contribution plots#459
lee-goodfire wants to merge 8 commits intofeature/attn_plotsfrom
feature/qk_c_datapoint_plots

Conversation

@lee-goodfire
Copy link
Copy Markdown
Contributor

Description

New script spd/scripts/plot_qk_c_datapoint/ that decomposes pre-softmax attention logits for individual dataset samples into per-(q_component, k_component) pair contributions at each key position.

For each (sample, query_pos, layer), produces a 4x2 grid plot (mean + 6 per-head subplots) showing:

  • Top-N component pair contributions as colored lines (ranked by peak abs contribution on the datapoint)
  • Sum over all components (black line)
  • Ground-truth pre-softmax logits from the target model (red dashed, weighted mode only)

Supports weighted mode (actual activation scaling) and binary mode (CI threshold gating).

Motivation and Context

The existing plot_qk_c_attention_contributions script computes weight-only QK interactions averaged over data. This script validates the decomposition on specific datapoints — verifying that the sum of component pair contributions matches actual attention logits. The residual (~0.4-0.6) is accounted for by the weight delta (V@U reconstruction error, ~11% of target weight norm).

How Has This Been Tested?

  • Ran on s-55ea3f9b across all 4 layers, 20 dataset samples each (80 plots total)
  • Verified decomposition residual matches weight delta via direct computation
  • Tested both weighted and binary modes
  • Type-checked with basedpyright, linted with ruff

Does this PR introduce a breaking change?

No — this is a new standalone analysis script with no changes to existing code.

ocg-goodfire and others added 8 commits March 18, 2026 13:23
* Add rich_examples autointerp strategy and compare tab

New autointerp strategy (rich_examples) that shows per-token CI and activation
values inline, letting the LLM judge evidence quality directly. Also adds an
Autointerp Compare tab to the app for side-by-side comparison of interpretation
results across different strategies/models/subruns.

Backend: 3 new endpoints for listing subruns, bulk headlines, and detail.
Frontend: SubrunSelector (multiselect chips), stacked SubrunInterpCard, two-panel
AutointerpComparer with full component data on the right panel.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Restrict Anthropic autointerp models and use structured outputs

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix rich_examples prompt: explain signed component activations

Adds explanation to the SPD decomposition description that component
activation sign is arbitrary (inner product with read direction) and
does not indicate suppression. Trims redundant legend text.

Also adds render_prompt.py script for iterating on prompt templates.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Expose snapshot_branch in spd-autointerp CLI

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Improve rich_examples prompt clarity

- Show raw text before annotated version in examples (helps with dense
  token sequences like code/LaTeX)
- Add explicit explanation of <<<token (ci:X, act:Y)>>> format
- Add "consider evidence critically" paragraph from dual_view

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Use XML blocks with raw + highlighted text in rich_examples examples

Replaces sanitized single-line format with:
  <example>
  <raw>...unmodified text...</raw>
  <highlighted>...<<<token (ci:X, act:Y)>>>...</highlighted>
  </example>

Adds AppTokenizer.get_raw_spans for LLM prompt rendering where actual
whitespace (newlines, indentation) is meaningful.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Show all subruns in autointerp comparer, not just .done ones

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* Fix rich_examples prompt: explain signed component activations

Adds explanation to the SPD decomposition description that component
activation sign is arbitrary (inner product with read direction) and
does not indicate suppression. Trims redundant legend text.

Also adds render_prompt.py script for iterating on prompt templates.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Expose snapshot_branch in spd-autointerp CLI

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Improve rich_examples prompt clarity

- Show raw text before annotated version in examples (helps with dense
  token sequences like code/LaTeX)
- Add explicit explanation of <<<token (ci:X, act:Y)>>> format
- Add "consider evidence critically" paragraph from dual_view

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Use XML blocks with raw + highlighted text in rich_examples examples

Replaces sanitized single-line format with:
  <example>
  <raw>...unmodified text...</raw>
  <highlighted>...<<<token (ci:X, act:Y)>>>...</highlighted>
  </example>

Adds AppTokenizer.get_raw_spans for LLM prompt rendering where actual
whitespace (newlines, indentation) is meaningful.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Show all subruns in autointerp comparer, not just .done ones

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Add autointerp_subrun_id to scoring CLI and InterpRepo.open_subrun

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Remove confidence field from autointerp + improve act legend

Drops the confidence field entirely from InterpretationResult, all DB
schemas, JSON output schemas, prompts, API responses, and frontend UI.

Expands the act legend in rich_examples to explain that sign is
meaningful within a component's examples even though the global
convention is arbitrary — polarity may indicate distinct input patterns.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Decomposes pre-softmax attention logits for individual dataset samples into
per-(q_component, k_component) pair contributions at each key position.
Overlays the component sum with ground-truth logits from the target model
to validate the decomposition.

Top-N pairs are ranked by peak absolute contribution on each specific
datapoint (not harvest mean CI), with per-head visibility masking to
reduce clutter. Supports weighted and binary modes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move flash_attention disable out of per-sample loop (set once)
- Use set for sample index lookup
- Update module docstring to match current CLI
- Rewrite README to reflect current behavior (no harvest filtering,
  no validation plot, dataset samples, per-layer output dirs)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants