Add data-specific QK component contribution plots by lee-goodfire · Pull Request #459 · goodfire-ai/spd

lee-goodfire · 2026-03-19T10:23:52Z

Description

New script spd/scripts/plot_qk_c_datapoint/ that decomposes pre-softmax attention logits for individual dataset samples into per-(q_component, k_component) pair contributions at each key position.

For each (sample, query_pos, layer), produces a 4x2 grid plot (mean + 6 per-head subplots) showing:

Top-N component pair contributions as colored lines (ranked by peak abs contribution on the datapoint)
Sum over all components (black line)
Ground-truth pre-softmax logits from the target model (red dashed, weighted mode only)

Supports weighted mode (actual activation scaling) and binary mode (CI threshold gating).

Motivation and Context

The existing plot_qk_c_attention_contributions script computes weight-only QK interactions averaged over data. This script validates the decomposition on specific datapoints — verifying that the sum of component pair contributions matches actual attention logits. The residual (~0.4-0.6) is accounted for by the weight delta (V@U reconstruction error, ~11% of target weight norm).

How Has This Been Tested?

Ran on s-55ea3f9b across all 4 layers, 20 dataset samples each (80 plots total)
Verified decomposition residual matches weight delta via direct computation
Tested both weighted and binary modes
Type-checked with basedpyright, linted with ruff

Does this PR introduce a breaking change?

No — this is a new standalone analysis script with no changes to existing code.

* Add rich_examples autointerp strategy and compare tab New autointerp strategy (rich_examples) that shows per-token CI and activation values inline, letting the LLM judge evidence quality directly. Also adds an Autointerp Compare tab to the app for side-by-side comparison of interpretation results across different strategies/models/subruns. Backend: 3 new endpoints for listing subruns, bulk headlines, and detail. Frontend: SubrunSelector (multiselect chips), stacked SubrunInterpCard, two-panel AutointerpComparer with full component data on the right panel. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Restrict Anthropic autointerp models and use structured outputs --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix rich_examples prompt: explain signed component activations Adds explanation to the SPD decomposition description that component activation sign is arbitrary (inner product with read direction) and does not indicate suppression. Trims redundant legend text. Also adds render_prompt.py script for iterating on prompt templates. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Expose snapshot_branch in spd-autointerp CLI Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Improve rich_examples prompt clarity - Show raw text before annotated version in examples (helps with dense token sequences like code/LaTeX) - Add explicit explanation of <<<token (ci:X, act:Y)>>> format - Add "consider evidence critically" paragraph from dual_view Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Use XML blocks with raw + highlighted text in rich_examples examples Replaces sanitized single-line format with: <example> <raw>...unmodified text...</raw> <highlighted>...<<<token (ci:X, act:Y)>>>...</highlighted> </example> Adds AppTokenizer.get_raw_spans for LLM prompt rendering where actual whitespace (newlines, indentation) is meaningful. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Show all subruns in autointerp comparer, not just .done ones Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Fix rich_examples prompt: explain signed component activations Adds explanation to the SPD decomposition description that component activation sign is arbitrary (inner product with read direction) and does not indicate suppression. Trims redundant legend text. Also adds render_prompt.py script for iterating on prompt templates. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Expose snapshot_branch in spd-autointerp CLI Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Improve rich_examples prompt clarity - Show raw text before annotated version in examples (helps with dense token sequences like code/LaTeX) - Add explicit explanation of <<<token (ci:X, act:Y)>>> format - Add "consider evidence critically" paragraph from dual_view Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Use XML blocks with raw + highlighted text in rich_examples examples Replaces sanitized single-line format with: <example> <raw>...unmodified text...</raw> <highlighted>...<<<token (ci:X, act:Y)>>>...</highlighted> </example> Adds AppTokenizer.get_raw_spans for LLM prompt rendering where actual whitespace (newlines, indentation) is meaningful. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Show all subruns in autointerp comparer, not just .done ones Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Add autointerp_subrun_id to scoring CLI and InterpRepo.open_subrun Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Remove confidence field from autointerp + improve act legend Drops the confidence field entirely from InterpretationResult, all DB schemas, JSON output schemas, prompts, API responses, and frontend UI. Expands the act legend in rich_examples to explain that sign is meaningful within a component's examples even though the global convention is arbitrary — polarity may indicate distinct input patterns. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Decomposes pre-softmax attention logits for individual dataset samples into per-(q_component, k_component) pair contributions at each key position. Overlays the component sum with ground-truth logits from the target model to validate the decomposition. Top-N pairs are ranked by peak absolute contribution on each specific datapoint (not harvest mean CI), with per-head visibility masking to reduce clutter. Supports weighted and binary modes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Move flash_attention disable out of per-sample loop (set once) - Use set for sample index lookup - Update module docstring to match current CLI - Rewrite README to reflect current behavior (no harvest filtering, no validation plot, dataset samples, per-layer output dirs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ocg-goodfire and others added 8 commits March 18, 2026 13:23

Extract component frequency curve picker

cb88015

Reduce default top_n_pairs from 40 to 20

3eeecbc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix README default for top_n_pairs

5ea91a6

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data-specific QK component contribution plots#459

Add data-specific QK component contribution plots#459
lee-goodfire wants to merge 8 commits intofeature/attn_plotsfrom
feature/qk_c_datapoint_plots

lee-goodfire commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lee-goodfire commented Mar 19, 2026

Description

Motivation and Context

How Has This Been Tested?

Does this PR introduce a breaking change?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants