Add data-specific QK component contribution plots#459
Open
lee-goodfire wants to merge 8 commits intofeature/attn_plotsfrom
Open
Add data-specific QK component contribution plots#459lee-goodfire wants to merge 8 commits intofeature/attn_plotsfrom
lee-goodfire wants to merge 8 commits intofeature/attn_plotsfrom
Conversation
* Add rich_examples autointerp strategy and compare tab New autointerp strategy (rich_examples) that shows per-token CI and activation values inline, letting the LLM judge evidence quality directly. Also adds an Autointerp Compare tab to the app for side-by-side comparison of interpretation results across different strategies/models/subruns. Backend: 3 new endpoints for listing subruns, bulk headlines, and detail. Frontend: SubrunSelector (multiselect chips), stacked SubrunInterpCard, two-panel AutointerpComparer with full component data on the right panel. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Restrict Anthropic autointerp models and use structured outputs --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix rich_examples prompt: explain signed component activations Adds explanation to the SPD decomposition description that component activation sign is arbitrary (inner product with read direction) and does not indicate suppression. Trims redundant legend text. Also adds render_prompt.py script for iterating on prompt templates. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Expose snapshot_branch in spd-autointerp CLI Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Improve rich_examples prompt clarity - Show raw text before annotated version in examples (helps with dense token sequences like code/LaTeX) - Add explicit explanation of <<<token (ci:X, act:Y)>>> format - Add "consider evidence critically" paragraph from dual_view Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Use XML blocks with raw + highlighted text in rich_examples examples Replaces sanitized single-line format with: <example> <raw>...unmodified text...</raw> <highlighted>...<<<token (ci:X, act:Y)>>>...</highlighted> </example> Adds AppTokenizer.get_raw_spans for LLM prompt rendering where actual whitespace (newlines, indentation) is meaningful. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Show all subruns in autointerp comparer, not just .done ones Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* Fix rich_examples prompt: explain signed component activations Adds explanation to the SPD decomposition description that component activation sign is arbitrary (inner product with read direction) and does not indicate suppression. Trims redundant legend text. Also adds render_prompt.py script for iterating on prompt templates. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Expose snapshot_branch in spd-autointerp CLI Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Improve rich_examples prompt clarity - Show raw text before annotated version in examples (helps with dense token sequences like code/LaTeX) - Add explicit explanation of <<<token (ci:X, act:Y)>>> format - Add "consider evidence critically" paragraph from dual_view Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Use XML blocks with raw + highlighted text in rich_examples examples Replaces sanitized single-line format with: <example> <raw>...unmodified text...</raw> <highlighted>...<<<token (ci:X, act:Y)>>>...</highlighted> </example> Adds AppTokenizer.get_raw_spans for LLM prompt rendering where actual whitespace (newlines, indentation) is meaningful. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Show all subruns in autointerp comparer, not just .done ones Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Add autointerp_subrun_id to scoring CLI and InterpRepo.open_subrun Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Remove confidence field from autointerp + improve act legend Drops the confidence field entirely from InterpretationResult, all DB schemas, JSON output schemas, prompts, API responses, and frontend UI. Expands the act legend in rich_examples to explain that sign is meaningful within a component's examples even though the global convention is arbitrary — polarity may indicate distinct input patterns. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Decomposes pre-softmax attention logits for individual dataset samples into per-(q_component, k_component) pair contributions at each key position. Overlays the component sum with ground-truth logits from the target model to validate the decomposition. Top-N pairs are ranked by peak absolute contribution on each specific datapoint (not harvest mean CI), with per-head visibility masking to reduce clutter. Supports weighted and binary modes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move flash_attention disable out of per-sample loop (set once) - Use set for sample index lookup - Update module docstring to match current CLI - Rewrite README to reflect current behavior (no harvest filtering, no validation plot, dataset samples, per-layer output dirs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
New script
spd/scripts/plot_qk_c_datapoint/that decomposes pre-softmax attention logits for individual dataset samples into per-(q_component, k_component) pair contributions at each key position.For each (sample, query_pos, layer), produces a 4x2 grid plot (mean + 6 per-head subplots) showing:
Supports weighted mode (actual activation scaling) and binary mode (CI threshold gating).
Motivation and Context
The existing
plot_qk_c_attention_contributionsscript computes weight-only QK interactions averaged over data. This script validates the decomposition on specific datapoints — verifying that the sum of component pair contributions matches actual attention logits. The residual (~0.4-0.6) is accounted for by the weight delta (V@U reconstruction error, ~11% of target weight norm).How Has This Been Tested?
s-55ea3f9bacross all 4 layers, 20 dataset samples each (80 plots total)Does this PR introduce a breaking change?
No — this is a new standalone analysis script with no changes to existing code.