Skip to content

fix(supermodel): wire feature-removal FP filter into extract_ground_truth (#714)#8

Closed
greynewell wants to merge 8 commits intomainfrom
feat/supermodel-dead-code-fixes
Closed

fix(supermodel): wire feature-removal FP filter into extract_ground_truth (#714)#8
greynewell wants to merge 8 commits intomainfrom
feat/supermodel-dead-code-fixes

Conversation

@greynewell
Copy link
Copy Markdown
Contributor

@greynewell greynewell commented Mar 26, 2026

Summary

Fixes the 0-recall benchmark scores reported in supermodeltools/supermodel-public-api#714 for PRs that delete entire files (e.g. n8n #23572, prisma #28485).

Root cause

_is_feature_removal_fp and _parse_deleted_imports were fully implemented but never called. extract_ground_truth only ran _parse_diff and returned raw results with no FP filtering — meaning GT included symbols that were actively imported by other files also deleted in the same PR. These are feature-removal co-deletions, not dead code; no static-analysis tool would ever report them pre-merge, so they should not count as benchmark targets.

What this PR does

Wires the existing filter into extract_ground_truth:

deleted_imports = _parse_deleted_imports(diff)
declarations = [d for d in declarations if not _is_feature_removal_fp(d, deleted_imports)]

The filter is conservative: a symbol is only suppressed when another deleted file in the same PR imports it from a module whose basename matches the declaration's file. Genuinely orphaned symbols (no deleted importer) are kept in GT and are correctly detectable by the API when analysing the pre-merge commit.

Benchmark PR examples (from issue)

PR Symbol Deleted importer? Expected GT result
n8n #23572 LoadPyodide (Pyodide.ts) None Keep — genuinely orphaned
n8n #23572 PythonSandbox (PythonSandbox.ts) CodeRunner.ts (deleted) Drop — feature removal
prisma #28485 extractSqliteSources None Keep — genuinely orphaned
prisma #28485 serializeDatasources extractSqliteSources.ts (deleted) Drop — feature removal

Tests

16 new unit tests in tests/test_supermodel_dead_code.py:

  • _parse_deleted_imports: single-line named, multi-line block, default import, empty diff
  • _is_feature_removal_fp: all four benchmark cases + multi-line import case
  • Integration (_parse_diff + filter pipeline): n8n pattern, prisma pattern, all-genuinely-dead, multi-line import

All 16 pass. Pre-commit hooks (ruff, mypy) clean.

Test plan

  • All 16 new tests pass: uv run pytest tests/test_supermodel_dead_code.py -v
  • Full test suite passes: uv run pytest -m "not integration"
  • Re-run n8n #23572 and prisma #28485 benchmark tasks — recall should improve from 0

Fixes supermodeltools/supermodel-public-api#714

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Extended dead code detection to recognize TypeScript interfaces, types, and enums.
  • Bug Fixes

    • Fixed ground truth extraction filtering in dead code benchmark, resolving 0-recall scoring for specific PR cases.
    • Improved feature-removal false-positive detection to exclude symbols co-removed with their importing files.
  • Improvements

    • Enhanced precision evaluation with consideration of "alive code" sets.

greynewell and others added 8 commits March 25, 2026 17:16
…e caches

- Stamp every zip with _ZIP_COMMENT (mcpbr-analysis-v3) so the hash is
  stable, deterministic, and distinct from earlier runs on old containers.
  Server-side cached errors from the 4Gi era are bypassed automatically.
- Add *.map / *.js.map to BINARY_EXCLUDE_PATTERNS (source maps add no
  value for dead-code analysis and inflate zip size).
- Delete stale local analysis caches (all pre-date the binary-filter commit).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…moval GT false positives

- Remove `resolved_threshold` parameter from SupermodelBenchmark.__init__; `resolved`
  now reflects whether the agent completed without error (evaluate() was reached), not
  whether recall crossed an arbitrary threshold. Precision/recall are the quality signal.

- Add `_parse_deleted_imports()` to dead_code.py: parses deleted `import { ... }` lines
  from a PR diff and filters out any removed exports whose names appear in those imports.
  Feature-removal PRs delete many files together; symbols exported by one deleted file
  and imported by another are NOT dead code — they were active within the feature.
  This eliminates false positives in GT for logto, n8n, prisma, podman, nextjs PRs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…file-aware filter

Address CodeRabbit review comments on #7:

1. Multi-line imports: Replace single-regex approach with a state machine that
   handles multi-line `import { ... }` blocks (common in feature-removal PRs
   where imports span many lines). Also captures default imports. The state
   machine accumulates deleted lines until the closing `}` then extracts names.

2. File-aware filtering: `_parse_deleted_imports` now returns
   `dict[str, set[str]]` (symbol → module_specifiers) instead of `set[str]`.
   New `_is_feature_removal_fp()` does basename matching between the module
   specifier in the import and the declaration's file stem (e.g. `./userService`
   matches `userService.ts`). This prevents spurious exclusion of exports that
   share a name with an unrelated import from a different module.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Benchmark evaluate() methods can return arbitrary fields beyond the
standard resolved/patch_applied/fail_to_pass set. Pass these through
to the stored result dict so metrics like precision, recall, f1_score,
true_positives, etc. are captured in incremental_results.jsonl.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e filter

Two changes to eliminate misleading precision numbers:

1. compute_prf1 now accepts an alive_code list. When provided, false positives
   are pred ∩ alive_set (agent flagged something confirmed alive) rather than
   pred − gt_set (agent found real dead code the PR didn't happen to remove).
   evaluate() reads entryPoints from the workspace analysis file to populate it.

2. enhanced_prompt_v2 filter script now skips medium/low confidence candidates.
   All GT items across sampled tasks are high-confidence; medium/low are noise.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…de label

Three changes to make benchmark results credible for comparison:

1. benchmark.py: baseline contamination fix — analysis file written to hidden
   path (.supermodel_dead_code_analysis.json) when is_mcp=False so the baseline
   agent cannot see pre-computed results. evaluate() checks both paths.
   fp_mode field added to result dict to clarify which FP definition was used.

2. harness.py: duck-typed is_mcp — uses inspect.signature to pass is_mcp only
   when the benchmark's create_environment supports it, so other benchmarks
   remain unaffected without Protocol changes.

3. evaluation.py: fallback to standard FP when alive_set is empty — prevents
   trivial precision=1.0 on analysis failure (empty entryPoints returns fp=0
   which makes the metric meaningless).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ensures the post-deploy benchmark run gets fresh dead-code analysis
results from the API rather than serving cached v2 responses.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ruth (#714)

_is_feature_removal_fp and _parse_deleted_imports were implemented but
never called. extract_ground_truth now applies them to remove symbols
from GT that were imported by other files also deleted in the same PR —
these are feature-removal co-deletions, not dead code, so no static
analysis tool would ever report them pre-merge.

Genuinely orphaned symbols (no deleted importer) are correctly kept in
GT and are detectable by the API when analysing the pre-merge commit.

Adds 16 unit tests covering:
- _parse_deleted_imports: single-line, multi-line, and default import forms
- _is_feature_removal_fp: FP detection keyed to file basename to avoid
  spurious suppression of unrelated same-named symbols
- Integration: full filter pipeline with the n8n (#23572) and
  prisma (#28485) benchmark cases from the issue

Fixes supermodeltools/supermodel-public-api#714

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 26, 2026

Caution

Review failed

Pull request was closed or merged during review

Walkthrough

This PR refactors dead-code analysis by activating a previously-implemented but unused false-positive filter that excludes symbols removed alongside their importing files. It also updates API authentication (removing curl config files, using direct headers), adds an MCP/baseline evaluation mode distinction, stabilizes zip archive generation with consistent metadata, and revises precision metrics to compare predictions against "alive code" instead of full ground truth. Comprehensive test coverage is added.

Changes

Cohort / File(s) Summary
Dead Code Ground-Truth & FP Filtering
src/mcpbr/benchmarks/supermodel/endpoints/dead_code.py, tests/test_supermodel_dead_code.py
Extends TypeScript declaration parsing to recognize interface, type, and enum exports; implements the _is_feature_removal_fp filter to exclude deleted symbols co-removed with their importing files, and applies tighter confidence gates for high-confidence candidates.
API Client & Request Handling
src/mcpbr/benchmarks/supermodel/api_client.py
Replaces curl config file auth with direct X-Api-Key header injection; updates idempotency key version to v3; simplifies JSON parsing by removing explicit exception handling; streamlines return value handling.
Benchmark Evaluation Framework
src/mcpbr/benchmarks/supermodel/benchmark.py, src/mcpbr/benchmarks/supermodel/evaluation.py
Removes resolved_threshold parameter; adds is_mcp flag to create_environment to separate MCP and baseline analysis paths; extends compute_prf1 with alive_code parameter to compute false positives against alive set instead of ground truth; removes module-level logging.
Archive & Utility Improvements
src/mcpbr/benchmarks/supermodel/git_utils.py
Adds source map files (*.map, *.js.map) to binary exclusion patterns; stamps generated zip archives with a consistent comment for version-independent hash stability.
Integration & Test Harness
src/mcpbr/harness.py
Uses runtime signature inspection to conditionally pass is_mcp parameter to create_environment when supported; extends result serialization to include custom benchmark metrics.
Documentation
CHANGELOG.md
Documents the fixed ground-truth extraction behavior for dead-code analysis.

Sequence Diagram

sequenceDiagram
    participant Harness as Harness (MCP vs Baseline)
    participant Benchmark as SupermodelBenchmark
    participant DeadCode as Dead Code Endpoint
    participant Evaluation as Evaluation Engine
    participant Analysis as Analysis JSON

    Harness->>Benchmark: create_environment(task, docker, is_mcp=True/False)
    Benchmark->>DeadCode: Extract ground truth (with FP filter)
    DeadCode->>DeadCode: Parse declarations & deleted imports
    DeadCode->>DeadCode: Filter: _is_feature_removal_fp?
    DeadCode->>Benchmark: Return filtered ground truth
    Benchmark->>Analysis: Write to well-known path (MCP)<br/>or hidden path (baseline)
    
    Harness->>Benchmark: evaluate()
    Benchmark->>Analysis: Load analysis JSON<br/>(check both paths)
    Analysis->>Benchmark: Return entryPoints (alive_code)
    Benchmark->>Evaluation: compute_prf1(..., alive_code=alive_code)
    Evaluation->>Evaluation: FP = predictions ∩ alive_code<br/>(if alive_code provided)
    Evaluation->>Benchmark: Return precision/recall/F1
    Benchmark->>Harness: Return resolved=True + results
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🧹 False positives swept away with care,
Imports traced with surgical flair.
Zip archives stamped, baselines bloom,
Dead code dreams light up the room! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.84% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: wiring a feature-removal false-positive filter into the extract_ground_truth function, which directly addresses the 0-recall issue documented in the PR objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/supermodel-dead-code-fixes
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch feat/supermodel-dead-code-fixes

Comment @coderabbitai help to get the list of available commands and usage tips.

@greynewell
Copy link
Copy Markdown
Contributor Author

Closing — the FP filter was already wired in since commit 918e3a1. This PR only rewrote the comment. Will investigate the actual root cause of 0-recall for whole-file-deletion PRs separately.

@greynewell greynewell closed this Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant