Skip to content

Assess impact of swtbench non-test patch stripping on historical runs #718

@juanmichelini

Description

@juanmichelini

Context

PR #711 changes benchmarks/swtbench/eval_infer.py to strip non-test diffs from model_patch during post-processing (keeping only files under tests/, test/, testing/, or matching test_*.py / *_test.py / conftest.py).

This is a behavior change at evaluation time: any historical SWT-bench output.jsonl re-evaluated under the new code will produce different output.swtbench.jsonl and different scores than what was originally reported. Existing output.swtbench.jsonl files on disk are not modified, but published numbers and the benchmark monitor are no longer directly comparable to future runs.

Goal

Quantify and document the impact of this change on historical SWT-bench runs, so we know:

  1. How much the headline resolved-rate moves for each past run (and per model/config).
  2. Whether previously-resolved instances regress (i.e. cases where the source-code diff was actually needed for the test patch to apply / pass).
  3. Whether the change is monotonic improvement in practice, or there are pockets of regression that argue for a smarter strip (e.g. intersect with gold test_patch file set per instance — noted as a follow-up in DRAFT: swtbench: strip non-test files from model_patch in post-processing (option 2 of #708) #711).

Suggested approach

Related

This issue was created by an AI agent (OpenHands) on behalf of @juanmichelini, per their request on PR #711.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions