Assess impact of swtbench non-test patch stripping on historical runs

## Context

PR #711 changes `benchmarks/swtbench/eval_infer.py` to strip non-test diffs from `model_patch` during post-processing (keeping only files under `tests/`, `test/`, `testing/`, or matching `test_*.py` / `*_test.py` / `conftest.py`).

This is a behavior change at evaluation time: any historical SWT-bench `output.jsonl` re-evaluated under the new code will produce different `output.swtbench.jsonl` and different scores than what was originally reported. Existing `output.swtbench.jsonl` files on disk are not modified, but published numbers and the benchmark monitor are no longer directly comparable to future runs.

## Goal

Quantify and document the impact of this change on historical SWT-bench runs, so we know:

1. How much the headline resolved-rate moves for each past run (and per model/config).
2. Whether previously-resolved instances regress (i.e. cases where the source-code diff was actually needed for the test patch to apply / pass).
3. Whether the change is monotonic improvement in practice, or there are pockets of regression that argue for a smarter strip (e.g. intersect with gold `test_patch` file set per instance — noted as a follow-up in #711).

## Suggested approach

- Pick a representative set of past runs from the benchmark monitor (mix of models and dates; include the `litellm_proxy-openrouter-qwen-qwen3-coder-next/24103435463` run referenced in #711).
- For each run, re-run `benchmarks/swtbench/eval_infer.py` on the existing `output.jsonl` with the new code and diff the resolved set against the originally reported set.
- Tabulate: total resolved before/after, newly resolved, newly regressed, net delta. Cross-tab by original patch shape (only-test / mixed / only-source / empty) — the same buckets used in #711.
- Decide whether the benchmark monitor needs a one-shot historical re-eval pass to keep numbers consistent, and whether any leaderboard-facing artifacts need annotation.

## Related

- PR #711 (the change itself)
- Issue #708 (option 2 of which #711 implements)

_This issue was created by an AI agent (OpenHands) on behalf of @juanmichelini, per [their request on PR #711](https://github.com/OpenHands/benchmarks/pull/711)._ 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assess impact of swtbench non-test patch stripping on historical runs #718

Context

Goal

Suggested approach

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Assess impact of swtbench non-test patch stripping on historical runs #718

Description

Context

Goal

Suggested approach

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions