DRAFT: swtbench: strip non-test files from model_patch in post-processing (option 2 of #708) by juanmichelini · Pull Request #711 · OpenHands/benchmarks

juanmichelini · 2026-05-13T02:12:50Z

Addresses option 2 from #708.

What

Extends the existing setup-file strip in benchmarks/swtbench/eval_infer.py with a positive whitelist that keeps only test-file diffs in model_patch. SWT-bench scores nothing else, and non-test diffs in model_patch actively hurt the score because the agent's own source "fix" runs against the test and silences the F2P signal.

Changes

benchmarks/utils/patch_utils.py: new keep_only_test_files(git_patch) helper that mirrors the structure of the existing remove_files_from_patch. A file is kept iff its path lives under tests/, test/, testing/, or its basename matches test_*.py / *_test.py / conftest.py. Repo-root paths are intentionally rejected so agent-authored scratch files (reproduction.py, test_repro.py, FIX_SUMMARY.md, …) are dropped even when they look testy.
benchmarks/swtbench/eval_infer.py: call keep_only_test_files immediately after the existing remove_files_from_patch(setup_files) call. Same place, same shape as the tox.ini/pyproject.toml/setup.py strip.
tests/test_patch_utils_keep_test_files.py: unit tests covering tests-dir, testing-dir, conftest, suffix/prefix names, repo-root scratch rejection, build/docs rejection, and the all-non-test case.

Why

From run litellm_proxy-openrouter-qwen-qwen3-coder-next/24103435463 (424 instances), cross-tabbing patch shape against the resolved set:

Patch category	# instances	Resolved	Rate
Only test files	22	10	45.5 %
Mixed test + source code	332	4	1.2 %
Only source code / no test	64	0	0 %
Empty	3	0	0 %
Total	424	14	3.3 %

Stripping non-test diffs converts most mixed patches into the "only test files" shape — the population that already resolves at 45.5 %. Not every mixed patch will recover (some tests are calibrated to the agent's own fix or import symbols the agent added), but even a conservative recovery is multiples of the current score.

How to test

No re-inference required. Run benchmarks/swtbench/eval_infer.py against the existing output.jsonl from the bad run; the regenerated output.swtbench.jsonl will have only test-file diffs and the harness will rescore accordingly.

Unit tests:

pytest tests/test_patch_utils_keep_test_files.py

Risks / caveats

Heuristic, not ground truth. The whitelist is path-based. A more precise version would intersect with the gold test_patch file set per instance; deferred to a follow-up to keep this PR minimal.
Doesn't fix the root cause — the agent still wastes tokens on source-code fixes. Option 1 (prompt change, opened separately as DRAFT: swtbench: tighten default prompt to discourage non-test edits (option 1 of #708) #710) is complementary, not an alternative.
Existing runs change behavior. Any past SWT-bench run rescored on this code will see different numbers. Numbers reported on the benchmark monitor will need a re-eval pass to be consistent. Existing output.swtbench.jsonl files in completed runs are untouched on disk; only future invocations of eval_infer.py change.
The new test file lives at tests/test_patch_utils_keep_test_files.py; happy to move/rename if the repo has a different convention I missed.

Refs #708.

This PR was created by an AI agent (OpenHands) on behalf of @juanmichelini.

@juanmichelini can click here to continue refining the PR

SWT-bench only scores diffs against existing test files. Today's post-processing in eval_infer.py removes a small blacklist of setup files (pyproject.toml, tox.ini, setup.py) but leaves everything else alone. That means any source-code 'fix' the agent committed alongside its test, plus scratch artifacts like reproduction.py, FIX_SUMMARY.md, build/lib/*, and docs/ changes, all flow through into model_patch. Source-code diffs are particularly harmful: when SWT-bench applies model_patch and runs the new test, the test is now executing against the agent's own 'fixed' code rather than the buggy code, so the F2P signal is silenced and the instance is marked unresolved. Empirically (run litellm_proxy-openrouter-qwen-qwen3-coder-next/24103435463, 424 instances): Only test files in patch : 22 instances, 10 resolved (45.5%) Mixed test + source code in patch : 332 instances, 4 resolved ( 1.2%) Only source code in patch : 64 instances, 0 resolved ( 0.0%) Empty patch : 3 instances, 0 resolved ( 0.0%) Total : 424 instances, 14 resolved ( 3.3%) This adds keep_only_test_files() in benchmarks/utils/patch_utils.py that mirrors remove_files_from_patch() and applies the inverse predicate: keep diffs whose target file lives under tests/, test/, testing/, or whose basename matches test_*.py / *_test.py / conftest.py. Repo-root paths are intentionally rejected so the agent's own root-level scratch test_repro.py files are dropped too. eval_infer.py applies this immediately after the existing setup-file strip, so all existing SWT-bench runs can be rescored from output.jsonl without re-running inference. See #708. Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

🟡 Acceptable - Pragmatic solution with clear heuristics that addresses a real problem (45.5% → 1.2% resolution rate drop for mixed patches). The implementation is well-tested and consistent with existing patch utilities, but has important caveats around backward compatibility and the use of heuristics vs. ground truth.

all-hands-bot · 2026-05-19T17:03:26Z

+                # a non-test diff that lands in model_patch can silence the
+                # F2P signal because the test then runs against the agent's
+                # own patched code instead of the buggy code.
+                git_patch = keep_only_test_files(git_patch)


🟠 Important: Missing Runtime Evidence - The PR description shows the problem exists (table with resolution rates), but doesn't show that applying this fix actually improves scores on a real evaluation run.

Please add an Evidence section to the PR description with:

The command used to run eval_infer.py on an existing bad run

Before/after resolved instance counts from the evaluation output

Confirmation that the score improved as expected

This proves the fix works in practice, not just in theory.

runtime evidence here https://openhands-eval-monitor.vercel.app/?run=swtbench%2Flitellm_proxy-openrouter-qwen-qwen3-coder-next%2F25976432719%2F&days=60&benchmark=swtbench

all-hands-bot · 2026-05-19T17:04:05Z

Code Review Summary

Taste Rating: 🟡 Acceptable - Works pragmatically but has caveats

[CRITICAL ISSUES]

None - the core logic is sound and well-tested.

[IMPROVEMENT OPPORTUNITIES]

Missing Runtime Evidence - The PR demonstrates the problem with data (45.5% vs 1.2% resolution rates for different patch types), but doesn't show that this fix actually improves scores when applied to a real evaluation run. See inline comment on eval_infer.py:217.

Backward Compatibility Impact - This changes evaluation semantics for all future and re-evaluated runs. Historical benchmark scores will not be directly comparable. See inline comment on eval_infer.py:217.

Heuristic vs Ground Truth - The implementation uses path-based heuristics rather than the gold test_patch file set mentioned in the PR description. The PR says a more precise version would use ground truth but doesn't explain why that approach wasn't taken. See inline comment on patch_utils.py:99.

[TESTING GAPS]

Tests are comprehensive, covering:

✅ Empty input handling
✅ Various directory structures (tests/, test/, testing/)
✅ Naming patterns (test_*.py, *_test.py, conftest.py)
✅ Root-level scratch file rejection
✅ Build and docs directory handling
✅ All-non-test edge case

Suggestion: Add test for singular test/ directory at root to document intentional behavior (see inline comment on patch_utils.py:87).

[RISK ASSESSMENT]

Overall PR: 🟡 MEDIUM RISK

Risk Factors:

Backward Compatibility: Changes evaluation behavior for all runs (past if re-evaluated, all future runs). Teams tracking scores over time will see discontinuity.
Heuristic Approach: Path-based filtering may have edge cases that ground-truth data wouldn't have. However, the heuristics are well-documented and tested.
Well-Defined Scope: Limited to SWT-bench eval post-processing, doesn't affect inference or other benchmarks.
Production Data: The PR includes analysis of 424 instances showing the problem exists and the expected benefit (45.5% resolution rate for test-only patches).

Recommendation: Safe to merge after addressing the evidence gap (run eval on a sample to confirm score improvement). Consider making this behavior opt-in via flag for one release cycle to give teams time to adjust their monitoring, then make it default.

VERDICT

✅ Worth merging (after evidence added) - Core logic is sound, approach is pragmatic, tests are solid. The backward compatibility impact is significant but acceptable for a benchmark improvement that addresses a real scoring artifact.

KEY INSIGHT:
This fix converts SWT-bench evaluation from "penalizing agents for attempting source fixes" to "scoring only their test contributions," which better aligns with the benchmark's intent. The 45.5% vs 1.2% resolution rate data suggests the current scoring is dominated by noise from source patches interfering with F2P signals.

Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.

Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.

When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

juanmichelini · 2026-05-27T22:39:34Z

tested here https://openhands-eval-monitor.vercel.app/?run=swtbench%2Flitellm_proxy-openrouter-qwen-qwen3-coder-next%2F25976432719%2F&days=60&benchmark=swtbench

juanmichelini · 2026-05-27T22:42:03Z

Opened #718 to track assessing the impact on historical SWT-bench runs (re-evaluating representative past output.jsonls under the new code and tabulating delta vs. originally reported scores).

This comment was created by an AI agent (OpenHands) on behalf of @juanmichelini.

all-hands-bot · 2026-05-27T22:42:27Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Solid, well-motivated change. The data in the PR description makes the benefit compelling (45.5% vs 1.2% resolution rate), the implementation is consistent with the existing remove_files_from_patch pattern, and the test file covers the important edge cases well.

Two things worth addressing before merge:

test_ prefix matching is not restricted to .py files — test_requirements.txt, test_config.yaml, etc. would be kept in the patch. Whether this matters in practice on Python SWT-bench repos is debatable, but it's an unintentional asymmetry with _test.py (which does check the extension).
Diff-block iteration is duplicated between keep_only_test_files and remove_files_from_patch (~15 lines of identical logic). As more patch utilities accumulate this will become a maintenance hazard.

Both are suggestions, not blockers. The backward-compatibility and runtime-evidence concerns raised in the earlier review have been addressed (issue #718 created, evaluation run linked).

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

all-hands-bot · 2026-05-27T22:46:33Z

+    )
+
+
+def keep_only_test_files(git_patch: str) -> str:


🟡 Suggestion: The diff-block extraction loop (finding start/end bounds per diff --git header) is identical to the one in remove_files_from_patch. Consider extracting a private helper to avoid drift:

def _split_diff_blocks(git_patch: str) -> list[str]: """Split a git patch string into individual per-file diff blocks.""" if not git_patch: return [] matches = list(re.finditer(r"diff --git [^\n]*\n", git_patch)) if not matches: return [git_patch] # or [] — depends on caller contract return [ git_patch[m.start() : (matches[i + 1].start() if i + 1 < len(matches) else len(git_patch))] for i, m in enumerate(matches) ]

Both remove_files_from_patch and keep_only_test_files can then reduce to a filter over _split_diff_blocks(git_patch). Not a blocker for this PR, but worth doing before a third caller appears.

same pattern but no need to extract

Co-authored-by: OpenHands Bot <contact@all-hands.dev>

all-hands-bot · 2026-05-30T15:34:26Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review

Overall Assessment

Well-motivated, data-driven change with clean implementation and solid test coverage. The quantitative evidence in the PR description (45.5 % resolution rate for test-only patches vs. 1.2 % for mixed, vs. 0 % for source-only) is the best possible justification for this kind of post-processing step.

The implementation correctly mirrors remove_files_from_patch, the .py extension is already guarded in the test_ prefix branch (the suggestion from a prior review was already in the code), and the root-level scratch-file exclusion is sound for SWT-bench's usage pattern.

Two fresh observations below that aren't covered by existing threads.

Observations

remove_files_from_patch is now logically redundant (eval_infer.py line 210)
pyproject.toml, tox.ini, and setup.py are not test files, so keep_only_test_files would strip them regardless. The first call is a no-op in the happy path. Keeping it is fine for explicit intent documentation, but a brief inline comment (e.g. # also covered by keep_only_test_files below) would prevent future readers from wondering why two filter calls exist in sequence.
Root-level conftest.py is silently excluded (patch_utils.py line 84)
if "/" not in path: return False rejects every repo-root file — including conftest.py at the project root, which is a standard pytest configuration point in many real Python repos. For SWT-bench the intent is correct (agent scratch files at root should be dropped), but if a benchmark instance's canonical test infrastructure includes a root-level conftest.py and the agent correctly patches it, those changes will be stripped and the fix silently lost. Worth adding a test_drops_root_level_conftest test case to make the trade-off explicit and guard against accidentally loosening the root-exclusion rule in the future.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

all-hands-bot · 2026-05-30T15:39:56Z

@@ -205,6 +208,13 @@ def convert_to_swtbench_format(input_file: str, output_file: str) -> None:
                # postprocess git_patch
                setup_files = ["pyproject.toml", "tox.ini", "setup.py"]
                git_patch = remove_files_from_patch(git_patch, setup_files)


With keep_only_test_files applied two lines below, this call is now logically redundant — pyproject.toml, tox.ini, and setup.py are not test files and would be stripped by the new filter anyway. Keeping it is fine for clarity, but a short follow-up comment (e.g. # belt-and-suspenders; also covered by keep_only_test_files below) would help future readers understand why two successive filter calls exist here.

all-hands-bot · 2026-05-30T15:39:56Z

+    match the naming pattern (e.g. ``test_repro.py``), because in SWT-bench
+    those are agent-authored scratch files that the harness ignores.
+    """
+    if "/" not in path:


This guard intentionally rejects root-level files (agent scratch files like test_repro.py) — but it also silently drops a root-level conftest.py, which is a legitimate pytest fixture in many projects. If a SWT-bench instance has a root-level conftest.py that the agent correctly patches, those changes will be stripped. Consider adding a test_drops_root_level_conftest test case to document this trade-off explicitly and guard against someone loosening the guard in the future without realising the impact:

def test_drops_root_level_conftest(self): # Root conftest.py is treated as agent scratch, not a real test file. patch = _diff('conftest.py') + _diff('tests/conftest.py') out = keep_only_test_files(patch) assert 'conftest.py' not in out.split('\n')[0] # root dropped assert 'tests/conftest.py' in out # package-level kept

juanmichelini added build-swt-bench Build SWT-Bench images based on SDK version on this PR. investigation labels May 13, 2026 — with OpenHands AI

juanmichelini added the build-swt-bench Build SWT-Bench images based on SDK version on this PR. label May 13, 2026

This was referenced May 13, 2026

swtbench: qwen3-coder-next score is artificially low — agent writes source-code fix alongside the test (78% of patches) #708

Open

DRAFT: swtbench: tighten default prompt to discourage non-test edits (option 1 of #708) #710

Closed

juanmichelini marked this pull request as ready for review May 19, 2026 16:59

all-hands-bot reviewed May 19, 2026

View reviewed changes

juanmichelini enabled auto-merge (squash) May 27, 2026 22:39

juanmichelini requested a review from all-hands-bot May 27, 2026 22:41

juanmichelini mentioned this pull request May 27, 2026

Assess impact of swtbench non-test patch stripping on historical runs #718

Open

all-hands-bot reviewed May 27, 2026

View reviewed changes

Update benchmarks/utils/patch_utils.py

2acf094

Co-authored-by: OpenHands Bot <contact@all-hands.dev>

juanmichelini requested a review from all-hands-bot May 30, 2026 15:33

all-hands-bot reviewed May 30, 2026

View reviewed changes

Conversation

juanmichelini commented May 13, 2026

What

Changes

Why

How to test

Risks / caveats

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

juanmichelini May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot commented May 19, 2026

Code Review Summary

[CRITICAL ISSUES]

[IMPROVEMENT OPPORTUNITIES]

[TESTING GAPS]

[RISK ASSESSMENT]

VERDICT

Uh oh!

juanmichelini commented May 27, 2026

Uh oh!

juanmichelini commented May 27, 2026

Uh oh!

all-hands-bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

all-hands-bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

juanmichelini May 30, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Code Review

Overall Assessment

Observations

Uh oh!

all-hands-bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

all-hands-bot commented May 27, 2026 •

edited

Loading

all-hands-bot commented May 30, 2026 •

edited

Loading