DRAFT: swtbench: strip non-test files from model_patch in post-processing (option 2 of #708)#711
DRAFT: swtbench: strip non-test files from model_patch in post-processing (option 2 of #708)#711juanmichelini wants to merge 2 commits into
Conversation
SWT-bench only scores diffs against existing test files. Today's post-processing in eval_infer.py removes a small blacklist of setup files (pyproject.toml, tox.ini, setup.py) but leaves everything else alone. That means any source-code 'fix' the agent committed alongside its test, plus scratch artifacts like reproduction.py, FIX_SUMMARY.md, build/lib/*, and docs/ changes, all flow through into model_patch. Source-code diffs are particularly harmful: when SWT-bench applies model_patch and runs the new test, the test is now executing against the agent's own 'fixed' code rather than the buggy code, so the F2P signal is silenced and the instance is marked unresolved. Empirically (run litellm_proxy-openrouter-qwen-qwen3-coder-next/24103435463, 424 instances): Only test files in patch : 22 instances, 10 resolved (45.5%) Mixed test + source code in patch : 332 instances, 4 resolved ( 1.2%) Only source code in patch : 64 instances, 0 resolved ( 0.0%) Empty patch : 3 instances, 0 resolved ( 0.0%) Total : 424 instances, 14 resolved ( 3.3%) This adds keep_only_test_files() in benchmarks/utils/patch_utils.py that mirrors remove_files_from_patch() and applies the inverse predicate: keep diffs whose target file lives under tests/, test/, testing/, or whose basename matches test_*.py / *_test.py / conftest.py. Repo-root paths are intentionally rejected so the agent's own root-level scratch test_repro.py files are dropped too. eval_infer.py applies this immediately after the existing setup-file strip, so all existing SWT-bench runs can be rescored from output.jsonl without re-running inference. See #708. Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Pragmatic solution with clear heuristics that addresses a real problem (45.5% → 1.2% resolution rate drop for mixed patches). The implementation is well-tested and consistent with existing patch utilities, but has important caveats around backward compatibility and the use of heuristics vs. ground truth.
| # a non-test diff that lands in model_patch can silence the | ||
| # F2P signal because the test then runs against the agent's | ||
| # own patched code instead of the buggy code. | ||
| git_patch = keep_only_test_files(git_patch) |
There was a problem hiding this comment.
🟠 Important: Missing Runtime Evidence - The PR description shows the problem exists (table with resolution rates), but doesn't show that applying this fix actually improves scores on a real evaluation run.
Please add an Evidence section to the PR description with:
- The command used to run
eval_infer.pyon an existing bad run - Before/after resolved instance counts from the evaluation output
- Confirmation that the score improved as expected
This proves the fix works in practice, not just in theory.
There was a problem hiding this comment.
Code Review SummaryTaste Rating: 🟡 Acceptable - Works pragmatically but has caveats [CRITICAL ISSUES]None - the core logic is sound and well-tested. [IMPROVEMENT OPPORTUNITIES]Missing Runtime Evidence - The PR demonstrates the problem with data (45.5% vs 1.2% resolution rates for different patch types), but doesn't show that this fix actually improves scores when applied to a real evaluation run. See inline comment on Backward Compatibility Impact - This changes evaluation semantics for all future and re-evaluated runs. Historical benchmark scores will not be directly comparable. See inline comment on Heuristic vs Ground Truth - The implementation uses path-based heuristics rather than the gold [TESTING GAPS]Tests are comprehensive, covering:
Suggestion: Add test for singular [RISK ASSESSMENT]Overall PR: 🟡 MEDIUM RISK Risk Factors:
Recommendation: Safe to merge after addressing the evidence gap (run eval on a sample to confirm score improvement). Consider making this behavior opt-in via flag for one release cycle to give teams time to adjust their monitoring, then make it default. VERDICT✅ Worth merging (after evidence added) - Core logic is sound, approach is pragmatic, tests are solid. The backward compatibility impact is significant but acceptable for a benchmark improvement that addresses a real scoring artifact. KEY INSIGHT:
|
|
Opened #718 to track assessing the impact on historical SWT-bench runs (re-evaluating representative past This comment was created by an AI agent (OpenHands) on behalf of @juanmichelini. |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Solid, well-motivated change. The data in the PR description makes the benefit compelling (45.5% vs 1.2% resolution rate), the implementation is consistent with the existing remove_files_from_patch pattern, and the test file covers the important edge cases well.
Two things worth addressing before merge:
-
test_prefix matching is not restricted to.pyfiles —test_requirements.txt,test_config.yaml, etc. would be kept in the patch. Whether this matters in practice on Python SWT-bench repos is debatable, but it's an unintentional asymmetry with_test.py(which does check the extension). -
Diff-block iteration is duplicated between
keep_only_test_filesandremove_files_from_patch(~15 lines of identical logic). As more patch utilities accumulate this will become a maintenance hazard.
Both are suggestions, not blockers. The backward-compatibility and runtime-evidence concerns raised in the earlier review have been addressed (issue #718 created, evaluation run linked).
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
| ) | ||
|
|
||
|
|
||
| def keep_only_test_files(git_patch: str) -> str: |
There was a problem hiding this comment.
🟡 Suggestion: The diff-block extraction loop (finding start/end bounds per diff --git header) is identical to the one in remove_files_from_patch. Consider extracting a private helper to avoid drift:
def _split_diff_blocks(git_patch: str) -> list[str]:
"""Split a git patch string into individual per-file diff blocks."""
if not git_patch:
return []
matches = list(re.finditer(r"diff --git [^\n]*\n", git_patch))
if not matches:
return [git_patch] # or [] — depends on caller contract
return [
git_patch[m.start() : (matches[i + 1].start() if i + 1 < len(matches) else len(git_patch))]
for i, m in enumerate(matches)
]Both remove_files_from_patch and keep_only_test_files can then reduce to a filter over _split_diff_blocks(git_patch). Not a blocker for this PR, but worth doing before a third caller appears.
There was a problem hiding this comment.
same pattern but no need to extract
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review
Overall Assessment
Well-motivated, data-driven change with clean implementation and solid test coverage. The quantitative evidence in the PR description (45.5 % resolution rate for test-only patches vs. 1.2 % for mixed, vs. 0 % for source-only) is the best possible justification for this kind of post-processing step.
The implementation correctly mirrors remove_files_from_patch, the .py extension is already guarded in the test_ prefix branch (the suggestion from a prior review was already in the code), and the root-level scratch-file exclusion is sound for SWT-bench's usage pattern.
Two fresh observations below that aren't covered by existing threads.
Observations
-
remove_files_from_patchis now logically redundant (eval_infer.pyline 210)
pyproject.toml,tox.ini, andsetup.pyare not test files, sokeep_only_test_fileswould strip them regardless. The first call is a no-op in the happy path. Keeping it is fine for explicit intent documentation, but a brief inline comment (e.g.# also covered by keep_only_test_files below) would prevent future readers from wondering why two filter calls exist in sequence. -
Root-level
conftest.pyis silently excluded (patch_utils.pyline 84)
if "/" not in path: return Falserejects every repo-root file — includingconftest.pyat the project root, which is a standard pytest configuration point in many real Python repos. For SWT-bench the intent is correct (agent scratch files at root should be dropped), but if a benchmark instance's canonical test infrastructure includes a root-levelconftest.pyand the agent correctly patches it, those changes will be stripped and the fix silently lost. Worth adding atest_drops_root_level_conftesttest case to make the trade-off explicit and guard against accidentally loosening the root-exclusion rule in the future.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
| @@ -205,6 +208,13 @@ def convert_to_swtbench_format(input_file: str, output_file: str) -> None: | |||
| # postprocess git_patch | |||
| setup_files = ["pyproject.toml", "tox.ini", "setup.py"] | |||
| git_patch = remove_files_from_patch(git_patch, setup_files) | |||
There was a problem hiding this comment.
With keep_only_test_files applied two lines below, this call is now logically redundant — pyproject.toml, tox.ini, and setup.py are not test files and would be stripped by the new filter anyway. Keeping it is fine for clarity, but a short follow-up comment (e.g. # belt-and-suspenders; also covered by keep_only_test_files below) would help future readers understand why two successive filter calls exist here.
| match the naming pattern (e.g. ``test_repro.py``), because in SWT-bench | ||
| those are agent-authored scratch files that the harness ignores. | ||
| """ | ||
| if "/" not in path: |
There was a problem hiding this comment.
This guard intentionally rejects root-level files (agent scratch files like test_repro.py) — but it also silently drops a root-level conftest.py, which is a legitimate pytest fixture in many projects. If a SWT-bench instance has a root-level conftest.py that the agent correctly patches, those changes will be stripped. Consider adding a test_drops_root_level_conftest test case to document this trade-off explicitly and guard against someone loosening the guard in the future without realising the impact:
def test_drops_root_level_conftest(self):
# Root conftest.py is treated as agent scratch, not a real test file.
patch = _diff('conftest.py') + _diff('tests/conftest.py')
out = keep_only_test_files(patch)
assert 'conftest.py' not in out.split('\n')[0] # root dropped
assert 'tests/conftest.py' in out # package-level kept
Addresses option 2 from #708.
What
Extends the existing setup-file strip in
benchmarks/swtbench/eval_infer.pywith a positive whitelist that keeps only test-file diffs inmodel_patch. SWT-bench scores nothing else, and non-test diffs inmodel_patchactively hurt the score because the agent's own source "fix" runs against the test and silences the F2P signal.Changes
benchmarks/utils/patch_utils.py: newkeep_only_test_files(git_patch)helper that mirrors the structure of the existingremove_files_from_patch. A file is kept iff its path lives undertests/,test/,testing/, or its basename matchestest_*.py/*_test.py/conftest.py. Repo-root paths are intentionally rejected so agent-authored scratch files (reproduction.py,test_repro.py,FIX_SUMMARY.md, …) are dropped even when they look testy.benchmarks/swtbench/eval_infer.py: callkeep_only_test_filesimmediately after the existingremove_files_from_patch(setup_files)call. Same place, same shape as the tox.ini/pyproject.toml/setup.py strip.tests/test_patch_utils_keep_test_files.py: unit tests covering tests-dir, testing-dir, conftest, suffix/prefix names, repo-root scratch rejection, build/docs rejection, and the all-non-test case.Why
From run
litellm_proxy-openrouter-qwen-qwen3-coder-next/24103435463(424 instances), cross-tabbing patch shape against the resolved set:Stripping non-test diffs converts most mixed patches into the "only test files" shape — the population that already resolves at 45.5 %. Not every mixed patch will recover (some tests are calibrated to the agent's own fix or import symbols the agent added), but even a conservative recovery is multiples of the current score.
How to test
No re-inference required. Run
benchmarks/swtbench/eval_infer.pyagainst the existingoutput.jsonlfrom the bad run; the regeneratedoutput.swtbench.jsonlwill have only test-file diffs and the harness will rescore accordingly.Unit tests:
Risks / caveats
test_patchfile set per instance; deferred to a follow-up to keep this PR minimal.output.swtbench.jsonlfiles in completed runs are untouched on disk; only future invocations ofeval_infer.pychange.tests/test_patch_utils_keep_test_files.py; happy to move/rename if the repo has a different convention I missed.Refs #708.
This PR was created by an AI agent (OpenHands) on behalf of @juanmichelini.
@juanmichelini can click here to continue refining the PR