Skip to content

DRAFT: swtbench: strip non-test files from model_patch in post-processing (option 2 of #708)#711

Open
juanmichelini wants to merge 2 commits into
mainfrom
swtbench-strip-non-test-files
Open

DRAFT: swtbench: strip non-test files from model_patch in post-processing (option 2 of #708)#711
juanmichelini wants to merge 2 commits into
mainfrom
swtbench-strip-non-test-files

Conversation

@juanmichelini
Copy link
Copy Markdown
Collaborator

Addresses option 2 from #708.

What

Extends the existing setup-file strip in benchmarks/swtbench/eval_infer.py with a positive whitelist that keeps only test-file diffs in model_patch. SWT-bench scores nothing else, and non-test diffs in model_patch actively hurt the score because the agent's own source "fix" runs against the test and silences the F2P signal.

Changes

  • benchmarks/utils/patch_utils.py: new keep_only_test_files(git_patch) helper that mirrors the structure of the existing remove_files_from_patch. A file is kept iff its path lives under tests/, test/, testing/, or its basename matches test_*.py / *_test.py / conftest.py. Repo-root paths are intentionally rejected so agent-authored scratch files (reproduction.py, test_repro.py, FIX_SUMMARY.md, …) are dropped even when they look testy.
  • benchmarks/swtbench/eval_infer.py: call keep_only_test_files immediately after the existing remove_files_from_patch(setup_files) call. Same place, same shape as the tox.ini/pyproject.toml/setup.py strip.
  • tests/test_patch_utils_keep_test_files.py: unit tests covering tests-dir, testing-dir, conftest, suffix/prefix names, repo-root scratch rejection, build/docs rejection, and the all-non-test case.

Why

From run litellm_proxy-openrouter-qwen-qwen3-coder-next/24103435463 (424 instances), cross-tabbing patch shape against the resolved set:

Patch category # instances Resolved Rate
Only test files 22 10 45.5 %
Mixed test + source code 332 4 1.2 %
Only source code / no test 64 0 0 %
Empty 3 0 0 %
Total 424 14 3.3 %

Stripping non-test diffs converts most mixed patches into the "only test files" shape — the population that already resolves at 45.5 %. Not every mixed patch will recover (some tests are calibrated to the agent's own fix or import symbols the agent added), but even a conservative recovery is multiples of the current score.

How to test

No re-inference required. Run benchmarks/swtbench/eval_infer.py against the existing output.jsonl from the bad run; the regenerated output.swtbench.jsonl will have only test-file diffs and the harness will rescore accordingly.

Unit tests:

pytest tests/test_patch_utils_keep_test_files.py

Risks / caveats

  • Heuristic, not ground truth. The whitelist is path-based. A more precise version would intersect with the gold test_patch file set per instance; deferred to a follow-up to keep this PR minimal.
  • Doesn't fix the root cause — the agent still wastes tokens on source-code fixes. Option 1 (prompt change, opened separately as DRAFT: swtbench: tighten default prompt to discourage non-test edits (option 1 of #708) #710) is complementary, not an alternative.
  • Existing runs change behavior. Any past SWT-bench run rescored on this code will see different numbers. Numbers reported on the benchmark monitor will need a re-eval pass to be consistent. Existing output.swtbench.jsonl files in completed runs are untouched on disk; only future invocations of eval_infer.py change.
  • The new test file lives at tests/test_patch_utils_keep_test_files.py; happy to move/rename if the repo has a different convention I missed.

Refs #708.


This PR was created by an AI agent (OpenHands) on behalf of @juanmichelini.

@juanmichelini can click here to continue refining the PR

SWT-bench only scores diffs against existing test files. Today's
post-processing in eval_infer.py removes a small blacklist of setup
files (pyproject.toml, tox.ini, setup.py) but leaves everything else
alone. That means any source-code 'fix' the agent committed alongside
its test, plus scratch artifacts like reproduction.py, FIX_SUMMARY.md,
build/lib/*, and docs/ changes, all flow through into model_patch.

Source-code diffs are particularly harmful: when SWT-bench applies
model_patch and runs the new test, the test is now executing against
the agent's own 'fixed' code rather than the buggy code, so the F2P
signal is silenced and the instance is marked unresolved.

Empirically (run litellm_proxy-openrouter-qwen-qwen3-coder-next/24103435463,
424 instances):

  Only test files in patch          : 22  instances, 10 resolved (45.5%)
  Mixed test + source code in patch : 332 instances,  4 resolved ( 1.2%)
  Only source code in patch         : 64  instances,  0 resolved ( 0.0%)
  Empty patch                       :  3  instances,  0 resolved ( 0.0%)
  Total                             : 424 instances, 14 resolved ( 3.3%)

This adds keep_only_test_files() in benchmarks/utils/patch_utils.py
that mirrors remove_files_from_patch() and applies the inverse
predicate: keep diffs whose target file lives under tests/, test/,
testing/, or whose basename matches test_*.py / *_test.py /
conftest.py. Repo-root paths are intentionally rejected so the agent's
own root-level scratch test_repro.py files are dropped too.

eval_infer.py applies this immediately after the existing setup-file
strip, so all existing SWT-bench runs can be rescored from
output.jsonl without re-running inference. See #708.

Co-authored-by: openhands <openhands@all-hands.dev>
@juanmichelini juanmichelini added build-swt-bench Build SWT-Bench images based on SDK version on this PR. investigation labels May 13, 2026 — with OpenHands AI
@juanmichelini juanmichelini added the build-swt-bench Build SWT-Bench images based on SDK version on this PR. label May 13, 2026
@juanmichelini juanmichelini marked this pull request as ready for review May 19, 2026 16:59
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Pragmatic solution with clear heuristics that addresses a real problem (45.5% → 1.2% resolution rate drop for mixed patches). The implementation is well-tested and consistent with existing patch utilities, but has important caveats around backward compatibility and the use of heuristics vs. ground truth.

# a non-test diff that lands in model_patch can silence the
# F2P signal because the test then runs against the agent's
# own patched code instead of the buggy code.
git_patch = keep_only_test_files(git_patch)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: Missing Runtime Evidence - The PR description shows the problem exists (table with resolution rates), but doesn't show that applying this fix actually improves scores on a real evaluation run.

Please add an Evidence section to the PR description with:

  • The command used to run eval_infer.py on an existing bad run
  • Before/after resolved instance counts from the evaluation output
  • Confirmation that the score improved as expected

This proves the fix works in practice, not just in theory.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread benchmarks/swtbench/eval_infer.py
Comment thread benchmarks/utils/patch_utils.py
Comment thread benchmarks/utils/patch_utils.py
@all-hands-bot
Copy link
Copy Markdown
Collaborator

Code Review Summary

Taste Rating: 🟡 Acceptable - Works pragmatically but has caveats


[CRITICAL ISSUES]

None - the core logic is sound and well-tested.


[IMPROVEMENT OPPORTUNITIES]

Missing Runtime Evidence - The PR demonstrates the problem with data (45.5% vs 1.2% resolution rates for different patch types), but doesn't show that this fix actually improves scores when applied to a real evaluation run. See inline comment on eval_infer.py:217.

Backward Compatibility Impact - This changes evaluation semantics for all future and re-evaluated runs. Historical benchmark scores will not be directly comparable. See inline comment on eval_infer.py:217.

Heuristic vs Ground Truth - The implementation uses path-based heuristics rather than the gold test_patch file set mentioned in the PR description. The PR says a more precise version would use ground truth but doesn't explain why that approach wasn't taken. See inline comment on patch_utils.py:99.


[TESTING GAPS]

Tests are comprehensive, covering:

  • ✅ Empty input handling
  • ✅ Various directory structures (tests/, test/, testing/)
  • ✅ Naming patterns (test_*.py, *_test.py, conftest.py)
  • ✅ Root-level scratch file rejection
  • ✅ Build and docs directory handling
  • ✅ All-non-test edge case

Suggestion: Add test for singular test/ directory at root to document intentional behavior (see inline comment on patch_utils.py:87).


[RISK ASSESSMENT]

Overall PR: 🟡 MEDIUM RISK

Risk Factors:

  • Backward Compatibility: Changes evaluation behavior for all runs (past if re-evaluated, all future runs). Teams tracking scores over time will see discontinuity.
  • Heuristic Approach: Path-based filtering may have edge cases that ground-truth data wouldn't have. However, the heuristics are well-documented and tested.
  • Well-Defined Scope: Limited to SWT-bench eval post-processing, doesn't affect inference or other benchmarks.
  • Production Data: The PR includes analysis of 424 instances showing the problem exists and the expected benefit (45.5% resolution rate for test-only patches).

Recommendation: Safe to merge after addressing the evidence gap (run eval on a sample to confirm score improvement). Consider making this behavior opt-in via flag for one release cycle to give teams time to adjust their monitoring, then make it default.


VERDICT

Worth merging (after evidence added) - Core logic is sound, approach is pragmatic, tests are solid. The backward compatibility impact is significant but acceptable for a benchmark improvement that addresses a real scoring artifact.

KEY INSIGHT:
This fix converts SWT-bench evaluation from "penalizing agents for attempting source fixes" to "scoring only their test contributions," which better aligns with the benchmark's intent. The 45.5% vs 1.2% resolution rate data suggests the current scoring is dominated by noise from source patches interfering with F2P signals.


Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

  1. Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.
  2. Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
  3. When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

@juanmichelini
Copy link
Copy Markdown
Collaborator Author

Copy link
Copy Markdown
Collaborator Author

Opened #718 to track assessing the impact on historical SWT-bench runs (re-evaluating representative past output.jsonls under the new code and tabulating delta vs. originally reported scores).

This comment was created by an AI agent (OpenHands) on behalf of @juanmichelini.

Copy link
Copy Markdown
Collaborator

all-hands-bot commented May 27, 2026

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid, well-motivated change. The data in the PR description makes the benefit compelling (45.5% vs 1.2% resolution rate), the implementation is consistent with the existing remove_files_from_patch pattern, and the test file covers the important edge cases well.

Two things worth addressing before merge:

  1. test_ prefix matching is not restricted to .py filestest_requirements.txt, test_config.yaml, etc. would be kept in the patch. Whether this matters in practice on Python SWT-bench repos is debatable, but it's an unintentional asymmetry with _test.py (which does check the extension).

  2. Diff-block iteration is duplicated between keep_only_test_files and remove_files_from_patch (~15 lines of identical logic). As more patch utilities accumulate this will become a maintenance hazard.

Both are suggestions, not blockers. The backward-compatibility and runtime-evidence concerns raised in the earlier review have been addressed (issue #718 created, evaluation run linked).

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Comment thread benchmarks/utils/patch_utils.py Outdated
)


def keep_only_test_files(git_patch: str) -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: The diff-block extraction loop (finding start/end bounds per diff --git header) is identical to the one in remove_files_from_patch. Consider extracting a private helper to avoid drift:

def _split_diff_blocks(git_patch: str) -> list[str]:
    """Split a git patch string into individual per-file diff blocks."""
    if not git_patch:
        return []
    matches = list(re.finditer(r"diff --git [^\n]*\n", git_patch))
    if not matches:
        return [git_patch]  # or [] — depends on caller contract
    return [
        git_patch[m.start() : (matches[i + 1].start() if i + 1 < len(matches) else len(git_patch))]
        for i, m in enumerate(matches)
    ]

Both remove_files_from_patch and keep_only_test_files can then reduce to a filter over _split_diff_blocks(git_patch). Not a blocker for this PR, but worth doing before a third caller appears.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same pattern but no need to extract

Co-authored-by: OpenHands Bot <contact@all-hands.dev>
Copy link
Copy Markdown
Collaborator

all-hands-bot commented May 30, 2026

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overall Assessment

Well-motivated, data-driven change with clean implementation and solid test coverage. The quantitative evidence in the PR description (45.5 % resolution rate for test-only patches vs. 1.2 % for mixed, vs. 0 % for source-only) is the best possible justification for this kind of post-processing step.

The implementation correctly mirrors remove_files_from_patch, the .py extension is already guarded in the test_ prefix branch (the suggestion from a prior review was already in the code), and the root-level scratch-file exclusion is sound for SWT-bench's usage pattern.

Two fresh observations below that aren't covered by existing threads.


Observations

  1. remove_files_from_patch is now logically redundant (eval_infer.py line 210)
    pyproject.toml, tox.ini, and setup.py are not test files, so keep_only_test_files would strip them regardless. The first call is a no-op in the happy path. Keeping it is fine for explicit intent documentation, but a brief inline comment (e.g. # also covered by keep_only_test_files below) would prevent future readers from wondering why two filter calls exist in sequence.

  2. Root-level conftest.py is silently excluded (patch_utils.py line 84)
    if "/" not in path: return False rejects every repo-root file — including conftest.py at the project root, which is a standard pytest configuration point in many real Python repos. For SWT-bench the intent is correct (agent scratch files at root should be dropped), but if a benchmark instance's canonical test infrastructure includes a root-level conftest.py and the agent correctly patches it, those changes will be stripped and the fix silently lost. Worth adding a test_drops_root_level_conftest test case to make the trade-off explicit and guard against accidentally loosening the root-exclusion rule in the future.


This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

@@ -205,6 +208,13 @@ def convert_to_swtbench_format(input_file: str, output_file: str) -> None:
# postprocess git_patch
setup_files = ["pyproject.toml", "tox.ini", "setup.py"]
git_patch = remove_files_from_patch(git_patch, setup_files)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With keep_only_test_files applied two lines below, this call is now logically redundant — pyproject.toml, tox.ini, and setup.py are not test files and would be stripped by the new filter anyway. Keeping it is fine for clarity, but a short follow-up comment (e.g. # belt-and-suspenders; also covered by keep_only_test_files below) would help future readers understand why two successive filter calls exist here.

match the naming pattern (e.g. ``test_repro.py``), because in SWT-bench
those are agent-authored scratch files that the harness ignores.
"""
if "/" not in path:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guard intentionally rejects root-level files (agent scratch files like test_repro.py) — but it also silently drops a root-level conftest.py, which is a legitimate pytest fixture in many projects. If a SWT-bench instance has a root-level conftest.py that the agent correctly patches, those changes will be stripped. Consider adding a test_drops_root_level_conftest test case to document this trade-off explicitly and guard against someone loosening the guard in the future without realising the impact:

def test_drops_root_level_conftest(self):
    # Root conftest.py is treated as agent scratch, not a real test file.
    patch = _diff('conftest.py') + _diff('tests/conftest.py')
    out = keep_only_test_files(patch)
    assert 'conftest.py' not in out.split('\n')[0]  # root dropped
    assert 'tests/conftest.py' in out              # package-level kept

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build-swt-bench Build SWT-Bench images based on SDK version on this PR. investigation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants