Summary
On the recent swtbench run litellm_proxy-openrouter-qwen-qwen3-coder-next/24103435463 the model resolves 14 / 424 (3.3 %) instances (monitor link). Analyzing the predicted patches shows the score is almost entirely explained by the agent modifying source code alongside its test, which destroys the F2P signal SWT-bench scores on.
Evidence
Classifying every patch in output.swtbench.jsonl by which kinds of files it touches and cross-tabbing with output.report.json:
| Patch category |
# instances |
Resolved |
Rate |
| Only test files (clean) |
22 |
10 |
45.5 % |
| Mixed: test files and source code |
332 |
4 |
1.2 % |
| Only source code / no test |
64 |
0 |
0 % |
| Empty |
3 |
0 |
0 % |
| Total |
424 |
14 |
3.3 % |
Two observations:
- The agent's tests are actually fine when left alone — 45 % solve rate on the "only test files" subset, comparable to what we'd expect for this model class.
- The same agent's solve rate drops to 1.2 % as soon as the patch also touches source. That's the expected SWT-bench failure mode: if the model's patch also "fixes" the bug, its own test no longer fails on
model_patch-applied state, so no F2P credit.
78 % of patches (332 / 424) fall into the "mixed" bucket — that is what is dragging the score down.
What the extra (non-test) changes look like
Across the 424 patches the non-test files modified are:
814 source code (e.g. sympy/core/basic.py, django/db/models/sql/query.py, …)
65 build/lib/... (1 instance, psf__requests-1142)
18 docs/ or doc/
7 reproduction.py / reproduction_script.py (repo root)
1 FIX_SUMMARY.md
+ 15 patches also contain root-level scratch test_*.py files alongside real tests
The dominant pattern is the agent producing a real source-code fix attempt in addition to the test.
For context, eval_infer.py already strips a small whitelist of non-source files post-hoc:
|
setup_files = ["pyproject.toml", "tox.ini", "setup.py"] |
setup_files = ["pyproject.toml", "tox.ini", "setup.py"]
git_patch = remove_files_from_patch(git_patch, setup_files)
…but it does nothing about source code or the scratch artifacts above.
Three options
I'd like to discuss before picking a direction — each has a different cost/risk profile.
Option 1 — Tighten the prompt for qwen3-coder-next only
Add a model-specific prompt override that makes it crystal-clear SWT-bench wants only a test patch and that modifying non-test files is harmful.
- Pros: Fixes the root cause (agent behavior), no eval-side trickery. Likely also helps families of models that share the same instruction-following shortcomings.
- Cons: Per-model prompt branching is precedent we've tried to avoid; doesn't help any other runs that already exhibit the same pattern; requires a re-run to measure (~$$$).
- Scope: Touches
benchmarks/swtbench/prompts/ and the prompt-selection logic; needs a re-run of inference.
Option 2 — Improve post-processing to strip all non-test files
Extend the existing remove_files_from_patch step in eval_infer.py with a positive whitelist: keep only diffs whose target file looks like a test (under tests/, test/, testing/, or named test_*.py / *_test.py / conftest.py; drop root-level files), and optionally intersect with the gold test_patch file set when the dataset row is available.
- Pros: Mechanically identical to the existing
tox.ini/pyproject.toml/setup.py strip — a few lines in benchmarks/utils/patch_utils.py + one call site in eval_infer.py. No re-inference needed; we can rescore this exact run from output.jsonl. Generalizes to every current and future SWT-bench run.
- Cons: Doesn't fix the root cause — the agent is still wasting tokens (and being penalized) by writing source-code fixes. Not a guaranteed +X% lift: some mixed-patch tests are calibrated against the model's own fix (e.g. assertions match the model's expected behavior, or the test imports a symbol the model added in the source patch); stripping makes those tests fail on both buggy and gold and they still get no credit.
- Expected upside: The 45.5 % "only-test" rate is an upper bound on what stripping can recover from the 332 mixed patches; a more realistic 10–15 % solve rate on that subset would lift overall scoring into the 8–15 % range. Even a conservative recovery is multiples of the current score.
Option 3 — Leave the low result as-is
Treat the 3.3 % as an accurate signal that qwen3-coder-next is bad at SWT-bench in its current configuration, and not paper over it.
- Pros: Honest signal. Avoids per-model knobs and avoids the post-hoc-fix-up question of "are we measuring the agent or our scrubber?".
- Cons: We're knowingly leaving signal on the table — the same agent demonstrably solves 45 % of the instances it self-restrains on. The benchmark is supposed to measure "can the model write a good test for this bug"; right now we're measuring "can the model write a good test and resist the urge to also fix it", which is a different question.
Recommendation request
Option 2 is the cheapest and most general; it's a direct extension of code that already exists in the repo. Option 1 is more principled but more expensive and narrower. Happy to open a PR for whichever direction we pick (or any combination — Options 1 and 2 are not mutually exclusive).
This issue was filed by an AI agent (OpenHands) on behalf of @juanmichelini after analyzing the run artifacts.
Summary
On the recent
swtbenchrunlitellm_proxy-openrouter-qwen-qwen3-coder-next/24103435463the model resolves 14 / 424 (3.3 %) instances (monitor link). Analyzing the predicted patches shows the score is almost entirely explained by the agent modifying source code alongside its test, which destroys the F2P signal SWT-bench scores on.Evidence
Classifying every patch in
output.swtbench.jsonlby which kinds of files it touches and cross-tabbing withoutput.report.json:Two observations:
model_patch-applied state, so no F2P credit.78 % of patches (332 / 424) fall into the "mixed" bucket — that is what is dragging the score down.
What the extra (non-test) changes look like
Across the 424 patches the non-test files modified are:
The dominant pattern is the agent producing a real source-code fix attempt in addition to the test.
For context,
eval_infer.pyalready strips a small whitelist of non-source files post-hoc:benchmarks/benchmarks/swtbench/eval_infer.py
Line 206 in 2e01f40
…but it does nothing about source code or the scratch artifacts above.
Three options
I'd like to discuss before picking a direction — each has a different cost/risk profile.
Option 1 — Tighten the prompt for
qwen3-coder-nextonlyAdd a model-specific prompt override that makes it crystal-clear SWT-bench wants only a test patch and that modifying non-test files is harmful.
benchmarks/swtbench/prompts/and the prompt-selection logic; needs a re-run of inference.Option 2 — Improve post-processing to strip all non-test files
Extend the existing
remove_files_from_patchstep ineval_infer.pywith a positive whitelist: keep only diffs whose target file looks like a test (undertests/,test/,testing/, or namedtest_*.py/*_test.py/conftest.py; drop root-level files), and optionally intersect with the goldtest_patchfile set when the dataset row is available.tox.ini/pyproject.toml/setup.pystrip — a few lines inbenchmarks/utils/patch_utils.py+ one call site ineval_infer.py. No re-inference needed; we can rescore this exact run fromoutput.jsonl. Generalizes to every current and future SWT-bench run.Option 3 — Leave the low result as-is
Treat the 3.3 % as an accurate signal that
qwen3-coder-nextis bad at SWT-bench in its current configuration, and not paper over it.Recommendation request
Option 2 is the cheapest and most general; it's a direct extension of code that already exists in the repo. Option 1 is more principled but more expensive and narrower. Happy to open a PR for whichever direction we pick (or any combination — Options 1 and 2 are not mutually exclusive).
This issue was filed by an AI agent (OpenHands) on behalf of @juanmichelini after analyzing the run artifacts.