swtbench: qwen3-coder-next score is artificially low — agent writes source-code fix alongside the test (78% of patches)

## Summary

On the recent `swtbench` run [`litellm_proxy-openrouter-qwen-qwen3-coder-next/24103435463`](https://results.eval.all-hands.dev/swtbench/litellm_proxy-openrouter-qwen-qwen3-coder-next/24103435463/results.tar.gz) the model resolves **14 / 424 (3.3 %)** instances ([monitor link](https://openhands-eval-monitor.vercel.app/?days=90&benchmark=swtbench&text=qwen&run=swtbench%2Flitellm_proxy-openrouter-qwen-qwen3-coder-next%2F24103435463)). Analyzing the predicted patches shows the score is almost entirely explained by the agent modifying *source code* alongside its test, which destroys the F2P signal SWT-bench scores on.

## Evidence

Classifying every patch in `output.swtbench.jsonl` by which kinds of files it touches and cross-tabbing with `output.report.json`:

| Patch category                              | # instances | Resolved | Rate      |
| ------------------------------------------- | ----------: | -------: | --------- |
| **Only test files** (clean)                 |          22 |       10 | **45.5 %** |
| Mixed: test files **and** source code       |         332 |        4 | 1.2 %     |
| Only source code / no test                  |          64 |        0 | 0 %       |
| Empty                                       |           3 |        0 | 0 %       |
| **Total**                                   |     **424** |   **14** | **3.3 %** |

Two observations:

1. The agent's tests are actually fine when left alone — **45 %** solve rate on the "only test files" subset, comparable to what we'd expect for this model class.
2. The same agent's solve rate drops to **1.2 %** as soon as the patch also touches source. That's the expected SWT-bench failure mode: if the model's patch also "fixes" the bug, its own test no longer fails on `model_patch`-applied state, so no F2P credit.

**78 % of patches (332 / 424)** fall into the "mixed" bucket — that is what is dragging the score down.

### What the extra (non-test) changes look like

Across the 424 patches the non-test files modified are:

```
  814  source code (e.g. sympy/core/basic.py, django/db/models/sql/query.py, …)
   65  build/lib/...                              (1 instance, psf__requests-1142)
   18  docs/ or doc/
    7  reproduction.py / reproduction_script.py   (repo root)
    1  FIX_SUMMARY.md
    +  15 patches also contain root-level scratch test_*.py files alongside real tests
```

The dominant pattern is the agent producing a real source-code fix attempt in addition to the test.

For context, `eval_infer.py` already strips a small whitelist of non-source files post-hoc:

https://github.com/OpenHands/benchmarks/blob/2e01f409bc752c3ed107d3e036d18651afad6cf1/benchmarks/swtbench/eval_infer.py#L206

```python
setup_files = ["pyproject.toml", "tox.ini", "setup.py"]
git_patch = remove_files_from_patch(git_patch, setup_files)
```

…but it does nothing about source code or the scratch artifacts above.

## Three options

I'd like to discuss before picking a direction — each has a different cost/risk profile.

### Option 1 — Tighten the prompt for `qwen3-coder-next` only

Add a model-specific prompt override that makes it crystal-clear SWT-bench wants *only* a test patch and that modifying non-test files is harmful.

- **Pros:** Fixes the root cause (agent behavior), no eval-side trickery. Likely also helps families of models that share the same instruction-following shortcomings.
- **Cons:** Per-model prompt branching is precedent we've tried to avoid; doesn't help any other runs that already exhibit the same pattern; requires a re-run to measure (~$$$).
- **Scope:** Touches `benchmarks/swtbench/prompts/` and the prompt-selection logic; needs a re-run of inference.

### Option 2 — Improve post-processing to strip all non-test files

Extend the existing `remove_files_from_patch` step in `eval_infer.py` with a *positive* whitelist: keep only diffs whose target file looks like a test (under `tests/`, `test/`, `testing/`, or named `test_*.py` / `*_test.py` / `conftest.py`; drop root-level files), and optionally intersect with the gold `test_patch` file set when the dataset row is available.

- **Pros:** Mechanically identical to the existing `tox.ini/pyproject.toml/setup.py` strip — a few lines in `benchmarks/utils/patch_utils.py` + one call site in `eval_infer.py`. **No re-inference needed**; we can rescore this exact run from `output.jsonl`. Generalizes to every current and future SWT-bench run.
- **Cons:** Doesn't fix the root cause — the agent is still wasting tokens (and being penalized) by writing source-code fixes. Not a guaranteed +X% lift: some mixed-patch tests are calibrated against the model's own fix (e.g. assertions match the model's expected behavior, or the test imports a symbol the model added in the source patch); stripping makes those tests fail on both buggy and gold and they still get no credit.
- **Expected upside:** The 45.5 % "only-test" rate is an upper bound on what stripping can recover from the 332 mixed patches; a more realistic 10–15 % solve rate on that subset would lift overall scoring into the 8–15 % range. Even a conservative recovery is multiples of the current score.

### Option 3 — Leave the low result as-is

Treat the 3.3 % as an accurate signal that `qwen3-coder-next` is bad at SWT-bench in its current configuration, and not paper over it.

- **Pros:** Honest signal. Avoids per-model knobs and avoids the post-hoc-fix-up question of "are we measuring the agent or our scrubber?".
- **Cons:** We're knowingly leaving signal on the table — the same agent demonstrably solves 45 % of the instances it self-restrains on. The benchmark is supposed to measure "can the model write a good test for this bug"; right now we're measuring "can the model write a good test *and* resist the urge to also fix it", which is a different question.

## Recommendation request

Option 2 is the cheapest and most general; it's a direct extension of code that already exists in the repo. Option 1 is more principled but more expensive and narrower. Happy to open a PR for whichever direction we pick (or any combination — Options 1 and 2 are not mutually exclusive).

---

_This issue was filed by an AI agent (OpenHands) on behalf of @juanmichelini after analyzing the run artifacts._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swtbench: qwen3-coder-next score is artificially low — agent writes source-code fix alongside the test (78% of patches) #708

Summary

Evidence

What the extra (non-test) changes look like

Three options

Option 1 — Tighten the prompt for `qwen3-coder-next` only

Option 2 — Improve post-processing to strip all non-test files

Option 3 — Leave the low result as-is

Recommendation request

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Patch category	# instances	Resolved	Rate
Only test files (clean)	22	10	45.5 %
Mixed: test files and source code	332	4	1.2 %
Only source code / no test	64	0	0 %
Empty	3	0	0 %
Total	424	14	3.3 %

swtbench: qwen3-coder-next score is artificially low — agent writes source-code fix alongside the test (78% of patches) #708

Description

Summary

Evidence

What the extra (non-test) changes look like

Three options

Option 1 — Tighten the prompt for qwen3-coder-next only

Option 2 — Improve post-processing to strip all non-test files

Option 3 — Leave the low result as-is

Recommendation request

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Option 1 — Tighten the prompt for `qwen3-coder-next` only