Skip to content

swtbench: qwen3-coder-next score is artificially low — agent writes source-code fix alongside the test (78% of patches) #708

@juanmichelini

Description

@juanmichelini

Summary

On the recent swtbench run litellm_proxy-openrouter-qwen-qwen3-coder-next/24103435463 the model resolves 14 / 424 (3.3 %) instances (monitor link). Analyzing the predicted patches shows the score is almost entirely explained by the agent modifying source code alongside its test, which destroys the F2P signal SWT-bench scores on.

Evidence

Classifying every patch in output.swtbench.jsonl by which kinds of files it touches and cross-tabbing with output.report.json:

Patch category # instances Resolved Rate
Only test files (clean) 22 10 45.5 %
Mixed: test files and source code 332 4 1.2 %
Only source code / no test 64 0 0 %
Empty 3 0 0 %
Total 424 14 3.3 %

Two observations:

  1. The agent's tests are actually fine when left alone — 45 % solve rate on the "only test files" subset, comparable to what we'd expect for this model class.
  2. The same agent's solve rate drops to 1.2 % as soon as the patch also touches source. That's the expected SWT-bench failure mode: if the model's patch also "fixes" the bug, its own test no longer fails on model_patch-applied state, so no F2P credit.

78 % of patches (332 / 424) fall into the "mixed" bucket — that is what is dragging the score down.

What the extra (non-test) changes look like

Across the 424 patches the non-test files modified are:

  814  source code (e.g. sympy/core/basic.py, django/db/models/sql/query.py, …)
   65  build/lib/...                              (1 instance, psf__requests-1142)
   18  docs/ or doc/
    7  reproduction.py / reproduction_script.py   (repo root)
    1  FIX_SUMMARY.md
    +  15 patches also contain root-level scratch test_*.py files alongside real tests

The dominant pattern is the agent producing a real source-code fix attempt in addition to the test.

For context, eval_infer.py already strips a small whitelist of non-source files post-hoc:

setup_files = ["pyproject.toml", "tox.ini", "setup.py"]

setup_files = ["pyproject.toml", "tox.ini", "setup.py"]
git_patch = remove_files_from_patch(git_patch, setup_files)

…but it does nothing about source code or the scratch artifacts above.

Three options

I'd like to discuss before picking a direction — each has a different cost/risk profile.

Option 1 — Tighten the prompt for qwen3-coder-next only

Add a model-specific prompt override that makes it crystal-clear SWT-bench wants only a test patch and that modifying non-test files is harmful.

  • Pros: Fixes the root cause (agent behavior), no eval-side trickery. Likely also helps families of models that share the same instruction-following shortcomings.
  • Cons: Per-model prompt branching is precedent we've tried to avoid; doesn't help any other runs that already exhibit the same pattern; requires a re-run to measure (~$$$).
  • Scope: Touches benchmarks/swtbench/prompts/ and the prompt-selection logic; needs a re-run of inference.

Option 2 — Improve post-processing to strip all non-test files

Extend the existing remove_files_from_patch step in eval_infer.py with a positive whitelist: keep only diffs whose target file looks like a test (under tests/, test/, testing/, or named test_*.py / *_test.py / conftest.py; drop root-level files), and optionally intersect with the gold test_patch file set when the dataset row is available.

  • Pros: Mechanically identical to the existing tox.ini/pyproject.toml/setup.py strip — a few lines in benchmarks/utils/patch_utils.py + one call site in eval_infer.py. No re-inference needed; we can rescore this exact run from output.jsonl. Generalizes to every current and future SWT-bench run.
  • Cons: Doesn't fix the root cause — the agent is still wasting tokens (and being penalized) by writing source-code fixes. Not a guaranteed +X% lift: some mixed-patch tests are calibrated against the model's own fix (e.g. assertions match the model's expected behavior, or the test imports a symbol the model added in the source patch); stripping makes those tests fail on both buggy and gold and they still get no credit.
  • Expected upside: The 45.5 % "only-test" rate is an upper bound on what stripping can recover from the 332 mixed patches; a more realistic 10–15 % solve rate on that subset would lift overall scoring into the 8–15 % range. Even a conservative recovery is multiples of the current score.

Option 3 — Leave the low result as-is

Treat the 3.3 % as an accurate signal that qwen3-coder-next is bad at SWT-bench in its current configuration, and not paper over it.

  • Pros: Honest signal. Avoids per-model knobs and avoids the post-hoc-fix-up question of "are we measuring the agent or our scrubber?".
  • Cons: We're knowingly leaving signal on the table — the same agent demonstrably solves 45 % of the instances it self-restrains on. The benchmark is supposed to measure "can the model write a good test for this bug"; right now we're measuring "can the model write a good test and resist the urge to also fix it", which is a different question.

Recommendation request

Option 2 is the cheapest and most general; it's a direct extension of code that already exists in the repo. Option 1 is more principled but more expensive and narrower. Happy to open a PR for whichever direction we pick (or any combination — Options 1 and 2 are not mutually exclusive).


This issue was filed by an AI agent (OpenHands) on behalf of @juanmichelini after analyzing the run artifacts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions