You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #711 changes benchmarks/swtbench/eval_infer.py to strip non-test diffs from model_patch during post-processing (keeping only files under tests/, test/, testing/, or matching test_*.py / *_test.py / conftest.py).
This is a behavior change at evaluation time: any historical SWT-bench output.jsonl re-evaluated under the new code will produce different output.swtbench.jsonl and different scores than what was originally reported. Existing output.swtbench.jsonl files on disk are not modified, but published numbers and the benchmark monitor are no longer directly comparable to future runs.
Goal
Quantify and document the impact of this change on historical SWT-bench runs, so we know:
How much the headline resolved-rate moves for each past run (and per model/config).
Whether previously-resolved instances regress (i.e. cases where the source-code diff was actually needed for the test patch to apply / pass).
For each run, re-run benchmarks/swtbench/eval_infer.py on the existing output.jsonl with the new code and diff the resolved set against the originally reported set.
Decide whether the benchmark monitor needs a one-shot historical re-eval pass to keep numbers consistent, and whether any leaderboard-facing artifacts need annotation.
Context
PR #711 changes
benchmarks/swtbench/eval_infer.pyto strip non-test diffs frommodel_patchduring post-processing (keeping only files undertests/,test/,testing/, or matchingtest_*.py/*_test.py/conftest.py).This is a behavior change at evaluation time: any historical SWT-bench
output.jsonlre-evaluated under the new code will produce differentoutput.swtbench.jsonland different scores than what was originally reported. Existingoutput.swtbench.jsonlfiles on disk are not modified, but published numbers and the benchmark monitor are no longer directly comparable to future runs.Goal
Quantify and document the impact of this change on historical SWT-bench runs, so we know:
test_patchfile set per instance — noted as a follow-up in DRAFT: swtbench: strip non-test files from model_patch in post-processing (option 2 of #708) #711).Suggested approach
litellm_proxy-openrouter-qwen-qwen3-coder-next/24103435463run referenced in DRAFT: swtbench: strip non-test files from model_patch in post-processing (option 2 of #708) #711).benchmarks/swtbench/eval_infer.pyon the existingoutput.jsonlwith the new code and diff the resolved set against the originally reported set.Related
This issue was created by an AI agent (OpenHands) on behalf of @juanmichelini, per their request on PR #711.