fix(swebenchmultimodal): skip evaluation when predictions file is empty#730
Open
juanmichelini wants to merge 1 commit into
Open
fix(swebenchmultimodal): skip evaluation when predictions file is empty#730juanmichelini wants to merge 1 commit into
juanmichelini wants to merge 1 commit into
Conversation
When every inference attempt fails (e.g. LLM 5xx errors) the OpenHands
output.jsonl is empty, so the converted predictions file is also empty.
The SWE-Bench harness then prints "No instances to run." and exits 0
without calling make_run_report, so the per-model report.json is never
created.
The previous code unconditionally looked for that report after the
harness exited and raised:
FileNotFoundError: Expected report file not found: ....json.
SWE-Bench harness output naming may have changed.
That message hides the real failure (all inference attempts errored)
and falsely suggests an upstream harness regression.
Detect the empty-predictions case up front, log a clear warning, and
return None. main() already handles report_path=None by emitting
{"report_json": ""} and exiting cleanly.
Refs OpenHands/software-agent-sdk#3435
Co-authored-by: openhands <openhands@all-hands.dev>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When every inference attempt for a
swebenchmultimodalrun fails (for example, the LiteLLM proxy returns 5xx errors for every conversation), the OpenHandsoutput.jsonlis empty and the converted SWE-Bench predictions file is empty too. The SWE-Bench harness then seesif not dataset: print("No instances to run.")and exits cleanly without callingmake_run_report, so the per-modelreport.jsonis never written.Today,
benchmarks/swebenchmultimodal/eval_infer.pyproceeds to look for that report unconditionally and raises a misleading error:That hides the real failure (every inference attempt errored) and incorrectly suggests an upstream harness regression.
Repro
Failing run from the eval monitor:
swebenchmultimodal/litellm_proxy-gemini-3-5-flash/26595627813litellm_proxy/gemini-3.5-flash,eval_limit=1Automattic__wp-calypso-25725failed on all 4 retries withLLMServiceUnavailableError: Vertex_aiException InternalServerError - { code: 500 ... }while sending the multimodal message (Sending instruction with 1 valid images).output.jsonlended at 0 bytes → "Conversion complete: 0 entries converted, 0 errors" → harness exited 0 → eval script crashed with the bogus "harness output naming may have changed" error.Fix
run_swebench_multimodal_evaluationnow:None.main()already handlesreport_path is None— it skips component scoring, still runsgenerate_cost_report, and emits{"report_json": ""}on stdout. Pipeline output stays well-formed; the real failure (inference errors) is surfaced cleanly.No change for the normal happy path: if predictions exist, behaviour is unchanged.
Out of scope / follow-up
The same defensive check would benefit
benchmarks/swebench/eval_infer.pyand other SWE-Bench-style variants (swebenchpro,swesmith,swebenchmultilingual,swtbench, etc.), which share the same pattern. Happy to follow up in a separate PR — keeping this one focused on the reported regression.Links
This PR was opened by an AI agent (OpenHands) on behalf of @juanmichelini.
@juanmichelini can click here to continue refining the PR