fix(swebenchmultimodal): skip evaluation when predictions file is empty by juanmichelini · Pull Request #730 · OpenHands/benchmarks

juanmichelini · 2026-05-29T14:16:35Z

Summary

When every inference attempt for a swebenchmultimodal run fails (for example, the LiteLLM proxy returns 5xx errors for every conversation), the OpenHands output.jsonl is empty and the converted SWE-Bench predictions file is empty too. The SWE-Bench harness then sees if not dataset: print("No instances to run.") and exits cleanly without calling make_run_report, so the per-model report.json is never written.

Today, benchmarks/swebenchmultimodal/eval_infer.py proceeds to look for that report unconditionally and raises a misleading error:

ERROR Script failed: Expected report file not found: .../OpenHands.<run_id>.json.
       SWE-Bench harness output naming may have changed.

That hides the real failure (every inference attempt errored) and incorrectly suggests an upstream harness regression.

Repro

Failing run from the eval monitor:

Run: swebenchmultimodal/litellm_proxy-gemini-3-5-flash/26595627813
Model: litellm_proxy/gemini-3.5-flash, eval_limit=1
The single instance Automattic__wp-calypso-25725 failed on all 4 retries with LLMServiceUnavailableError: Vertex_aiException InternalServerError - { code: 500 ... } while sending the multimodal message (Sending instruction with 1 valid images).
output.jsonl ended at 0 bytes → "Conversion complete: 0 entries converted, 0 errors" → harness exited 0 → eval script crashed with the bogus "harness output naming may have changed" error.

Fix

run_swebench_multimodal_evaluation now:

Counts non-empty lines in the converted predictions file before invoking the harness.
If there are 0 predictions, logs a clear warning explaining that every inference attempt likely failed, and returns None.
main() already handles report_path is None — it skips component scoring, still runs generate_cost_report, and emits {"report_json": ""} on stdout. Pipeline output stays well-formed; the real failure (inference errors) is surfaced cleanly.

No change for the normal happy path: if predictions exist, behaviour is unchanged.

Out of scope / follow-up

The same defensive check would benefit benchmarks/swebench/eval_infer.py and other SWE-Bench-style variants (swebenchpro, swesmith, swebenchmultilingual, swtbench, etc.), which share the same pattern. Happy to follow up in a separate PR — keeping this one focused on the reported regression.

Links

Tracking issue: swebenchmultimodal eval_infer.py: misleading FileNotFoundError when all inference attempts fail (empty predictions file) software-agent-sdk#3435

This PR was opened by an AI agent (OpenHands) on behalf of @juanmichelini.

@juanmichelini can click here to continue refining the PR

When every inference attempt fails (e.g. LLM 5xx errors) the OpenHands output.jsonl is empty, so the converted predictions file is also empty. The SWE-Bench harness then prints "No instances to run." and exits 0 without calling make_run_report, so the per-model report.json is never created. The previous code unconditionally looked for that report after the harness exited and raised: FileNotFoundError: Expected report file not found: ....json. SWE-Bench harness output naming may have changed. That message hides the real failure (all inference attempts errored) and falsely suggests an upstream harness regression. Detect the empty-predictions case up front, log a clear warning, and return None. main() already handles report_path=None by emitting {"report_json": ""} and exiting cleanly. Refs OpenHands/software-agent-sdk#3435 Co-authored-by: openhands <openhands@all-hands.dev>

juanmichelini added the bug Something isn't working label May 29, 2026 — with OpenHands AI

juanmichelini mentioned this pull request May 29, 2026

swebenchmultimodal eval_infer.py: misleading FileNotFoundError when all inference attempts fail (empty predictions file) OpenHands/software-agent-sdk#3435

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(swebenchmultimodal): skip evaluation when predictions file is empty#730

fix(swebenchmultimodal): skip evaluation when predictions file is empty#730
juanmichelini wants to merge 1 commit into
mainfrom
fix/swebenchmultimodal-empty-predictions

juanmichelini commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

juanmichelini commented May 29, 2026

Summary

Repro

Fix

Out of scope / follow-up

Links

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants