Skip to content

fix(swebenchmultimodal): skip evaluation when predictions file is empty#730

Open
juanmichelini wants to merge 1 commit into
mainfrom
fix/swebenchmultimodal-empty-predictions
Open

fix(swebenchmultimodal): skip evaluation when predictions file is empty#730
juanmichelini wants to merge 1 commit into
mainfrom
fix/swebenchmultimodal-empty-predictions

Conversation

@juanmichelini
Copy link
Copy Markdown
Collaborator

Summary

When every inference attempt for a swebenchmultimodal run fails (for example, the LiteLLM proxy returns 5xx errors for every conversation), the OpenHands output.jsonl is empty and the converted SWE-Bench predictions file is empty too. The SWE-Bench harness then sees if not dataset: print("No instances to run.") and exits cleanly without calling make_run_report, so the per-model report.json is never written.

Today, benchmarks/swebenchmultimodal/eval_infer.py proceeds to look for that report unconditionally and raises a misleading error:

ERROR Script failed: Expected report file not found: .../OpenHands.<run_id>.json.
       SWE-Bench harness output naming may have changed.

That hides the real failure (every inference attempt errored) and incorrectly suggests an upstream harness regression.

Repro

Failing run from the eval monitor:

  • Run: swebenchmultimodal/litellm_proxy-gemini-3-5-flash/26595627813
  • Model: litellm_proxy/gemini-3.5-flash, eval_limit=1
  • The single instance Automattic__wp-calypso-25725 failed on all 4 retries with LLMServiceUnavailableError: Vertex_aiException InternalServerError - { code: 500 ... } while sending the multimodal message (Sending instruction with 1 valid images).
  • output.jsonl ended at 0 bytes → "Conversion complete: 0 entries converted, 0 errors" → harness exited 0 → eval script crashed with the bogus "harness output naming may have changed" error.

Fix

run_swebench_multimodal_evaluation now:

  1. Counts non-empty lines in the converted predictions file before invoking the harness.
  2. If there are 0 predictions, logs a clear warning explaining that every inference attempt likely failed, and returns None.
  3. main() already handles report_path is None — it skips component scoring, still runs generate_cost_report, and emits {"report_json": ""} on stdout. Pipeline output stays well-formed; the real failure (inference errors) is surfaced cleanly.

No change for the normal happy path: if predictions exist, behaviour is unchanged.

Out of scope / follow-up

The same defensive check would benefit benchmarks/swebench/eval_infer.py and other SWE-Bench-style variants (swebenchpro, swesmith, swebenchmultilingual, swtbench, etc.), which share the same pattern. Happy to follow up in a separate PR — keeping this one focused on the reported regression.

Links


This PR was opened by an AI agent (OpenHands) on behalf of @juanmichelini.

@juanmichelini can click here to continue refining the PR

When every inference attempt fails (e.g. LLM 5xx errors) the OpenHands
output.jsonl is empty, so the converted predictions file is also empty.
The SWE-Bench harness then prints "No instances to run." and exits 0
without calling make_run_report, so the per-model report.json is never
created.

The previous code unconditionally looked for that report after the
harness exited and raised:

    FileNotFoundError: Expected report file not found: ....json.
    SWE-Bench harness output naming may have changed.

That message hides the real failure (all inference attempts errored)
and falsely suggests an upstream harness regression.

Detect the empty-predictions case up front, log a clear warning, and
return None. main() already handles report_path=None by emitting
{"report_json": ""} and exiting cleanly.

Refs OpenHands/software-agent-sdk#3435

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants