Skip to content

SWE-bench eval_limit=500 stalls: all runtime pods stuck in pending, no progress for 45+ minutes #580

@simonrosenberg

Description

@simonrosenberg

Summary

SWE-bench ACP-claude eval with eval_limit=500 stalls completely — runtime pods never leave pending status, the orchestrator hangs with no log output for 45+ minutes, and 0/500 instances complete.

This affects ALL benchmarks using SDK >= v1.15.0 (commit b99f15f), not just SWE-bench. Confirmed broken: swebench, swebenchmultimodal, commit0. Only GAIA (binary build) is unaffected.

Details

Triggered run:

  • K8s job: eval-23622798964-claude-4-6
  • SDK run: 23622781040
  • Eval run: 23622798964
  • CID: 2A6EC49A
  • Params: benchmark=swebench, model=claude-4.6-opus, agent_type=acp-claude, eval_limit=500, benchmarks_branch=fix/copy-python-runtime-for-eval-images

Other affected runs (same root cause):

Job Benchmark Branch Status
eval-23622889791-claude-son swebenchmultimodal fix/swebenchmultimodal-api-timeout All 5 instances CreateContainerError
eval-23624504367-claude-son swebenchmultimodal fix/swebenchmultimodal-api-timeout All 5 instances CreateContainerError
commit0 (deleted) commit0 unknown All instances failed, job deleted

Concurrent GAIA run (same SDK) is progressing fineeval-23622799654-claude-4-6 uses b99f15f-gaia-binary images (different build path, not affected).

Observed Behavior

  1. Job starts, image build succeeds, 8 instances are launched at 00:01:28 UTC
  2. All 8 instances immediately fail with Runtime not yet ready (status: pending)runtime_id=None, session_id=None
  3. Retries also fail with same error (32 total failures across 8 instances, all on retry 2-3)
  4. Last log line at 00:21:52 UTC — no output for 45+ minutes after that
  5. Progress bar stuck at 0/500
  6. runtime-pods namespace shows no pods at all: No resources found in runtime-pods namespace
  7. Orchestrator pod itself is healthy (Running, 1/1 READY)

Root Cause

SDK v1.15.0 broke all source-minimal eval images. PR OpenHands/software-agent-sdk#2567 changed the agent-server Dockerfile builder from --managed-python (self-contained venv) to --python-preference only-system (venv symlinks to system Python). The eval-base images don't have Python 3.13 at /usr/local/bin/, so the symlink is broken.

Every source-minimal runtime pod fails with:

CreateContainerError: exec: "/agent-server/.venv/bin/python":
stat /agent-server/.venv/bin/python: no such file or directory

Pod status logger confirms: 10 out of 13 runtime pods are in CreateContainerError (all source-minimal), while 3 GAIA pods (-binary build) run fine.

PR #578 fix is not sufficient. The Dockerfile.agent-layer changes are present on both branches (fix/copy-python-runtime-for-eval-images and fix/swebenchmultimodal-api-timeout), builds succeed, but pods still crash. The validation only tested 1 instance (sympy__sympy-23824). All other instances remain broken, likely because:

  1. Stale images cached in GHCR from pre-fix builds (tag b99f15f-* already existed), or
  2. Python 3.13 is not at /usr/local/bin/python3.13 in the builder (the assumed COPY path may be wrong)

Contributing factors

  • Cluster resource exhaustion: 4 concurrent evals saturate the 23-node runtime cluster (17/23 insufficient CPU, 19/23 insufficient memory)
  • Warm runtime pool broken: warm_runtimes.py crashes with relation "runtimes" does not exist — all sessions cold-start
  • Orchestrator hangs silently after retries exhaust instead of failing fast

Suggested Fix

Step 1: Verify the Python path in the builder image

docker run --rm ghcr.io/openhands/eval-builder:b99f15f ls -la /usr/local/bin/python*
docker run --rm ghcr.io/openhands/eval-builder:b99f15f readlink -f /agent-server/.venv/bin/python

If Python is not at /usr/local/bin/python3.13, update the COPY paths in Dockerfile.agent-layer to match the actual location.

Step 2: Force-rebuild images (purge GHCR cache)

The b99f15f-* image tags in GHCR are stale (built before the Dockerfile fix). Either:

  • Delete the stale tags from ghcr.io/openhands/eval-agent-server and re-trigger builds, or
  • Use a new SDK commit so the image tags are fresh (e.g. abc1234-sweb.eval.* has never been built, guaranteeing no cache hit)

Step 3: Merge the corrected fix to main

Once verified, merge PR #578 (with correct COPY paths) to main so ALL benchmark branches inherit the fix. Currently every feature branch needs the fix independently.

Step 4: Address cluster saturation (separate issue)

  • Add autoscaling or resource limits for concurrent eval jobs
  • Run the missing database migration for the runtimes table to restore the warm runtime pool
  • Add a circuit-breaker in the orchestrator to fail fast instead of hanging silently after retries exhaust

Timeline (UTC, 2026-03-27)

Time Event
00:00:29 SWE-bench eval starts, 10 workers
00:01:15 prune_db.py crashes: relation "runtimes" does not exist
00:01:28 8 instances dispatched to runtime-api
00:01:32 warm_runtimes.py crashes: same DB error
00:11:32 All 8 instances timeout: Runtime not yet ready (status: pending), runtime_id=None
00:21:52 Retry round 2 also fails; orchestrator goes silent after this

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions