SWE-bench eval_limit=500 stalls: all runtime pods stuck in pending, no progress for 45+ minutes

## Summary

SWE-bench ACP-claude eval with `eval_limit=500` stalls completely — runtime pods never leave `pending` status, the orchestrator hangs with no log output for 45+ minutes, and 0/500 instances complete.

**This affects ALL benchmarks using SDK >= v1.15.0 (commit `b99f15f`)**, not just SWE-bench. Confirmed broken: swebench, swebenchmultimodal, commit0. Only GAIA (binary build) is unaffected.

## Details

**Triggered run:**
- K8s job: `eval-23622798964-claude-4-6`
- SDK run: [23622781040](https://github.com/OpenHands/software-agent-sdk/actions/runs/23622781040)
- Eval run: [23622798964](https://github.com/OpenHands/evaluation/actions/runs/23622798964)
- CID: `2A6EC49A`
- Params: `benchmark=swebench`, `model=claude-4.6-opus`, `agent_type=acp-claude`, `eval_limit=500`, `benchmarks_branch=fix/copy-python-runtime-for-eval-images`

**Other affected runs (same root cause):**
| Job | Benchmark | Branch | Status |
|-----|-----------|--------|--------|
| `eval-23622889791-claude-son` | swebenchmultimodal | `fix/swebenchmultimodal-api-timeout` | All 5 instances CreateContainerError |
| `eval-23624504367-claude-son` | swebenchmultimodal | `fix/swebenchmultimodal-api-timeout` | All 5 instances CreateContainerError |
| commit0 (deleted) | commit0 | unknown | All instances failed, job deleted |

**Concurrent GAIA run (same SDK) is progressing fine** — `eval-23622799654-claude-4-6` uses `b99f15f-gaia-binary` images (different build path, not affected).

## Observed Behavior

1. Job starts, image build succeeds, 8 instances are launched at `00:01:28 UTC`
2. All 8 instances immediately fail with `Runtime not yet ready (status: pending)` — `runtime_id=None`, `session_id=None`
3. Retries also fail with same error (32 total failures across 8 instances, all on retry 2-3)
4. Last log line at `00:21:52 UTC` — no output for 45+ minutes after that
5. Progress bar stuck at `0/500`
6. `runtime-pods` namespace shows **no pods at all**: `No resources found in runtime-pods namespace`
7. Orchestrator pod itself is healthy (`Running`, `1/1 READY`)

## Root Cause

**SDK v1.15.0 broke all `source-minimal` eval images.** PR OpenHands/software-agent-sdk#2567 changed the agent-server Dockerfile builder from `--managed-python` (self-contained venv) to `--python-preference only-system` (venv symlinks to system Python). The eval-base images don't have Python 3.13 at `/usr/local/bin/`, so the symlink is broken.

Every `source-minimal` runtime pod fails with:
```
CreateContainerError: exec: "/agent-server/.venv/bin/python":
stat /agent-server/.venv/bin/python: no such file or directory
```

Pod status logger confirms: **10 out of 13 runtime pods** are in `CreateContainerError` (all `source-minimal`), while 3 GAIA pods (`-binary` build) run fine.

**PR #578 fix is not sufficient.** The Dockerfile.agent-layer changes are present on both branches (`fix/copy-python-runtime-for-eval-images` and `fix/swebenchmultimodal-api-timeout`), builds succeed, but pods still crash. The validation only tested 1 instance (`sympy__sympy-23824`). All other instances remain broken, likely because:
1. Stale images cached in GHCR from pre-fix builds (tag `b99f15f-*` already existed), or
2. Python 3.13 is not at `/usr/local/bin/python3.13` in the builder (the assumed COPY path may be wrong)

### Contributing factors

- **Cluster resource exhaustion**: 4 concurrent evals saturate the 23-node runtime cluster (17/23 insufficient CPU, 19/23 insufficient memory)
- **Warm runtime pool broken**: `warm_runtimes.py` crashes with `relation "runtimes" does not exist` — all sessions cold-start
- **Orchestrator hangs silently** after retries exhaust instead of failing fast

## Suggested Fix

### Step 1: Verify the Python path in the builder image

```bash
docker run --rm ghcr.io/openhands/eval-builder:b99f15f ls -la /usr/local/bin/python*
docker run --rm ghcr.io/openhands/eval-builder:b99f15f readlink -f /agent-server/.venv/bin/python
```

If Python is not at `/usr/local/bin/python3.13`, update the COPY paths in `Dockerfile.agent-layer` to match the actual location.

### Step 2: Force-rebuild images (purge GHCR cache)

The `b99f15f-*` image tags in GHCR are stale (built before the Dockerfile fix). Either:
- **Delete the stale tags** from `ghcr.io/openhands/eval-agent-server` and re-trigger builds, or
- **Use a new SDK commit** so the image tags are fresh (e.g. `abc1234-sweb.eval.*` has never been built, guaranteeing no cache hit)

### Step 3: Merge the corrected fix to `main`

Once verified, merge PR #578 (with correct COPY paths) to `main` so ALL benchmark branches inherit the fix. Currently every feature branch needs the fix independently.

### Step 4: Address cluster saturation (separate issue)

- Add autoscaling or resource limits for concurrent eval jobs
- Run the missing database migration for the `runtimes` table to restore the warm runtime pool
- Add a circuit-breaker in the orchestrator to fail fast instead of hanging silently after retries exhaust

## Timeline (UTC, 2026-03-27)

| Time | Event |
|------|-------|
| 00:00:29 | SWE-bench eval starts, 10 workers |
| 00:01:15 | `prune_db.py` crashes: `relation "runtimes" does not exist` |
| 00:01:28 | 8 instances dispatched to runtime-api |
| 00:01:32 | `warm_runtimes.py` crashes: same DB error |
| 00:11:32 | All 8 instances timeout: `Runtime not yet ready (status: pending)`, `runtime_id=None` |
| 00:21:52 | Retry round 2 also fails; orchestrator goes silent after this |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SWE-bench eval_limit=500 stalls: all runtime pods stuck in pending, no progress for 45+ minutes #580

Summary

Details

Observed Behavior

Root Cause

Contributing factors

Suggested Fix

Step 1: Verify the Python path in the builder image

Step 2: Force-rebuild images (purge GHCR cache)

Step 3: Merge the corrected fix to `main`

Step 4: Address cluster saturation (separate issue)

Timeline (UTC, 2026-03-27)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Job	Benchmark	Branch	Status
`eval-23622889791-claude-son`	swebenchmultimodal	`fix/swebenchmultimodal-api-timeout`	All 5 instances CreateContainerError
`eval-23624504367-claude-son`	swebenchmultimodal	`fix/swebenchmultimodal-api-timeout`	All 5 instances CreateContainerError
commit0 (deleted)	commit0	unknown	All instances failed, job deleted

Time	Event
00:00:29	SWE-bench eval starts, 10 workers
00:01:15	`prune_db.py` crashes: `relation "runtimes" does not exist`
00:01:28	8 instances dispatched to runtime-api
00:01:32	`warm_runtimes.py` crashes: same DB error
00:11:32	All 8 instances timeout: `Runtime not yet ready (status: pending)`, `runtime_id=None`
00:21:52	Retry round 2 also fails; orchestrator goes silent after this

SWE-bench eval_limit=500 stalls: all runtime pods stuck in pending, no progress for 45+ minutes #580

Description

Summary

Details

Observed Behavior

Root Cause

Contributing factors

Suggested Fix

Step 1: Verify the Python path in the builder image

Step 2: Force-rebuild images (purge GHCR cache)

Step 3: Merge the corrected fix to main

Step 4: Address cluster saturation (separate issue)

Timeline (UTC, 2026-03-27)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Step 3: Merge the corrected fix to `main`