[codex] Extract shared Harbor benchmark runner by neubig · Pull Request #727 · OpenHands/benchmarks

neubig · 2026-05-28T15:26:57Z

Summary

add a shared benchmarks.utils.harbor helper for Harbor CLI checks, command construction, result discovery, and OpenHands-style JSONL conversion
route Terminal-Bench and SkillsBench through the shared helper while preserving their public wrapper functions
keep SkillsBench-specific dataset sync, skill injection, task normalization, and credential forwarding behavior in the SkillsBench wrapper

Closes #719.

Validation

uv run pytest tests/test_terminalbench.py tests/test_skillsbench_run_infer.py

neubig · 2026-05-28T15:44:54Z

@OpenHands /codereview

openhands-ai · 2026-05-28T15:46:06Z

Uh oh! There was an unexpected error starting the job :(

neubig · 2026-05-28T15:49:39Z

@OpenHands /codereview

Retrying because the previous OpenHands job failed to start.

openhands-ai · 2026-05-28T15:50:04Z

I'm on it! neubig can track my progress at all-hands.dev

neubig · 2026-05-28T16:09:51Z

@OpenHands /codereview

Validation context for the re-review:

PR CI is green: pre-commit and tests succeeded on commit 057a3aa.
Local validation passed: uv run ruff check ..., uv run pyright benchmarks/utils/harbor.py, and uv run pytest tests/test_terminalbench.py tests/test_skillsbench_run_infer.py.
The stacked branch smoke in [codex] Add Harbor covered benchmark mapping #728 deployed Terminal-Bench inference successfully via OpenHands/evaluation run 26586388116.

Note: the SDK run-eval.yml dispatcher accepted terminalbench, but the downstream eval-job.yml prerequisite switch rejected it before inference; the direct run-infer.yml path is the valid Terminal-Bench smoke path for now.

openhands-ai · 2026-05-28T16:10:46Z

I'm on it! neubig can track my progress at all-hands.dev

all-hands-bot · 2026-05-29T00:29:25Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

🟡 Acceptable — solid extraction of shared Harbor logic that eliminates meaningful duplication. A few things worth addressing before merge.

Summary

This PR correctly identifies that Terminal-Bench and SkillsBench share the same Harbor CLI invocation pattern and consolidates them into benchmarks/utils/harbor.py. The credential-mode enum is a clean way to capture the one real behavioral difference between the two callers. The public API of each benchmark's run_infer.py is preserved.

Issues

[CRITICAL ISSUES]

None.

[IMPROVEMENT OPPORTUNITIES]

run_harbor_evaluation — env is None branching is duplicated twice

The function calls subprocess.run (or the injected run) in two independent blocks that differ only in whether env is forwarded. subprocess.run(cmd, env=None) is identical in behavior to subprocess.run(cmd) — passing None explicitly tells the child process to inherit the parent environment, which is the default. The env is None branch exists solely to avoid a cast, but the duplication compounds when the fallback-retry path copies the same pattern.

Simplest fix — remove the branching entirely:

result = cast(
    subprocess.CompletedProcess[str],
    run(cmd, capture_output=True, text=True, env=env),
)

env=None is a valid, meaningful argument to subprocess.run; no special case needed.

subprocess_run=subprocess.run passed explicitly by both callers

Both skillsbench/run_infer.py and terminalbench/run_infer.py pass subprocess_run=subprocess.run. The function signature already defaults to None and does run = subprocess_run or subprocess.run, so these two callers are passing the default value explicitly. Either remove the explicit pass from the call sites, or make the default subprocess.run directly (and remove the or subprocess.run fallback).

get_supported_task_filter_flag and get_supported_agent_name each invoke harbor run --help separately

SkillsBench calls both functions before run_harbor_evaluation, meaning it spawns harbor run --help twice in sequence. A single _probe_harbor_help(harbor_executable: str) -> str helper that returns the raw help text, then two thin parsers operating on that string, would halve the subprocess overhead and make both functions testable without touching the filesystem.

[TESTING GAPS]

No unit tests for benchmarks/utils/harbor.py

This PR introduces 318 lines of new logic in a shared utility, but the only test coverage is running the existing test_terminalbench.py and test_skillsbench.py (mentioned in the PR description). Those tests exercise the wrappers, not the underlying helpers.

The most important behaviors to cover with unit tests:

get_supported_task_filter_flag: feed it a fake subprocess.run that returns controlled help text and verify the flag selection logic, including the regex boundary condition ((?<![\w-])--).
get_supported_agent_name: same pattern, verify openhands-sdk vs openhands priority.
run_harbor_evaluation: inject a mock subprocess_run, assert the command list constructed in AGENT_ENV_FLAGS and PROCESS_ENV modes, and verify the fallback-retry path triggers (or doesn't) under the right conditions.
convert_harbor_to_eval_output: supply a temp directory with synthetic result.json files and assert the output JSONL content.

These are all pure-Python paths that can be tested with tmp_path and a lambda in place of subprocess.run — no Harbor installation required.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟢 LOW

This is a pure refactoring with no public API changes. Both benchmark wrappers' external function signatures are preserved. The credential-mode split correctly maps each caller's original behavior (SkillsBench used PROCESS_ENV; TerminalBench used AGENT_ENV_FLAGS). The one pre-existing security concern — the LLM API key appearing in the harbor run command logged at INFO level when AGENT_ENV_FLAGS is used — was already present in the original SkillsBench code and is not a regression introduced here. It is worth a follow-up issue to mask that value in the logged string.

VERDICT:
✅ Worth merging — core logic is sound. The env is None duplication and missing unit tests are the two items most worth addressing; the rest are suggestions.

KEY INSIGHT: subprocess.run(cmd, env=None) is identical to subprocess.run(cmd) — the env is None branch inside run_harbor_evaluation is unnecessary and doubles the amount of code that must be maintained when the call signature changes.

Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing.

Re-request a review — the reviewer reads guidelines from the PR branch, so your changes take effect immediately.

Was this review helpful? React with 👍 or 👎 to give feedback.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

neubig · 2026-05-30T04:23:20Z

@OpenHands /iterate

openhands-ai · 2026-05-30T04:23:44Z

I'm on it! neubig can track my progress at all-hands.dev

- extract _probe_harbor_run_help to deduplicate harbor run --help subprocess calls - change subprocess_run default from None to subprocess.run and drop the or-fallback and cast - collapse env-is-None branching: always pass env=env since subprocess.run accepts None natively - redact LLM credential arguments before logging the harbor command - update terminalbench test fake_run to accept env kwarg (now always passed) Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-05-30T04:36:36Z

Addressed all four review comments in d434121:

Extracted _probe_harbor_run_help — deduplicated the harbor run --help subprocess call that was duplicated in get_supported_task_filter_flag and get_supported_agent_name. Both functions now delegate to the shared helper (empty string on FileNotFoundError, preserving existing fallback behavior).
Credential redaction in logging — added safe_cmd that replaces LLM_*=... arguments following --ae with *** before logging. The raw cmd is unchanged and still passed to the subprocess.
Collapsed env is None branching — both the primary and fallback-retry subprocess calls now use subprocess_run(cmd, capture_output=True, text=True, env=env). subprocess.run(env=None) inherits the parent environment, so both branches were equivalent.
Fixed subprocess_run default — changed from None to subprocess.run, dropped the or subprocess.run fallback and all cast calls. The parameter is now non-optional and documented as a testing seam.

All tests pass (30/30) and pre-commit is clean. Ready for another look. 🙏

This comment was created by an AI agent (OpenHands) on behalf of the PR author.

all-hands-bot · 2026-05-30T04:37:37Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Review: [codex] Extract shared Harbor benchmark runner

The refactoring is well-executed. All four issues from the prior review are resolved in the current HEAD: env=None branching collapsed, API-key redaction via safe_cmd, _probe_harbor_run_help extracted to avoid double subprocess invocations, and subprocess_run defaulting directly to subprocess.run. The HarborCredentialMode enum cleanly captures the single real behavioral difference between the two callers, and each wrapper correctly preserves its original semantics (probe_arg, task-flag negotiation, credential forwarding strategy, task-ID normalization).

Two items remain worth addressing before merge:

Testing gap (from prior review, still open): benchmarks/utils/harbor.py introduces 304 lines of shared logic — flag detection via regex, credential branching, fallback-retry, JSONL conversion — with no dedicated unit tests. The existing wrapper tests exercise these paths indirectly, but direct tests would catch regressions in the shared module without needing full benchmark execution. At a minimum, get_supported_task_filter_flag, get_supported_agent_name, and convert_harbor_to_eval_output have enough distinct branches to warrant targeted tests with a fake subprocess_run.

find_job_dir visibility: see inline comment.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

find_job_dir is only called internally by convert_harbor_to_eval_output; make it private (_find_job_dir) to avoid expanding the module's public API. Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-05-30T04:48:13Z

Also addressed the latest review comment in ef50709: renamed find_job_dir → _find_job_dir since it's only used internally by convert_harbor_to_eval_output within the same module.

All threads resolved. This comment was created by an AI agent (OpenHands) on behalf of the PR author.

all-hands-bot · 2026-05-30T04:49:21Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Review: [codex] Extract shared Harbor benchmark runner — Round 3

The refactoring is clean and the prior feedback has been addressed well. The ef507094 commit correctly makes _find_job_dir private, resolving the last open inline flag from round 2.

The implementation is otherwise in good shape: _probe_harbor_run_help is properly extracted to avoid double subprocess invocations, safe_cmd masks API keys in logs, subprocess_run defaults directly to subprocess.run, and the HarborCredentialMode enum cleanly captures the one real behavioral difference between the two callers.

One new issue and one carry-over remain:

🟠 New: _probe_harbor_run_help has no timeout — check_harbor_installed correctly wraps its subprocess call with timeout=10 and catches subprocess.TimeoutExpired, but _probe_harbor_run_help omits both. Both functions spawn Harbor subprocesses; if harbor run --help stalls (e.g., waiting on a pseudo-terminal or slow startup), the benchmark setup hangs indefinitely. The fix is a one-line addition — see inline comment.

🟠 Carry-over: Testing gap for benchmarks/utils/harbor.py — this has been raised in both prior reviews and has not been addressed. The module now contains 304 lines of shared logic including flag-detection regex, credential-mode branching, fallback-retry behavior, and JSONL conversion. Each of these paths has enough branches to warrant targeted unit tests. The subprocess_run seam is already in place, so tests require no mocking framework: pass a lambda in place of subprocess.run, control stdout/stderr, and assert the constructed command and returned path. The most important coverage targets remain:

get_supported_task_filter_flag — regex boundary conditions on (?<![\w-])--
get_supported_agent_name — openhands-sdk vs openhands priority
run_harbor_evaluation — AGENT_ENV_FLAGS vs PROCESS_ENV credential modes, and the fallback-retry path
convert_harbor_to_eval_output — synthetic result.json fixtures verifying JSONL output and error handling

Verdict: Core logic is correct. The _probe_harbor_run_help timeout is a straightforward fix; the testing gap is the more significant outstanding item before merge.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

…or_run_help Add timeout=10 and catch subprocess.TimeoutExpired in _probe_harbor_run_help, consistent with check_harbor_installed. Prevents a hung harbor process from stalling benchmark startup indefinitely. Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-05-30T04:59:20Z

Addressed the timeout feedback in 7d11158 — added timeout=10 and subprocess.TimeoutExpired to _probe_harbor_run_help, consistent with check_harbor_installed. All threads resolved, CI is green. Ready for another look. 🙏

This comment was created by an AI agent (OpenHands) on behalf of the PR author.

all-hands-bot · 2026-05-30T05:00:18Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Review: [codex] Extract shared Harbor benchmark runner — Round 4

Verified at HEAD 7d11158f.

All six previously-flagged review threads are now resolved. The succession of fixes across commits d4341211 → ef507094 → 7d11158f has addressed every inline concern raised in rounds 1–3.

What's confirmed correct

_probe_harbor_run_help now has timeout=10 and catches subprocess.TimeoutExpired, matching the pattern used by check_harbor_installed. ✅
safe_cmd redaction correctly masks any --ae argument starting with LLM_ before logging. ✅
subprocess_run defaults directly to subprocess.run; the or subprocess.run fallback is gone. ✅
_find_job_dir is private (leading underscore). ✅
Credential mode differences between the two callers are cleanly captured via HarborCredentialMode — TerminalBench uses AGENT_ENV_FLAGS, SkillsBench uses PROCESS_ENV, preserving the behavioral split from the original code. ✅
retry_legacy_task_flag=True on SkillsBench and default (False) on TerminalBench correctly preserves the old per-benchmark behavior. ✅
convert_harbor_to_eval_output signature with optional canonicalize_instance_id cleanly supports both callers: TerminalBench uses the identity default; SkillsBench passes its own canonicalizer. ✅

Outstanding: testing gap — unchanged across all four rounds

benchmarks/utils/harbor.py is now 305 lines of shared logic with no dedicated unit test file. The existing test_terminalbench.py exercises convert_harbor_to_eval_output (directory discovery, error entries, multi-trial discovery), which is good coverage for that function. However, the following code paths have no test coverage at all:

Function	Untested paths
`get_supported_task_filter_flag`	regex boundary detection (the lookbehind `(?<![\w-])` correctly rejecting flags prefixed with a word char); behavior when help text is empty (timeout/not found)
`get_supported_agent_name`	`openhands-sdk` vs `openhands` priority; fallback to `default_agent_name` on empty help text
`run_harbor_evaluation`	`AGENT_ENV_FLAGS` mode (`--ae` flags injected into `cmd`); `PROCESS_ENV` mode (keys set in `env` dict); fallback-retry path (`retry_legacy_task_flag=True` with `--task-name` rejection in stderr); `n_limit` and `task_ids` command inclusion

The subprocess_run seam is already in place, so none of these require a mocking framework or a real Harbor binary. Illustrative sketch:

def test_run_harbor_credential_mode_agent_env(tmp_path):
    captured = {}
    def fake_run(cmd, *, capture_output, text, env=None):
        captured["cmd"] = cmd
        captured["env"] = env
        return type("R", (), {"returncode": 0, "stdout": "", "stderr": ""})()
    run_harbor_evaluation(llm, "my-dataset", str(tmp_path),
                          credential_mode=HarborCredentialMode.AGENT_ENV_FLAGS,
                          subprocess_run=fake_run)
    assert any("LLM_API_KEY=" in part for part in captured["cmd"])
    assert captured["env"] is None

Verdict: The extraction is clean and all mechanical issues are resolved. The module is ready for merge except for this persistent testing gap. Adding a tests/test_harbor.py covering the four untested areas above would close the last outstanding concern.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Tests cover: - get_supported_task_filter_flag: --include-task-name priority, --task-name fallback, regex boundary rejection of embedded flags, empty help text - get_supported_agent_name: openhands-sdk priority, openhands fallback, custom default, empty help text - run_harbor_evaluation: AGENT_ENV_FLAGS mode (--ae flags, env=None), PROCESS_ENV mode (env dict, no --ae), task_ids+n_limit command inclusion, fallback-retry with --include-task-name Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-05-30T05:12:15Z

Added tests/test_harbor.py in 82386a9 covering the untested code paths flagged in the previous review:

get_supported_task_filter_flag: priority logic (--include-task-name > --task-name), regex boundary detection (rejects flags embedded in words), empty help text fallback (6 tests)
get_supported_agent_name: openhands-sdk priority, openhands fallback, custom default, empty help text (5 tests)
run_harbor_evaluation: AGENT_ENV_FLAGS mode (--ae flags injected, env=None), PROCESS_ENV mode (env dict set, no --ae), task_ids+n_limit command construction, fallback-retry with --include-task-name (4 tests)

All 45 tests pass (15 new + 30 existing), CI is green. Ready for another look. 🙏

This comment was created by an AI agent (OpenHands) on behalf of the PR author.

all-hands-bot · 2026-05-30T05:15:54Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Review: [codex] Extract shared Harbor benchmark runner — Round 5

Verified at HEAD 82386a99.

All six previously-flagged inline threads are resolved, and the new tests/test_harbor.py commit directly addresses the testing gap that triggered the Round 4 CHANGES_REQUESTED.

What's confirmed

Prior resolutions (rounds 1–4) — still correct at HEAD:

_probe_harbor_run_help has timeout=10 and catches subprocess.TimeoutExpired, matching check_harbor_installed. ✅
safe_cmd correctly masks any --ae argument starting with LLM_ before logging; the raw cmd is never logged. ✅
subprocess_run defaults directly to subprocess.run; no or subprocess.run fallback. ✅
_find_job_dir is private (leading underscore). ✅
HarborCredentialMode cleanly encodes the one real behavioral difference: TerminalBench uses AGENT_ENV_FLAGS, SkillsBench uses PROCESS_ENV. ✅
retry_legacy_task_flag=True on SkillsBench, default False on TerminalBench — both correctly preserve original per-benchmark semantics. ✅

New commit (82386a99) — testing gap resolved:

TestGetSupportedTaskFilterFlag (6 cases): covers --include-task-name priority, --task-name fallback, both-flags present, neither flag, empty help text, and the regex lookbehind that rejects word-prefixed flags. ✅
TestGetSupportedAgentName (5 cases): covers openhands-sdk vs openhands priority, custom default, and empty help text. ✅
TestRunHarborEvaluationCredentialModes (2 cases): AGENT_ENV_FLAGS injects --ae flags and leaves env=None; PROCESS_ENV sets env dict and emits no --ae flags. ✅
TestRunHarborEvaluationTaskFiltering (2 cases): task IDs and n_limit appear in the command; the fallback-retry correctly replaces --task-name with --include-task-name on the "No such option" error path. ✅

Minor observations (non-blocking)

convert_harbor_to_eval_output is not exercised in test_harbor.py, but the existing test_terminalbench.py covers directory discovery, error entries, and multi-trial results — sufficient coverage.
check_harbor_installed has no unit test. The function is eight lines (one subprocess call + error handling) and is an acceptable gap at this stage.
Both skillsbench/run_infer.py and terminalbench/run_infer.py pass subprocess_run=subprocess.run explicitly despite it being the default. Redundant, but it makes the testing seam visible to future readers and is harmless.

Verdict

The refactoring is correct, the public API of each wrapper is preserved, and the outstanding testing gap has been resolved with targeted, well-structured tests. ✅ Ready to merge.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

openhands-ai · 2026-05-30T05:26:40Z

Summary — PR #727 Iteration Complete ✅

Branch: codex/shared-harbor-runner | PR: #727

What Was Done

The /iterate request triggered a full review-and-fix loop. The automated review bot (all-hands-bot) left 4 inline comments, which were addressed across multiple commits:

Commit d4341211 — Addressed all 4 original inline comments:

✅ Extracted _probe_harbor_run_help() to deduplicate the harbor run --help subprocess call that was duplicated across get_supported_task_filter_flag and get_supported_agent_name
✅ Changed subprocess_run default from Callable[..., Any] | None = None to Callable[..., Any] = subprocess.run, dropped the or subprocess.run fallback and all cast() calls
✅ Collapsed the redundant env is None branching — now always passes env=env since subprocess.run(env=None) inherits the parent environment (identical to omitting env)
✅ Added safe_cmd redaction — replaces LLM_*=... arguments following --ae with *** before logging

Commit ef507094 — Bot's second review: renamed find_job_dir → _find_job_dir (private, only called internally)

Commit 7d11158f — Bot's third review: added timeout=10 and caught subprocess.TimeoutExpired in _probe_harbor_run_help, consistent with check_harbor_installed

Commit 82386a99 — Bot's fourth review (CHANGES_REQUESTED for testing gap): added tests/test_harbor.py with 15 tests covering all previously-untested paths:

get_supported_task_filter_flag: priority, fallback, regex boundary, empty help text
get_supported_agent_name: sdk priority, fallback, custom default, empty help text
run_harbor_evaluation: AGENT_ENV_FLAGS mode, PROCESS_ENV mode, task filtering, fallback-retry

Final State

Verification Layer	Result
CI (`pre-commit` + `tests`)	✅ 2/2 PASSED
Total tests	✅ 45 passed (30 existing + 15 new)
Review bot	✅ APPROVED on `82386a99`
Unresolved threads	✅ 0 of 6
Mergeable	✅ MERGEABLE

All changes are focused and directly address the review feedback — no extraneous changes.

Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-05-30T20:47:43Z

Validation update for #727:

Unit/CI checks are green on the latest head (3cc4af8a).
I ran the shared Harbor runner through CI using MiniMax M2.7 on TerminalBench with the evaluation-side DinD fix branch:
- Inference run: https://github.com/OpenHands/evaluation/actions/runs/26693940327
- Inference archive: https://results.eval.all-hands.dev/terminalbench/litellm_proxy-minimax-MiniMax-M2-7/26693940327/infer-output.tar.gz
- Eval-harness run: https://github.com/OpenHands/evaluation/actions/runs/26694297279
- Eval-harness archive: https://results.eval.all-hands.dev/terminalbench/litellm_proxy-minimax-MiniMax-M2-7/26694297279/results.tar.gz
This exercised Harbor task selection, OpenHands SDK agent setup/execution, output conversion, and eval-harness reporting. The sampled task is reported as an error because Harbor/TerminalBench did not emit reward.txt/reward.json, but the benchmark wrapper plumbing produced the expected output.jsonl and output.report.json.
During validation I found two compatibility issues and fixed/PR’d them:
- This PR now detects the Harbor task-filter flag instead of assuming legacy --task-name.
- Evaluation infra PR OpenHands/evaluation#567 enables DinD for TerminalBench inference and updates the TerminalBench Harbor runtime.

This comment was created by an AI agent (OpenHands) on behalf of the user.

neubig mentioned this pull request May 28, 2026

[codex] Add Harbor covered benchmark mapping #728

Closed

Extract shared Harbor benchmark runner

057a3aa

neubig force-pushed the codex/shared-harbor-runner branch from cc30da7 to 057a3aa Compare May 28, 2026 15:33

neubig marked this pull request as ready for review May 28, 2026 15:34

neubig requested review from all-hands-bot and openhands-agent May 28, 2026 15:37

neubig added the review-this label May 28, 2026

neubig added review-this and removed review-this labels May 29, 2026

all-hands-bot reviewed May 29, 2026

View reviewed changes

Comment thread benchmarks/utils/harbor.py Outdated

Comment thread benchmarks/utils/harbor.py Outdated

Comment thread benchmarks/utils/harbor.py Outdated

Comment thread benchmarks/utils/harbor.py Outdated

neubig requested a review from all-hands-bot May 30, 2026 04:36

all-hands-bot reviewed May 30, 2026

View reviewed changes

Comment thread benchmarks/utils/harbor.py Outdated

chore: address PR review feedback (#727) - make find_job_dir private

ef50709

find_job_dir is only called internally by convert_harbor_to_eval_output; make it private (_find_job_dir) to avoid expanding the module's public API. Co-authored-by: openhands <openhands@all-hands.dev>

neubig requested a review from all-hands-bot May 30, 2026 04:48

all-hands-bot reviewed May 30, 2026

View reviewed changes

Comment thread benchmarks/utils/harbor.py

neubig requested a review from all-hands-bot May 30, 2026 04:59

all-hands-bot requested changes May 30, 2026

View reviewed changes

neubig requested a review from all-hands-bot May 30, 2026 05:14

all-hands-bot approved these changes May 30, 2026

View reviewed changes

terminalbench: detect Harbor task filter flag

3cc4af8

Co-authored-by: openhands <openhands@all-hands.dev>

neubig merged commit b72c63d into main May 30, 2026
2 checks passed

neubig deleted the codex/shared-harbor-runner branch May 30, 2026 20:47

neubig mentioned this pull request May 30, 2026

[codex] Add Harbor covered benchmark mapping #731

Open

Conversation

neubig commented May 28, 2026

Summary

Validation

Uh oh!

neubig commented May 28, 2026

Uh oh!

openhands-ai Bot commented May 28, 2026

Uh oh!

neubig commented May 28, 2026

Uh oh!

openhands-ai Bot commented May 28, 2026

Uh oh!

neubig commented May 28, 2026

Uh oh!

openhands-ai Bot commented May 28, 2026

Uh oh!

all-hands-bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Summary

Issues

[CRITICAL ISSUES]

[IMPROVEMENT OPPORTUNITIES]

[TESTING GAPS]

[RISK ASSESSMENT]

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

neubig commented May 30, 2026

Uh oh!

openhands-ai Bot commented May 30, 2026

Uh oh!

neubig commented May 30, 2026

Uh oh!

all-hands-bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Review: [codex] Extract shared Harbor benchmark runner

Uh oh!

Uh oh!

neubig commented May 30, 2026

Uh oh!

all-hands-bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Review: [codex] Extract shared Harbor benchmark runner — Round 3

Uh oh!

Uh oh!

neubig commented May 30, 2026

Uh oh!

all-hands-bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Review: [codex] Extract shared Harbor benchmark runner — Round 4

What's confirmed correct

Outstanding: testing gap — unchanged across all four rounds

Uh oh!

neubig commented May 30, 2026

Uh oh!

all-hands-bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Review: [codex] Extract shared Harbor benchmark runner — Round 5

What's confirmed

Minor observations (non-blocking)

Verdict

Uh oh!

all-hands-bot commented May 29, 2026 •

edited

Loading

all-hands-bot commented May 30, 2026 •

edited

Loading

all-hands-bot commented May 30, 2026 •

edited

Loading

all-hands-bot commented May 30, 2026 •

edited

Loading

all-hands-bot commented May 30, 2026 •

edited

Loading