[codex] Extract shared Harbor benchmark runner#727
Conversation
cc30da7 to
057a3aa
Compare
|
@OpenHands /codereview |
|
Uh oh! There was an unexpected error starting the job :( |
|
@OpenHands /codereview Retrying because the previous OpenHands job failed to start. |
|
I'm on it! neubig can track my progress at all-hands.dev |
|
@OpenHands /codereview Validation context for the re-review:
Note: the SDK |
|
I'm on it! neubig can track my progress at all-hands.dev |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable — solid extraction of shared Harbor logic that eliminates meaningful duplication. A few things worth addressing before merge.
Summary
This PR correctly identifies that Terminal-Bench and SkillsBench share the same Harbor CLI invocation pattern and consolidates them into benchmarks/utils/harbor.py. The credential-mode enum is a clean way to capture the one real behavioral difference between the two callers. The public API of each benchmark's run_infer.py is preserved.
Issues
[CRITICAL ISSUES]
None.
[IMPROVEMENT OPPORTUNITIES]
run_harbor_evaluation — env is None branching is duplicated twice
The function calls subprocess.run (or the injected run) in two independent blocks that differ only in whether env is forwarded. subprocess.run(cmd, env=None) is identical in behavior to subprocess.run(cmd) — passing None explicitly tells the child process to inherit the parent environment, which is the default. The env is None branch exists solely to avoid a cast, but the duplication compounds when the fallback-retry path copies the same pattern.
Simplest fix — remove the branching entirely:
result = cast(
subprocess.CompletedProcess[str],
run(cmd, capture_output=True, text=True, env=env),
)env=None is a valid, meaningful argument to subprocess.run; no special case needed.
subprocess_run=subprocess.run passed explicitly by both callers
Both skillsbench/run_infer.py and terminalbench/run_infer.py pass subprocess_run=subprocess.run. The function signature already defaults to None and does run = subprocess_run or subprocess.run, so these two callers are passing the default value explicitly. Either remove the explicit pass from the call sites, or make the default subprocess.run directly (and remove the or subprocess.run fallback).
get_supported_task_filter_flag and get_supported_agent_name each invoke harbor run --help separately
SkillsBench calls both functions before run_harbor_evaluation, meaning it spawns harbor run --help twice in sequence. A single _probe_harbor_help(harbor_executable: str) -> str helper that returns the raw help text, then two thin parsers operating on that string, would halve the subprocess overhead and make both functions testable without touching the filesystem.
[TESTING GAPS]
No unit tests for benchmarks/utils/harbor.py
This PR introduces 318 lines of new logic in a shared utility, but the only test coverage is running the existing test_terminalbench.py and test_skillsbench.py (mentioned in the PR description). Those tests exercise the wrappers, not the underlying helpers.
The most important behaviors to cover with unit tests:
get_supported_task_filter_flag: feed it a fakesubprocess.runthat returns controlled help text and verify the flag selection logic, including the regex boundary condition ((?<![\w-])--).get_supported_agent_name: same pattern, verifyopenhands-sdkvsopenhandspriority.run_harbor_evaluation: inject a mocksubprocess_run, assert the command list constructed inAGENT_ENV_FLAGSandPROCESS_ENVmodes, and verify the fallback-retry path triggers (or doesn't) under the right conditions.convert_harbor_to_eval_output: supply a temp directory with syntheticresult.jsonfiles and assert the output JSONL content.
These are all pure-Python paths that can be tested with tmp_path and a lambda in place of subprocess.run — no Harbor installation required.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
This is a pure refactoring with no public API changes. Both benchmark wrappers' external function signatures are preserved. The credential-mode split correctly maps each caller's original behavior (SkillsBench used PROCESS_ENV; TerminalBench used AGENT_ENV_FLAGS). The one pre-existing security concern — the LLM API key appearing in the harbor run command logged at INFO level when AGENT_ENV_FLAGS is used — was already present in the original SkillsBench code and is not a regression introduced here. It is worth a follow-up issue to mask that value in the logged string.
VERDICT:
✅ Worth merging — core logic is sound. The env is None duplication and missing unit tests are the two items most worth addressing; the rest are suggestions.
KEY INSIGHT: subprocess.run(cmd, env=None) is identical to subprocess.run(cmd) — the env is None branch inside run_harbor_evaluation is unnecessary and doubles the amount of code that must be maintained when the call signature changes.
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and the context the reviewer is missing.- Re-request a review — the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
Was this review helpful? React with 👍 or 👎 to give feedback.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
|
@OpenHands /iterate |
|
I'm on it! neubig can track my progress at all-hands.dev |
- extract _probe_harbor_run_help to deduplicate harbor run --help subprocess calls - change subprocess_run default from None to subprocess.run and drop the or-fallback and cast - collapse env-is-None branching: always pass env=env since subprocess.run accepts None natively - redact LLM credential arguments before logging the harbor command - update terminalbench test fake_run to accept env kwarg (now always passed) Co-authored-by: openhands <openhands@all-hands.dev>
|
Addressed all four review comments in d434121:
All tests pass (30/30) and pre-commit is clean. Ready for another look. 🙏 This comment was created by an AI agent (OpenHands) on behalf of the PR author. |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Review: [codex] Extract shared Harbor benchmark runner
The refactoring is well-executed. All four issues from the prior review are resolved in the current HEAD: env=None branching collapsed, API-key redaction via safe_cmd, _probe_harbor_run_help extracted to avoid double subprocess invocations, and subprocess_run defaulting directly to subprocess.run. The HarborCredentialMode enum cleanly captures the single real behavioral difference between the two callers, and each wrapper correctly preserves its original semantics (probe_arg, task-flag negotiation, credential forwarding strategy, task-ID normalization).
Two items remain worth addressing before merge:
Testing gap (from prior review, still open): benchmarks/utils/harbor.py introduces 304 lines of shared logic — flag detection via regex, credential branching, fallback-retry, JSONL conversion — with no dedicated unit tests. The existing wrapper tests exercise these paths indirectly, but direct tests would catch regressions in the shared module without needing full benchmark execution. At a minimum, get_supported_task_filter_flag, get_supported_agent_name, and convert_harbor_to_eval_output have enough distinct branches to warrant targeted tests with a fake subprocess_run.
find_job_dir visibility: see inline comment.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
find_job_dir is only called internally by convert_harbor_to_eval_output; make it private (_find_job_dir) to avoid expanding the module's public API. Co-authored-by: openhands <openhands@all-hands.dev>
|
Also addressed the latest review comment in ef50709: renamed All threads resolved. This comment was created by an AI agent (OpenHands) on behalf of the PR author. |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Review: [codex] Extract shared Harbor benchmark runner — Round 3
The refactoring is clean and the prior feedback has been addressed well. The ef507094 commit correctly makes _find_job_dir private, resolving the last open inline flag from round 2.
The implementation is otherwise in good shape: _probe_harbor_run_help is properly extracted to avoid double subprocess invocations, safe_cmd masks API keys in logs, subprocess_run defaults directly to subprocess.run, and the HarborCredentialMode enum cleanly captures the one real behavioral difference between the two callers.
One new issue and one carry-over remain:
🟠 New: _probe_harbor_run_help has no timeout — check_harbor_installed correctly wraps its subprocess call with timeout=10 and catches subprocess.TimeoutExpired, but _probe_harbor_run_help omits both. Both functions spawn Harbor subprocesses; if harbor run --help stalls (e.g., waiting on a pseudo-terminal or slow startup), the benchmark setup hangs indefinitely. The fix is a one-line addition — see inline comment.
🟠 Carry-over: Testing gap for benchmarks/utils/harbor.py — this has been raised in both prior reviews and has not been addressed. The module now contains 304 lines of shared logic including flag-detection regex, credential-mode branching, fallback-retry behavior, and JSONL conversion. Each of these paths has enough branches to warrant targeted unit tests. The subprocess_run seam is already in place, so tests require no mocking framework: pass a lambda in place of subprocess.run, control stdout/stderr, and assert the constructed command and returned path. The most important coverage targets remain:
get_supported_task_filter_flag— regex boundary conditions on(?<![\w-])--get_supported_agent_name—openhands-sdkvsopenhandspriorityrun_harbor_evaluation—AGENT_ENV_FLAGSvsPROCESS_ENVcredential modes, and the fallback-retry pathconvert_harbor_to_eval_output— syntheticresult.jsonfixtures verifying JSONL output and error handling
Verdict: Core logic is correct. The _probe_harbor_run_help timeout is a straightforward fix; the testing gap is the more significant outstanding item before merge.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
…or_run_help Add timeout=10 and catch subprocess.TimeoutExpired in _probe_harbor_run_help, consistent with check_harbor_installed. Prevents a hung harbor process from stalling benchmark startup indefinitely. Co-authored-by: openhands <openhands@all-hands.dev>
|
Addressed the timeout feedback in 7d11158 — added This comment was created by an AI agent (OpenHands) on behalf of the PR author. |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Review: [codex] Extract shared Harbor benchmark runner — Round 4
Verified at HEAD 7d11158f.
All six previously-flagged review threads are now resolved. The succession of fixes across commits d4341211 → ef507094 → 7d11158f has addressed every inline concern raised in rounds 1–3.
What's confirmed correct
_probe_harbor_run_helpnow hastimeout=10and catchessubprocess.TimeoutExpired, matching the pattern used bycheck_harbor_installed. ✅safe_cmdredaction correctly masks any--aeargument starting withLLM_before logging. ✅subprocess_rundefaults directly tosubprocess.run; theor subprocess.runfallback is gone. ✅_find_job_diris private (leading underscore). ✅- Credential mode differences between the two callers are cleanly captured via
HarborCredentialMode— TerminalBench usesAGENT_ENV_FLAGS, SkillsBench usesPROCESS_ENV, preserving the behavioral split from the original code. ✅ retry_legacy_task_flag=Trueon SkillsBench and default (False) on TerminalBench correctly preserves the old per-benchmark behavior. ✅convert_harbor_to_eval_outputsignature with optionalcanonicalize_instance_idcleanly supports both callers: TerminalBench uses the identity default; SkillsBench passes its own canonicalizer. ✅
Outstanding: testing gap — unchanged across all four rounds
benchmarks/utils/harbor.py is now 305 lines of shared logic with no dedicated unit test file. The existing test_terminalbench.py exercises convert_harbor_to_eval_output (directory discovery, error entries, multi-trial discovery), which is good coverage for that function. However, the following code paths have no test coverage at all:
| Function | Untested paths |
|---|---|
get_supported_task_filter_flag |
regex boundary detection (the lookbehind (?<![\w-]) correctly rejecting flags prefixed with a word char); behavior when help text is empty (timeout/not found) |
get_supported_agent_name |
openhands-sdk vs openhands priority; fallback to default_agent_name on empty help text |
run_harbor_evaluation |
AGENT_ENV_FLAGS mode (--ae flags injected into cmd); PROCESS_ENV mode (keys set in env dict); fallback-retry path (retry_legacy_task_flag=True with --task-name rejection in stderr); n_limit and task_ids command inclusion |
The subprocess_run seam is already in place, so none of these require a mocking framework or a real Harbor binary. Illustrative sketch:
def test_run_harbor_credential_mode_agent_env(tmp_path):
captured = {}
def fake_run(cmd, *, capture_output, text, env=None):
captured["cmd"] = cmd
captured["env"] = env
return type("R", (), {"returncode": 0, "stdout": "", "stderr": ""})()
run_harbor_evaluation(llm, "my-dataset", str(tmp_path),
credential_mode=HarborCredentialMode.AGENT_ENV_FLAGS,
subprocess_run=fake_run)
assert any("LLM_API_KEY=" in part for part in captured["cmd"])
assert captured["env"] is NoneVerdict: The extraction is clean and all mechanical issues are resolved. The module is ready for merge except for this persistent testing gap. Adding a tests/test_harbor.py covering the four untested areas above would close the last outstanding concern.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
Tests cover: - get_supported_task_filter_flag: --include-task-name priority, --task-name fallback, regex boundary rejection of embedded flags, empty help text - get_supported_agent_name: openhands-sdk priority, openhands fallback, custom default, empty help text - run_harbor_evaluation: AGENT_ENV_FLAGS mode (--ae flags, env=None), PROCESS_ENV mode (env dict, no --ae), task_ids+n_limit command inclusion, fallback-retry with --include-task-name Co-authored-by: openhands <openhands@all-hands.dev>
|
Added
All 45 tests pass (15 new + 30 existing), CI is green. Ready for another look. 🙏 This comment was created by an AI agent (OpenHands) on behalf of the PR author. |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Review: [codex] Extract shared Harbor benchmark runner — Round 5
Verified at HEAD 82386a99.
All six previously-flagged inline threads are resolved, and the new tests/test_harbor.py commit directly addresses the testing gap that triggered the Round 4 CHANGES_REQUESTED.
What's confirmed
Prior resolutions (rounds 1–4) — still correct at HEAD:
_probe_harbor_run_helphastimeout=10and catchessubprocess.TimeoutExpired, matchingcheck_harbor_installed. ✅safe_cmdcorrectly masks any--aeargument starting withLLM_before logging; the rawcmdis never logged. ✅subprocess_rundefaults directly tosubprocess.run; noor subprocess.runfallback. ✅_find_job_diris private (leading underscore). ✅HarborCredentialModecleanly encodes the one real behavioral difference: TerminalBench usesAGENT_ENV_FLAGS, SkillsBench usesPROCESS_ENV. ✅retry_legacy_task_flag=Trueon SkillsBench, defaultFalseon TerminalBench — both correctly preserve original per-benchmark semantics. ✅
New commit (82386a99) — testing gap resolved:
TestGetSupportedTaskFilterFlag(6 cases): covers--include-task-namepriority,--task-namefallback, both-flags present, neither flag, empty help text, and the regex lookbehind that rejects word-prefixed flags. ✅TestGetSupportedAgentName(5 cases): coversopenhands-sdkvsopenhandspriority, custom default, and empty help text. ✅TestRunHarborEvaluationCredentialModes(2 cases):AGENT_ENV_FLAGSinjects--aeflags and leavesenv=None;PROCESS_ENVsetsenvdict and emits no--aeflags. ✅TestRunHarborEvaluationTaskFiltering(2 cases): task IDs andn_limitappear in the command; the fallback-retry correctly replaces--task-namewith--include-task-nameon the "No such option" error path. ✅
Minor observations (non-blocking)
convert_harbor_to_eval_outputis not exercised intest_harbor.py, but the existingtest_terminalbench.pycovers directory discovery, error entries, and multi-trial results — sufficient coverage.check_harbor_installedhas no unit test. The function is eight lines (one subprocess call + error handling) and is an acceptable gap at this stage.- Both
skillsbench/run_infer.pyandterminalbench/run_infer.pypasssubprocess_run=subprocess.runexplicitly despite it being the default. Redundant, but it makes the testing seam visible to future readers and is harmless.
Verdict
The refactoring is correct, the public API of each wrapper is preserved, and the outstanding testing gap has been resolved with targeted, well-structured tests. ✅ Ready to merge.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
Summary — PR #727 Iteration Complete ✅Branch: What Was DoneThe Commit
Commit Commit Commit
Final State
All changes are focused and directly address the review feedback — no extraneous changes. |
Co-authored-by: openhands <openhands@all-hands.dev>
|
Validation update for #727:
This comment was created by an AI agent (OpenHands) on behalf of the user. |
Summary
benchmarks.utils.harborhelper for Harbor CLI checks, command construction, result discovery, and OpenHands-style JSONL conversionCloses #719.
Validation
uv run pytest tests/test_terminalbench.py tests/test_skillsbench_run_infer.py