[codex] Fix stuck origin fetch during task creation#656
Conversation
h0x91b
left a comment
There was a problem hiding this comment.
Hi @kerncore — Claude here (the AI assistant working with Arseny on this repo). First off: awesome first contribution, and a genuinely great one. You nailed the actual root cause — a synchronously-required network fetch with no deadline and an unreachable fallback — instead of chasing the misleading Full Disk Access symptom. The reproduce-first test with the hang mock and the queue-release assertion is exactly the right approach. Really solid work.
The no-timeout path of run() stays byte-for-byte equivalent to the old behavior, backoff-on-timeout is the right call, and the changelog is accurate. Two things I'd like addressed before merging — both about making the timeout actually unbreakable on the SSH path. Details inline.
Thanks again for digging in on this one — please keep them coming!
| : { code: await proc.exited, timedOut: false as const }; | ||
| if (timeoutId) clearTimeout(timeoutId); | ||
| if (outcome.timedOut) { | ||
| proc.kill(); |
There was a problem hiding this comment.
Residual hang risk: proc.kill() only signals git, not its children.
git fetch over a git@ (SSH) remote forks ssh, which inherits fd 2 (our stderr pipe). Killing only the git process can leave ssh alive holding the write-end of that pipe, so new Response(proc.stderr).text() never gets EOF and the await Promise.all([stdoutPromise, stderrPromise]) two lines below hangs forever — the exact hang this timeout was meant to kill.
The mock test doesn't catch it because its streams close synchronously. Suggested hardening:
- kill the whole process group rather than just git, and/or
- after a timed-out kill, don't do an unbounded
awaiton the streams — race them against a short deadline (or skip them) sorun()is guaranteed to return.
Worth a real git@-remote-with-hanging-ssh smoke test to confirm run() returns in ~timeoutMs.
There was a problem hiding this comment.
Fixed in 3c766dd. The timeout path no longer starts unbounded Response.text() readers. After killing git, process exit is bounded to a 1s grace period, raw stdout/stderr streams are cancelled with the same bound, and run() returns even when an SSH child keeps pipe descriptors open. The regression mock now leaves both streams open and asserts bounded completion plus queue release.
| const result = await run(cmd, projectPath); | ||
| const result = await run(cmd, projectPath, { | ||
| timeoutMs, | ||
| env: { GIT_TERMINAL_PROMPT: "0" }, |
There was a problem hiding this comment.
GIT_TERMINAL_PROMPT=0 only silences git's own prompts, not ssh's.
For an SSH remote, ssh can still block on a passphrase or an unknown host-key prompt and hang (then only the timeout + the kill above saves us). Better to prevent the interactive hang at the source:
env: {
GIT_TERMINAL_PROMPT: "0",
GIT_SSH_COMMAND: "ssh -o BatchMode=yes -o ConnectTimeout=10",
},BatchMode=yes makes ssh fail fast instead of prompting, and ConnectTimeout bounds the TCP connect — together they address the most common stuck-fetch cause before the 30s deadline even matters.
There was a problem hiding this comment.
Fixed in 3c766dd. Background fetches now set GIT_SSH_COMMAND="ssh -o BatchMode=yes -o ConnectTimeout=10" alongside GIT_TERMINAL_PROMPT=0. The focused test asserts both environment settings.
What changed
Root cause
Creating a task without an explicit branch fetches
origin/<baseBranch>before creating the worktree. That fetch had no deadline, so a stalled remote, credential helper, or macOS child-process permission issue could block task preparation indefinitely. Tasks created from an explicit branch bypassed the fetch, which caused the branch-dependent behavior.Validation
bun run lintbunx vitest run --config /tmp/dev3-vitest.config.ts src/bun/__tests__/git-fetch-snapshot.test.ts(17 tests passed)bun run testwas attempted, but the local macOS filesystem stalled while traversingnode_modules; no test failure was reported before termination.