Skip to content

[codex] Fix stuck origin fetch during task creation#656

Open
kerncore wants to merge 4 commits into
h0x91b:mainfrom
kerncore:codex/timeout-stuck-origin-fetch
Open

[codex] Fix stuck origin fetch during task creation#656
kerncore wants to merge 4 commits into
h0x91b:mainfrom
kerncore:codex/timeout-stuck-origin-fetch

Conversation

@kerncore

Copy link
Copy Markdown

What changed

  • Add a 30-second deadline to remote git fetches used before creating task worktrees.
  • Disable interactive Git credential prompts for these background fetches.
  • Terminate a timed-out fetch so the per-repository fetch queue can continue.
  • Fall back to the local base branch through the existing worktree creation path.
  • Add a regression test covering process termination and queue release.

Root cause

Creating a task without an explicit branch fetches origin/<baseBranch> before creating the worktree. That fetch had no deadline, so a stalled remote, credential helper, or macOS child-process permission issue could block task preparation indefinitely. Tasks created from an explicit branch bypassed the fetch, which caused the branch-dependent behavior.

Validation

  • bun run lint
  • bunx vitest run --config /tmp/dev3-vitest.config.ts src/bun/__tests__/git-fetch-snapshot.test.ts (17 tests passed)
  • bun run test was attempted, but the local macOS filesystem stalled while traversing node_modules; no test failure was reported before termination.

@h0x91b h0x91b left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kerncore — Claude here (the AI assistant working with Arseny on this repo). First off: awesome first contribution, and a genuinely great one. You nailed the actual root cause — a synchronously-required network fetch with no deadline and an unreachable fallback — instead of chasing the misleading Full Disk Access symptom. The reproduce-first test with the hang mock and the queue-release assertion is exactly the right approach. Really solid work.

The no-timeout path of run() stays byte-for-byte equivalent to the old behavior, backoff-on-timeout is the right call, and the changelog is accurate. Two things I'd like addressed before merging — both about making the timeout actually unbreakable on the SSH path. Details inline.

Thanks again for digging in on this one — please keep them coming!

Comment thread src/bun/git.ts
: { code: await proc.exited, timedOut: false as const };
if (timeoutId) clearTimeout(timeoutId);
if (outcome.timedOut) {
proc.kill();

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Residual hang risk: proc.kill() only signals git, not its children.

git fetch over a git@ (SSH) remote forks ssh, which inherits fd 2 (our stderr pipe). Killing only the git process can leave ssh alive holding the write-end of that pipe, so new Response(proc.stderr).text() never gets EOF and the await Promise.all([stdoutPromise, stderrPromise]) two lines below hangs forever — the exact hang this timeout was meant to kill.

The mock test doesn't catch it because its streams close synchronously. Suggested hardening:

  • kill the whole process group rather than just git, and/or
  • after a timed-out kill, don't do an unbounded await on the streams — race them against a short deadline (or skip them) so run() is guaranteed to return.

Worth a real git@-remote-with-hanging-ssh smoke test to confirm run() returns in ~timeoutMs.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3c766dd. The timeout path no longer starts unbounded Response.text() readers. After killing git, process exit is bounded to a 1s grace period, raw stdout/stderr streams are cancelled with the same bound, and run() returns even when an SSH child keeps pipe descriptors open. The regression mock now leaves both streams open and asserts bounded completion plus queue release.

Comment thread src/bun/git.ts Outdated
const result = await run(cmd, projectPath);
const result = await run(cmd, projectPath, {
timeoutMs,
env: { GIT_TERMINAL_PROMPT: "0" },

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GIT_TERMINAL_PROMPT=0 only silences git's own prompts, not ssh's.

For an SSH remote, ssh can still block on a passphrase or an unknown host-key prompt and hang (then only the timeout + the kill above saves us). Better to prevent the interactive hang at the source:

env: {
  GIT_TERMINAL_PROMPT: "0",
  GIT_SSH_COMMAND: "ssh -o BatchMode=yes -o ConnectTimeout=10",
},

BatchMode=yes makes ssh fail fast instead of prompting, and ConnectTimeout bounds the TCP connect — together they address the most common stuck-fetch cause before the 30s deadline even matters.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3c766dd. Background fetches now set GIT_SSH_COMMAND="ssh -o BatchMode=yes -o ConnectTimeout=10" alongside GIT_TERMINAL_PROMPT=0. The focused test asserts both environment settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants