Skip to content

Upstream: Copilot CLI headless server breaks when global CLI cleans shared native module directory #392

@PureWeen

Description

@PureWeen

Summary

When the Copilot CLI headless server (spawned by PolyPilot) and a separate global Copilot CLI installation run concurrently, they share the same native module directory (~/.copilot/pkg/darwin-arm64/). The global CLI can clean up the version directory that the headless server loaded at startup, causing all subsequent shell spawns to fail with posix_spawn failed: No such file or directory.

This is a data race in the CLI's native module management that should be reported upstream to the Copilot CLI team.

Root Cause Analysis

The shared directory

Both CLI installations use ~/.copilot/pkg/darwin-arm64/{version}/prebuilds/darwin-arm64/ to store platform-specific native modules:

  • pty.node — terminal multiplexing (Node.js native addon)
  • spawn-helper — binary used by pty.node to spawn shell processes via posix_spawn()
  • keytar.node — credential storage

The conflict

  1. PolyPilot's bundled CLI (Mach-O binary, v1.0.6-0) starts the headless server. At startup, it extracts/loads native modules from ~/.copilot/pkg/darwin-arm64/1.0.2/. The pty.node file is loaded into memory via dlopen().

  2. Global Homebrew CLI (/opt/homebrew/bin/copilot, Node.js-based) starts separately. It extracts its own version (0.0.420) to ~/.copilot/pkg/darwin-arm64/0.0.420/ and deletes the old 1.0.2/ directory as part of version cleanup.

  3. The headless server's pty.node is still loaded in memory (Unix keeps open file descriptors valid after deletion), so the server process itself continues running fine.

  4. However, when pty.node tries to spawn a new shell, it calls posix_spawn() looking for spawn-helper at the original path ~/.copilot/pkg/darwin-arm64/1.0.2/prebuilds/darwin-arm64/spawn-helperwhich no longer exists on disk.

  5. All shell spawns fail with: posix_spawn failed: No such file or directory

Evidence

# The headless server (started 8:18 AM) still has the deleted files loaded:
$ lsof -p 10381 | grep pkg
copilot 10381 user txt REG ... /Users/user/.copilot/pkg/darwin-arm64/1.0.2/prebuilds/darwin-arm64/pty.node
copilot 10381 user txt REG ... /Users/user/.copilot/pkg/darwin-arm64/1.0.2/prebuilds/darwin-arm64/keytar.node

# But the directory is gone from disk:
$ ls ~/.copilot/pkg/darwin-arm64/
0.0.420/    # only version remaining

# The 1.0.2 directory was cleaned up, taking spawn-helper with it
$ ls ~/.copilot/pkg/darwin-arm64/1.0.2
ls: No such file or directory

# Workers hit this error within ~14 seconds of dispatch:
posix_spawn failed: No such file or directory

Timeline

Time Event
8:18 AM PolyPilot starts headless server (PID 10381) → loads pty.node from darwin-arm64/1.0.2/
8:27 AM Global CLI starts (copilot --yolo, PID 14029) → extracts darwin-arm64/0.0.420/, cleans up 1.0.2/
Later All new shell spawns from headless server fail — spawn-helper path is gone

Impact

  • Multi-agent orchestration completely broken — all workers fail immediately when they try to use shell tools
  • Single sessions affected too — any session trying to run shell commands fails
  • Silent failure — the server itself appears healthy (responds to API requests, creates sessions), but every tool execution that needs a shell dies
  • No auto-recovery — the server must be killed and restarted to pick up the new native module path

Suggested Upstream Fix

The CLI's native module cleanup logic should:

  1. Check for open file handles before deleting version directories (e.g., check for inuse.*.lock files or use lsof)
  2. Use atomic replacement instead of delete-then-create (rename the directory rather than deleting it)
  3. Not clean up other versions' platform-specific directories — only manage its own version
  4. Or: use version-isolated paths that include the CLI binary's own version in the path, so different CLI versions never share native module directories

Alternatively, the spawn-helper path could be resolved relative to the loaded pty.node file descriptor rather than the original filesystem path, making it resilient to directory deletion.

PolyPilot Workarounds (Planned)

  • Detect posix_spawn errors in worker results → auto-restart headless server → retry
  • Server health probe before orchestrator dispatch
  • Possibly isolate native module directory to ~/.polypilot/pkg/ if the CLI supports an env var for this

Environment

  • macOS (Darwin, arm64)
  • PolyPilot bundled CLI: Mach-O binary v1.0.6-0 (from GitHub.Copilot.SDK NuGet)
  • Global CLI: Homebrew install, Node.js-based, v1.0.6-0
  • Native module versions: 1.0.2 (loaded by server) vs 0.0.420 (extracted by global CLI)

Related

  • Workers report: All Workers Failed — Shell Environment Broken
  • Error occurs within ~14 seconds of worker dispatch
  • Resource limits are NOT the issue: 591/122880 FDs, 4 active sessions, 6 child processes

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions