Skip to content

feat: implement ssh_remote and colab_drive experiment modes#17

Open
Pseudoataraxia wants to merge 4 commits intoaiming-lab:mainfrom
Pseudoataraxia:feat/ssh-remote-and-colab-drive
Open

feat: implement ssh_remote and colab_drive experiment modes#17
Pseudoataraxia wants to merge 4 commits intoaiming-lab:mainfrom
Pseudoataraxia:feat/ssh-remote-and-colab-drive

Conversation

@Pseudoataraxia
Copy link

Summary

  • Implement ssh_remote experiment mode: execute experiments on remote GPU servers via SSH, with optional Docker-over-SSH for full container isolation
  • Implement colab_drive experiment mode: async execution via Google Drive sync with a Colab worker notebook — no SSH tunnel needed
  • Fix bug where llm.acp.timeout_sec was parsed from config but never passed to ACPClient, causing all ACP calls to use the hardcoded 600s default

Details

ssh_remote mode

  • New SshRemoteSandbox class implementing SandboxProtocol
  • Two execution modes: bare Python (with HOME override + unshare --net sandboxing) or Docker-over-SSH (full container isolation)
  • Extended SshRemoteConfig with user, port, key_path, remote_python, setup_commands, and Docker options
  • Explicit WARNING when unshare is unavailable instead of silent fallback

colab_drive mode

  • New ColabDriveSandbox class implementing SandboxProtocol
  • Protocol: pipeline writes to pending/, Colab worker polls and executes, results written to done/
  • Includes ready-to-use Colab worker notebook template (auto-generated on first run)
  • Configurable poll interval and timeout for Google Drive sync latency

ACP timeout fix

  • ACPClient.from_rc_config() was not passing timeout_sec from config, so all ACP calls used the hardcoded 600s default regardless of config

Test plan

  • 28 new unit tests covering command building, connectivity checks, mocked execution, factory integration, and ACP timeout fix
  • All 1251 existing tests pass (no regressions)
  • End-to-end verified: SSH sandbox on macOS localhost with real ssh + scp
  • End-to-end verified: Colab Drive sandbox with simulated worker (submit → poll → collect)
  • Not yet tested on a real remote GPU server or real Google Colab instance

Config examples

# SSH remote (bare Python)
experiment:
  mode: "ssh_remote"
  ssh_remote:
    host: "gpu-server.lab.edu"
    user: "researcher"
    key_path: "~/.ssh/id_rsa"
    gpu_ids: [0]

# SSH remote (Docker-over-SSH, most secure)
experiment:
  mode: "ssh_remote"
  ssh_remote:
    host: "gpu-server.lab.edu"
    user: "researcher"
    use_docker: true
    docker_network_policy: "none"

# Colab via Google Drive (no SSH needed)
experiment:
  mode: "colab_drive"
  colab_drive:
    drive_root: "~/Library/CloudStorage/GoogleDrive-you@gmail.com/My Drive/researchclaw"
    poll_interval_sec: 30
    timeout_sec: 3600

🤖 Generated with Claude Code

Pseudoataraxia and others added 4 commits March 16, 2026 23:30
Add two new experiment execution backends:

- ssh_remote: execute experiments on remote GPU servers via SSH,
  with optional Docker-over-SSH for full container isolation.
  Includes basic sandboxing (HOME override + unshare --net).

- colab_drive: async execution via Google Drive sync. A Colab
  notebook polls a shared Drive folder for tasks, executes them,
  and writes results back. No SSH tunnel needed.

Also fixes a bug where llm.acp.timeout_sec was parsed from config
but never passed to ACPClient, causing all ACP calls to use the
hardcoded 600s default.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covers command building, connectivity checks, mocked execution flows,
Google Drive async submit-poll-collect, factory integration, and
the ACP timeout_sec passthrough fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mode

Previously, if unshare was not available on the remote host (e.g. macOS),
the sandbox silently fell back to running without network isolation.
Now it checks for unshare first and prints a clear WARNING to stderr,
so users know their experiment code has network access.

Verified end-to-end on macOS localhost SSH.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document the execution flow, protocol, and result schema for
_execute(), _submit_and_wait(), and _collect_result().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant