diff --git a/.claude/context/state-sync-development.md b/.claude/context/state-sync-development.md new file mode 100644 index 000000000..932575ee3 --- /dev/null +++ b/.claude/context/state-sync-development.md @@ -0,0 +1,101 @@ +# State-Sync Development Context + +**When to load:** Working on state-sync scripts, Dockerfile, workspace hydration, or S3 sync logic + +## Quick Reference + +- **Language:** Bash (POSIX-ish, requires bash for arrays and `${var//pattern/}`) +- **Base image:** Alpine 3.21 +- **Tools:** rclone (S3 sync), git (repo operations), jq (JSON parsing), sqlite3 (WAL checkpoint), bash +- **Primary files:** `components/runners/state-sync/hydrate.sh`, `components/runners/state-sync/sync.sh` +- **Spec:** [components/runners/state-sync/spec/spec.md](../../components/runners/state-sync/spec/spec.md) + +## Critical Rules + +### Input Sanitization + +All user-provided path components MUST be stripped to `[a-zA-Z0-9-]`: + +```bash +NAMESPACE="${NAMESPACE//[^a-zA-Z0-9-]/}" +SESSION_NAME="${SESSION_NAME//[^a-zA-Z0-9-]/}" +``` + +Used in S3 and filesystem paths; prevents path traversal. + +### Credential Handling + +**NEVER log tokens.** The git credential helper writes tokens only to stdout via git credential protocol. It does not echo them. + +**ALWAYS strip credentials from persisted URLs:** + +```bash +remote_url=$(echo "${remote_url}" | sed 's|://[^@]*@|://|') +``` + +This runs before writing `metadata.json` for git backups. + +**Protect rclone config:** The config file contains S3 credentials and MUST be written with `chmod 600`. + +### Error Handling + +- `set -e` at script start (both scripts) +- `set +e` before git clone loops — clone failures are non-fatal +- `trap 'final_sync' SIGTERM SIGINT` in sync.sh — ensures final backup on shutdown +- Individual operation failures log warnings and continue; the scripts do not exit on non-critical errors + +### Permissions + +The 777 permissions on workspace directories are intentional (cross-container UID mismatch, SELinux/SCC fallback). See spec Workspace Structure > Permissions for full rationale. + +### S3 Operations + +- All S3 access via rclone with `--config /tmp/.config/rclone/rclone.conf` +- Sync uses `--checksum` (content-based, not timestamp-based); hydrate uses `rclone copy` without checksum +- Sync passes `--max-size ${MAX_SYNC_SIZE}` — rclone skips individual files exceeding this limit +- `--copy-links` to follow symlinks +- `--fast-list` to reduce API calls +- Hydrate uses 8 transfers (download), sync uses 4 (upload) + +## Testing + +No automated test suite exists. Validate changes manually: + +1. Deploy to a kind cluster: `make kind-up LOCAL_IMAGES=true` +2. Create a session — verify hydrate logs show workspace creation and repo cloning +3. Wait for sync cycle — verify S3 contains expected paths (`kubectl exec` into MinIO or use `mc` CLI) +4. Delete the session pod and recreate — verify state is restored from S3 +5. Test ephemeral mode — remove S3 credentials, verify hydrate succeeds without persistence + +Edge cases to test: +- Private repo without credentials (should warn, not fail) +- Workflow with invalid subpath (should fall back to full repo) +- Large workspace exceeding MAX_SYNC_SIZE (should warn, sync anyway) +- SIGTERM during sync (should complete final sync before exit) + +## Common Tasks + +### Adding a new sync path + +1. Add to `SYNC_PATHS` array in both `hydrate.sh` and `sync.sh` +2. Add `mkdir -p` and permission setup in `hydrate.sh` +3. Verify the path is not covered by an exclude pattern + +### Adding a new env var + +1. Add to the configuration section at the top of the script +2. Apply sanitization if the value is used in filesystem or S3 paths +3. Document in `spec/spec.md` under Inputs + +### Changing the base image + +1. Update `Dockerfile` +2. Verify all required packages are available (`rclone`, `git`, `jq`, `bash`, `sqlite`) +3. Test that `stat -c%s` works (GNU coreutils syntax; macOS `stat` differs) + +## Key Files + +- `hydrate.sh` — init container entrypoint +- `sync.sh` — sidecar entrypoint +- `Dockerfile` — container definition +- `spec/spec.md` — behavioral specification diff --git a/BOOKMARKS.md b/BOOKMARKS.md index 6d87eaf97..46d5869e1 100644 --- a/BOOKMARKS.md +++ b/BOOKMARKS.md @@ -59,6 +59,10 @@ NextJS patterns, Shadcn UI usage, React Query data fetching, component guideline Auth flows, RBAC enforcement, token handling, container security patterns. +### [State-Sync Development Context](.claude/context/state-sync-development.md) + +Shell scripting conventions, security constraints, and testing approach for state-sync. + --- ## Code Patterns @@ -107,6 +111,10 @@ Operator development, watch patterns, reconciliation loop. Python runner development, Claude Code SDK integration. +### [State-Sync Spec](components/runners/state-sync/spec/spec.md) + +Behavioral specification for the session state persistence sidecar. + ### [Public API README](components/public-api/README.md) Stateless gateway design, token forwarding, input validation. diff --git a/components/runners/state-sync/spec/spec.md b/components/runners/state-sync/spec/spec.md new file mode 100644 index 000000000..497e4f29b --- /dev/null +++ b/components/runners/state-sync/spec/spec.md @@ -0,0 +1,225 @@ +# State-Sync Specification + +Session state persistence for the Ambient Code Platform. Ensures workspace data survives pod restarts by synchronizing workspace contents to and from S3-compatible object storage. + +## Operational Modes + +### Init (hydrate) + +Runs as a Kubernetes init container before the runner starts. Prepares the workspace: + +1. **Create workspace structure** — directories for framework state, artifacts, file uploads, and repositories +2. **Set permissions** — ownership to uid 1001 (runner user), with 777 fallbacks for cross-container access +3. **Download prior session state** — if S3 is configured and prior state exists, download framework state, artifacts, and file uploads +4. **Fetch git credentials** — retrieve GitHub/GitLab tokens from the backend API using the session's bot token +5. **Install credential helper** — a shell-based git credential helper that maps host patterns to the appropriate token (GitHub or GitLab) +6. **Clone repositories** — iterate `REPOS_JSON`, clone each repo to `/workspace/repos/{name}` on the specified branch (or default branch) +7. **Clone workflow** — if `ACTIVE_WORKFLOW_GIT_URL` is set, clone the workflow repo and optionally extract a subpath +8. **Restore git state** — if S3 contains a `repo-state/` backup, restore branches from bundles, apply uncommitted/staged patches, and verify HEAD matches expectations +9. **Final permissions** — re-apply ownership and permissions after all downloads and clones + +### Sidecar (sync) + +Runs alongside the runner container for the lifetime of the session pod. Periodically uploads workspace state: + +1. **Wait for workspace population** — 30-second initial delay after pod start +2. **Sync loop** — every `SYNC_INTERVAL` seconds (default 60): + - Check total sync size against `MAX_SYNC_SIZE` + - Checkpoint any SQLite WAL files in the framework data directory (defensive — databases are created by the framework runtime and are opaque to state-sync) + - Upload framework state, artifacts, and file uploads to S3 via rclone + - Write sync metadata (timestamp, session info, paths synced) +3. **Periodic git backup** — every `REPO_BACKUP_INTERVAL` sync cycles (default 5), back up git repo state: + - Create bundles with all refs + - Capture uncommitted and staged changes as patches + - Write metadata (remote URL with credentials stripped, branch, HEAD SHA, local branches) + - Upload to S3 under `repo-state/` +4. **Graceful shutdown** — on SIGTERM, perform one final git backup + sync before exiting + +## Inputs + +### Required for persistence + +| Variable | Description | +|---|---| +| `AWS_ACCESS_KEY_ID` | S3 access key | +| `AWS_SECRET_ACCESS_KEY` | S3 secret key | + +If either is missing, state-sync operates in **ephemeral mode**: hydrate creates the workspace structure but skips S3; sync sleeps indefinitely. + +### Session identity + +| Variable | Default | Description | +|---|---|---| +| `NAMESPACE` | `default` | Kubernetes namespace (sanitized to `[a-zA-Z0-9-]`) | +| `SESSION_NAME` | `unknown` | Session identifier (sanitized to `[a-zA-Z0-9-]`) | + +### S3 configuration + +| Variable | Default | Description | +|---|---|---| +| `S3_ENDPOINT` | `http://minio.ambient-code.svc:9000` | S3-compatible endpoint URL | +| `S3_BUCKET` | `ambient-sessions` | Bucket name | + +### Framework configuration + +| Variable | Default | Description | +|---|---|---| +| `RUNNER_STATE_DIR` | `.claude` | Relative path under `/workspace/` for framework state | + +### Repository configuration + +| Variable | Default | Description | +|---|---|---| +| `REPOS_JSON` | (empty) | JSON array of `{url, branch, name}` objects | + +### Workflow configuration + +| Variable | Default | Description | +|---|---|---| +| `ACTIVE_WORKFLOW_GIT_URL` | (empty) | Git URL of the workflow repository | +| `ACTIVE_WORKFLOW_BRANCH` | `main` | Branch to clone | +| `ACTIVE_WORKFLOW_PATH` | (empty) | Subpath within the repo to extract | + +### Credential sources + +| Variable | Description | +|---|---| +| `GITHUB_TOKEN` | GitHub personal access token (if pre-set, skips backend fetch) | +| `GITLAB_TOKEN` | GitLab access token (if pre-set, skips backend fetch) | +| `BACKEND_API_URL` | Backend API base URL for credential fetch | +| `BOT_TOKEN` | Authentication token for backend API calls | +| `PROJECT_NAME` | Project name for credential endpoint path | + +### Sync tuning (sidecar only) + +| Variable | Default | Description | +|---|---|---| +| `SYNC_INTERVAL` | `60` | Seconds between sync cycles | +| `MAX_SYNC_SIZE` | `1073741824` | Maximum total sync size in bytes (1 GB) | +| `REPO_BACKUP_INTERVAL` | `5` | Back up git repos every Nth sync cycle | + +## Workspace Structure + +Hydration produces: + +``` +/workspace/ + {RUNNER_STATE_DIR}/ # Framework state (e.g., .claude/) + debug/ # Debug logs (created only when RUNNER_STATE_DIR is ".claude"; excluded from sync regardless) + artifacts/ # Output files created by the agent + file-uploads/ # User-uploaded files + repos/ + {repo-name}/ # Cloned repositories + workflows/ + {workflow-name}/ # Cloned workflow (or extracted subpath) +``` + +### Permissions + +The runner container runs as uid 1001 (non-root). The init container runs as root. + +| Path | Permissions | Rationale | +|---|---|---| +| `{RUNNER_STATE_DIR}/` | 777 | Framework SDK requires write access; group-based permissions don't work across containers with different UIDs | +| `artifacts/` | 755 | Runner user owns, standard access | +| `file-uploads/` | 777 | Content sidecar (uid 1001) must write; init container (root) creates | +| `repos/` | 777 | Runtime repo additions via `clone_repo_at_runtime`; containers may not share groups | + +Ownership is set to `1001:0` via `chown` first. The 777 fallback handles environments where `chown` fails (SELinux, OpenShift SCCs with forced fsGroup). + +## S3 Storage Layout + +``` +s3://{bucket}/{namespace}/{session_name}/ + {RUNNER_STATE_DIR}/ # Framework state files + artifacts/ # Agent output files + file-uploads/ # User-uploaded files + repo-state/ + {repo-name}/ + repo.bundle # Git bundle with all refs + uncommitted.patch # Uncommitted tracked changes + staged.patch # Staged changes + metadata.json # Remote URL, branch, HEAD SHA, local branches, timestamp + metadata.json # Sync metadata (last sync time, session info, paths synced) +``` + +### Sync exclusions + +The following patterns are excluded from S3 sync: + +- `repos/**` — git handles this separately via bundles +- `node_modules/**`, `.venv/**`, `__pycache__/**`, `*.pyc` — dependency artifacts +- `.cache/**`, `target/**`, `dist/**`, `build/**` — build artifacts +- `.git/**` — git internals (bundled separately) +- `debug/**` — debug logs with symlinks that break rclone + +## Behavioral Invariants + +1. **Repo clone failures are non-fatal.** Individual repository clone failures MUST log a warning and continue. Other repos and the rest of workspace initialization MUST proceed. + +2. **S3 unavailability does not block workspace creation.** If S3 credentials are missing or the endpoint is unreachable, hydration MUST create the workspace structure and exit successfully. The session operates in ephemeral mode. + +3. **Credentials never appear in logs or persisted metadata.** The git credential helper writes tokens only to stdout in git credential protocol format. `backup_git_repos` strips embedded credentials from remote URLs before writing `metadata.json` (via `sed 's|://[^@]*@|://|'`). + +4. **Final sync on shutdown.** The sidecar MUST trap SIGTERM and perform a complete git backup + workspace sync before exiting. This is the primary mechanism for preserving uncommitted work. + +5. **SQLite WAL checkpoint before sync.** Before uploading framework state, all `.db` files MUST be checkpointed (`PRAGMA wal_checkpoint(TRUNCATE)`) to ensure consistent backups. The `.db` files are created by the framework runtime (e.g., Claude Code CLI) and their contents are opaque to state-sync. + +6. **Sync size enforcement.** Total sync size MUST be checked against `MAX_SYNC_SIZE` before each cycle. If exceeded, a warning is logged but sync proceeds. Additionally, rclone enforces `--max-size` per-file — individual files exceeding `MAX_SYNC_SIZE` are silently skipped by rclone. + +7. **Input sanitization.** `NAMESPACE` and `SESSION_NAME` MUST be stripped to `[a-zA-Z0-9-]` to prevent path traversal in both S3 paths and local filesystem paths. + +8. **Rclone config protection.** The rclone configuration file (which contains S3 credentials) MUST be written with mode 600. + +## Failure Modes + +| Scenario | Behavior | +|---|---| +| S3 not configured (missing credentials) | Hydrate: creates workspace, exits 0. Sync: sleeps forever (keeps sidecar alive). | +| S3 unreachable | Hydrate: workspace created without prior state, exits 0. Sync: logs error, retries next interval. | +| Repo clone fails (auth, network, etc.) | Warning logged, other repos continue. | +| Workflow clone fails | Warning logged, no workflow available. | +| Workflow subpath not found | Warning logged, falls back to entire cloned repo. | +| Git bundle fetch fails during restore | Warning logged, repo stays at freshly-cloned state. | +| Patch apply fails during restore | Warning logged (likely merge conflicts), repo stays at bundle state. | +| HEAD SHA mismatch after restore | Warning logged (diverged state), no corrective action taken. | +| Sync size exceeds MAX_SYNC_SIZE | Warning logged, sync proceeds anyway. | + +## Interfaces + +### Operator + +The Kubernetes operator configures state-sync by setting environment variables on the init container and sidecar container specs. The operator controls: +- Session identity (`NAMESPACE`, `SESSION_NAME`) +- S3 credentials (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`) +- Repository configuration (`REPOS_JSON`) +- Workflow configuration (`ACTIVE_WORKFLOW_GIT_URL`, `ACTIVE_WORKFLOW_BRANCH`, `ACTIVE_WORKFLOW_PATH`) +- Framework selection (`RUNNER_STATE_DIR`) +- Backend API access (`BACKEND_API_URL`, `BOT_TOKEN`, `PROJECT_NAME`) + +### Runner container + +Reads the `/workspace/` directory structure created by hydration. Expects: +- Repos cloned to `/workspace/repos/{name}` +- Framework state directory at `/workspace/{RUNNER_STATE_DIR}` +- Artifacts directory at `/workspace/artifacts` +- File uploads at `/workspace/file-uploads` + +### S3 / MinIO + +All S3 operations use rclone. Configuration: +- Provider type: `Other` (S3-compatible), ACL: `private` +- Sync (upload) uses `--checksum` for content-based comparison; hydrate (download) uses `rclone copy` without checksum +- Transfers: 8 (hydrate download), 4 (sync upload) +- `--fast-list` and `--copy-links` enabled + +### Backend API + +The init container fetches git credentials from `{BACKEND_API_URL}/projects/{PROJECT_NAME}/agentic-sessions/{SESSION_NAME}/credentials/{provider}` using `BOT_TOKEN` for authentication. Providers: `github`, `gitlab`. Tokens are only fetched if not already present in the environment. + +## Container + +- **Base image:** Alpine 3.21 +- **Installed packages:** rclone, git, jq, bash, sqlite +- **Entrypoint:** `/usr/local/bin/sync.sh` (sidecar mode) +- **Init container usage:** overrides entrypoint to `/usr/local/bin/hydrate.sh`