CCimen · CCimen · Apr 6, 2026 · Apr 3, 2026 · Apr 4, 2026 · Apr 4, 2026
diff --git a/.gitignore b/.gitignore
@@ -62,3 +62,41 @@ safetypluginclone/.claude/.state/
 
 # Local refactor checklist
 refactor-plan.md
+
+# GSD runtime/state noise (keep curated project docs + milestone docs committed)
+.gsd/STATE.md
+.gsd/activity/
+.gsd/auto.lock
+.gsd/completed-units.json
+.gsd/completed-units-*.json
+.gsd/event-log.jsonl
+.gsd/gsd.db
+.gsd/gsd.db-*
+.gsd/journal/
+.gsd/metrics.json
+.gsd/OVERRIDES.md
+.gsd/PREFERENCES.md
+.gsd/recovery/
+.gsd/reports/
+.gsd/runtime/
+.gsd/state-manifest.json
+.gsd/worktrees/
+.bg-shell/
+coverage.json
+READMEGSD.md
+
+# ── GSD baseline (auto-generated) ──
+.gsd-id
+*.code-workspace
+.env
+.env.*
+!.env.example
+node_modules/
+.next/
+*.pyc
+target/
+vendor/
+*.log
+coverage/
+.cache/
+tmp/
diff --git a/.gsd/DECISIONS.md b/.gsd/DECISIONS.md
diff --git a/.gsd/KNOWLEDGE.md b/.gsd/KNOWLEDGE.md
diff --git a/.gsd/PROJECT.md b/.gsd/PROJECT.md
@@ -0,0 +1,114 @@
+# Sandboxed Coding CLI (SCC)
+
+## What the project is
+SCC is a governed runtime for coding agents. It lets organizations run approved agents inside portable sandboxes with explicit policy, team-level configuration, safer defaults, and runtime-enforced controls that are explainable to security reviewers.
+
+## What the project is not
+- not a new general-purpose coding agent
+- not a forever-Claude-only wrapper
+- not a Docker Desktop-only product
+- not a fake security story built on advisory naming
+- not a proprietary skills ecosystem
+
+## Current v1 product target
+The v1 target is a clean architecture on top of `scc-sync-1.7.3` that supports Claude Code and Codex through the same provider-neutral core, portable OCI runtimes, enforced web egress, and a shared runtime safety engine.
+
+## Strategic success condition
+A security or platform team can approve SCC because its governance model, runtime enforcement, and diagnostics are understandable and inspectable, while developers can switch providers and team contexts without rebuilding their world. The implementation should also become easier to change over time, not more brittle.
+
+## Cross-cutting engineering priority
+- Maximize maintainability, clean architecture, and clean code while delivering milestones.
+- Prefer smaller cohesive modules, typed seams, and composition-root boundaries over growing central orchestrators.
+- When a slice touches a large or fragile file, plan the smallest safe extraction that improves testability and future changeability.
+- Pair refactors with characterization or contract tests so maintainability work stays measurable.
+
+## Milestone history
+
+### M001 — Provider-Neutral Launch Boundary ✅
+Established typed contracts (core/contracts.py), AgentProvider protocol, and provider-neutral seam for launch, runtime, network, safety, and audit planning.
+
+### M002 — Provider-Neutral Launch Pipeline ✅
+Made AgentProvider and AgentLaunchSpec part of the real launch path. Claude settings are adapter-owned. Codex is a first-class provider. Preflight validation, durable JSONL audit sink, and application-owned support-bundle converged. Launch wizard resume extracted to typed helpers.
+
+### M003 — Portable Runtime And Enforced Web Egress ✅
+Delivered portable OCI sandbox backend (no Docker Desktop dependency) with topology-enforced web egress via Squid proxy sidecar, provider destination validation, operator diagnostics, and docs truthfulness guardrails. +178 net new tests (3464 total).
+
+### M004 — Cross-Agent Runtime Safety ✅
+Delivered shared safety policy and verdict engine, runtime wrapper baseline, provider-specific safety adapters, fail-closed policy loader, safety audit reader, doctor safety-policy check, and `scc support safety-audit` CLI command. +289 net new tests (3790 total).
+
+### M005 — Architecture Quality, Strictness, And Hardening ✅
+Delivered comprehensive architecture quality: module decomposition (15 files split), typed governed-artifact model hierarchy, provider-neutral bundle resolution/rendering pipeline, 100% branch coverage on pipeline modules, D023 portable artifact rendering, and 18 truthfulness guardrail tests. Final: 4486 tests.
+
+### M006 — Provider Selection UX and End-to-End Codex Launch ✅
+SCC became a genuine multi-provider runtime. Users choose Claude or Codex via config or CLI flag (`scc provider show/set`, `scc start --provider codex`), validated against org/team policy. Provider identity flows through container naming, volume naming, session identity, machine-readable outputs (dry-run JSON, support bundle, session list). CodexAgentRunner adapter with Codex-specific image, settings, and argv. Provider-aware branding ("Sandboxed Coding CLI"), doctor image check with exact build commands, and 16 coexistence proofs. 153 new tests, 4643 total, zero regressions.
+
+### M007 — Provider Neutralization, Operator Truthfulness, and Legacy Claude Cleanup ✅
+Eliminated Claude assumptions from shared/core/operator paths. ProviderRuntimeSpec replaces 5 scattered dicts. Settings serialization is provider-owned (rendered_bytes, not dict). Config layering is provider-native (Claude home-scoped, Codex workspace-scoped). Unknown providers fail closed. Auth readiness is adapter-owned via auth_check() on AgentProvider. Runtime permission normalization. Config freshness guarantee on every fresh launch. Doctor is provider-aware with --provider flag and categorized output. Core constants stripped to product-level only. 32 truthfulness guardrail tests. 166 net new tests, 4820 total.
+
+### M008 — Cross-Flow Consistency, Reliability, and Maintainability Hardening ✅
+Consolidated five duplicated launch preflight sequences into one shared module. S01: shared preflight module with typed LaunchReadiness model, flow.py and flow_interactive.py migrated, 7 structural guardrail tests. S02: auth vocabulary truthfulness (three-tier distinction), Docker Desktop removed from active paths, provider adapter dispatch consolidated via shared get_agent_provider() helper, 15 new guardrail tests. S03: 106 edge-case and regression-guard tests covering workspace persistence, resume-after-drift, setup idempotency, and error message quality. Auth bootstrap exception wrapping. Legacy Docker Desktop module documentation. 294 net new tests (5114 total), zero regressions.
+
+### M009 — Preflight Convergence and Auth Bootstrap Unification ✅
+All five launch sites (flow.py, flow_interactive.py, worktree_commands.py, orchestrator_handlers.py, and the start command) now use collect_launch_readiness() + ensure_launch_ready() through the shared preflight module. ensure_launch_ready() actually calls bootstrap_auth() when auth is missing (silent gap closed). auth_bootstrap.py reduced to deprecated redirect. Auth messaging centralized in preflight._ensure_auth(). Setup's _render_provider_status uses _three_tier_status() so both onboarding panel and completion summary show identical four-state readiness vocabulary. D048 superseded by D049. 3 net new tests (5117 total).
+
+## Next milestone order
+1. ~~M001 — Provider-Neutral Launch Boundary~~ ✅
+2. ~~M002 — Provider-Neutral Launch Pipeline~~ ✅
+3. ~~M003 — Portable Runtime And Enforced Web Egress~~ ✅
+4. ~~M004 — Cross-Agent Runtime Safety~~ ✅
+5. ~~M005 — Architecture Quality, Strictness, And Hardening~~ ✅
+6. ~~M006 — Provider Selection UX and End-to-End Codex Launch~~ ✅
+7. ~~M007 — Provider Neutralization, Operator Truthfulness, and Legacy Claude Cleanup~~ ✅
+8. ~~M008 — Cross-Flow Consistency, Reliability, and Maintainability Hardening~~ ✅
+9. ~~M009 — Preflight Convergence and Auth Bootstrap Unification~~ ✅
+
+## Requirement status
+- **R001: maintainability in touched high-churn areas** — ✅ validated. Advanced through all nine milestones.
+
+## Current verification baseline
+- `uv run ruff check` ✅
+- `uv run mypy src/scc_cli` ✅ (303 files, 0 issues)
+- `uv run pytest -q` ✅ (5117 passed, 23 skipped, 2 xfailed)
+- Zero files in src/scc_cli/ exceed 1100 lines
+- One file in 800–1100 zone justified (compute_effective_config.py at 852, 93% coverage)
+
+## Known deferred items
+- Wizard cast cleanup (23 casts in wizard.py/flow_interactive.py) — deferred per D018
+- Legacy module coverage (docker_sandbox_runtime 30%, overall 74%) — deprioritized per D017/D021 user overrides
+- Portable MCP stdio transport support — requires additional source metadata
+- Live bundle registry integration — renderers write metadata references only
+- Dashboard provider switching TUI feature (dashboard 'a' key)
+- Container labels (scc.provider=<id>) for external tooling discovery
+- Image build/push pipeline for scc-agent-codex
+- Podman support on the same SandboxRuntime contracts
+- `scc auth login/status/logout` commands — model supports them via auth_check()
+- Fine-grained volume splitting (auth-only vs ephemeral) for enterprise data-retention (D036)
+- start_claude parameter rename to start_agent in worktree_commands.py (deferred from M008/S01)
+- WorkContext.provider_id threading through _record_session_and_context (deferred from M008/S01)
+- Delete auth_bootstrap.py entirely after updating test consumers to use preflight directly
+
+## Key architecture invariants
+- `bootstrap.py` is the sole composition root for adapter symbols consumed outside `scc_cli.adapters`.
+- `AgentLaunchSpec.env` stays empty for file-based providers; provider config travels via `artifact_paths`.
+- The canonical provider-adapter characterization shape is: capability metadata, clean-spec, settings-artifact, and env-is-clean.
+- Adding a provider to `DefaultAdapters` still requires the same four touch points: adapter file, bootstrap wiring, fake adapters factory, and inline test constructions.
+- Provider-core destination validation belongs before launch, not as a runtime surprise.
+- RuntimeProbe protocol is the canonical detection surface for runtime capabilities; no consumer outside the adapter layer should call docker.check_docker_available() directly.
+- Bootstrap probes runtime at construction time and selects OciSandboxRuntime or DockerSandboxRuntime based on preferred_backend.
+- OciSandboxRuntime is imported only in bootstrap.py; application layer uses SandboxRuntime protocol.
+- Enforced web-egress uses internal Docker network + dual-homed Squid proxy sidecar as the hard enforcement boundary (D014).
+- Safety engine is provider-neutral: DefaultSafetyEngine in core orchestrates shell tokenizer + git rules + network tool rules. Fail-closed semantics.
+- SafetyPolicy loader is fail-closed: any parse failure → default block policy. Uses raw org config (not NormalizedOrgConfig).
+- Provider safety adapters are pure UX/audit wrappers with zero verdict logic — the engine is the single source of safety truth.
+- Import boundary guard (test_import_boundaries.py) mechanically enforces layer separation via AST scanning.
+- **Launch preflight is fully unified via commands/launch/preflight.py (D046, D049):** resolve_launch_provider() → collect_launch_readiness() → ensure_launch_ready() is the canonical three-function sequence used by all five launch sites. ensure_launch_ready() calls bootstrap_auth() when auth is missing. Auth messaging lives in _ensure_auth() only.
+- Renderers return fragment dicts for caller-owned merge — they do not write shared config files (settings.local.json, .mcp.json) directly.
+- **ProviderRuntimeSpec** (frozen dataclass in `core/contracts.py`) is the single source of truth for provider runtime details. **PROVIDER_REGISTRY** in `core/provider_registry.py` maps provider_id → spec.
+- Unknown, forbidden, or unavailable providers fail closed in active launch logic — never silently fall back to Claude.
+- **AgentRunner owns settings serialization format**: `build_settings()` produces `rendered_bytes: bytes` + `path` + `suffix`, not dict.
+- **Product name is 'SCC — Sandboxed Coding CLI'** consistently across README, pyproject.toml, CLI branding, D045, and all user-facing surfaces.
+- **Auth vocabulary is three-tier truthful**: 'auth cache present' (file exists), 'image available' (container image present), 'launch-ready' (both). No surface uses 'connected' or standalone 'ready' to describe partial state. All setup surfaces (onboarding panel and completion summary) use the single _three_tier_status() helper.
+- **Docker Desktop references** are confined to docker/, adapters/, core/errors.py, and doctor/ layers only. Active user-facing commands/ paths use 'Docker' or 'container runtime'.
+- **Provider adapter dispatch** uses a shared `get_agent_provider(adapters, provider_id)` helper in dependencies.py — no hardcoded per-site dispatch dicts.
+- **40+ guardrail tests** across test_docs_truthfulness.py, test_auth_vocabulary_guardrail.py, test_lifecycle_inventory_consistency.py, and test_launch_preflight_guardrail.py mechanically prevent regression.
+- **Auth bootstrap exception wrapping** in ensure_launch_ready/_ensure_auth: raw exceptions from bootstrap_auth() become ProviderNotReadyError with actionable guidance; already-typed ProviderNotReadyError passes through unchanged.
diff --git a/.gsd/REQUIREMENTS.md b/.gsd/REQUIREMENTS.md
@@ -0,0 +1,29 @@
+# Requirements
+
+This file is the explicit capability and coverage contract for the project.
+
+## Validated
+
+### R001 — SCC changes must improve maintainability by keeping touched areas cohesive, testable, and easier to change, especially when work crosses oversized or high-churn files.
+- Class: non-functional
+- Status: validated
+- Description: SCC changes must improve maintainability by keeping touched areas cohesive, testable, and easier to change, especially when work crosses oversized or high-churn files.
+- Why it matters: Maintainability directly drives testability, consistency, and the long-term cost and safety of future provider/runtime changes.
+- Source: user-feedback
+- Primary owning slice: architecture
+- Supporting slices: M002/S03, M002/S05
+- Validation: Proof from M005: Zero files >1100 lines (from 3 at 1665/1493/1336), 15 MANDATORY-SPLIT files decomposed, 3 boundary violations repaired, 31 import boundary tests pass, typed governed-artifact model hierarchy adopted, provider-neutral bundle pipeline with 100% branch coverage (resolver + both renderers), D023 portable artifact rendering implemented, file/function size guardrails pass without xfail, 18 truthfulness tests, 4486 total tests passing. Exit gate: `uv run ruff check` (0 errors), `uv run mypy src/scc_cli` (289 files, 0 issues), `uv run pytest --rootdir "$PWD" -q` (4486 passed, 23 skipped, 2 xfailed).
+- Notes: Validated by M002/S05, substantially strengthened by M005. M005 delivered: module decomposition (S02), typed config models (S03), governed-artifact pipeline (S04), 100% pipeline coverage (S05), diagnostics/truthfulness/guardrails (S06), D023 portable artifact rendering (S07). Wizard cast cleanup deferred (D018). Legacy module coverage targets deferred per D017/D021 user overrides directing work toward team-pack architecture.
+
+## Traceability
+
+| ID | Class | Status | Primary owner | Supporting | Proof |
+|---|---|---|---|---|---|
+| R001 | non-functional | validated | architecture | M002/S03, M002/S05 | Proof from M005: Zero files >1100 lines (from 3 at 1665/1493/1336), 15 MANDATORY-SPLIT files decomposed, 3 boundary violations repaired, 31 import boundary tests pass, typed governed-artifact model hierarchy adopted, provider-neutral bundle pipeline with 100% branch coverage (resolver + both renderers), D023 portable artifact rendering implemented, file/function size guardrails pass without xfail, 18 truthfulness tests, 4486 total tests passing. Exit gate: `uv run ruff check` (0 errors), `uv run mypy src/scc_cli` (289 files, 0 issues), `uv run pytest --rootdir "$PWD" -q` (4486 passed, 23 skipped, 2 xfailed). |
+
+## Coverage Summary
+
+- Active requirements: 0
+- Mapped to slices: 0
+- Validated: 1 (R001)
+- Unmapped active requirements: 0
diff --git a/.gsd/RUNTIME.md b/.gsd/RUNTIME.md
@@ -0,0 +1,30 @@
+# RUNTIME.md
+
+## Canonical implementation root
+- `scc-sync-1.7.3` is the only writable repo for this work.
+- The original dirty `scc` tree is archival and rollback evidence only.
+
+## Runtime assumptions for v1
+- Plain OCI backend first.
+- Docker Engine / OrbStack / Colima-style Docker CLIs are first runtime targets.
+- Podman follows on the same contracts after the first Claude/Codex vertical slice is stable.
+- Windows support is WSL-first if needed.
+
+## Verification commands
+- `uv run ruff check`
+- `uv run mypy src/scc_cli`
+- `uv run pytest`
+
+## Expected runtime deliverables
+- `scc-base`
+- `scc-agent-claude`
+- `scc-agent-codex`
+- `scc-egress-proxy`
+
+## Enforced egress topology
+- agent container on internal-only network
+- egress proxy as the only component with internal + external attachment
+- no host networking
+- deny IP literals by default
+- deny loopback, private, link-local, and metadata endpoints by default
+- proxy ACL evaluates requested host and resolved IP/CIDR
diff --git a/.gsd/milestones/M001-CONTEXT.md b/.gsd/milestones/M001-CONTEXT.md
@@ -0,0 +1,25 @@
+# M001-CONTEXT.md
+
+# Locked decisions for M001
+
+## Non-negotiables
+- No long-term backward compatibility in core after the one-time migration.
+- No Docker Desktop dependency in the architecture.
+- No provider-specific logic in core contracts.
+- No fake use of overclaimed enforcement language.
+- No widening of effective egress outside org policy and delegated team policy.
+
+## Primary objective
+Create the cleanest possible foundation for later runtime and provider work. Do not rush into Podman, Pi, OpenCode, or enterprise dashboards before the baseline and typed architecture are sound.
+
+## Canonical references
+- `CONSTITUTION.md`
+- `PLAN.md`
+- `.gsd/REQUIREMENTS.md`
+- `specs/01-repo-baseline-and-migration.md`
+- `specs/02-control-plane-and-types.md`
+- `specs/03-provider-boundary.md`
+- `specs/07-verification-and-quality-gates.md`
+
+## Notes
+This milestone is intentionally quality-first. It should reduce ambiguity, provider leakage, and orchestration risk before any major feature expansion.
diff --git a/.gsd/milestones/M001-RESEARCH.md b/.gsd/milestones/M001-RESEARCH.md
@@ -0,0 +1,17 @@
+# M001-RESEARCH.md
+
+# Baseline findings to preserve during refactor
+
+## Codebase reality from prior review
+- Provider abstraction is still too Claude-shaped.
+- Error and exit-code contracts need alignment.
+- Launch and flow orchestration remain larger than they should be.
+- Application/config boundaries still rely too heavily on raw dictionaries.
+- Runtime detection is still name-based instead of capability-based.
+- Complexity guardrails exist but are not yet enforced strongly enough.
+
+## Why Milestone 0 / M001 must come first
+If the codebase moves directly into multi-runtime and multi-provider work without a green synced baseline and typed contracts, the product will accumulate more provider leakage and more misleading security surfaces.
+
+## Research conclusion
+The best first step is not new runtime code. It is repo truth, vocabulary cleanup, typed core seams, and characterization coverage.
diff --git a/.gsd/milestones/M001-ROADMAP.md b/.gsd/milestones/M001-ROADMAP.md
@@ -0,0 +1,28 @@
+# M001-ROADMAP.md
+
+# Milestone M001 — Baseline Freeze And Typed Foundation
+
+## Outcome
+The project has a single authoritative repo root, a green migrated baseline, typed control-plane direction, and the first characterization/contract tests needed for safe refactoring.
+
+## Slices
+- [ ] Freeze the archived dirty `scc` tree and make `scc-sync-1.7.3` the only writable root
+- [ ] Normalize local docs, configs, tests, and terminology to the new truthful network vocabulary
+- [ ] Re-run the full verification gate on the synced repo and capture the baseline
+- [ ] Add characterization tests around current Claude launch, resume, config inheritance, and safety-net behavior
+- [ ] Define typed core contracts: `AgentProvider`, `AgentLaunchSpec`, `RuntimeInfo`, `NetworkPolicyPlan`, `SafetyPolicy`, `SafetyVerdict`, and `AuditEvent`
+- [ ] Align `SCCError`, exit-code mapping, and human/JSON output contracts
+- [ ] Record accepted decisions and update specs so follow-on work does not invent hidden compatibility or provider leaks
+
+## Dependencies
+- none
+
+## Risk level
+High
+
+## Done when
+- `scc-sync-1.7.3` is the only implementation root in active use
+- no stale compatibility aliases remain in planned core surfaces
+- the baseline is green
+- characterization coverage exists for the most fragile current behavior
+- the typed control-plane contracts are written down and accepted