-
Notifications
You must be signed in to change notification settings - Fork 2
fix(install): Carl-grade install reliability — close the broken-merge gap #968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
joelteply
wants to merge
7
commits into
canary
Choose a base branch
from
fix/install-carl-mac-windows
base: canary
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
7a43bd3
docs: Carl-grade CI plan — close the broken-merge gap
2071eae
fix(install/E): widget-server /health gate + refuse-to-open-on-fail (…
f9fe2b7
feat(ci/A): carl-install-smoke — runs Carl's exact install command + …
9d2e8bb
fix(install/F): friendlier failures — phase-named errors with 1-line …
7f77359
docs(plan): correct B.1/B.2 — Mac is architecturally hybrid (Metal bl…
6cc0ee1
fix(install/G): UI URL says :9003 not :9000 (bootstrap.sh + install.ps1)
662b7da
fix(install/G): stream cargo build output during first-build (no more…
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,99 @@ | ||
| # Carl-install smoke — runs the EXACT install command Carl runs, then | ||
| # verifies the page Carl opens after install actually serves usable HTML. | ||
| # | ||
| # Closes the gap that let #950 merge with the Mac install path doing a | ||
| # hidden 5-15min Rust source build despite the README claiming "Docker- | ||
| # first: no compilation needed." Existing CI gates (verify-architectures, | ||
| # verify-after-rebuild, validate, install-and-run-gate) all passed because | ||
| # they validate image presence + revision label + service health on a | ||
| # CI-only docker compose. They never exercised `curl install.sh | bash`. | ||
| # | ||
| # Status: ADVISORY for the first week of operation (per docs/CARL-CI-PLAN.md | ||
| # rollout section). Once we have <2% false-fail rate over 1 week, flip to | ||
| # REQUIRED via the PrimaryBranches ruleset PUT. Until then, this workflow | ||
| # runs but doesn't block merge — letting us tune the smoke without locking | ||
| # the merge button on flakes. | ||
|
|
||
| name: Carl Install Smoke | ||
|
|
||
| on: | ||
| pull_request: | ||
| branches: [canary, main] | ||
| paths: | ||
| # Run when anything that affects Carl's install path changes. | ||
| # No need to re-run on TS-only widget changes that don't touch | ||
| # install/docker; those are covered by other gates. | ||
| - 'install.sh' | ||
| - 'install.ps1' | ||
| - 'setup.sh' | ||
| - 'bootstrap.sh' | ||
| - 'src/scripts/install*.sh' | ||
| - 'src/scripts/lib/install-common.sh' | ||
| - 'docker/**' | ||
| - 'docker-compose*.yml' | ||
| - 'src/.dockerignore' | ||
| - 'src/workers/.dockerignore' | ||
| - 'scripts/ci/carl-install-smoke.sh' | ||
| - '.github/workflows/carl-install-smoke.yml' | ||
| push: | ||
| branches: [canary, main] | ||
| # Manual trigger so anyone can validate Carl's path against any branch | ||
| # without opening a throwaway PR. | ||
| workflow_dispatch: | ||
| inputs: | ||
| install_ref: | ||
| description: 'Git ref to fetch install.sh from (sha / branch / tag)' | ||
| required: false | ||
| default: '' | ||
|
|
||
| jobs: | ||
| carl-install-smoke-amd64: | ||
| name: carl-install-smoke (linux/amd64) | ||
| runs-on: ubuntu-latest | ||
| timeout-minutes: 30 | ||
| permissions: | ||
| contents: read | ||
| packages: read | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
| with: | ||
| # PR HEAD, not the synthetic merge commit. Otherwise github.sha | ||
| # is the merge commit and the install.sh we'd fetch from raw. | ||
| # githubusercontent.com wouldn't be the one in this PR. Same | ||
| # rationale as docker-images.yml's ref pattern. | ||
| ref: ${{ github.event.pull_request.head.sha || github.sha }} | ||
| # Smoke uses the local script directly; no need for full history. | ||
| fetch-depth: 1 | ||
|
|
||
| - name: Set up Docker Buildx | ||
| uses: docker/setup-buildx-action@v3 | ||
|
|
||
| - name: Login to ghcr.io (so install.sh can pull pre-built images) | ||
| run: echo "${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u "${{ github.actor }}" --password-stdin | ||
|
|
||
| - name: Run carl-install smoke | ||
| env: | ||
| # Pass the PR HEAD sha so the smoke fetches the install.sh from | ||
| # THIS PR (not main). Falls back to manual workflow_dispatch input | ||
| # when not in a PR context. | ||
| CARL_INSTALL_REF: ${{ github.event.pull_request.head.sha || inputs.install_ref || github.sha }} | ||
| # 25-min cap on the docker-only install. Hybrid (Mac source-build) | ||
| # path would exceed this — by design, that's the gate firing on | ||
| # the README/install mismatch. | ||
| CARL_INSTALL_TIMEOUT_SEC: '1500' | ||
| # Generous health wait — model-init can take 3-5min on cold pull. | ||
| CARL_HEALTH_TIMEOUT_SEC: '300' | ||
| # CI shouldn't leave docker compose stacks running. | ||
| SKIP_TEARDOWN: '0' | ||
| run: bash scripts/ci/carl-install-smoke.sh | ||
|
|
||
| - name: Upload install + page artifacts on failure | ||
| if: failure() | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: carl-install-debug-${{ github.event.pull_request.head.sha || github.sha }} | ||
| path: | | ||
| /tmp/carl-smoke-*.install.log | ||
| /tmp/carl-smoke-*.page.html | ||
| retention-days: 7 | ||
| if-no-files-found: ignore |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,230 @@ | ||||||
| # Carl-Grade CI: closing the broken-merge gap | ||||||
|
|
||||||
| **Status:** plan / in-progress on `fix/install-carl-mac-windows` | ||||||
| **Owner:** anvil (mac), green-022a (windows), bigmama-wsl (linux/cuda) | ||||||
| **Driver:** anvil | ||||||
|
|
||||||
| ## The problem we're solving | ||||||
|
|
||||||
| #950 merged with the install path on Mac doing a hidden 5-15min Rust source | ||||||
| build despite the README claiming "Docker-first: pulls pre-built images, no | ||||||
| compilation needed." The CI gates that exist today (verify-architectures, | ||||||
| verify-after-rebuild, validate, install-and-run-gate) caught: | ||||||
|
|
||||||
| - Multi-arch presence at `:pr-N` ✅ | ||||||
| - Per-arch revision label matches HEAD SHA ✅ | ||||||
| - TS/Rust compile clean ✅ | ||||||
| - docker-compose-up + widget-server health responds ✅ | ||||||
|
|
||||||
| What they did NOT catch: | ||||||
|
|
||||||
| - **Carl's actual install command** (`curl install.sh | bash`) was never | ||||||
| exercised by CI. | ||||||
| - **README claim** (no compilation needed) vs **install.sh behavior** | ||||||
| (5-15min Rust build on Mac) was never reconciled. | ||||||
| - **First chat message** the user would send was never validated to produce | ||||||
| a clean response (no `<tool_use>` XML, no vision hallucination). | ||||||
| - **Browser-loaded UI** was never verified to actually render and accept | ||||||
| user input through the same path Carl would use. | ||||||
|
|
||||||
| So #950 went green on its CI gates but Carl's install experience is | ||||||
| materially different from the README's promise. That's the gap this work | ||||||
| closes. | ||||||
|
|
||||||
| ## Design principles | ||||||
|
|
||||||
| 1. **Test the user's path, not a CI-only path.** The same `install.sh` that | ||||||
| Carl invokes from `curl ... | bash` runs in CI. No CI-only smoke | ||||||
| substitutes. | ||||||
|
|
||||||
| 2. **Test the user's first action, not just service health.** After install | ||||||
| succeeds, CI sends a chat message + an image, and asserts the response | ||||||
| reads like a non-broken product (no XML leak, no hallucination markers, | ||||||
| real Vision description). | ||||||
|
|
||||||
| 3. **Cross-platform from day one.** amd64-linux is mandatory; arm64-mac is | ||||||
| high-priority via self-hosted runner OR developer-pre-push gate; Windows | ||||||
| (via WSL2 or PowerShell) is third tier but not optional. | ||||||
|
|
||||||
| 4. **Conservative-by-default required-checks.** New gates added as REQUIRED | ||||||
| in the PrimaryBranches ruleset only after they demonstrate <2% false-fail | ||||||
| rate over 1 week. False positives erode trust faster than they protect. | ||||||
|
|
||||||
| 5. **Same script for CI and humans.** Per Joel 2026-04-23: "make your own | ||||||
| testing easy." Every gate is a one-line shell invocation any of us can | ||||||
| run locally in 30 seconds. | ||||||
|
|
||||||
| ## What lands in THIS PR | ||||||
|
|
||||||
| ### A. Carl-install validation in CI (the headline) | ||||||
|
|
||||||
| A new CI job `carl-install-and-chat-smoke` that: | ||||||
|
|
||||||
| 1. On a fresh ubuntu-latest GHA runner (amd64), does: | ||||||
| ``` | ||||||
| CONTINUUM_DIR=/tmp/carl-probe \ | ||||||
| bash <(curl -fsSL https://raw.githubusercontent.com/CambrianTech/continuum/$GITHUB_SHA/install.sh) | ||||||
| ``` | ||||||
| The actual install path Carl runs. | ||||||
|
|
||||||
| 2. Times the install (target: <15 min for the Carl-mode docker-only path). | ||||||
|
|
||||||
| 3. After install completes, hits `http://localhost:9003/health` (existing | ||||||
| health check, kept) PLUS a new `chat-smoke` script: | ||||||
| - POSTs a chat message ("hello, who are you?") via the REST API | ||||||
| - Waits up to 60s for a response | ||||||
| - Asserts response: no `<tool_use>` XML, no `<persona-name>:` prefix, | ||||||
| >100 chars, doesn't claim it cannot do something it actually can | ||||||
|
|
||||||
| 4. POSTs a chat message with an image attachment (test fixture | ||||||
| `test-data/images/image-2.jpg` — small, public CC0): | ||||||
| - Asserts Vision AI's response describes the actual image content | ||||||
| - Asserts non-vision personas EITHER skip the response OR honestly say | ||||||
| they cannot see images (no hallucinated content) | ||||||
|
|
||||||
| 5. Tears down. Captures docker logs on failure to GHA artifacts so we can | ||||||
| diagnose without re-running. | ||||||
|
|
||||||
| **Required check:** `carl-install-and-chat-smoke` becomes required for | ||||||
| canary→main promotion (after 1 week of <2% false-fail rate to confirm | ||||||
| stability). For PR→canary promotion, it's required from day one — canary | ||||||
| is where we discover regressions, that's its job. | ||||||
|
|
||||||
| ### B. Mac-mode install rationalization | ||||||
|
|
||||||
| **Update 2026-04-25 (anvil, after reading install.sh:118-123):** B.1 is | ||||||
| not a choice we have. Apple's hypervisor blocks GPU passthrough to | ||||||
| containers (confirmed by Docker Feb 2026, comment in install.sh). Mac | ||||||
| NEEDS to run continuum-core natively for Metal acceleration. The 5-15min | ||||||
| Rust build is architectural, not a bug. Going with B.2. | ||||||
|
|
||||||
| **B.2 (current plan):** README updated to admit the hybrid split: | ||||||
| - Linux: docker-first, no compilation (matches the existing README claim) | ||||||
| - Mac: docker for support services + native continuum-core for Metal | ||||||
| (~10min first build, incremental after; happens automatically as part | ||||||
| of `curl install.sh | bash` — no separate command, no env flag) | ||||||
|
|
||||||
| Implementation: | ||||||
| - README's headline install section gets a small per-platform table or | ||||||
| inline note explaining the wall-clock difference. | ||||||
| - install.sh prints an upfront banner on Mac estimating build time | ||||||
| (so Carl knows to expect ~10min, not ~3min). | ||||||
| - `--quiet` mode keeps existing behavior; just clearer messaging. | ||||||
|
|
||||||
| (Considered B.3: ship TWO install commands — install-mac.sh vs install.sh. | ||||||
| Rejected: more docs surface, more drift risk, fragments the support story. | ||||||
| One entry point with honest messaging beats two entry points with shorter | ||||||
| average time.) | ||||||
|
|
||||||
| ### C. Browser smoke test (puppeteer) | ||||||
|
|
||||||
| Within the same CI job, after install + chat-smoke pass: | ||||||
|
|
||||||
| 1. Launch headless Chrome via puppeteer | ||||||
| 2. Navigate to `http://localhost:9003/` | ||||||
| 3. Assert page loads (no chrome-error://) | ||||||
| 4. Type "hello" into the chat input | ||||||
| 5. Assert response renders within 30s | ||||||
| 6. Capture screenshot for the GHA artifact (so we have visual evidence) | ||||||
|
|
||||||
| Catches the chrome-error trap class of bug — when widget-server isn't ready | ||||||
| fast enough, browser stays in a recoverable state. | ||||||
|
|
||||||
| ### D. install.sh idempotence and friendly retry | ||||||
|
|
||||||
| When install.sh is interrupted partway (Carl Ctrl+C's, network drops), | ||||||
| re-running should resume from where it left off, not retry from scratch. | ||||||
| Specifically: | ||||||
|
|
||||||
| - Skip `git clone` if repo already at $CONTINUUM_DIR with correct origin | ||||||
| - Skip `docker compose pull` if all images present locally with current tags | ||||||
| - Skip prereq install steps that already report installed | ||||||
| - ONLY repeat the failed step + everything after it | ||||||
|
|
||||||
| Most of this is already in install.sh's check-then-install pattern; verify | ||||||
| end-to-end and document the resume behavior in the README. | ||||||
|
|
||||||
| ### E. Browser pre-open delay | ||||||
|
|
||||||
| install.sh currently opens the browser after compose-up returns. compose-up | ||||||
| returns when containers START, not when widget-server is HEALTHY. Result: | ||||||
| chrome-error trap when browser hits localhost:9003 0.5 sec before the | ||||||
| server is listening. | ||||||
|
|
||||||
| Fix: install.sh polls widget-server `/health` with a 60s timeout BEFORE | ||||||
| running `open http://localhost:9003/`. If health doesn't come up, print a | ||||||
| human-readable timeout message + log dump command instead of opening the | ||||||
| browser to an error. | ||||||
|
|
||||||
| ### F. Friendlier first-fail messaging | ||||||
|
|
||||||
| When install.sh fails (any phase), the error output should: | ||||||
| - Name the phase (`Phase 4/8: Python ML environment`) | ||||||
| - Show the actual failing command + its stderr | ||||||
| - Print 1-line guidance for that specific failure ("If pip install timed | ||||||
| out, retry: `python -m pip install --retries 5 ...`") | ||||||
| - Capture full log to a clipboardable path (`/tmp/continuum-install-*.log`) | ||||||
|
|
||||||
| Carl shouldn't have to read the script source to understand what broke. | ||||||
|
|
||||||
| ## What does NOT land in this PR (deferred to follow-ups) | ||||||
|
|
||||||
| - **Self-hosted GPU runner** (bigmama's box as a GHA runner) — bigger | ||||||
| infra lift, do once Carl-install-and-chat-smoke is stable on amd64. | ||||||
| - **Persona-airc bridge** (#967) — separate value stream. | ||||||
| - **(d) tool_use XML parser fix** (#76) — the `chat-smoke` step in this PR | ||||||
| ASSERTS clean output, so #76 is now a hard prerequisite for the smoke | ||||||
| to pass. Decide: fix #76 first then ship this PR's smoke as required, or | ||||||
| ship the smoke as advisory until #76 lands. | ||||||
| - **Recipe substrate** (#71/#73) and **Phase C paging** — independent | ||||||
| workstreams, queued. | ||||||
|
|
||||||
| ## Rollout | ||||||
|
|
||||||
| 1. **This PR adds the smoke + the Mac-mode rationalization** to canary. | ||||||
| 2. CI runs the new smoke as ADVISORY (not blocking) for 1 week to gather | ||||||
| false-positive rate data. | ||||||
| 3. After 1 week of <2% false-fail, flip to REQUIRED via the PrimaryBranches | ||||||
| ruleset (gh api PUT). | ||||||
| 4. Canary→main promotion is gated on the smoke passing. | ||||||
| 5. New install regressions become impossible to merge without explicit | ||||||
| `--no-verify` (which the team's standing rule forbids per Joel). | ||||||
|
|
||||||
| ## Per-platform validation | ||||||
|
|
||||||
| | Platform | Validator | Notes | | ||||||
| |---|---|---| | ||||||
| | linux/amd64 | GHA runner (`ubuntu-latest`) | Always-on. Carl's dominant platform per HF data. | | ||||||
| | linux/amd64 + GPU | bigmama-wsl box, eventually self-hosted runner | Real Carl path; covers vision/persona functionality | | ||||||
| | darwin/arm64 | anvil mac (manual probe), eventually puppeteer-on-mac in CI | Dev's dominant platform | | ||||||
| | windows + WSL2 | green-022a (manual probe), bigmama-wsl secondary | Carl's secondary platform | | ||||||
| | windows native (powershell) | green-022a (manual probe via install.ps1) | New platform — rely on green's dogfood | | ||||||
|
|
||||||
| Each push to canary should have at least the linux/amd64 smoke green before | ||||||
| promotion. The other tiers are progressively-tightening. | ||||||
|
|
||||||
| ## Success criteria | ||||||
|
|
||||||
| - [ ] Carl-install-and-chat-smoke runs on every PR; passes for unchanged- | ||||||
|
||||||
| - [ ] Carl-install-and-chat-smoke runs on every PR; passes for unchanged- | |
| - [ ] Carl-install-and-chat-smoke runs on every PR; passes for unchanged |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section reads as if the smoke job + Mac-mode changes are already included in the current PR (e.g., “What lands in THIS PR” / “This PR adds…”), but the PR currently appears to only add the plan doc. Consider rewording to “What’s planned on this branch” (or similar) to avoid misleading readers until the implementation lands.