DEMO ONLY: wire faultcat-explore into multi-node-tests CI by UtkarshBhatthere · Pull Request #731 · canonical/microceph

UtkarshBhatthere · 2026-05-13T13:32:42Z

⚠️ Do NOT merge. This PR exists to exercise canonical/faultcat PR #8 (feat/explore-mode) end-to-end against a real failing microceph CI job. Both demo changes are tagged FAULTCAT-DEMO so they can be ripped out cleanly. The PR is opened as a draft to make the do-not-merge status visible.

What this PR does

Disables the "Add 2 OSDs" step in the multi-node-tests job with if: false. The downstream "Test 3 osds present" / "Test crush rules" steps then fail naturally on the missing setup — this is a real diagnostics_gap shape: state that a later step expects was never set up by an earlier step.
Adds a new step right after the existing "Print logs for failure" step:
```
- name: FAULTCAT-DEMO faultcat explore
  if: failure()
  uses: canonical/faultcat/.github/actions/faultcat-explore@feat/explore-mode
  with:
    hints: microceph,lxd
    openrouter-api-key: \${{ secrets.OPENROUTER_API_KEY }}
    artifact-name: faultcat-probe-multi-node
    fail-on-explore-error: "false"
```
The composite action installs @earendil-works/pi-coding-agent, faultcat, drives Pi (fast tier, DeepSeek V3 via OpenRouter) to inspect the live failed multi-node test environment read-only, validates the output against the same evidence_bundle.v1 / findings.v1 / suggestions.v1 schemas as faultcat M1, scrubs secrets from the entire output tree, and uploads it as a faultcat-probe-multi-node artifact.

What I want to learn from this CI run

Does the composite action install cleanly on the ubuntu-22.04 runner used by multi-node-tests?
Does Pi correctly identify the missing-OSDs cause from live microceph status / ceph -s / snap.microceph.osd logs, or does it get distracted?
Is the resulting artifact actually safe to publish on the public repo (scrubber output looks clean)?
Are the three baseline skills (microceph, lxd, _default) tight enough that Pi stays focused, or does it bloat?

Prerequisite

OPENROUTER_API_KEY needs to be set as a repo secret in canonical/microceph. The action reads \${{ secrets.OPENROUTER_API_KEY }} — if the secret is missing, the action will run but Pi will return empty content (no fallback model in this first cut). I have not added that secret; would appreciate you doing so before kicking the workflow, or letting me know and I'll keep the PR in draft until it's in place.

When this PR is no longer needed

After the explore action has been observed working over a few real CI failures, a follow-up PR can add the if: failure() step into more (or all) integration test jobs as a permanent hook. This PR's role ends as soon as we have evidence the action behaves correctly.

Test plan

CI fails as expected in the "Test 3 osds present" step (missing OSDs)
The FAULTCAT-DEMO faultcat explore step runs to completion (does not fail the job either way thanks to fail-on-explore-error: false)
faultcat-probe-multi-node artifact appears on the workflow run
Downloaded artifact contains a scrubbed evidence.yaml, findings.yaml, suggestions.yaml whose content reflects the missing-OSDs failure shape

🤖 Generated with Claude Code

⚠️ DO NOT MERGE. This branch deliberately breaks the Tests workflow to exercise faultcat's new CI-side explore mode end-to-end. Two demo-only changes, both clearly tagged FAULTCAT-DEMO: 1. Disable the "Add 2 OSDs" step with `if: false`. Downstream steps that expect 3 OSDs ("Test 3 osds present", "Test crush rules") then fail naturally — this is a real diagnostics_gap shape (state expected by a later step was never set up by an earlier step). 2. Append a `if: failure()` step right after the existing "Print logs for failure" step that calls the new canonical/faultcat/.github/actions/faultcat-explore@feat/explore-mode composite action, passing hints=microceph,lxd and the OPENROUTER_API_KEY repo secret. The action installs pi-coding-agent, faultcat, runs `faultcat explore`, and uploads a scrubbed evidence bundle as the `faultcat-probe-multi-node` artifact. Verifying that: - The composite action installs cleanly on ubuntu-22.04 runners. - Pi (DeepSeek V3 via OpenRouter) inspects the live failed multi-node test environment and emits schema-valid evidence + findings + suggestions. - The scrubber leaves the artifact safe to publish on a public repo. - The action does not fail the job (fail-on-explore-error=false), so the original test failure remains the surfaced one. Before merging this PR (do not merge for the demo cycle): - Remove the `if: false` from "Add 2 OSDs" so it runs normally. - Remove the entire "FAULTCAT-DEMO faultcat explore" step. - Once the action is stable, a follow-up PR can add the explore step back in as a permanent if: failure() hook across all integration jobs. Requires `OPENROUTER_API_KEY` to be set as a repo secret in canonical/microceph. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ltcat#8)

The first cycle of canonical/faultcat#8 demo CI surfaced two issues: - pipx install of faultcat failed because ubuntu-22.04 ships Python 3.10.12 and faultcat requires >=3.11. Fixed in faultcat 693ae4c (setup-python@v5 + pipx --python). - The composite action's `faultcat-ref` input defaulted to `main`, so the runner pip-installed the pre-merge main, not the explore-mode branch. Now passing `faultcat-ref: feat/explore-mode` explicitly. Also tighten the demo feedback loop by disabling every Tests job except build-microceph (needed for the snap artifact) and multi-node-tests (the only job exercising the explore action) with `if: false # FAULTCAT-DEMO ...`. This reduces a >20-job matrix to two so each retry cycle takes minutes, not hours. Both demo changes (the disabled jobs and the explore step) are tagged FAULTCAT-DEMO and must be removed before merging this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dd 2 OSDs' Previous skip target ("Add 2 OSDs") put the demo failure at minute ~9 of the multi-node-tests job ("Test failure domain scale up"). Moving the skip to "Setup cluster" fails the job at minute ~5-6 ("Verify config" or "Add 2 OSDs" depending on which is reached first), which shortens the demo feedback loop on each iteration. The failure shape is also more interesting for explore mode: - Bootstrap succeeds → node-wrk0 has microceph - Setup cluster is skipped → node-wrk1..3 never join - `lxc exec node-wrk0 -- microceph status` shows a 1-node cluster where 4 were expected - That is a real diagnostics_gap shape that the lxc/microceph skills in canonical/faultcat#8 0a713e1 now explicitly know how to reach. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

multi-node-tests no longer depends on the build-microceph job. The ~3 min snap build is skipped; instead the multi-node-tests job uses `install_store reef/stable` (the same helper several other jobs use) to install the published microceph snap into each LXD container. This trims another ~3 min off each demo iteration. The failure shape under test ("Setup cluster" skipped → cluster never forms past node-wrk0) is independent of which microceph version is installed, so a stored release is equivalent to a freshly-built one for this demo. Like the other FAULTCAT-DEMO changes, this must be reverted before merging: - Restore `needs: build-microceph` on multi-node-tests. - Restore the "Download snap" step. - Restore "Install local microceph snap" → install_multinode. - Remove the `if: false` on build-microceph. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`timeout-minutes` is not allowed inside the composite action manifest (rejected by GHA). The wall-clock cap belongs on the caller's `uses:` step instead, which is supported. 6 minutes = the action's internal 4.5-min ceiling plus install/cleanup overhead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…plore-mode) Tests new explore-mode prompt: minimum 6 tool calls, fallback ladder when both lxc list and host PATH are empty, CI-context probes, no early exit on negative observations only.

…e-mode) Tests goal-driven stopping criterion + validator-enforced curiosity: no tool-call quota, info-gain heuristic, reject diagnostics_gap without CI-context probe, require alternatives_ruled_out for moderate/strong.

UtkarshBhatthere and others added 9 commits May 13, 2026 19:01

retrigger CI to pick up faultcat action fix (a272154 on canonical/fau…

edb1b17

…ltcat#8)

retrigger CI to pick up 5-min explore timeout (faultcat 525d3ed)

3d983eb

retrigger CI to pick up faultcat curiosity prompt (8b2c9d3 on feat/ex…

010cef7

…plore-mode) Tests new explore-mode prompt: minimum 6 tool calls, fallback ladder when both lxc list and host PATH are empty, CI-context probes, no early exit on negative observations only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEMO ONLY: wire faultcat-explore into multi-node-tests CI#731

DEMO ONLY: wire faultcat-explore into multi-node-tests CI#731
UtkarshBhatthere wants to merge 9 commits into
mainfrom
feat/faultcat-explore-demo

UtkarshBhatthere commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

UtkarshBhatthere commented May 13, 2026

What this PR does

What I want to learn from this CI run

Prerequisite

When this PR is no longer needed

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant