Skip to content

test: fix flaky multi-node Test crush rules step (pool migration race)#741

Merged
UtkarshBhatthere merged 1 commit into
mainfrom
fix/multinode-crush-rules-race
May 27, 2026
Merged

test: fix flaky multi-node Test crush rules step (pool migration race)#741
UtkarshBhatthere merged 1 commit into
mainfrom
fix/multinode-crush-rules-race

Conversation

@UtkarshBhatthere
Copy link
Copy Markdown
Contributor

Summary

  • The Test crush rules step in the multi-node CI job races against .mgr pool creation / crush-rule migration after the failure-domain auto-switch flips the default rule from microceph_auto_osd (id 1) to microceph_auto_host (id 2).
  • A bare ceph osd pool ls detail | grep -F "crush_rule 2" exits 1 when either no pools are listed yet, or the only pool still carries crush_rule 1.
  • Add wait_for_pool_crush_rule helper (30 tries × 2s) and call it from the workflow step. The microceph_auto_host rule existence check is unchanged.

Why this lands now

Observed recent multi-node failures across branches all bottom out on the same step:

  • feat/orchPlus PR Enhance MicroCeph orchestrator support #721 — run 26499172061, step Test crush rules, stderr shows osd pool ls detail returned no crush_rule 2. Cluster state preceding the assertion: pools: 0 pools, 0 pgs.
  • megademo-robot runs 26424267655 and 26418196210 — robot-translated form of the same assertion fails with STDOUT: : 1 != 0 (grep matched 0 lines).

The mgr daemon creates the .mgr pool asynchronously; tests should not assume it exists at any specific tick after the rule switch.

Test plan

  • CI multi-node job passes on first attempt
  • Test crush rules step logs Found pool with crush_rule 2
  • No new lint / shell warnings in actionutils.sh

🤖 Generated with Claude Code

The "Test crush rules" step in the multi-node CI job races against
mgr pool creation and crush-rule migration. After the failure-domain
auto-switch flips the default crush rule from osd (id 1) to host
(id 2), the mgr-created .mgr pool may not yet exist or its rule
migration may still be in flight when the assertion runs.

`ceph osd pool ls detail | grep -F "crush_rule 2"` then exits 1
because either no pools are listed, or the only listed pool still
has crush_rule 1.

Add a wait_for_pool_crush_rule helper to actionutils.sh that polls
`ceph osd pool ls detail` until a pool reaches the expected rule
(default 30 tries * 2s = 60s budget), and use it from the workflow
step. The crush rule existence check is unchanged.

Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Assisted-by: Claude Opus 4.7 <noreply@anthropic.com>
@UtkarshBhatthere UtkarshBhatthere merged commit 99bf489 into main May 27, 2026
48 checks passed
johnramsden added a commit to johnramsden/microceph that referenced this pull request May 29, 2026
Replaces all 22 bash CI test jobs with Robot Framework 7.x suites that
produce structured HTML/XML reports, support selective suite execution,
and make failures easier to diagnose with inline keyword-level output.

Structure:
- tests/robot/resources/microceph_harness.resource — ~110 shared keywords
  (VM lifecycle, snap install, cluster bootstrap, OSD/RGW/NFS helpers)
- tests/robot/resources/streaming_process.py — real-time output for
  long-running processes (DSL, cephadm-adopt, wiping)
- 23 suite directories under tests/robot/, one per CI job:
  single-system-tests, multi-node-tests, availability-zone-tests,
  multi-node-tests-with-custom-microceph-ip, test-sequential-mon-host-refresh,
  test-maintenance-modes, loop-file-tests, wal-db-tests, upgrade-reef-tests,
  cluster-tests, rbd-replication-test, cephfs-replication-test, nfs-test,
  nfs-multinode-test, messenger-v2-tests, wiping-test, cephadm-adopt-test,
  dsl-functional-tests (6 parallel jobs), api-tests, static-checks, unit-tests
- robot.py / tox.ini — CLI wrapper and tox integration for local runs
- tests/scripts/: actionutils.sh idempotency fix, adoptutils.sh upstream
  fixes, test_dsl_functest.sh timeout hardening

Migration style:
- Inline reimplementation: bash logic rewritten as Robot/harness keywords
  (the majority — checked line-by-line for 1:1 parity)
- Direct bash execution: very long suites (DSL x6, cephadm-adopt, wiping,
  api-disk) run the original .sh unchanged via Run Streaming Process
- All flakiness fixes from upstream (canonical#737, canonical#741) incorporated; additional
  retry loops and polling guards added throughout

Assisted-by: claude-code:claude-sonnet-4-6
Assisted-by: claude-code:claude-opus-4-7
Assisted-by: claude-code:claude-opus-4-8
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
johnramsden added a commit to johnramsden/microceph that referenced this pull request May 29, 2026
Replaces all 22 bash CI test jobs with Robot Framework 7.x suites that
produce structured HTML/XML reports, support selective suite execution,
and make failures easier to diagnose with inline keyword-level output.

Structure:
- tests/robot/resources/microceph_harness.resource — ~110 shared keywords
  (VM lifecycle, snap install, cluster bootstrap, OSD/RGW/NFS helpers)
- tests/robot/resources/streaming_process.py — real-time output for
  long-running processes (DSL, cephadm-adopt, wiping)
- 23 suite directories under tests/robot/, one per CI job:
  single-system-tests, multi-node-tests, availability-zone-tests,
  multi-node-tests-with-custom-microceph-ip, test-sequential-mon-host-refresh,
  test-maintenance-modes, loop-file-tests, wal-db-tests, upgrade-reef-tests,
  cluster-tests, rbd-replication-test, cephfs-replication-test, nfs-test,
  nfs-multinode-test, messenger-v2-tests, wiping-test, cephadm-adopt-test,
  dsl-functional-tests (6 parallel jobs), api-tests, static-checks, unit-tests
- robot.py / tox.ini — CLI wrapper and tox integration for local runs
- tests/scripts/: actionutils.sh idempotency fix, adoptutils.sh upstream
  fixes, test_dsl_functest.sh timeout hardening

Migration style:
- Inline reimplementation: bash logic rewritten as Robot/harness keywords
  (the majority — checked line-by-line for 1:1 parity)
- Direct bash execution: very long suites (DSL x6, cephadm-adopt, wiping,
  api-disk) run the original .sh unchanged via Run Streaming Process
- All flakiness fixes from upstream (canonical#737, canonical#741) incorporated; additional
  retry loops and polling guards added throughout

Assisted-by: claude-code:claude-sonnet-4-6
Assisted-by: claude-code:claude-opus-4-7
Assisted-by: claude-code:claude-opus-4-8
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant