test: fix flaky multi-node Test crush rules step (pool migration race) by UtkarshBhatthere · Pull Request #741 · canonical/microceph

UtkarshBhatthere · 2026-05-27T09:10:52Z

Summary

The Test crush rules step in the multi-node CI job races against .mgr pool creation / crush-rule migration after the failure-domain auto-switch flips the default rule from microceph_auto_osd (id 1) to microceph_auto_host (id 2).
A bare ceph osd pool ls detail | grep -F "crush_rule 2" exits 1 when either no pools are listed yet, or the only pool still carries crush_rule 1.
Add wait_for_pool_crush_rule helper (30 tries × 2s) and call it from the workflow step. The microceph_auto_host rule existence check is unchanged.

Why this lands now

Observed recent multi-node failures across branches all bottom out on the same step:

feat/orchPlus PR Enhance MicroCeph orchestrator support #721 — run 26499172061, step Test crush rules, stderr shows osd pool ls detail returned no crush_rule 2. Cluster state preceding the assertion: pools: 0 pools, 0 pgs.
megademo-robot runs 26424267655 and 26418196210 — robot-translated form of the same assertion fails with STDOUT: : 1 != 0 (grep matched 0 lines).

The mgr daemon creates the .mgr pool asynchronously; tests should not assume it exists at any specific tick after the rule switch.

Test plan

CI multi-node job passes on first attempt
Test crush rules step logs Found pool with crush_rule 2
No new lint / shell warnings in actionutils.sh

🤖 Generated with Claude Code

The "Test crush rules" step in the multi-node CI job races against mgr pool creation and crush-rule migration. After the failure-domain auto-switch flips the default crush rule from osd (id 1) to host (id 2), the mgr-created .mgr pool may not yet exist or its rule migration may still be in flight when the assertion runs. `ceph osd pool ls detail | grep -F "crush_rule 2"` then exits 1 because either no pools are listed, or the only listed pool still has crush_rule 1. Add a wait_for_pool_crush_rule helper to actionutils.sh that polls `ceph osd pool ls detail` until a pool reaches the expected rule (default 30 tries * 2s = 60s budget), and use it from the workflow step. The crush rule existence check is unchanged. Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com> Assisted-by: Claude Opus 4.7 <noreply@anthropic.com>

Replaces all 22 bash CI test jobs with Robot Framework 7.x suites that produce structured HTML/XML reports, support selective suite execution, and make failures easier to diagnose with inline keyword-level output. Structure: - tests/robot/resources/microceph_harness.resource — ~110 shared keywords (VM lifecycle, snap install, cluster bootstrap, OSD/RGW/NFS helpers) - tests/robot/resources/streaming_process.py — real-time output for long-running processes (DSL, cephadm-adopt, wiping) - 23 suite directories under tests/robot/, one per CI job: single-system-tests, multi-node-tests, availability-zone-tests, multi-node-tests-with-custom-microceph-ip, test-sequential-mon-host-refresh, test-maintenance-modes, loop-file-tests, wal-db-tests, upgrade-reef-tests, cluster-tests, rbd-replication-test, cephfs-replication-test, nfs-test, nfs-multinode-test, messenger-v2-tests, wiping-test, cephadm-adopt-test, dsl-functional-tests (6 parallel jobs), api-tests, static-checks, unit-tests - robot.py / tox.ini — CLI wrapper and tox integration for local runs - tests/scripts/: actionutils.sh idempotency fix, adoptutils.sh upstream fixes, test_dsl_functest.sh timeout hardening Migration style: - Inline reimplementation: bash logic rewritten as Robot/harness keywords (the majority — checked line-by-line for 1:1 parity) - Direct bash execution: very long suites (DSL x6, cephadm-adopt, wiping, api-disk) run the original .sh unchanged via Run Streaming Process - All flakiness fixes from upstream (canonical#737, canonical#741) incorporated; additional retry loops and polling guards added throughout Assisted-by: claude-code:claude-sonnet-4-6 Assisted-by: claude-code:claude-opus-4-7 Assisted-by: claude-code:claude-opus-4-8 Signed-off-by: John Ramsden <john.ramsden@canonical.com>

UtkarshBhatthere merged commit 99bf489 into main May 27, 2026
48 checks passed

UtkarshBhatthere mentioned this pull request May 27, 2026

Tracking: Ceph Mgr orchestrator module (microceph-orch) #743

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: fix flaky multi-node Test crush rules step (pool migration race)#741

test: fix flaky multi-node Test crush rules step (pool migration race)#741
UtkarshBhatthere merged 1 commit into
mainfrom
fix/multinode-crush-rules-race

UtkarshBhatthere commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

UtkarshBhatthere commented May 27, 2026

Summary

Why this lands now

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant