Skip to content

ci: Add robot framework structure#732

Open
johnramsden wants to merge 2 commits into
canonical:mainfrom
johnramsden:megademo-robot
Open

ci: Add robot framework structure#732
johnramsden wants to merge 2 commits into
canonical:mainfrom
johnramsden:megademo-robot

Conversation

@johnramsden
Copy link
Copy Markdown
Member

@johnramsden johnramsden commented May 13, 2026

Description

Migrate MicroCeph's CI test suite from bash + GitHub Actions to the Robot Framework. The new tests must be runnable locally without any GitHub Actions dependency.

See for context:

Type of change

  • Clean code (code refactor, test updates; does not introduce functional changes)

Contributor checklist

Please check that you have:

  • self-reviewed the code in this PR
  • added code comments, particularly in less straightforward areas
  • checked and added or updated relevant documentation
  • added or updated HTML meta descriptions for any new or modified documentation pages (see #643)
  • verified that page title and headings accurately represent page content for new or modified documentation pages
  • checked and added or updated relevant release notes
  • added tests to verify effectiveness of this change

@johnramsden johnramsden marked this pull request as draft May 13, 2026 14:17
@johnramsden johnramsden force-pushed the megademo-robot branch 4 times, most recently from 68f3a89 to ecacc58 Compare May 21, 2026 18:04
johnramsden added a commit to johnramsden/microceph that referenced this pull request May 21, 2026
The previous 14-test-case structure called test_dsl_functest.sh once per
test case. Each call bootstraps its own fresh VMs/containers, so the job
took 1h+ and never completed (cancelled after 1h 16m on PR canonical#732).

The original CI ran run_dsl_full_tests as a single step, letting the
script manage all VM lifecycles internally (shared VMs for
baseline/validation/dryrun, isolated VMs for provision/cleanup/
consistency). Restore that behaviour with one test case and a 4-hour
timeout, matching the upstream contract.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
johnramsden added a commit to johnramsden/microceph that referenced this pull request May 28, 2026
The previous 14-test-case structure called test_dsl_functest.sh once per
test case. Each call bootstraps its own fresh VMs/containers, so the job
took 1h+ and never completed (cancelled after 1h 16m on PR canonical#732).

The original CI ran run_dsl_full_tests as a single step, letting the
script manage all VM lifecycles internally (shared VMs for
baseline/validation/dryrun, isolated VMs for provision/cleanup/
consistency). Restore that behaviour with one test case and a 4-hour
timeout, matching the upstream contract.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
@johnramsden johnramsden force-pushed the megademo-robot branch 2 times, most recently from d899910 to 2ca33a5 Compare May 28, 2026 16:05
johnramsden added a commit to johnramsden/microceph that referenced this pull request May 28, 2026
The previous 14-test-case structure called test_dsl_functest.sh once per
test case. Each call bootstraps its own fresh VMs/containers, so the job
took 1h+ and never completed (cancelled after 1h 16m on PR canonical#732).

The original CI ran run_dsl_full_tests as a single step, letting the
script manage all VM lifecycles internally (shared VMs for
baseline/validation/dryrun, isolated VMs for provision/cleanup/
consistency). Restore that behaviour with one test case and a 4-hour
timeout, matching the upstream contract.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
@johnramsden johnramsden linked an issue May 28, 2026 that may be closed by this pull request
@johnramsden johnramsden marked this pull request as ready for review May 29, 2026 20:20
Replaces all 22 bash CI test jobs with Robot Framework 7.x suites that
produce structured HTML/XML reports, support selective suite execution,
and make failures easier to diagnose with inline keyword-level output.

Structure:
- tests/robot/resources/microceph_harness.resource — ~110 shared keywords
  (VM lifecycle, snap install, cluster bootstrap, OSD/RGW/NFS helpers)
- tests/robot/resources/streaming_process.py — real-time output for
  long-running processes (DSL, cephadm-adopt, wiping)
- 23 suite directories under tests/robot/, one per CI job:
  single-system-tests, multi-node-tests, availability-zone-tests,
  multi-node-tests-with-custom-microceph-ip, test-sequential-mon-host-refresh,
  test-maintenance-modes, loop-file-tests, wal-db-tests, upgrade-reef-tests,
  cluster-tests, rbd-replication-test, cephfs-replication-test, nfs-test,
  nfs-multinode-test, messenger-v2-tests, wiping-test, cephadm-adopt-test,
  dsl-functional-tests (6 parallel jobs), api-tests, static-checks, unit-tests
- robot.py / tox.ini — CLI wrapper and tox integration for local runs
- tests/scripts/: actionutils.sh idempotency fix, adoptutils.sh upstream
  fixes, test_dsl_functest.sh timeout hardening

Migration style:
- Inline reimplementation: bash logic rewritten as Robot/harness keywords
  (the majority — checked line-by-line for 1:1 parity)
- Direct bash execution: very long suites (DSL x6, cephadm-adopt, wiping,
  api-disk) run the original .sh unchanged via Run Streaming Process
- All flakiness fixes from upstream (canonical#737, canonical#741) incorporated; additional
  retry loops and polling guards added throughout

Assisted-by: claude-code:claude-sonnet-4-6
Assisted-by: claude-code:claude-opus-4-7
Assisted-by: claude-code:claude-opus-4-8
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Rewrites .github/workflows/tests.yml to invoke Robot Framework instead of
calling actionutils.sh functions directly:

- Each of the 22 test jobs now runs:
    python3 tests/robot/robot.py --snap-path <snap> --test-suite <suite>
- static-checks and unit-tests moved to checks.yml (run on every push,
  not just when a snap artifact is available)
- DSL functional tests split into 6 parallel jobs (baseline, validation,
  dryrun, provision, cleanup, consistency) to cut wall-clock time
- LXD initialisation made explicit; host dependency checks added
- Wiping test streams output from inside the outer VM (no nested KVM)
- bash -x tracing enabled for all DSL jobs to aid debugging

Assisted-by: claude-code:claude-sonnet-4-6
Assisted-by: claude-code:claude-opus-4-7
Assisted-by: claude-code:claude-opus-4-8
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Integration tests cannot be run locally

1 participant