Enhance MicroCeph orchestrator support#721
Conversation
0035b85 to
c57755e
Compare
Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Python 3.12 enum machinery resolves str()/repr() on PlacementSpec and
HostPlacementSpec via Enum.__str__/__repr__ inside ceph-mgr, causing
AttributeError on _name_/_value_repr_. Two callsites trigger this:
- f"...{vars(spec)}" debug logs invoke dict.__repr__ which recurses into
each value's __repr__, hitting PlacementSpec → HostPattern enum.
- str(h) in _get_placement_hosts falls through to Enum.__str__ on
HostPlacementSpec (NamedTuple) instances.
Replace 7 vars(spec) f-strings with static debug strings. Use h.hostname
directly when extracting placement hosts.
Also surface raw output in test-orch.sh assert_contains/assert_not_contains
on FAIL to make future diagnosis cheaper.
Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
- get_inventory: drop the dead list_resources() client method and its misleading test mocks; document that the inventory reports OSD-backed disks only, since /1.0/resources is local-node-only and the socket client cannot proxy to peers. - _apply_service: report a race-free, honest summary "<svc>: enabled (requested placement: <hosts>)" instead of falsely claiming each placement host gained the service; per-host targeting is not supported via the unix socket so the actual placement cannot be confirmed. - _get_existing_service_hosts: accept a group_id so distinct NFS clusters are not conflated when checking whether a service already runs. - get_hosts: parse bracketed IPv6 "[addr]:port" addresses and leave bare IPv6 literals intact instead of truncating on the last colon. - service.py: tolerate non-JSON / error-key-less HTTP error bodies when translating microcluster exceptions. - _get_service_hostlist: use .get() for record fields, consistent with the rest of the module. - remove_service: log when a non-NFS service id is ignored. - test-orch.sh: quote $CEPH, drop the unused host_count assignment. - pyproject.toml: declare pytest as a dev dependency group. Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
c930643 to
d320224
Compare
There was a problem hiding this comment.
Pull request overview
This PR expands the MicroCeph Ceph-mgr orchestrator backend to support more orchestrator operations (inventory, daemon/service listing, apply/remove services, host removal, restart), and adds Python unit + integration coverage to CI to validate the new behavior.
Changes:
- Implement orchestrator backend support for inventory, daemon/service listing, service apply/remove, host removal, and restart actions.
- Add a Python unit test suite for the orchestrator module and client layer using Ceph-type stubs/mocks.
- Add a bash-based integration test script and run both unit + integration orchestrator tests in GitHub Actions.
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
tests/scripts/test-orch.sh |
Adds an end-to-end orchestrator CLI integration test for status/ls/ps/inventory/apply/rm/restart flows. |
microceph-orch/tests/test_module.py |
Unit tests covering orchestrator module behaviors (hosts/services/daemons/inventory/apply/rm/actions). |
microceph-orch/tests/test_client.py |
Unit tests for the MicroCeph client services layer (cluster + extended API). |
microceph-orch/tests/stubs.py |
Provides minimal Ceph/orchestrator type stubs for running tests outside the snap environment. |
microceph-orch/tests/conftest.py |
Installs mocked Ceph/snap-only modules into sys.modules and provides shared pytest fixtures. |
microceph-orch/src/microceph/module.py |
Implements the new orchestrator methods (parsing, filtering, inventory, apply/remove/restart). |
microceph-orch/src/microceph/client/service.py |
Hardens HTTP error-body parsing to avoid masking the real HTTP error on non-JSON responses. |
microceph-orch/src/microceph/client/cluster.py |
Makes client list methods robust to null/missing metadata and adds service enable/delete/restart calls. |
microceph-orch/pyproject.toml |
Adds a dev dependency group for pytest. |
.gitignore |
Ignores Python bytecode caches. |
.github/workflows/tests.yml |
Runs orchestrator Python unit tests and adds orchestrator integration tests to CI. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
`pip install pytest` alone left runtime deps like `requests` to runner preinstall, which is brittle. Install the package itself so declared dependencies resolve. Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com> Assisted-by: Claude Opus 4.7 <noreply@anthropic.com>
MicroCeph has no host-label concept. Previously labels were silently ignored, returning unfiltered hosts/devices to callers that expected a label filter. Raise NotImplementedError instead so the mismatch surfaces. Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com> Assisted-by: Claude Opus 4.7 <noreply@anthropic.com>
The preceding assert already accepts "enabled|already active". The second assert forced "enabled" only, which would false-negative when apply legitimately reports "already active" (idempotent re-apply after the test's earlier `orch rm rgw`). Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com> Assisted-by: Claude Opus 4.7 <noreply@anthropic.com>
ExtendedAPIService.list_services() may return records without group_id (and other optional keys). Direct dict indexing in list_daemons raised KeyError and broke `ceph orch ps`. Switch to .get() with empty-string defaults, consistent with how the same fields are read in _get_service_hostlist() and the inventory path. Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com> Assisted-by: Claude Opus 4.7 <noreply@anthropic.com>
sabaini
left a comment
There was a problem hiding this comment.
Hey @UtkarshBhatthere some notes inline
For the known limitations around SSL and remote targets I feel we'd need to warn users, or maybe even better hide the functionality behind a feature flag so users arent surprised by the behaviour
| name=service_name, | ||
| payload=payload, | ||
| wait=True, | ||
| ) |
There was a problem hiding this comment.
So for remote hosts targetting we'd need to pass hosts to enable_service() -- this is a known limitation afaict?
| ) | ||
|
|
||
| payload = json.dumps(rgw_params) if rgw_params else "{}" | ||
| return self._apply_service("rgw", spec, payload) |
There was a problem hiding this comment.
We're dropping the service_id so only bare rgw is being applied and service_id is dropped. Would suggest to log a warning and document the limitation
| def test_restart_service_with_id(self, orchestrator, mock_client): | ||
| result = orchestrator.service_action("restart", "nfs.mycluster") | ||
| assert result.exception is None | ||
| mock_client.services.restart_services.assert_called_once_with(["nfs"]) |
There was a problem hiding this comment.
Hm, do we actually have the ability to restart NFS?
Three issues raised in PR review:
1. apply_rgw silently drops spec.service_id. MicroCeph deploys a
single bare "rgw" service per node and does not support multiple
RGW realms with distinct service_ids on the same cluster. Log a
warning when a service_id is supplied so users get a clear signal
instead of confusing apparently-successful no-op behaviour, and
document the limitation on the docstring.
2. apply_mds has the same shape: per-filesystem MDS placement is not
supported and service_id is silently ignored. Same fix.
3. service_action("restart", ...) only the backend serviceWorkerTable
in microceph/ceph/services.go has entries for osd, mon and rgw;
any other service type would have surfaced an opaque
`no handler defined for service X` RemoteException from the
backend. Reject unsupported service types up front with a
NotImplementedError that names the supported set, and add a
RESTART_SUPPORTED_SERVICES constant so the boundary is explicit.
Tests:
- test_apply_rgw_service_id_warns
- test_apply_mds_service_id_warns
- test_restart_unsupported_service_type covers nfs/mds/mgr/rbd-mirror
- test_restart_service_with_id updated to use rgw.realm1 (was
nfs.mycluster, which is now correctly rejected)
Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Assisted-by: Claude Opus 4.7 <noreply@anthropic.com>
Adds end-to-end per-host targeting for service enable / remove /
restart, plus a bare-name guard that aligns the orchestrator's
service-name semantics with MicroCeph's single-bare-instance-per-node
backend.
Transport: MicroCeph's HTTP API endpoints for service mutation all
declare `ProxyTarget: true` in the Go rest layer. The Python client
keeps using the local unix socket; the server-side proxyTarget
middleware in microcluster reads the new `?target=<member>` query
parameter and mTLS-forwards the request to that member's HTTPS
endpoint. No certs or extra config required in the mgr snap; ceph-mgr
and microcephd share `$SNAP_COMMON`.
Behavioural changes:
* `_apply_service`, `remove_service`, `service_action` fan out per
host. Already-running hosts are skipped on apply; remove and
restart discover hosts via `list_services()`. Any per-host failure
raises `RuntimeError` with partial-success context, instead of
silently mixing failure entries into a success result.
* `_get_existing_service_hosts` no longer swallows `RemoteException`
on `list_services()` — a backend failure now surfaces rather than
producing a false "no-op" success.
* New `_require_bare_service_name(svc_type, svc_id)` helper and a
`DOTTED_NAME_SUPPORTED_SERVICES = {"nfs"}` allowlist. Dotted names
on any other service (rgw.realm1, mds.fs1, mon.x, ...) are rejected
up front with `ValueError`. Previously rgw/mds warned-and-dropped;
mon/mgr/rbd-mirror/cephfs-mirror silently dropped. Same guard
applied on `remove_service` and `service_action`.
* `_apply_service` parameter renamed `service_name` -> `service_type`
to reflect that it always receives the bare type, never a dotted
name.
Client (`microceph-orch/src/microceph/client/cluster.py`):
* `enable_service`, `delete_service`, `restart_services`,
`delete_nfs_service` accept an optional `target: str | None`
kwarg. When set, it is forwarded as a `?target=` query param to
drive the server-side proxy. Guard uses `if target is not None`
so an empty string is no longer silently equivalent to None.
Tests:
* 94 unit tests pass (was 79 before this branch). New coverage:
per-host target assertion on every apply/remove/restart success
path; `_apply_service` partial-failure raises with context;
`_get_existing_service_hosts` exception propagation; dotted-name
rejection on rgw/mds/mon/mgr/rbd-mirror/cephfs-mirror; NFS
group_id no-match remove; `service_action` partial failure;
`reset_mock` per loop iter in unsupported-restart test.
* `side_effect` lambdas use `(*args, **kwargs)` to remain robust to
positional-vs-keyword call shapes.
Multi-node CI integration test
(`tests/scripts/test-orch.sh`):
* New section 6b applies rgw with `--placement="$first_host,$second_host"`
and verifies the daemon appears on the *second* host via
`orch ps --hostname` — exercising the unix-socket ->
proxyTarget -> remote mTLS path end to end.
* Section 9 verifies cluster-wide RGW removal (every host) instead
of just the local node.
* Section 8 asserts the new `Restarted mon on <host>` output shape
and that the restart enumerates each host in a multi-host cluster.
* Adds a guard test that `orch restart rgw.realm1` fails with
`does not support a service_id`.
Refs #743
Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Assisted-by: Claude Opus 4.7 <noreply@anthropic.com>
Pre-merge polish after multi-agent review of the per-host targeting
work. No new feature surface; corrects how the orchestrator reports
failures and unsupported operations to the Ceph framework, and
hardens the apply path against a backend race.
Exception hierarchy
-------------------
Swap bare exception types for Ceph orchestrator types so the
framework renders them as proper operation errors rather than Python
tracebacks:
* `RuntimeError` (per-host fan-out failure) -> `OrchestratorError`
* `NotImplementedError` (unsupported action / unsupported service
type / unsupported InventoryFilter labels) ->
`OrchestratorValidationError`
`ValueError` is kept for input-shape errors (dotted-name guard, NFS
missing cluster_id) since those fire before any I/O.
Cephadm parity on "nothing to do"
---------------------------------
`remove_service` and `service_action("restart", ...)` now raise
`OrchestratorError("Service X is not running on any host")` when the
requested service has no live hosts, instead of returning a green
success summary. This matches cephadm: an operator asking to remove
or restart a service that does not exist should see an error so a
typo or stale assumption is caught, not a silent no-op.
TOCTOU tolerance on apply
-------------------------
`_apply_service` snapshots existing hosts once and then enables per
host. If the service races into existence between snapshot and
enable, the backend's `genericHospitalityCheck`
(`microceph/ceph/services_placement_generic.go`) returns
"<svc> service already active on host". This is the desired end
state, not a failure, so we now match that substring and treat the
host as enabled instead of appending to failures.
`describe_service` size
-----------------------
`ServiceDescription` now sets `size=len(hostlist)` alongside
`running`. MicroCeph has no separate spec store -- desired state is
the live running set -- so `ceph orch ls` was rendering RUNNING/-
(looks degraded). With `size` populated it renders RUNNING/SIZE
cleanly.
Integration test
----------------
* `tests/scripts/test-orch.sh` `second_host` awk now filters first
column to a hostname shape (`/^[A-Za-z]/`) so the trailing
"N hosts in cluster" footer row is no longer picked up as a
hostname on single-node CI.
* Section 10 (Remove NFS) now accepts either the new
"removed from <host>" success shape or the new
"not running on any host" error shape (NFS may legitimately have
failed to start on CI runners missing kernel modules).
Tests
-----
* 94 unit tests pass.
* Exception-class assertions updated:
- `isinstance(..., RuntimeError)` -> `OrchestratorError`
- `isinstance(..., NotImplementedError)` -> `OrchestratorValidationError`
* `test_remove_service_no_hosts`, `test_remove_service_nfs_group_id_no_match`,
`test_restart_no_hosts` now assert the new "not running on any host"
error shape instead of a string-summary success.
* `tests/stubs.py` and `tests/conftest.py` extended with
`OrchestratorError` and `OrchestratorValidationError` stubs and
bound into the injected `orchestrator` module so production-code
imports resolve to the right types under test.
* `ServiceDescription` stub gains a `size` keyword.
Refs #743
Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Assisted-by: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Testing
Notes