Skip to content

Enhance MicroCeph orchestrator support#721

Open
UtkarshBhatthere wants to merge 11 commits into
mainfrom
feat/orchPlus
Open

Enhance MicroCeph orchestrator support#721
UtkarshBhatthere wants to merge 11 commits into
mainfrom
feat/orchPlus

Conversation

@UtkarshBhatthere
Copy link
Copy Markdown
Contributor

Summary

  • add MicroCeph orchestrator support for service inventory, daemon listing, service apply/remove, host removal, and restart actions
  • add Python unit tests for the orchestrator client/module using Ceph test stubs
  • add orchestrator unit and integration coverage to CI

Testing

  • PYTHONPATH=src:tests python3 -m pytest tests/ -v
  • bash -n tests/scripts/test-orch.sh

Notes

  • go test ./... was attempted locally but fails in this environment on go-dqlite CGO headers: DQLITE_SNAPSHOT_TRAILING_STATIC/DYNAMIC undefined
  • the SMB dashboard compatibility patch and python3-openssl change from the old branch are already present on main, so they were not included in this PR

Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Python 3.12 enum machinery resolves str()/repr() on PlacementSpec and
HostPlacementSpec via Enum.__str__/__repr__ inside ceph-mgr, causing
AttributeError on _name_/_value_repr_. Two callsites trigger this:

- f"...{vars(spec)}" debug logs invoke dict.__repr__ which recurses into
  each value's __repr__, hitting PlacementSpec → HostPattern enum.
- str(h) in _get_placement_hosts falls through to Enum.__str__ on
  HostPlacementSpec (NamedTuple) instances.

Replace 7 vars(spec) f-strings with static debug strings. Use h.hostname
directly when extracting placement hosts.

Also surface raw output in test-orch.sh assert_contains/assert_not_contains
on FAIL to make future diagnosis cheaper.

Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
- get_inventory: drop the dead list_resources() client method and its
  misleading test mocks; document that the inventory reports OSD-backed
  disks only, since /1.0/resources is local-node-only and the socket
  client cannot proxy to peers.
- _apply_service: report a race-free, honest summary
  "<svc>: enabled (requested placement: <hosts>)" instead of falsely
  claiming each placement host gained the service; per-host targeting is
  not supported via the unix socket so the actual placement cannot be
  confirmed.
- _get_existing_service_hosts: accept a group_id so distinct NFS clusters
  are not conflated when checking whether a service already runs.
- get_hosts: parse bracketed IPv6 "[addr]:port" addresses and leave bare
  IPv6 literals intact instead of truncating on the last colon.
- service.py: tolerate non-JSON / error-key-less HTTP error bodies when
  translating microcluster exceptions.
- _get_service_hostlist: use .get() for record fields, consistent with
  the rest of the module.
- remove_service: log when a non-NFS service id is ignored.
- test-orch.sh: quote $CEPH, drop the unused host_count assignment.
- pyproject.toml: declare pytest as a dev dependency group.

Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the MicroCeph Ceph-mgr orchestrator backend to support more orchestrator operations (inventory, daemon/service listing, apply/remove services, host removal, restart), and adds Python unit + integration coverage to CI to validate the new behavior.

Changes:

  • Implement orchestrator backend support for inventory, daemon/service listing, service apply/remove, host removal, and restart actions.
  • Add a Python unit test suite for the orchestrator module and client layer using Ceph-type stubs/mocks.
  • Add a bash-based integration test script and run both unit + integration orchestrator tests in GitHub Actions.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/scripts/test-orch.sh Adds an end-to-end orchestrator CLI integration test for status/ls/ps/inventory/apply/rm/restart flows.
microceph-orch/tests/test_module.py Unit tests covering orchestrator module behaviors (hosts/services/daemons/inventory/apply/rm/actions).
microceph-orch/tests/test_client.py Unit tests for the MicroCeph client services layer (cluster + extended API).
microceph-orch/tests/stubs.py Provides minimal Ceph/orchestrator type stubs for running tests outside the snap environment.
microceph-orch/tests/conftest.py Installs mocked Ceph/snap-only modules into sys.modules and provides shared pytest fixtures.
microceph-orch/src/microceph/module.py Implements the new orchestrator methods (parsing, filtering, inventory, apply/remove/restart).
microceph-orch/src/microceph/client/service.py Hardens HTTP error-body parsing to avoid masking the real HTTP error on non-JSON responses.
microceph-orch/src/microceph/client/cluster.py Makes client list methods robust to null/missing metadata and adds service enable/delete/restart calls.
microceph-orch/pyproject.toml Adds a dev dependency group for pytest.
.gitignore Ignores Python bytecode caches.
.github/workflows/tests.yml Runs orchestrator Python unit tests and adds orchestrator integration tests to CI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/scripts/test-orch.sh Outdated
Comment thread microceph-orch/src/microceph/module.py
Comment thread .github/workflows/tests.yml
`pip install pytest` alone left runtime deps like `requests` to runner
preinstall, which is brittle. Install the package itself so declared
dependencies resolve.

Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Assisted-by: Claude Opus 4.7 <noreply@anthropic.com>
MicroCeph has no host-label concept. Previously labels were silently
ignored, returning unfiltered hosts/devices to callers that expected a
label filter. Raise NotImplementedError instead so the mismatch surfaces.

Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Assisted-by: Claude Opus 4.7 <noreply@anthropic.com>
The preceding assert already accepts "enabled|already active". The
second assert forced "enabled" only, which would false-negative when
apply legitimately reports "already active" (idempotent re-apply after
the test's earlier `orch rm rgw`).

Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Assisted-by: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 1 comment.

Comment thread microceph-orch/src/microceph/module.py
ExtendedAPIService.list_services() may return records without
group_id (and other optional keys). Direct dict indexing in
list_daemons raised KeyError and broke `ceph orch ps`. Switch to
.get() with empty-string defaults, consistent with how the same
fields are read in _get_service_hostlist() and the inventory path.

Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Assisted-by: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@sabaini sabaini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @UtkarshBhatthere some notes inline
For the known limitations around SSL and remote targets I feel we'd need to warn users, or maybe even better hide the functionality behind a feature flag so users arent surprised by the behaviour

Comment thread microceph-orch/src/microceph/module.py Outdated
name=service_name,
payload=payload,
wait=True,
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for remote hosts targetting we'd need to pass hosts to enable_service() -- this is a known limitation afaict?

)

payload = json.dumps(rgw_params) if rgw_params else "{}"
return self._apply_service("rgw", spec, payload)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're dropping the service_id so only bare rgw is being applied and service_id is dropped. Would suggest to log a warning and document the limitation

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar with mds

Comment thread microceph-orch/tests/test_module.py Outdated
def test_restart_service_with_id(self, orchestrator, mock_client):
result = orchestrator.service_action("restart", "nfs.mycluster")
assert result.exception is None
mock_client.services.restart_services.assert_called_once_with(["nfs"])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, do we actually have the ability to restart NFS?

Three issues raised in PR review:

1. apply_rgw silently drops spec.service_id. MicroCeph deploys a
   single bare "rgw" service per node and does not support multiple
   RGW realms with distinct service_ids on the same cluster. Log a
   warning when a service_id is supplied so users get a clear signal
   instead of confusing apparently-successful no-op behaviour, and
   document the limitation on the docstring.

2. apply_mds has the same shape: per-filesystem MDS placement is not
   supported and service_id is silently ignored. Same fix.

3. service_action("restart", ...) only the backend serviceWorkerTable
   in microceph/ceph/services.go has entries for osd, mon and rgw;
   any other service type would have surfaced an opaque
   `no handler defined for service X` RemoteException from the
   backend. Reject unsupported service types up front with a
   NotImplementedError that names the supported set, and add a
   RESTART_SUPPORTED_SERVICES constant so the boundary is explicit.

Tests:
- test_apply_rgw_service_id_warns
- test_apply_mds_service_id_warns
- test_restart_unsupported_service_type covers nfs/mds/mgr/rbd-mirror
- test_restart_service_with_id updated to use rgw.realm1 (was
  nfs.mycluster, which is now correctly rejected)

Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Assisted-by: Claude Opus 4.7 <noreply@anthropic.com>
Adds end-to-end per-host targeting for service enable / remove /
restart, plus a bare-name guard that aligns the orchestrator's
service-name semantics with MicroCeph's single-bare-instance-per-node
backend.

Transport: MicroCeph's HTTP API endpoints for service mutation all
declare `ProxyTarget: true` in the Go rest layer. The Python client
keeps using the local unix socket; the server-side proxyTarget
middleware in microcluster reads the new `?target=<member>` query
parameter and mTLS-forwards the request to that member's HTTPS
endpoint. No certs or extra config required in the mgr snap; ceph-mgr
and microcephd share `$SNAP_COMMON`.

Behavioural changes:

* `_apply_service`, `remove_service`, `service_action` fan out per
  host. Already-running hosts are skipped on apply; remove and
  restart discover hosts via `list_services()`. Any per-host failure
  raises `RuntimeError` with partial-success context, instead of
  silently mixing failure entries into a success result.
* `_get_existing_service_hosts` no longer swallows `RemoteException`
  on `list_services()` — a backend failure now surfaces rather than
  producing a false "no-op" success.
* New `_require_bare_service_name(svc_type, svc_id)` helper and a
  `DOTTED_NAME_SUPPORTED_SERVICES = {"nfs"}` allowlist. Dotted names
  on any other service (rgw.realm1, mds.fs1, mon.x, ...) are rejected
  up front with `ValueError`. Previously rgw/mds warned-and-dropped;
  mon/mgr/rbd-mirror/cephfs-mirror silently dropped. Same guard
  applied on `remove_service` and `service_action`.
* `_apply_service` parameter renamed `service_name` -> `service_type`
  to reflect that it always receives the bare type, never a dotted
  name.

Client (`microceph-orch/src/microceph/client/cluster.py`):

* `enable_service`, `delete_service`, `restart_services`,
  `delete_nfs_service` accept an optional `target: str | None`
  kwarg. When set, it is forwarded as a `?target=` query param to
  drive the server-side proxy. Guard uses `if target is not None`
  so an empty string is no longer silently equivalent to None.

Tests:

* 94 unit tests pass (was 79 before this branch). New coverage:
  per-host target assertion on every apply/remove/restart success
  path; `_apply_service` partial-failure raises with context;
  `_get_existing_service_hosts` exception propagation; dotted-name
  rejection on rgw/mds/mon/mgr/rbd-mirror/cephfs-mirror; NFS
  group_id no-match remove; `service_action` partial failure;
  `reset_mock` per loop iter in unsupported-restart test.
* `side_effect` lambdas use `(*args, **kwargs)` to remain robust to
  positional-vs-keyword call shapes.

Multi-node CI integration test
(`tests/scripts/test-orch.sh`):

* New section 6b applies rgw with `--placement="$first_host,$second_host"`
  and verifies the daemon appears on the *second* host via
  `orch ps --hostname` — exercising the unix-socket ->
  proxyTarget -> remote mTLS path end to end.
* Section 9 verifies cluster-wide RGW removal (every host) instead
  of just the local node.
* Section 8 asserts the new `Restarted mon on <host>` output shape
  and that the restart enumerates each host in a multi-host cluster.
* Adds a guard test that `orch restart rgw.realm1` fails with
  `does not support a service_id`.

Refs #743

Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Assisted-by: Claude Opus 4.7 <noreply@anthropic.com>
Pre-merge polish after multi-agent review of the per-host targeting
work. No new feature surface; corrects how the orchestrator reports
failures and unsupported operations to the Ceph framework, and
hardens the apply path against a backend race.

Exception hierarchy
-------------------

Swap bare exception types for Ceph orchestrator types so the
framework renders them as proper operation errors rather than Python
tracebacks:

* `RuntimeError` (per-host fan-out failure) -> `OrchestratorError`
* `NotImplementedError` (unsupported action / unsupported service
  type / unsupported InventoryFilter labels) ->
  `OrchestratorValidationError`

`ValueError` is kept for input-shape errors (dotted-name guard, NFS
missing cluster_id) since those fire before any I/O.

Cephadm parity on "nothing to do"
---------------------------------

`remove_service` and `service_action("restart", ...)` now raise
`OrchestratorError("Service X is not running on any host")` when the
requested service has no live hosts, instead of returning a green
success summary. This matches cephadm: an operator asking to remove
or restart a service that does not exist should see an error so a
typo or stale assumption is caught, not a silent no-op.

TOCTOU tolerance on apply
-------------------------

`_apply_service` snapshots existing hosts once and then enables per
host. If the service races into existence between snapshot and
enable, the backend's `genericHospitalityCheck`
(`microceph/ceph/services_placement_generic.go`) returns
"<svc> service already active on host". This is the desired end
state, not a failure, so we now match that substring and treat the
host as enabled instead of appending to failures.

`describe_service` size
-----------------------

`ServiceDescription` now sets `size=len(hostlist)` alongside
`running`. MicroCeph has no separate spec store -- desired state is
the live running set -- so `ceph orch ls` was rendering RUNNING/-
(looks degraded). With `size` populated it renders RUNNING/SIZE
cleanly.

Integration test
----------------

* `tests/scripts/test-orch.sh` `second_host` awk now filters first
  column to a hostname shape (`/^[A-Za-z]/`) so the trailing
  "N hosts in cluster" footer row is no longer picked up as a
  hostname on single-node CI.
* Section 10 (Remove NFS) now accepts either the new
  "removed from <host>" success shape or the new
  "not running on any host" error shape (NFS may legitimately have
  failed to start on CI runners missing kernel modules).

Tests
-----

* 94 unit tests pass.
* Exception-class assertions updated:
  - `isinstance(..., RuntimeError)` -> `OrchestratorError`
  - `isinstance(..., NotImplementedError)` -> `OrchestratorValidationError`
* `test_remove_service_no_hosts`, `test_remove_service_nfs_group_id_no_match`,
  `test_restart_no_hosts` now assert the new "not running on any host"
  error shape instead of a string-summary success.
* `tests/stubs.py` and `tests/conftest.py` extended with
  `OrchestratorError` and `OrchestratorValidationError` stubs and
  bound into the injected `orchestrator` module so production-code
  imports resolve to the right types under test.
* `ServiceDescription` stub gains a `size` keyword.

Refs #743

Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Assisted-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants