feat(ceph): decouple microceph from core and unify with storage framework#686
feat(ceph): decouple microceph from core and unify with storage framework#686gboutry wants to merge 1 commit into
Conversation
4fdf02d to
b5ebb4e
Compare
|
Overall seems good to me |
There was a problem hiding this comment.
Pull request overview
This PR decouples MicroCeph from Sunbeam’s core orchestration so the storage role can be deployed without automatically deploying MicroCeph, enabling future scenarios like remote/external Ceph backends.
Changes:
- Introduces a
CephProviderabstraction plus persistedCephDeploymentMode(microceph/none) in clusterd, and wires provider selection throughDeployment.get_ceph_provider(). - Moves MicroCeph step implementation into
sunbeam/features/microceph/*with a backward-compatible re-export shim insunbeam/steps/microceph.py. - Guards MicroCeph-specific behavior (bootstrap/join/remove/resize/upgrades/status/TLS/maintenance) behind
is_microceph_necessary()checks.
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| sunbeam-python/sunbeam/core/ceph.py | Adds Ceph mode persistence + provider interface (CephProvider, NoCephProvider, is_microceph_necessary, SetCephProviderStep). |
| sunbeam-python/sunbeam/core/deployment.py | Adds Deployment.get_ceph_provider() with lazy provider selection/caching. |
| sunbeam-python/sunbeam/features/microceph/steps.py | New home for MicroCeph step implementations previously in sunbeam/steps/microceph.py. |
| sunbeam-python/sunbeam/features/microceph/provider.py | Implements MicrocephProvider for tfvars/offer URLs/replica logic. |
| sunbeam-python/sunbeam/steps/microceph.py | Backward-compatible shim re-exporting MicroCeph symbols from the feature module. |
| sunbeam-python/sunbeam/steps/openstack.py | Switches Ceph tfvars to provider interface; gates traefik-rgw service patching on microceph mode. |
| sunbeam-python/sunbeam/steps/cinder_volume.py | Delegates Ceph-specific tfvars to CephProvider. |
| sunbeam-python/sunbeam/steps/cluster_status.py | Uses provider to decide whether to show a storage column/app in status. |
| sunbeam-python/sunbeam/provider/local/commands.py | Adds --no-microceph and persists mode via SetCephProviderStep; gates MicroCeph steps based on mode. |
| sunbeam-python/sunbeam/provider/local/deployment.py | Skips emitting microceph_config preseed section when microceph mode is disabled. |
| sunbeam-python/sunbeam/provider/maas/commands.py | Gates MicroCeph deploy/remove/destroy steps based on is_microceph_necessary(). |
| sunbeam-python/sunbeam/steps/upgrades/intra_channel.py | Makes MicroCeph upgrade steps conditional on microceph mode. |
| sunbeam-python/sunbeam/commands/resize.py | Makes MicroCeph-related resize steps conditional on microceph mode. |
| sunbeam-python/sunbeam/features/tls/{ca.py,common.py} | Only monitors traefik-rgw TLS app when microceph is enabled. |
| sunbeam-python/sunbeam/features/maintenance/{commands.py,checks.py} | Gates MicroCeph maintenance checks/actions on microceph mode. |
| sunbeam-python/tests/unit/sunbeam/steps/test_openstack.py | Updates unit tests to provide a MicrocephProvider via deployment.get_ceph_provider(). |
| sunbeam-python/tests/unit/sunbeam/steps/test_cinder_volume.py | Updates test expectations/patch paths for the new provider-based Ceph integration. |
| sunbeam-python/tests/unit/sunbeam/steps/test_microceph.py | Updates imports to new MicroCeph feature module location. |
| sunbeam-python/tests/unit/sunbeam/steps/test_maintenance.py | Updates MicroCeph APPLICATION import to new feature module location. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
9e73e99 to
2eeeec4
Compare
1d46094 to
9af282f
Compare
|
@hemanthnakkina force push is: rebasing on main, and rewording the telemetry feature error message not to resurface microceph implementation details |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 57 out of 57 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Create instance of UpgradeStorageBackendCharms class. | ||
|
|
||
| Upgrades cinder-volume and storage backend charms managed by the | ||
| storage-backend Terraform plan. | ||
|
|
||
| :client: Client to connect to clusterdb | ||
| :jhelper: Helper for interacting with pylibjuju | ||
| :manifest: Manifest object | ||
| :model: Name of model containing charms. | ||
| """ | ||
| super().__init__( | ||
| "Upgrade Cinder Volume charm", | ||
| "Upgrading cinder-volume charm", | ||
| "Upgrade Storage Backend charms", | ||
| "Upgrading storage backend charms", | ||
| client, | ||
| tfhelper, | ||
| jhelper, | ||
| manifest, | ||
| model, | ||
| ["cinder-volume", "cinder-volume-ceph"], | ||
| CINDER_VOLUME_CONFIG_KEY, | ||
| ["cinder-volume"], | ||
| STORAGE_BACKEND_TFVAR_CONFIG_KEY, | ||
| 1200, |
There was a problem hiding this comment.
UpgradeStorageBackendCharms claims to upgrade “cinder-volume and storage backend charms”, but it only targets the cinder-volume charm. This means backend charms managed by the storage-backend plan (e.g., cinder-volume-ceph for internal Ceph, and any other enabled storage backends) won’t be upgraded by this step. Consider deriving the charm list from the storage backend tfvars / registered backends, or at least including cinder-volume-ceph when internal Ceph is enabled.
| backend_type = "internal-ceph" | ||
| display_name = "Internal Ceph" | ||
| generally_available = True |
There was a problem hiding this comment.
InternalCephBackend is documented as “not exposed via the sunbeam storage add CLI”, but generally_available = True means FeatureGateMixin.check_gated() will always return False, so this backend will be visible/registrable in the storage CLI like any other backend. To keep it internal-only, set generally_available = False (and avoid enabling it via feature gates), or override check_enabled()/CLI registration so it never registers user-facing commands.
| integrations: set[HypervisorIntegration] = set() | ||
|
|
||
| registered = client.cluster.get_storage_backends() | ||
|
|
There was a problem hiding this comment.
collect_hypervisor_integrations() doesn’t call _load_backends(). If the global backend registry is non-empty while _loaded is still False (e.g., partial/early registration), this method can silently miss backend implementations and return an incomplete integration set. Call self._load_backends() at the start of this method (consistent with get_backend()/register_cli_commands()).
7026cb5 to
43e3da3
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 58 out of 59 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| class ReapplyStorageBackendTerraformPlanStep(BaseStep): | ||
| """Reapply the storage-backend Terraform plan. | ||
|
|
||
| This step re-applies the storage-backend plan using the existing | ||
| Terraform variables stored in clusterd. It is used during upgrades | ||
| to pick up charm channel / revision changes without rebuilding the | ||
| full configuration from scratch. | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| deployment: Deployment, | ||
| client: Client, | ||
| tfhelper: TerraformHelper, | ||
| jhelper: JujuHelper, | ||
| manifest: Manifest, | ||
| model: str, | ||
| ): | ||
| super().__init__( | ||
| "Reapply Storage Backend Terraform plan", | ||
| "Reapplying Storage Backend Terraform plan", | ||
| ) | ||
| self.deployment = deployment | ||
| self.client = client | ||
| self.tfhelper = tfhelper | ||
| self.jhelper = jhelper | ||
| self.manifest = manifest | ||
| self.model = model | ||
|
|
||
| def is_skip(self, context: StepContext) -> Result: | ||
| """Skip when no storage backends are configured.""" | ||
| try: | ||
| tfvars = read_config(self.client, STORAGE_BACKEND_TFVAR_CONFIG_KEY) | ||
| except ConfigItemNotFoundException: | ||
| return Result(ResultType.SKIPPED, "No storage backends configured.") | ||
|
|
||
| if not tfvars.get("backends") and not tfvars.get("cinder-volumes"): | ||
| return Result(ResultType.SKIPPED, "No storage backends configured.") | ||
|
|
||
| return Result(ResultType.COMPLETED) | ||
|
|
||
| @tenacity.retry( | ||
| wait=tenacity.wait_fixed(60), | ||
| stop=tenacity.stop_after_delay(300), | ||
| retry=tenacity.retry_if_exception_type(TerraformStateLockedException), | ||
| retry_error_callback=friendly_terraform_lock_retry_callback, | ||
| ) | ||
| def run(self, context: StepContext) -> Result: | ||
| """Reapply the storage backend Terraform plan.""" | ||
| try: | ||
| tfvars = read_config(self.client, STORAGE_BACKEND_TFVAR_CONFIG_KEY) | ||
| except ConfigItemNotFoundException: | ||
| LOG.debug("No storage backend config found, nothing to reapply.") | ||
| return Result(ResultType.COMPLETED) | ||
|
|
||
| try: | ||
| self.tfhelper.update_tfvars_and_apply_tf( | ||
| self.client, | ||
| self.manifest, | ||
| tfvar_config=STORAGE_BACKEND_TFVAR_CONFIG_KEY, | ||
| override_tfvars=tfvars, | ||
| reporter=context.reporter, | ||
| ) |
There was a problem hiding this comment.
ReapplyStorageBackendTerraformPlanStep claims it will “pick up charm channel / revision changes”, but run() passes the already-stored tfvars as override_tfvars into update_tfvars_and_apply_tf(). That makes those values effectively “computed” and prevents manifest-driven updates (and these channel/revision fields are plain strings, not charm config dicts that get merged). Either rebuild/refresh the relevant fields from the current manifest before applying, or adjust the logic/docstring so it doesn’t imply manifest channel/revision updates that won’t occur.
| client = deployment.get_client() | ||
| storage_tfhelper = deployment.get_tfhelper(backend.tfplan) | ||
| jhelper = JujuHelper(deployment.juju_controller) | ||
| manifest = deployment.get_manifest() |
There was a problem hiding this comment.
CephFeature disable path loads the manifest via deployment.get_manifest() (ignoring self.user_manifest), while enable/on_join paths consistently use deployment.get_manifest(self.user_manifest). If the user provided a manifest override during enable, disable may compute different charm settings / plan inputs than enable did. Consider using deployment.get_manifest(self.user_manifest) consistently in _get_internal_ceph_disable_steps() and run_disable_plans().
| manifest = deployment.get_manifest() | |
| manifest = deployment.get_manifest(self.user_manifest) |
| SetCephMgrPoolSizeStep( | ||
| client, jhelper, deployment.openstack_machines_model | ||
| ), | ||
| ] | ||
| ) |
There was a problem hiding this comment.
remove_node() appends SetCephMgrPoolSizeStep unconditionally. With this PR, the storage role can exist without internal microceph, and SetCephMgrPoolSizeStep will run whenever storage nodes exist (its is_skip only checks storage node count), then fail if the microceph application isn’t deployed. Gate this step behind internal_ceph_enabled (same as RemoveMicrocephUnitsStep) or update SetCephMgrPoolSizeStep to skip when microceph isn’t present.
| SetCephMgrPoolSizeStep( | |
| client, jhelper, deployment.openstack_machines_model | |
| ), | |
| ] | |
| ) | |
| ] | |
| ) | |
| if internal_ceph_enabled: | |
| plan.append( | |
| SetCephMgrPoolSizeStep( | |
| client, jhelper, deployment.openstack_machines_model | |
| ), | |
| ) |
22d73fa to
4b7345b
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 61 out of 62 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| else: | ||
| microceph_content.extend(lines[2:]) | ||
| preseed_content.extend(microceph_content) | ||
| if is_internal_ceph_enabled(client): |
There was a problem hiding this comment.
generate_core_config() now gates MicroCeph preseed output on is_internal_ceph_enabled(client), but that helper reads clusterd config and can raise ClusterServiceUnavailableException (which is already handled for other load_answers() calls in this method). As written, preseed generation can crash when clusterd is temporarily unavailable. Consider wrapping this call in a try/except ClusterServiceUnavailableException and defaulting to showing the microceph section (or defaulting to DEFAULT_MODE) when the mode cannot be read, so preseed generation remains resilient.
| if is_internal_ceph_enabled(client): | |
| try: | |
| internal_ceph_enabled = is_internal_ceph_enabled(client) | |
| except ClusterServiceUnavailableException: | |
| internal_ceph_enabled = True | |
| if internal_ceph_enabled: |
| """Build the list of legacy Ceph resources to import.""" | ||
| # Use the known HA principal — legacy Ceph always deployed under | ||
| # cinder-volume (HA). Do NOT use next(iter(...)) which could | ||
| # pick a non-HA principal from another backend. | ||
| principal = PRINCIPAL_HA_APPLICATION | ||
| cinder_volume = tfvars["cinder-volumes"][principal] | ||
| backend = tfvars["backends"][INTERNAL_CEPH_BACKEND_NAME] |
There was a problem hiding this comment.
ImportCephResourcesToStorageFrameworkStep._build_imports() assumes the HA principal key (PRINCIPAL_HA_APPLICATION, i.e. "cinder-volume") exists in tfvars["cinder-volumes"], but is_skip() only checks that cinder-volumes is non-empty. If the stored config only contains a non-HA principal (or is otherwise missing the HA entry), this will raise KeyError and fail the upgrade step. Suggest updating is_skip() (or _build_imports()) to explicitly check for the HA principal entry and skip or error with a clearer message when it’s absent.
4b7345b to
6f263e0
Compare
…work Extract microceph management into a dedicated CephFeature (EnableDisableFeature) with its own enable/disable lifecycle, and introduce a CephProvider abstraction (MicrocephProvider / NoCephProvider) following the existing OVN provider pattern. This allows deployments to opt out of internal Ceph via --no-default-storage while retaining the storage role for third-party backends (PureStorage, Hitachi, etc.). Key changes: Architecture: - New sunbeam.core.ceph module with CephProvider ABC, deployment mode persistence (CephDeploymentMode.MICROCEPH / NONE), and shared helpers (is_internal_ceph_enabled, is_internal_ceph_enabled_feature_aware, set_ceph_feature_enabled_state, get_default_ceph_bootstrap_steps, etc.) - New sunbeam.features.ceph.feature.CephFeature managing the full microceph + internal-ceph backend lifecycle, with a get_bootstrap_deploy_steps helper consumed by the bootstrap/join plans - New sunbeam.features.microceph.provider.MicrocephProvider implementing CephProvider for local Ceph via microceph charm - New sunbeam.storage.backends.internal_ceph.backend.InternalCephBackend routing cinder-volume-ceph through the storage framework - Moved microceph steps from sunbeam.steps.microceph to sunbeam.features.microceph.steps (deleted dead re-export shim) - Deleted sunbeam.steps.cinder_volume: functionality absorbed by the storage framework (storage/steps.py) Storage framework enhancements: - Backends declare extra_integrations and hypervisor_integrations via typed dataclasses (BackendIntegration, HypervisorIntegration) - Terraform plans support dynamic backend/hypervisor integrations via for_each instead of hardcoded cinder-volume-ceph resources - DeploySpecificCinderVolumeStep and DestroySpecificCinderVolumeStep manage per-backend cinder-volume lifecycle; deploy always recomputes machine_ids from clusterd so scale-out places units on new storage nodes for every HA backend - New CheckStorageNodeRemovalStep / RemoveStorageMachineUnitsStep for safe node removal - register_storage_terraform_plan extracted as a module-level helper in storage/base.py with STORAGE_TFPLAN / STORAGE_TFPLAN_DIR constants, with StorageBackendBase.register_terraform_plan kept as a thin shim - Destroy step uses a setdefault-backed local variable so a missing tfvars['backends'] entry no longer raises KeyError Upgrade migration (storage_migration.py): - MigrateCinderVolumeToStorageFrameworkStep migrates legacy cinder-volume-plan state to the unified storage backend plan, merging into existing third-party backend config rather than overwriting it; _clear_old_state runs last so partial failures leave the legacy state intact and the step re-runs cleanly - ImportCephResourcesToStorageFrameworkStep imports Juju resources into the new terraform state using explicit HA principal lookup and clears StorageBackendLegacyImportIds from clusterd on success - BackfillCephFeatureStateStep marks existing internal Ceph deployments as feature-enabled and sets CephConfig.mode = MICROCEPH explicitly so upgraded clusters satisfy strict-equality checks - Terraform moved blocks preserve brownfield hypervisor and storage backend resources across the migration Bootstrap / join / remove: - Bootstrap and maas deploy accept --no-default-storage to skip internal Ceph - Microceph deploy + OSD config steps are inlined into local bootstrap plan1, local join plan4 (first-storage), and maas deploy plan2 before DeployControlPlaneStep so data.juju_offer.microceph resolves at terraform plan time - SetCephProviderStep runs in the pre-plan2 phase for maas deploy so plan assembly sees the correct CephConfig.mode - Local join runs ReapplyHypervisorOptionalIntegrationsStep in a post-ensure_default_ceph_feature plan5 so collect_hypervisor_integrations sees internal-ceph in clusterd and wires cinder-volume-ceph:ceph-access on first storage join - Node remove uses storage framework checks instead of legacy cinder-volume distribution checks - Feature join/depart hooks propagate exceptions to surface failures Disable flow safety: - Ceph disable uses phased execution: destroy backend first (mode stays MICROCEPH), then flip mode to NONE, then reapply control plane (sees NoCephProvider), then destroy microceph — minimising inconsistency window on partial failure - A ceph_disabling feature marker is set on entry and cleared only after the final phase; is_internal_ceph_enabled_feature_aware respects the marker so the transient window does not leak stale "enabled" state to readers, and retries re-enter each idempotent phase safely Telemetry: - Added --metrics-storage-offer option for S3-compatible metrics storage - Telemetry enable/disable updates cinder-volume notification flags across all storage backends Other: - Shared filesystem feature blocks enablement without internal Ceph - Shared filesystem, maintenance, and telemetry call sites use is_internal_ceph_enabled_feature_aware - Maintenance commands cache is_internal_ceph_enabled per invocation - Removed deprecated cinder_volume_tfhelper parameter from hypervisor - TerraformHelper gains import_resource() and improved error messages Assisted-By: Claude Code (claude-opus-4-6) Assisted-By: Codex (gpt-5-4) Signed-off-by: Guillaume Boutry <guillaume.boutry@canonical.com>
6f263e0 to
c99d65c
Compare
hemanthnakkina
left a comment
There was a problem hiding this comment.
I reviewed the code, LGTM.
Summary
Decouple microceph from the core sunbeam codebase and unify ceph-related storage through the feature framework. This enables the
storagerole to be deployed without microceph (e.g. remote Ceph, external S3 for Gnocchi) and routes all ceph storage management through a unifiedcephfeature and theStorageBackendBaseplugin system.Motivation
Microceph was deeply coupled throughout the core code, unlike other storage backends which follow the
StorageBackendBaseplugin framework. It was the only storage backend installed directly by sunbeam (triggered by thestoragerole), making it special in ways that prevented flexibility. This change enables:storagerole without microceph (via--no-default-storage)EnableDisableFeaturewith lifecycle hooksArchitecture
CephProvider abstraction (
sunbeam/core/ceph.py)CephDeploymentModeenum (MICROCEPH/NONE) persisted in clusterdCephProviderABC withMicrocephProviderandNoCephProviderimplementationsDeployment.get_ceph_provider()returns the appropriate provider (uncached — mode can change mid-process during enable/disable flows)is_internal_ceph_enabled(),set_ceph_feature_enabled_state(),ensure_default_ceph_feature(),is_internal_ceph_enabled_feature_aware()CephFeature (
sunbeam/features/ceph/feature.py)EnableDisableFeaturemanaging the full microceph + internal-ceph backend lifecycleenabledeploys microceph, configures OSDs, registers internal-ceph backend, deploys cinder-volume via storage framework, and reapplies control planedisableuses phased execution to minimise inconsistency on partial failure:on_joinhook deploys microceph on new storage nodes and reconciles storage backend placement (cinder-volume machine IDs, replica count, control plane)enable_default_storage()handles first-time Ceph setup during bootstrapInternalCephBackend (
sunbeam/storage/backends/internal_ceph/)StorageBackendBaseimplementation for cinder-volume-ceph as an HA subordinateBackendIntegration(microceph ceph relation) andHypervisorIntegration(ceph-access)sunbeam/steps/cinder_volume.pymoduleStorage framework enhancements (
sunbeam/storage/)extra_integrationsandhypervisor_integrationsvia typed dataclassesfor_eachinstead of hardcoded resourcesDeploySpecificCinderVolumeStep/DestroySpecificCinderVolumeStepmanage per-backend cinder-volume lifecycleCheckStorageNodeRemovalStep/RemoveStorageMachineUnitsStepfor safe node removalUpgrade migration (
sunbeam/steps/upgrades/storage_migration.py)MigrateCinderVolumeToStorageFrameworkStep: migrates legacy cinder-volume-plan state to the unified storage backend plan, merging into existing third-party backend config (PureStorage, Hitachi, etc.) rather than overwritingImportCephResourcesToStorageFrameworkStep: imports Juju resources into the new terraform state using explicit HA principal lookupBackfillCephFeatureStateStep: marks existing internal Ceph deployments as feature-enabled during refreshTelemetry S3 support
--metrics-storage-offeroption onsunbeam enable telemetryfor external S3 storage_has_metrics_storage()gates Gnocchi/Ceilometer deployment with three-tier check (instance state → cluster DB → storage nodes with internal ceph)Feature lifecycle hooks
on_join/on_departhooks onEnableDisableFeatureFeatureManagerdispatches to all enabled features; exceptions propagate to surface failuresKey design decisions
--no-default-storageMICROCEPH/NONEonlyStorageBackendBaseMICROCEPHPRINCIPAL_HA_APPLICATIONon_joindoes real work; silent failures cause driftTest plan
tox— unit, pep8, mypy)--no-default-storage, verify storage role works without microceph--metrics-storage-offer, verify Gnocchi uses S3