Skip to content

feat(ceph): decouple microceph from core and unify with storage framework#686

Draft
gboutry wants to merge 1 commit into
canonical:mainfrom
gboutry:feat/migrate-microceph
Draft

feat(ceph): decouple microceph from core and unify with storage framework#686
gboutry wants to merge 1 commit into
canonical:mainfrom
gboutry:feat/migrate-microceph

Conversation

@gboutry
Copy link
Copy Markdown
Collaborator

@gboutry gboutry commented Mar 1, 2026

Summary

Decouple microceph from the core sunbeam codebase and unify ceph-related storage through the feature framework. This enables the storage role to be deployed without microceph (e.g. remote Ceph, external S3 for Gnocchi) and routes all ceph storage management through a unified ceph feature and the StorageBackendBase plugin system.

Motivation

Microceph was deeply coupled throughout the core code, unlike other storage backends which follow the StorageBackendBase plugin framework. It was the only storage backend installed directly by sunbeam (triggered by the storage role), making it special in ways that prevented flexibility. This change enables:

  • Deploying the storage role without microceph (via --no-default-storage)
  • Managing ceph as a proper EnableDisableFeature with lifecycle hooks
  • Routing cinder-volume and ceph storage through the unified storage framework
  • Enabling telemetry (Gnocchi) with an external S3 backend when microceph is not configured
  • Safe upgrades that merge internal-ceph into existing third-party backend config

Architecture

CephProvider abstraction (sunbeam/core/ceph.py)

  • CephDeploymentMode enum (MICROCEPH / NONE) persisted in clusterd
  • CephProvider ABC with MicrocephProvider and NoCephProvider implementations
  • Deployment.get_ceph_provider() returns the appropriate provider (uncached — mode can change mid-process during enable/disable flows)
  • Shared helpers: is_internal_ceph_enabled(), set_ceph_feature_enabled_state(), ensure_default_ceph_feature(), is_internal_ceph_enabled_feature_aware()

CephFeature (sunbeam/features/ceph/feature.py)

  • EnableDisableFeature managing the full microceph + internal-ceph backend lifecycle
  • enable deploys microceph, configures OSDs, registers internal-ceph backend, deploys cinder-volume via storage framework, and reapplies control plane
  • disable uses phased execution to minimise inconsistency on partial failure:
    1. Destroy internal-ceph backend + cinder-volume (mode stays MICROCEPH)
    2. Flip mode to NONE
    3. Reapply control plane (sees NoCephProvider)
    4. Destroy microceph
  • on_join hook deploys microceph on new storage nodes and reconciles storage backend placement (cinder-volume machine IDs, replica count, control plane)
  • enable_default_storage() handles first-time Ceph setup during bootstrap

InternalCephBackend (sunbeam/storage/backends/internal_ceph/)

  • StorageBackendBase implementation for cinder-volume-ceph as an HA subordinate
  • Declares BackendIntegration (microceph ceph relation) and HypervisorIntegration (ceph-access)
  • Replaces the deleted sunbeam/steps/cinder_volume.py module

Storage framework enhancements (sunbeam/storage/)

  • Backends declare extra_integrations and hypervisor_integrations via typed dataclasses
  • Terraform plans support dynamic integrations via for_each instead of hardcoded resources
  • DeploySpecificCinderVolumeStep / DestroySpecificCinderVolumeStep manage per-backend cinder-volume lifecycle
  • CheckStorageNodeRemovalStep / RemoveStorageMachineUnitsStep for safe node removal

Upgrade migration (sunbeam/steps/upgrades/storage_migration.py)

  • MigrateCinderVolumeToStorageFrameworkStep: migrates legacy cinder-volume-plan state to the unified storage backend plan, merging into existing third-party backend config (PureStorage, Hitachi, etc.) rather than overwriting
  • ImportCephResourcesToStorageFrameworkStep: imports Juju resources into the new terraform state using explicit HA principal lookup
  • BackfillCephFeatureStateStep: marks existing internal Ceph deployments as feature-enabled during refresh

Telemetry S3 support

  • --metrics-storage-offer option on sunbeam enable telemetry for external S3 storage
  • S3 offer takes precedence over microceph; persisted in clusterd for re-enable flows
  • _has_metrics_storage() gates Gnocchi/Ceilometer deployment with three-tier check (instance state → cluster DB → storage nodes with internal ceph)

Feature lifecycle hooks

  • on_join / on_depart hooks on EnableDisableFeature
  • FeatureManager dispatches to all enabled features; exceptions propagate to surface failures

Key design decisions

Decision Choice Rationale
Flag --no-default-storage Abstracted from specific backend name
Mutability Mutable post-bootstrap Operator may add microceph later
Modes MICROCEPH / NONE only Remote ceph is a regular StorageBackendBase
Default MICROCEPH Backward compatibility
Disable ordering Destroy → flip mode → reapply CP → destroy microceph Minimises inconsistency window on partial failure
Migration merge Merge into existing tfvars Preserves third-party backends during upgrade
Import principal Explicit PRINCIPAL_HA_APPLICATION Never picks wrong principal from mixed-backend config
Provider caching Uncached Mode can change mid-process during enable/disable
Hook propagation Exceptions propagate on_join does real work; silent failures cause drift
Pattern OVN provider + storage framework Consistent with existing codebase patterns

Test plan

  • Unit tests pass (tox — unit, pep8, mypy)
  • New test coverage for: CephFeature (enable/disable/join/on_join reconciliation), storage migration (merge with existing backends, PureStorage + Hitachi scenarios), internal-ceph backend, feature lifecycle hooks, hypervisor integrations, local/MAAS provider helpers
  • Functional: bootstrap with --no-default-storage, verify storage role works without microceph
  • Functional: bootstrap with default (microceph), add storage node, verify cinder-volume placement updated
  • Functional: upgrade existing deployment with microceph + third-party backend, verify migration preserves both
  • Functional: enable/disable ceph feature, verify clean state transitions
  • Functional: enable telemetry with --metrics-storage-offer, verify Gnocchi uses S3

@gboutry gboutry changed the title Feat/migrate microceph feat: allow storage without microceph Mar 1, 2026
@gboutry gboutry force-pushed the feat/migrate-microceph branch from 4fdf02d to b5ebb4e Compare March 1, 2026 09:22
@hemanthnakkina hemanthnakkina requested a review from Copilot March 2, 2026 05:36
@hemanthnakkina
Copy link
Copy Markdown
Collaborator

Overall seems good to me

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR decouples MicroCeph from Sunbeam’s core orchestration so the storage role can be deployed without automatically deploying MicroCeph, enabling future scenarios like remote/external Ceph backends.

Changes:

  • Introduces a CephProvider abstraction plus persisted CephDeploymentMode (microceph/none) in clusterd, and wires provider selection through Deployment.get_ceph_provider().
  • Moves MicroCeph step implementation into sunbeam/features/microceph/* with a backward-compatible re-export shim in sunbeam/steps/microceph.py.
  • Guards MicroCeph-specific behavior (bootstrap/join/remove/resize/upgrades/status/TLS/maintenance) behind is_microceph_necessary() checks.

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
sunbeam-python/sunbeam/core/ceph.py Adds Ceph mode persistence + provider interface (CephProvider, NoCephProvider, is_microceph_necessary, SetCephProviderStep).
sunbeam-python/sunbeam/core/deployment.py Adds Deployment.get_ceph_provider() with lazy provider selection/caching.
sunbeam-python/sunbeam/features/microceph/steps.py New home for MicroCeph step implementations previously in sunbeam/steps/microceph.py.
sunbeam-python/sunbeam/features/microceph/provider.py Implements MicrocephProvider for tfvars/offer URLs/replica logic.
sunbeam-python/sunbeam/steps/microceph.py Backward-compatible shim re-exporting MicroCeph symbols from the feature module.
sunbeam-python/sunbeam/steps/openstack.py Switches Ceph tfvars to provider interface; gates traefik-rgw service patching on microceph mode.
sunbeam-python/sunbeam/steps/cinder_volume.py Delegates Ceph-specific tfvars to CephProvider.
sunbeam-python/sunbeam/steps/cluster_status.py Uses provider to decide whether to show a storage column/app in status.
sunbeam-python/sunbeam/provider/local/commands.py Adds --no-microceph and persists mode via SetCephProviderStep; gates MicroCeph steps based on mode.
sunbeam-python/sunbeam/provider/local/deployment.py Skips emitting microceph_config preseed section when microceph mode is disabled.
sunbeam-python/sunbeam/provider/maas/commands.py Gates MicroCeph deploy/remove/destroy steps based on is_microceph_necessary().
sunbeam-python/sunbeam/steps/upgrades/intra_channel.py Makes MicroCeph upgrade steps conditional on microceph mode.
sunbeam-python/sunbeam/commands/resize.py Makes MicroCeph-related resize steps conditional on microceph mode.
sunbeam-python/sunbeam/features/tls/{ca.py,common.py} Only monitors traefik-rgw TLS app when microceph is enabled.
sunbeam-python/sunbeam/features/maintenance/{commands.py,checks.py} Gates MicroCeph maintenance checks/actions on microceph mode.
sunbeam-python/tests/unit/sunbeam/steps/test_openstack.py Updates unit tests to provide a MicrocephProvider via deployment.get_ceph_provider().
sunbeam-python/tests/unit/sunbeam/steps/test_cinder_volume.py Updates test expectations/patch paths for the new provider-based Ceph integration.
sunbeam-python/tests/unit/sunbeam/steps/test_microceph.py Updates imports to new MicroCeph feature module location.
sunbeam-python/tests/unit/sunbeam/steps/test_maintenance.py Updates MicroCeph APPLICATION import to new feature module location.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sunbeam-python/sunbeam/core/ceph.py
Comment thread sunbeam-python/sunbeam/steps/upgrades/intra_channel.py
@gboutry gboutry force-pushed the feat/migrate-microceph branch from 9e73e99 to 2eeeec4 Compare March 2, 2026 10:29
@gboutry gboutry force-pushed the feat/migrate-microceph branch 2 times, most recently from 1d46094 to 9af282f Compare March 29, 2026 14:57
@gboutry
Copy link
Copy Markdown
Collaborator Author

gboutry commented Mar 29, 2026

@hemanthnakkina force push is: rebasing on main, and rewording the telemetry feature error message not to resurface microceph implementation details

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 57 out of 57 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +320 to 340
"""Create instance of UpgradeStorageBackendCharms class.

Upgrades cinder-volume and storage backend charms managed by the
storage-backend Terraform plan.

:client: Client to connect to clusterdb
:jhelper: Helper for interacting with pylibjuju
:manifest: Manifest object
:model: Name of model containing charms.
"""
super().__init__(
"Upgrade Cinder Volume charm",
"Upgrading cinder-volume charm",
"Upgrade Storage Backend charms",
"Upgrading storage backend charms",
client,
tfhelper,
jhelper,
manifest,
model,
["cinder-volume", "cinder-volume-ceph"],
CINDER_VOLUME_CONFIG_KEY,
["cinder-volume"],
STORAGE_BACKEND_TFVAR_CONFIG_KEY,
1200,
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpgradeStorageBackendCharms claims to upgrade “cinder-volume and storage backend charms”, but it only targets the cinder-volume charm. This means backend charms managed by the storage-backend plan (e.g., cinder-volume-ceph for internal Ceph, and any other enabled storage backends) won’t be upgraded by this step. Consider deriving the charm list from the storage backend tfvars / registered backends, or at least including cinder-volume-ceph when internal Ceph is enabled.

Copilot uses AI. Check for mistakes.
Comment thread sunbeam-python/sunbeam/provider/local/commands.py
Comment on lines +49 to +51
backend_type = "internal-ceph"
display_name = "Internal Ceph"
generally_available = True
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InternalCephBackend is documented as “not exposed via the sunbeam storage add CLI”, but generally_available = True means FeatureGateMixin.check_gated() will always return False, so this backend will be visible/registrable in the storage CLI like any other backend. To keep it internal-only, set generally_available = False (and avoid enabling it via feature gates), or override check_enabled()/CLI registration so it never registers user-facing commands.

Copilot uses AI. Check for mistakes.
Comment on lines +346 to +349
integrations: set[HypervisorIntegration] = set()

registered = client.cluster.get_storage_backends()

Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect_hypervisor_integrations() doesn’t call _load_backends(). If the global backend registry is non-empty while _loaded is still False (e.g., partial/early registration), this method can silently miss backend implementations and return an incomplete integration set. Call self._load_backends() at the start of this method (consistent with get_backend()/register_cli_commands()).

Copilot uses AI. Check for mistakes.
@gboutry gboutry force-pushed the feat/migrate-microceph branch 3 times, most recently from 7026cb5 to 43e3da3 Compare March 30, 2026 22:26
@gboutry gboutry changed the title feat: allow storage without microceph feat(ceph): decouple microceph from core and unify with storage framework Mar 30, 2026
@gboutry gboutry requested a review from Copilot March 30, 2026 22:29
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 58 out of 59 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +941 to +1003
class ReapplyStorageBackendTerraformPlanStep(BaseStep):
"""Reapply the storage-backend Terraform plan.

This step re-applies the storage-backend plan using the existing
Terraform variables stored in clusterd. It is used during upgrades
to pick up charm channel / revision changes without rebuilding the
full configuration from scratch.
"""

def __init__(
self,
deployment: Deployment,
client: Client,
tfhelper: TerraformHelper,
jhelper: JujuHelper,
manifest: Manifest,
model: str,
):
super().__init__(
"Reapply Storage Backend Terraform plan",
"Reapplying Storage Backend Terraform plan",
)
self.deployment = deployment
self.client = client
self.tfhelper = tfhelper
self.jhelper = jhelper
self.manifest = manifest
self.model = model

def is_skip(self, context: StepContext) -> Result:
"""Skip when no storage backends are configured."""
try:
tfvars = read_config(self.client, STORAGE_BACKEND_TFVAR_CONFIG_KEY)
except ConfigItemNotFoundException:
return Result(ResultType.SKIPPED, "No storage backends configured.")

if not tfvars.get("backends") and not tfvars.get("cinder-volumes"):
return Result(ResultType.SKIPPED, "No storage backends configured.")

return Result(ResultType.COMPLETED)

@tenacity.retry(
wait=tenacity.wait_fixed(60),
stop=tenacity.stop_after_delay(300),
retry=tenacity.retry_if_exception_type(TerraformStateLockedException),
retry_error_callback=friendly_terraform_lock_retry_callback,
)
def run(self, context: StepContext) -> Result:
"""Reapply the storage backend Terraform plan."""
try:
tfvars = read_config(self.client, STORAGE_BACKEND_TFVAR_CONFIG_KEY)
except ConfigItemNotFoundException:
LOG.debug("No storage backend config found, nothing to reapply.")
return Result(ResultType.COMPLETED)

try:
self.tfhelper.update_tfvars_and_apply_tf(
self.client,
self.manifest,
tfvar_config=STORAGE_BACKEND_TFVAR_CONFIG_KEY,
override_tfvars=tfvars,
reporter=context.reporter,
)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReapplyStorageBackendTerraformPlanStep claims it will “pick up charm channel / revision changes”, but run() passes the already-stored tfvars as override_tfvars into update_tfvars_and_apply_tf(). That makes those values effectively “computed” and prevents manifest-driven updates (and these channel/revision fields are plain strings, not charm config dicts that get merged). Either rebuild/refresh the relevant fields from the current manifest before applying, or adjust the logic/docstring so it doesn’t imply manifest channel/revision updates that won’t occur.

Copilot uses AI. Check for mistakes.
client = deployment.get_client()
storage_tfhelper = deployment.get_tfhelper(backend.tfplan)
jhelper = JujuHelper(deployment.juju_controller)
manifest = deployment.get_manifest()
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CephFeature disable path loads the manifest via deployment.get_manifest() (ignoring self.user_manifest), while enable/on_join paths consistently use deployment.get_manifest(self.user_manifest). If the user provided a manifest override during enable, disable may compute different charm settings / plan inputs than enable did. Consider using deployment.get_manifest(self.user_manifest) consistently in _get_internal_ceph_disable_steps() and run_disable_plans().

Suggested change
manifest = deployment.get_manifest()
manifest = deployment.get_manifest(self.user_manifest)

Copilot uses AI. Check for mistakes.
Comment on lines +1718 to +1722
SetCephMgrPoolSizeStep(
client, jhelper, deployment.openstack_machines_model
),
]
)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove_node() appends SetCephMgrPoolSizeStep unconditionally. With this PR, the storage role can exist without internal microceph, and SetCephMgrPoolSizeStep will run whenever storage nodes exist (its is_skip only checks storage node count), then fail if the microceph application isn’t deployed. Gate this step behind internal_ceph_enabled (same as RemoveMicrocephUnitsStep) or update SetCephMgrPoolSizeStep to skip when microceph isn’t present.

Suggested change
SetCephMgrPoolSizeStep(
client, jhelper, deployment.openstack_machines_model
),
]
)
]
)
if internal_ceph_enabled:
plan.append(
SetCephMgrPoolSizeStep(
client, jhelper, deployment.openstack_machines_model
),
)

Copilot uses AI. Check for mistakes.
@gboutry gboutry force-pushed the feat/migrate-microceph branch 2 times, most recently from 22d73fa to 4b7345b Compare April 11, 2026 23:05
@gboutry gboutry requested a review from Copilot April 11, 2026 23:09
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 61 out of 62 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

else:
microceph_content.extend(lines[2:])
preseed_content.extend(microceph_content)
if is_internal_ceph_enabled(client):
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generate_core_config() now gates MicroCeph preseed output on is_internal_ceph_enabled(client), but that helper reads clusterd config and can raise ClusterServiceUnavailableException (which is already handled for other load_answers() calls in this method). As written, preseed generation can crash when clusterd is temporarily unavailable. Consider wrapping this call in a try/except ClusterServiceUnavailableException and defaulting to showing the microceph section (or defaulting to DEFAULT_MODE) when the mode cannot be read, so preseed generation remains resilient.

Suggested change
if is_internal_ceph_enabled(client):
try:
internal_ceph_enabled = is_internal_ceph_enabled(client)
except ClusterServiceUnavailableException:
internal_ceph_enabled = True
if internal_ceph_enabled:

Copilot uses AI. Check for mistakes.
Comment on lines +173 to +179
"""Build the list of legacy Ceph resources to import."""
# Use the known HA principal — legacy Ceph always deployed under
# cinder-volume (HA). Do NOT use next(iter(...)) which could
# pick a non-HA principal from another backend.
principal = PRINCIPAL_HA_APPLICATION
cinder_volume = tfvars["cinder-volumes"][principal]
backend = tfvars["backends"][INTERNAL_CEPH_BACKEND_NAME]
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImportCephResourcesToStorageFrameworkStep._build_imports() assumes the HA principal key (PRINCIPAL_HA_APPLICATION, i.e. "cinder-volume") exists in tfvars["cinder-volumes"], but is_skip() only checks that cinder-volumes is non-empty. If the stored config only contains a non-HA principal (or is otherwise missing the HA entry), this will raise KeyError and fail the upgrade step. Suggest updating is_skip() (or _build_imports()) to explicitly check for the HA principal entry and skip or error with a clearer message when it’s absent.

Copilot uses AI. Check for mistakes.
@gboutry gboutry force-pushed the feat/migrate-microceph branch from 4b7345b to 6f263e0 Compare April 12, 2026 07:05
…work

Extract microceph management into a dedicated CephFeature
(EnableDisableFeature) with its own enable/disable lifecycle, and
introduce a CephProvider abstraction (MicrocephProvider / NoCephProvider)
following the existing OVN provider pattern. This allows deployments to
opt out of internal Ceph via --no-default-storage while retaining the
storage role for third-party backends (PureStorage, Hitachi, etc.).

Key changes:

Architecture:
- New sunbeam.core.ceph module with CephProvider ABC, deployment mode
  persistence (CephDeploymentMode.MICROCEPH / NONE), and shared helpers
  (is_internal_ceph_enabled, is_internal_ceph_enabled_feature_aware,
  set_ceph_feature_enabled_state, get_default_ceph_bootstrap_steps,
  etc.)
- New sunbeam.features.ceph.feature.CephFeature managing the full
  microceph + internal-ceph backend lifecycle, with a
  get_bootstrap_deploy_steps helper consumed by the bootstrap/join
  plans
- New sunbeam.features.microceph.provider.MicrocephProvider implementing
  CephProvider for local Ceph via microceph charm
- New sunbeam.storage.backends.internal_ceph.backend.InternalCephBackend
  routing cinder-volume-ceph through the storage framework
- Moved microceph steps from sunbeam.steps.microceph to
  sunbeam.features.microceph.steps (deleted dead re-export shim)
- Deleted sunbeam.steps.cinder_volume: functionality absorbed by the
  storage framework (storage/steps.py)

Storage framework enhancements:
- Backends declare extra_integrations and hypervisor_integrations via
  typed dataclasses (BackendIntegration, HypervisorIntegration)
- Terraform plans support dynamic backend/hypervisor integrations via
  for_each instead of hardcoded cinder-volume-ceph resources
- DeploySpecificCinderVolumeStep and DestroySpecificCinderVolumeStep
  manage per-backend cinder-volume lifecycle; deploy always recomputes
  machine_ids from clusterd so scale-out places units on new storage
  nodes for every HA backend
- New CheckStorageNodeRemovalStep / RemoveStorageMachineUnitsStep for
  safe node removal
- register_storage_terraform_plan extracted as a module-level helper in
  storage/base.py with STORAGE_TFPLAN / STORAGE_TFPLAN_DIR constants,
  with StorageBackendBase.register_terraform_plan kept as a thin shim
- Destroy step uses a setdefault-backed local variable so a missing
  tfvars['backends'] entry no longer raises KeyError

Upgrade migration (storage_migration.py):
- MigrateCinderVolumeToStorageFrameworkStep migrates legacy
  cinder-volume-plan state to the unified storage backend plan,
  merging into existing third-party backend config rather than
  overwriting it; _clear_old_state runs last so partial failures leave
  the legacy state intact and the step re-runs cleanly
- ImportCephResourcesToStorageFrameworkStep imports Juju resources into
  the new terraform state using explicit HA principal lookup and
  clears StorageBackendLegacyImportIds from clusterd on success
- BackfillCephFeatureStateStep marks existing internal Ceph deployments
  as feature-enabled and sets CephConfig.mode = MICROCEPH explicitly so
  upgraded clusters satisfy strict-equality checks
- Terraform moved blocks preserve brownfield hypervisor and storage
  backend resources across the migration

Bootstrap / join / remove:
- Bootstrap and maas deploy accept --no-default-storage to skip
  internal Ceph
- Microceph deploy + OSD config steps are inlined into local bootstrap
  plan1, local join plan4 (first-storage), and maas deploy plan2
  before DeployControlPlaneStep so data.juju_offer.microceph resolves
  at terraform plan time
- SetCephProviderStep runs in the pre-plan2 phase for maas deploy so
  plan assembly sees the correct CephConfig.mode
- Local join runs ReapplyHypervisorOptionalIntegrationsStep in a
  post-ensure_default_ceph_feature plan5 so
  collect_hypervisor_integrations sees internal-ceph in clusterd and
  wires cinder-volume-ceph:ceph-access on first storage join
- Node remove uses storage framework checks instead of legacy
  cinder-volume distribution checks
- Feature join/depart hooks propagate exceptions to surface failures

Disable flow safety:
- Ceph disable uses phased execution: destroy backend first (mode stays
  MICROCEPH), then flip mode to NONE, then reapply control plane (sees
  NoCephProvider), then destroy microceph — minimising inconsistency
  window on partial failure
- A ceph_disabling feature marker is set on entry and cleared only
  after the final phase; is_internal_ceph_enabled_feature_aware
  respects the marker so the transient window does not leak stale
  "enabled" state to readers, and retries re-enter each idempotent
  phase safely

Telemetry:
- Added --metrics-storage-offer option for S3-compatible metrics storage
- Telemetry enable/disable updates cinder-volume notification flags
  across all storage backends

Other:
- Shared filesystem feature blocks enablement without internal Ceph
- Shared filesystem, maintenance, and telemetry call sites use
  is_internal_ceph_enabled_feature_aware
- Maintenance commands cache is_internal_ceph_enabled per invocation
- Removed deprecated cinder_volume_tfhelper parameter from hypervisor
- TerraformHelper gains import_resource() and improved error messages

Assisted-By: Claude Code (claude-opus-4-6)
Assisted-By: Codex (gpt-5-4)
Signed-off-by: Guillaume Boutry <guillaume.boutry@canonical.com>
@gboutry gboutry force-pushed the feat/migrate-microceph branch from 6f263e0 to c99d65c Compare April 12, 2026 07:33
Copy link
Copy Markdown
Collaborator

@hemanthnakkina hemanthnakkina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the code, LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants