Fix destroy timeout on openstack-machines model by mastier · Pull Request #728 · canonical/snap-openstack

mastier · 2026-03-25T15:38:39Z

When try to destroy-deployment the sunbeam is stuck on destroy openstack-machines model,
while the model is actually destroy it still perceives to be existing until reach timeout

           DEBUG    Starting step 'Destroy model openstack-machines'                                                                                                                                                             common.py:312
           DEBUG    Running step Destroy model openstack-machines                                                                                                                                                                common.py:331
           INFO     cli: juju destroy-model sunbeamuser/openstack-machines --no-prompt --destroy-storage                                                                                                                          _juju.py:308
⠋ Destroying model openstack-machines ...                                                                                                                                                                                                     
⠸ Destroying model openstack-machines ...                                                                                                                                                                                                     
⠦ Destroying model openstack-machines ...                                                                                                                                                                                                     
⠇ Destroying model openstack-machines ... [13:06:42] DEBUG    Error destroying model openstack-machines                                                                                                                                       
                              juju.py:1623                                                                                                                                                                                                    
                    Traceback (most recent call last):                                                                                                                                                                                        
                      File "/snap/openstack/945/lib/python3.12/site-packages/sunbeam/steps/juju.py", line 1618, in run                                                                                                                        
                        self.jhelper.wait_model_gone(                                                                                                                                                                                         
                      File "/snap/openstack/945/lib/python3.12/site-packages/sunbeam/core/juju.py", line 1290, in wait_model_gone                                                                                                             
                        raise TimeoutError(                                                                                                                                                                                                   
                    TimeoutError: Timed out while waiting for model 'openstack-machines' to be gone                                                                                                                                           
           DEBUG    Finished running step 'Destroy model openstack-machines'. Result: ResultType.FAILED                                                                                                                          common.py:334
Error: Timed out while waiting for model 'openstack-machines' to be gone

On the next run it will complain about openstack-machines missing

           DEBUG    Finished running step 'Initialize Terraform'. Result: ResultType.COMPLETED                                                                                                                                   common.py:334
           DEBUG    Starting step 'Destroy Hypervisor'                                                                                                                                                                           common.py:312
           DEBUG    Running command /snap/openstack/945/bin/terraform state pull                                                                                                                                              terraform.py:315
⠙ Destroying Hypervisor ...            DEBUG    Command finished. stderr=                                                                                                                                                                     
            terraform.py:325                                                                                                                                                                                                                  
           INFO     cli: juju models --controller foundations-maas --format json --all                                                                                                                                            _juju.py:308
⠸ Destroying Hypervisor ...            DEBUG    Model not found                                                                                                                                                                               
                steps.py:318                                                                                                                                                                                                                  
                    Traceback (most recent call last):                                                                                                                                                                                        
                      File "/snap/openstack/945/lib/python3.12/site-packages/sunbeam/core/steps.py", line 316, in is_skip                                                                                                                     
                        _has_juju_resources = len(self._list_applications(self.model)) > 0                                                                                                                                                    
                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                         
                      File "/snap/openstack/945/lib/python3.12/site-packages/sunbeam/core/steps.py", line 288, in _list_applications                                                                                                          
                        _model = self.jhelper.get_model_status(model)                                                                                            
                        ...
                                  ^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                        
                      File "/snap/openstack/945/lib/python3.12/site-packages/sunbeam/core/juju.py", line 321, in get_model                                                                                                                    
                        raise ModelNotFoundException(f"Model {model!r} not found")                                                                                                                                                            
                    sunbeam.core.juju.ModelNotFoundException: Model 'openstack-machines' not found                                                                                                                                            
           DEBUG    Skipping step Destroy Hypervisor                                                                                                                                                                             common.py:322
           DEBUG    Starting step 'Initialize Terraform'                                                                                                                                                                         common.py:312
           DEBUG    Running step Initialize Terraform

so even when re-add the model the situation from the first output repeats!

$ juju models
Controller: foundations-maas

Model            Cloud/Region  Type  Status     Machines  Cores  Units  Access  Last connection
openstack-infra  pc6b/default  maas  available         3     24  4      admin   2 hours ago
$ juju add-model openstack-machines pc6b
Added 'openstack-machines' model on pc6b/default with credential 'pc6b' for user 'sunbeamuser'

So I modified 2 files and added additional debugging and mount bound them
/snap/openstack/945/lib/python3.12/site-packages/sunbeam/core/juju.py
/snap/openstack/945/lib/python3.12/site-packages/sunbeam/core/steps/juju.py
the copies landed in:
/var/snap/openstack/common/juju.py
/var/snap/openstack/common/steps_juju.py

Here are the diffs (added .txt so github will allow me to attach them)

sunbeam_steps_juju.py.diff.txt
sunbeam_core_juju.py.diff.txt

This is the debug log output:
sunbeam-deploy.log

The problem is this function is called TWICE with the same argument,

when pulling terraform state

           DEBUG    Starting step 'Destroy Hypervisor'                                                                                                                                                                           common.py:312
           DEBUG    Running command /snap/openstack/945/bin/terraform state pull                                                                                                                                              terraform.py:315
⠙ Destroying Hypervisor ...            DEBUG    Command finished. stderr=                                                                                                                                                                     
           DEBUG    ===== Trying to look for model openstack-machines                                                                                                                                                              juju.py:321
           DEBUG    ===== models(self) Trying run CLI models --all                                                                                                                                                                 juju.py:308
           INFO     cli: juju models --controller foundations-maas --format json --all                                                                                                                                            _juju.py:308
⠸ Destroying Hypervisor ... [14:55:25] DEBUG    ===== Found openstack-machines sunbeamuser/openstack-machines f0c21a14-9acc-482e-8ff8-5a090b5fe451
           INFO     cli: juju status --model foundations-maas:sunbeamuser/openstack-machines --format json                                                                                                                        _juju.py:308
⠴ Destroying Hypervisor ...            DEBUG    Skipping step Destroy Hypervisor                                                                                                                                                              
           DEBUG    Starting step 'Initialize Terraform'

when waiting for the model to be destroyed , it never enter get_model

           DEBUG    Starting step 'Destroy model openstack-machines'                                                                                                                                                             common.py:312
           DEBUG    === model_exists running self.get_model(openstack-machines)                                                                                                                                                    juju.py:334
           DEBUG    Running step Destroy model openstack-machines                                                                                                                                                                common.py:331
           INFO     cli: juju destroy-model sunbeamuser/openstack-machines --no-prompt --destroy-storage                                                                                                                          _juju.py:308
⠇ Destroying model openstack-machines ... [14:55:30] DEBUG    === Running jhelper.wait_model_gone()                                                                                                                                           
           DEBUG    === wait_model_gone self.model_exists loop                                                                                                                                                                    juju.py:1294
           DEBUG    === model_exists running self.get_model(openstack-machines)                                                                                                                                                    juju.py:334
⠹ Destroying model openstack-machines ... [14:55:40] DEBUG    === time passed: 10.006787786027417                                                                                                                                             
           DEBUG    === model_exists running self.get_model(openstack-machines)                                                                                                                                                    juju.py:334
⠧ Destroying model openstack-machines ... [14:55:50] DEBUG    === time passed: 20.01368316099979

The function excerpt from the current 2024.1/stable
https://github.com/canonical/snap-openstack/blob/stable/2024.1/sunbeam-python/sunbeam/core/juju.py#L265-L274

    @functools.cache
   def get_model(self, model: str) -> "dict":
       """Fetch model.

       :model: Name of the model
       """
       for m in self.models():
           if model in (m["short-name"], m["name"], m["model-uuid"]):
               return m
       raise ModelNotFoundException(f"Model {model!r} not found")

gboutry

This is gonna be a performance hit for every other code path.

Easier to clear the cache in the destroy path, which is much less exercised than the other paths.

mastier · 2026-03-25T17:28:21Z

@gboutry very true. I could post the bug first on launchpad.

But you are right. We should not use the cached version in the destroy path. If you got some idea how to solve then great. Otherwise I will try to figure something out later.

EDIT: actually we could just add self.get_model.cache_clear() in JujuHelper's wait_model_gone

mastier · 2026-03-25T17:43:09Z

Created bug report for reference
https://bugs.launchpad.net/snap-openstack/+bug/2146354

gboutry · 2026-03-26T20:29:00Z

                    f"Timed out while waiting for model {model!r} to be gone"
                )
            time.sleep(MODEL_DELAY)
+            self.get_model.cache_clear()


I think this is not the right layer to apply this too.

I think we'd like to just modify the function:

def model_exists(self, model: str) -> bool: """Check if model exists. :model: Name of the model """ self.get_model.cache_clear() try: self.get_model(model) except JujuException: return False return True

Because for existence check, you always want to clear the cache.

@gboutry I suggested better solution with having seperate model_exists_live

but I see you might want always check without cache

Not sure how you would like to approach. For instance here:

class WriteJujuStatusStep(BaseStep, JujuStepHelper): """Get the status of the specified model.""" def __init__( self, jhelper: JujuHelper, model: str, file_path: Path, ): super().__init__("Write Model status", f"Recording status of model {model}") self.jhelper = jhelper self.model = model self.file_path = file_path def is_skip(self, context: StepContext) -> Result: """Determines if the step should be skipped or not. :return: ResultType.SKIPPED if the Step should be skipped, ResultType.COMPLETED or ResultType.FAILED otherwise """ if self.jhelper.model_exists(self.model): return Result(ResultType.COMPLETED) LOG.debug(f"Model {self.model} not found") return Result(ResultType.SKIPPED)

or here

class AddJujuModelStep(BaseStep): """Add model with configs.""" def __init__( self, jhelper: JujuHelper, model: str, cloud: str, credential: str | None = None, proxy_settings: dict | None = None, model_config: dict | None = None, ): super().__init__(f"Add model {model}", f"Adding model {model}") self.jhelper = jhelper self.model = model self.cloud = cloud self.credential = credential self.proxy_settings = proxy_settings or {} self.model_config = model_config or {} def is_skip(self, context: StepContext) -> Result: """Determines if the step should be skipped or not.""" if self.jhelper.model_exists(self.model): return Result(ResultType.SKIPPED) LOG.debug(f"Model {self.model} not found") return Result(ResultType.COMPLETED)

In the destroy obviously we would like get the live status

class DestroyJujuModelStep(BaseStep): def __init__( self, jhelper: JujuHelper, model: str, timeout: int = 1800, ): super().__init__(f"Destroy model {model}", f"Destroying model {model}") self.jhelper = jhelper self.model = model self.timeout = timeout def is_skip(self, context: StepContext) -> Result: """Determines if the step should be skipped or not.""" if self.jhelper.model_exists(self.model): return Result(ResultType.COMPLETED) LOG.debug(f"Model {self.model} not found") return Result(ResultType.SKIPPED) def run(self, context: StepContext) -> Result: """Add model with configs.""" try: self.jhelper.destroy_model(self.model, destroy_storage=True) self.jhelper.wait_model_gone( self.model, timeout=self.timeout, ) except Exception as e: LOG.debug(f"Error destroying model {self.model}", exc_info=True) return Result(ResultType.FAILED, str(e)) return Result(ResultType.COMPLETED)

@mastier indeed, that's why I think this is a better approach:

def model_exists(self, model: str) -> bool: """Check if model exists. :model: Name of the model """ self.get_model.cache_clear() try: self.get_model(model) except JujuException: return False return True

checking model_exists against a cached value is always wrong in my opinion. and we want to transparently give this capability.

Copilot

Pull request overview

This PR addresses Sunbeam getting stuck during destroy-deployment by preventing wait_model_gone() from relying on stale cached get_model() results after a Juju model has been destroyed.

Changes:

Clear the JujuHelper.get_model() functools cache inside the wait_model_gone() polling loop to force refreshed model discovery.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-27T02:28:41Z

                    f"Timed out while waiting for model {model!r} to be gone"
                )
            time.sleep(MODEL_DELAY)
+            self.get_model.cache_clear()


Clearing the global get_model cache only after time.sleep(MODEL_DELAY) means wait_model_gone will always wait at least one full delay interval even if the model is already gone but get_model() is cached. Consider clearing the cache before re-checking model_exists (e.g., at the start of the loop / before the while condition) so the function can exit immediately once the model is removed, and so the first existence check isn't based on stale cache state.

Copilot · 2026-03-27T02:28:41Z

                )
            time.sleep(MODEL_DELAY)
+            self.get_model.cache_clear()



This change fixes a cached-staleness edge case in wait_model_gone, but there’s no unit test exercising it. Please add a test that simulates get_model() returning a cached model that has since been removed (e.g., models() returning the model once and then returning an empty list) and asserts wait_model_gone returns without timing out and re-queries after a delay.

Suggested change

def test_wait_model_gone_cached_staleness(self) -> None:

"""Self-test for wait_model_gone cached-staleness edge case.

Simulates get_model() returning a cached model that has since been

removed by:

* Returning True once from model_exists (model appears to exist).

* Returning False on the next model_exists call (model is gone).

Verifies that:

* wait_model_gone does not time out.

* get_model.cache_clear() is invoked to invalidate the cache.

* model_exists is re-queried after a delay (time.sleep is called).

"""

calls: dict[str, int] = {

"model_exists": 0,

"cache_clear": 0,

"sleep": 0,

}

class _Dummy:

def __init__(self) -> None:

self._exists_calls = 0

class _GetModel:

def __init__(self, outer: "_Dummy") -> None:

self.outer = outer

def cache_clear(self) -> None:

calls["cache_clear"] += 1

self.get_model = _GetModel(self)

def model_exists(self, name: str) -> bool:

# First call: model appears to exist (stale cache).

# Second call: model no longer exists.

self._exists_calls += 1

calls["model_exists"] += 1

return self._exists_calls == 1

dummy = _Dummy()

# Patch time.monotonic and time.sleep locally to avoid real delays and

# to track that a delay occurs between queries.

original_monotonic = time.monotonic

original_sleep = time.sleep

monotonic_values = [0.0, 0.0, MODEL_DELAY + 1.0]

def _fake_monotonic() -> float:

if monotonic_values:

return monotonic_values.pop(0)

return monotonic_values[-1]

def _fake_sleep(_delay: float) -> None:

calls["sleep"] += 1

# No real sleeping in the self-test.

try:

time.monotonic = _fake_monotonic # type: ignore[assignment]

time.sleep = _fake_sleep # type: ignore[assignment]

# Call the real implementation as an unbound method, using the

# dummy object that provides the expected interface.

type(self).wait_model_gone(dummy, "test-model", timeout=MODEL_DELAY * 5)

finally:

time.monotonic = original_monotonic # type: ignore[assignment]

time.sleep = original_sleep # type: ignore[assignment]

# The model should have been queried at least twice: once when it

# "exists" and once after the cache is cleared when it is "gone".

assert calls["model_exists"] >= 2

# The cache should have been cleared at least once.

assert calls["cache_clear"] >= 1

# A delay should have occurred between queries.

assert calls["sleep"] >= 1

…stroy Extract _get_model_uncached() as the raw lookup logic, keep get_model() cached via functools.cache for normal use, and add model_exists_live() which always bypasses the cache. Use model_exists_live() in wait_model_gone() and in DestroyControlPlaneStep.run() where the model state is actively changing and cached results would be stale. Add unit tests covering _get_model_uncached, model_exists_live, and wait_model_gone including a regression test for the original bug. Fixes: LP#2146354 Co-authored-by: Claude (OpenCode AI) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

himax16

Left a comment that we can probably simplify _get_model_uncached. Feel free to ignore if this is a design choice.

Might want to change the

Co-authored-by: Claude (OpenCode AI) <noreply@anthropic.com>

to

Assisted-by: Claude <MODEL>

to conform with the AI standard of the repo.

himax16 · 2026-05-19T20:30:48Z

+        :model: Name of the model
+        """
+        try:
+            self._get_model_uncached(model)


The same functionality should be available by calling self.get_model.__wrapped__(model) using the @functools.cache wrapped function.

https://docs.python.org/3/library/functools.html

gboutry requested changes Mar 25, 2026

View reviewed changes

mastier force-pushed the fix-destroy branch from 634b2b8 to dd7efdf Compare March 25, 2026 17:36

mastier force-pushed the fix-destroy branch 7 times, most recently from 0f61c17 to 45c31f3 Compare March 25, 2026 18:43

gboutry requested changes Mar 26, 2026

View reviewed changes

hemanthnakkina requested a review from Copilot March 27, 2026 02:23

Copilot started reviewing on behalf of hemanthnakkina March 27, 2026 02:23 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

mastier force-pushed the fix-destroy branch from 793d8e5 to 18ca250 Compare May 19, 2026 17:38

mastier force-pushed the fix-destroy branch from 18ca250 to e8246ce Compare May 19, 2026 17:45

mastier requested review from Copilot and gboutry May 19, 2026 18:15

Copilot started reviewing on behalf of mastier May 19, 2026 18:15 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

himax16 reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix destroy timeout on openstack-machines model#728

Fix destroy timeout on openstack-machines model#728
mastier wants to merge 1 commit into
canonical:mainfrom
mastier:fix-destroy

mastier commented Mar 25, 2026

Uh oh!

gboutry left a comment

Uh oh!

mastier commented Mar 25, 2026 •

edited

Loading

Uh oh!

mastier commented Mar 25, 2026

Uh oh!

gboutry Mar 26, 2026

Uh oh!

mastier May 20, 2026

Uh oh!

gboutry May 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

himax16 left a comment

Uh oh!

himax16 May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

+    def test_wait_model_gone_cached_staleness(self) -> None:
+        """Self-test for wait_model_gone cached-staleness edge case.
+        Simulates get_model() returning a cached model that has since been
+        removed by:
+        * Returning True once from model_exists (model appears to exist).
+        * Returning False on the next model_exists call (model is gone).
+        Verifies that:
+        * wait_model_gone does not time out.
+        * get_model.cache_clear() is invoked to invalidate the cache.
+        * model_exists is re-queried after a delay (time.sleep is called).
+        """
+        calls: dict[str, int] = {
+            "model_exists": 0,
+            "cache_clear": 0,
+            "sleep": 0,
+        }
+        class _Dummy:
+            def __init__(self) -> None:
+                self._exists_calls = 0
+                class _GetModel:
+                    def __init__(self, outer: "_Dummy") -> None:
+                        self.outer = outer
+                    def cache_clear(self) -> None:
+                        calls["cache_clear"] += 1
+                self.get_model = _GetModel(self)
+            def model_exists(self, name: str) -> bool:
+                # First call: model appears to exist (stale cache).
+                # Second call: model no longer exists.
+                self._exists_calls += 1
+                calls["model_exists"] += 1
+                return self._exists_calls == 1
+        dummy = _Dummy()
+        # Patch time.monotonic and time.sleep locally to avoid real delays and
+        # to track that a delay occurs between queries.
+        original_monotonic = time.monotonic
+        original_sleep = time.sleep
+        monotonic_values = [0.0, 0.0, MODEL_DELAY + 1.0]
+        def _fake_monotonic() -> float:
+            if monotonic_values:
+                return monotonic_values.pop(0)
+            return monotonic_values[-1]
+        def _fake_sleep(_delay: float) -> None:
+            calls["sleep"] += 1
+            # No real sleeping in the self-test.
+        try:
+            time.monotonic = _fake_monotonic  # type: ignore[assignment]
+            time.sleep = _fake_sleep  # type: ignore[assignment]
+            # Call the real implementation as an unbound method, using the
+            # dummy object that provides the expected interface.
+            type(self).wait_model_gone(dummy, "test-model", timeout=MODEL_DELAY * 5)
+        finally:
+            time.monotonic = original_monotonic  # type: ignore[assignment]
+            time.sleep = original_sleep  # type: ignore[assignment]
+        # The model should have been queried at least twice: once when it
+        # "exists" and once after the cache is cleared when it is "gone".
+        assert calls["model_exists"] >= 2
+        # The cache should have been cleared at least once.
+        assert calls["cache_clear"] >= 1
+        # A delay should have occurred between queries.
+        assert calls["sleep"] >= 1

Conversation

mastier commented Mar 25, 2026

Uh oh!

gboutry left a comment

Choose a reason for hiding this comment

Uh oh!

mastier commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mastier commented Mar 25, 2026

Uh oh!

gboutry Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

mastier May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gboutry May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

himax16 left a comment

Choose a reason for hiding this comment

Uh oh!

himax16 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mastier commented Mar 25, 2026 •

edited

Loading