Skip to content

Fix destroy timeout on openstack-machines model#728

Open
mastier wants to merge 1 commit into
canonical:mainfrom
mastier:fix-destroy
Open

Fix destroy timeout on openstack-machines model#728
mastier wants to merge 1 commit into
canonical:mainfrom
mastier:fix-destroy

Conversation

@mastier
Copy link
Copy Markdown

@mastier mastier commented Mar 25, 2026

When try to destroy-deployment the sunbeam is stuck on destroy openstack-machines model,
while the model is actually destroy it still perceives to be existing until reach timeout

           DEBUG    Starting step 'Destroy model openstack-machines'                                                                                                                                                             common.py:312
           DEBUG    Running step Destroy model openstack-machines                                                                                                                                                                common.py:331
           INFO     cli: juju destroy-model sunbeamuser/openstack-machines --no-prompt --destroy-storage                                                                                                                          _juju.py:308
⠋ Destroying model openstack-machines ...                                                                                                                                                                                                     
⠸ Destroying model openstack-machines ...                                                                                                                                                                                                     
⠦ Destroying model openstack-machines ...                                                                                                                                                                                                     
⠇ Destroying model openstack-machines ... [13:06:42] DEBUG    Error destroying model openstack-machines                                                                                                                                       
                              juju.py:1623                                                                                                                                                                                                    
                    Traceback (most recent call last):                                                                                                                                                                                        
                      File "/snap/openstack/945/lib/python3.12/site-packages/sunbeam/steps/juju.py", line 1618, in run                                                                                                                        
                        self.jhelper.wait_model_gone(                                                                                                                                                                                         
                      File "/snap/openstack/945/lib/python3.12/site-packages/sunbeam/core/juju.py", line 1290, in wait_model_gone                                                                                                             
                        raise TimeoutError(                                                                                                                                                                                                   
                    TimeoutError: Timed out while waiting for model 'openstack-machines' to be gone                                                                                                                                           
           DEBUG    Finished running step 'Destroy model openstack-machines'. Result: ResultType.FAILED                                                                                                                          common.py:334
Error: Timed out while waiting for model 'openstack-machines' to be gone   

On the next run it will complain about openstack-machines missing

           DEBUG    Finished running step 'Initialize Terraform'. Result: ResultType.COMPLETED                                                                                                                                   common.py:334
           DEBUG    Starting step 'Destroy Hypervisor'                                                                                                                                                                           common.py:312
           DEBUG    Running command /snap/openstack/945/bin/terraform state pull                                                                                                                                              terraform.py:315
⠙ Destroying Hypervisor ...            DEBUG    Command finished. stderr=                                                                                                                                                                     
            terraform.py:325                                                                                                                                                                                                                  
           INFO     cli: juju models --controller foundations-maas --format json --all                                                                                                                                            _juju.py:308
⠸ Destroying Hypervisor ...            DEBUG    Model not found                                                                                                                                                                               
                steps.py:318                                                                                                                                                                                                                  
                    Traceback (most recent call last):                                                                                                                                                                                        
                      File "/snap/openstack/945/lib/python3.12/site-packages/sunbeam/core/steps.py", line 316, in is_skip                                                                                                                     
                        _has_juju_resources = len(self._list_applications(self.model)) > 0                                                                                                                                                    
                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                         
                      File "/snap/openstack/945/lib/python3.12/site-packages/sunbeam/core/steps.py", line 288, in _list_applications                                                                                                          
                        _model = self.jhelper.get_model_status(model)                                                                                            
                        ...
                                  ^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                        
                      File "/snap/openstack/945/lib/python3.12/site-packages/sunbeam/core/juju.py", line 321, in get_model                                                                                                                    
                        raise ModelNotFoundException(f"Model {model!r} not found")                                                                                                                                                            
                    sunbeam.core.juju.ModelNotFoundException: Model 'openstack-machines' not found                                                                                                                                            
           DEBUG    Skipping step Destroy Hypervisor                                                                                                                                                                             common.py:322
           DEBUG    Starting step 'Initialize Terraform'                                                                                                                                                                         common.py:312
           DEBUG    Running step Initialize Terraform                                        

so even when re-add the model the situation from the first output repeats!

$ juju models
Controller: foundations-maas

Model            Cloud/Region  Type  Status     Machines  Cores  Units  Access  Last connection
openstack-infra  pc6b/default  maas  available         3     24  4      admin   2 hours ago
$ juju add-model openstack-machines pc6b
Added 'openstack-machines' model on pc6b/default with credential 'pc6b' for user 'sunbeamuser'

So I modified 2 files and added additional debugging and mount bound them
/snap/openstack/945/lib/python3.12/site-packages/sunbeam/core/juju.py
/snap/openstack/945/lib/python3.12/site-packages/sunbeam/core/steps/juju.py
the copies landed in:
/var/snap/openstack/common/juju.py
/var/snap/openstack/common/steps_juju.py

Here are the diffs (added .txt so github will allow me to attach them)

sunbeam_steps_juju.py.diff.txt
sunbeam_core_juju.py.diff.txt

This is the debug log output:
sunbeam-deploy.log

The problem is this function is called TWICE with the same argument,

  1. when pulling terraform state
           DEBUG    Starting step 'Destroy Hypervisor'                                                                                                                                                                           common.py:312
           DEBUG    Running command /snap/openstack/945/bin/terraform state pull                                                                                                                                              terraform.py:315
⠙ Destroying Hypervisor ...            DEBUG    Command finished. stderr=                                                                                                                                                                     
           DEBUG    ===== Trying to look for model openstack-machines                                                                                                                                                              juju.py:321
           DEBUG    ===== models(self) Trying run CLI models --all                                                                                                                                                                 juju.py:308
           INFO     cli: juju models --controller foundations-maas --format json --all                                                                                                                                            _juju.py:308
⠸ Destroying Hypervisor ... [14:55:25] DEBUG    ===== Found openstack-machines sunbeamuser/openstack-machines f0c21a14-9acc-482e-8ff8-5a090b5fe451
           INFO     cli: juju status --model foundations-maas:sunbeamuser/openstack-machines --format json                                                                                                                        _juju.py:308
⠴ Destroying Hypervisor ...            DEBUG    Skipping step Destroy Hypervisor                                                                                                                                                              
           DEBUG    Starting step 'Initialize Terraform'                                                       
  1. when waiting for the model to be destroyed , it never enter get_model
           DEBUG    Starting step 'Destroy model openstack-machines'                                                                                                                                                             common.py:312
           DEBUG    === model_exists running self.get_model(openstack-machines)                                                                                                                                                    juju.py:334
           DEBUG    Running step Destroy model openstack-machines                                                                                                                                                                common.py:331
           INFO     cli: juju destroy-model sunbeamuser/openstack-machines --no-prompt --destroy-storage                                                                                                                          _juju.py:308
⠇ Destroying model openstack-machines ... [14:55:30] DEBUG    === Running jhelper.wait_model_gone()                                                                                                                                           
           DEBUG    === wait_model_gone self.model_exists loop                                                                                                                                                                    juju.py:1294
           DEBUG    === model_exists running self.get_model(openstack-machines)                                                                                                                                                    juju.py:334
⠹ Destroying model openstack-machines ... [14:55:40] DEBUG    === time passed: 10.006787786027417                                                                                                                                             
           DEBUG    === model_exists running self.get_model(openstack-machines)                                                                                                                                                    juju.py:334
⠧ Destroying model openstack-machines ... [14:55:50] DEBUG    === time passed: 20.01368316099979    

The function excerpt from the current 2024.1/stable
https://github.com/canonical/snap-openstack/blob/stable/2024.1/sunbeam-python/sunbeam/core/juju.py#L265-L274

    @functools.cache
   def get_model(self, model: str) -> "dict":
       """Fetch model.

       :model: Name of the model
       """
       for m in self.models():
           if model in (m["short-name"], m["name"], m["model-uuid"]):
               return m
       raise ModelNotFoundException(f"Model {model!r} not found")

Copy link
Copy Markdown
Collaborator

@gboutry gboutry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is gonna be a performance hit for every other code path.

Easier to clear the cache in the destroy path, which is much less exercised than the other paths.

@mastier
Copy link
Copy Markdown
Author

mastier commented Mar 25, 2026

@gboutry very true. I could post the bug first on launchpad.

But you are right. We should not use the cached version in the destroy path. If you got some idea how to solve then great. Otherwise I will try to figure something out later.

EDIT: actually we could just add self.get_model.cache_clear() in JujuHelper's wait_model_gone

@mastier
Copy link
Copy Markdown
Author

mastier commented Mar 25, 2026

Created bug report for reference
https://bugs.launchpad.net/snap-openstack/+bug/2146354

@mastier mastier force-pushed the fix-destroy branch 7 times, most recently from 0f61c17 to 45c31f3 Compare March 25, 2026 18:43
Comment thread sunbeam-python/sunbeam/core/juju.py Outdated
f"Timed out while waiting for model {model!r} to be gone"
)
time.sleep(MODEL_DELAY)
self.get_model.cache_clear()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not the right layer to apply this too.

I think we'd like to just modify the function:

    def model_exists(self, model: str) -> bool:
        """Check if model exists.

        :model: Name of the model
        """
        self.get_model.cache_clear()
        try:
            self.get_model(model)
        except JujuException:
            return False
        return True

Because for existence check, you always want to clear the cache.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gboutry I suggested better solution with having seperate model_exists_live

but I see you might want always check without cache

Not sure how you would like to approach. For instance here:

  class WriteJujuStatusStep(BaseStep, JujuStepHelper):                                                
      """Get the status of the specified model."""                                                    
                                                                                                      
      def __init__(                                                                                   
          self,                                                                                       
          jhelper: JujuHelper,                                                                        
          model: str,                                                                                 
          file_path: Path,                                                                            
      ):                                                                                              
          super().__init__("Write Model status", f"Recording status of model {model}")                
                                                                                                      
          self.jhelper = jhelper                                                                      
          self.model = model                                                                          
          self.file_path = file_path                                                                  
                                                                                                      
      def is_skip(self, context: StepContext) -> Result:                                              
          """Determines if the step should be skipped or not.                                         
                                                                                                      
          :return: ResultType.SKIPPED if the Step should be skipped,                                  
                   ResultType.COMPLETED or ResultType.FAILED otherwise                                
          """                                                                                         
          if self.jhelper.model_exists(self.model):                                                   
              return Result(ResultType.COMPLETED)                                                     
          LOG.debug(f"Model {self.model} not found")                                                  
          return Result(ResultType.SKIPPED)      

or here

  class AddJujuModelStep(BaseStep):                                                                   
      """Add model with configs."""                                                                   
                                                                                                      
      def __init__(                                                                                   
          self,                                                                                       
          jhelper: JujuHelper,                                                                        
          model: str,                                                                                 
          cloud: str,                                                                                 
          credential: str | None = None,                                                              
          proxy_settings: dict | None = None,                                                         
          model_config: dict | None = None,                                                           
      ):                                                                                              
          super().__init__(f"Add model {model}", f"Adding model {model}")                             
          self.jhelper = jhelper                                                                      
          self.model = model                                                                          
          self.cloud = cloud                                                                          
          self.credential = credential                                                                
          self.proxy_settings = proxy_settings or {}                                                  
          self.model_config = model_config or {}                                                      
                                                                                                      
      def is_skip(self, context: StepContext) -> Result:                                              
          """Determines if the step should be skipped or not."""                                      
          if self.jhelper.model_exists(self.model):                                                   
              return Result(ResultType.SKIPPED)                                                       
          LOG.debug(f"Model {self.model} not found")                                                  
          return Result(ResultType.COMPLETED)       

In the destroy obviously we would like get the live status

   class DestroyJujuModelStep(BaseStep):                                                               
      def __init__(                                                                                   
          self,                                                                                       
          jhelper: JujuHelper,                                                                        
          model: str,                                                                                 
          timeout: int = 1800,                                                                        
      ):                                                                                              
          super().__init__(f"Destroy model {model}", f"Destroying model {model}")                     
          self.jhelper = jhelper                                                                      
          self.model = model                                                                          
          self.timeout = timeout                                                                      
                                                                                                      
      def is_skip(self, context: StepContext) -> Result:                                              
          """Determines if the step should be skipped or not."""                                      
          if self.jhelper.model_exists(self.model):                                                   
              return Result(ResultType.COMPLETED)                                                     
          LOG.debug(f"Model {self.model} not found")                                                  
          return Result(ResultType.SKIPPED)                                                           
                                                                                                      
      def run(self, context: StepContext) -> Result:                                                  
          """Add model with configs."""                                                               
          try:                                                                                        
              self.jhelper.destroy_model(self.model, destroy_storage=True)                            
              self.jhelper.wait_model_gone(                                                           
                  self.model,                                                                         
                  timeout=self.timeout,                                                               
              )                                                                                       
          except Exception as e:                                                                      
              LOG.debug(f"Error destroying model {self.model}", exc_info=True)                        
              return Result(ResultType.FAILED, str(e))                                                
                                                                                                      
          return Result(ResultType.COMPLETED)                                                         

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mastier indeed, that's why I think this is a better approach:

    def model_exists(self, model: str) -> bool:
        """Check if model exists.

        :model: Name of the model
        """
        self.get_model.cache_clear()
        try:
            self.get_model(model)
        except JujuException:
            return False
        return True

checking model_exists against a cached value is always wrong in my opinion. and we want to transparently give this capability.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses Sunbeam getting stuck during destroy-deployment by preventing wait_model_gone() from relying on stale cached get_model() results after a Juju model has been destroyed.

Changes:

  • Clear the JujuHelper.get_model() functools cache inside the wait_model_gone() polling loop to force refreshed model discovery.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sunbeam-python/sunbeam/core/juju.py Outdated
Comment on lines +1288 to +1294
f"Timed out while waiting for model {model!r} to be gone"
)
time.sleep(MODEL_DELAY)
self.get_model.cache_clear()
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearing the global get_model cache only after time.sleep(MODEL_DELAY) means wait_model_gone will always wait at least one full delay interval even if the model is already gone but get_model() is cached. Consider clearing the cache before re-checking model_exists (e.g., at the start of the loop / before the while condition) so the function can exit immediately once the model is removed, and so the first existence check isn't based on stale cache state.

Copilot uses AI. Check for mistakes.
)
time.sleep(MODEL_DELAY)
self.get_model.cache_clear()

Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change fixes a cached-staleness edge case in wait_model_gone, but there’s no unit test exercising it. Please add a test that simulates get_model() returning a cached model that has since been removed (e.g., models() returning the model once and then returning an empty list) and asserts wait_model_gone returns without timing out and re-queries after a delay.

Suggested change
def test_wait_model_gone_cached_staleness(self) -> None:
"""Self-test for wait_model_gone cached-staleness edge case.
Simulates get_model() returning a cached model that has since been
removed by:
* Returning True once from model_exists (model appears to exist).
* Returning False on the next model_exists call (model is gone).
Verifies that:
* wait_model_gone does not time out.
* get_model.cache_clear() is invoked to invalidate the cache.
* model_exists is re-queried after a delay (time.sleep is called).
"""
calls: dict[str, int] = {
"model_exists": 0,
"cache_clear": 0,
"sleep": 0,
}
class _Dummy:
def __init__(self) -> None:
self._exists_calls = 0
class _GetModel:
def __init__(self, outer: "_Dummy") -> None:
self.outer = outer
def cache_clear(self) -> None:
calls["cache_clear"] += 1
self.get_model = _GetModel(self)
def model_exists(self, name: str) -> bool:
# First call: model appears to exist (stale cache).
# Second call: model no longer exists.
self._exists_calls += 1
calls["model_exists"] += 1
return self._exists_calls == 1
dummy = _Dummy()
# Patch time.monotonic and time.sleep locally to avoid real delays and
# to track that a delay occurs between queries.
original_monotonic = time.monotonic
original_sleep = time.sleep
monotonic_values = [0.0, 0.0, MODEL_DELAY + 1.0]
def _fake_monotonic() -> float:
if monotonic_values:
return monotonic_values.pop(0)
return monotonic_values[-1]
def _fake_sleep(_delay: float) -> None:
calls["sleep"] += 1
# No real sleeping in the self-test.
try:
time.monotonic = _fake_monotonic # type: ignore[assignment]
time.sleep = _fake_sleep # type: ignore[assignment]
# Call the real implementation as an unbound method, using the
# dummy object that provides the expected interface.
type(self).wait_model_gone(dummy, "test-model", timeout=MODEL_DELAY * 5)
finally:
time.monotonic = original_monotonic # type: ignore[assignment]
time.sleep = original_sleep # type: ignore[assignment]
# The model should have been queried at least twice: once when it
# "exists" and once after the cache is cleared when it is "gone".
assert calls["model_exists"] >= 2
# The cache should have been cleared at least once.
assert calls["cache_clear"] >= 1
# A delay should have occurred between queries.
assert calls["sleep"] >= 1

Copilot uses AI. Check for mistakes.
…stroy

Extract _get_model_uncached() as the raw lookup logic, keep get_model()
cached via functools.cache for normal use, and add model_exists_live()
which always bypasses the cache.

Use model_exists_live() in wait_model_gone() and in
DestroyControlPlaneStep.run() where the model state is actively changing
and cached results would be stale.

Add unit tests covering _get_model_uncached, model_exists_live, and
wait_model_gone including a regression test for the original bug.

Fixes: LP#2146354
Co-authored-by: Claude (OpenCode AI) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown
Member

@himax16 himax16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment that we can probably simplify _get_model_uncached. Feel free to ignore if this is a design choice.

Might want to change the

Co-authored-by: Claude (OpenCode AI) <noreply@anthropic.com>

to

Assisted-by: Claude <MODEL>

to conform with the AI standard of the repo.

:model: Name of the model
"""
try:
self._get_model_uncached(model)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same functionality should be available by calling self.get_model.__wrapped__(model) using the @functools.cache wrapped function.

https://docs.python.org/3/library/functools.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants