Fix destroy timeout on openstack-machines model#728
Conversation
gboutry
left a comment
There was a problem hiding this comment.
This is gonna be a performance hit for every other code path.
Easier to clear the cache in the destroy path, which is much less exercised than the other paths.
|
@gboutry very true. I could post the bug first on launchpad. But you are right. We should not use the cached version in the destroy path. If you got some idea how to solve then great. Otherwise I will try to figure something out later. EDIT: actually we could just add |
|
Created bug report for reference |
0f61c17 to
45c31f3
Compare
| f"Timed out while waiting for model {model!r} to be gone" | ||
| ) | ||
| time.sleep(MODEL_DELAY) | ||
| self.get_model.cache_clear() |
There was a problem hiding this comment.
I think this is not the right layer to apply this too.
I think we'd like to just modify the function:
def model_exists(self, model: str) -> bool:
"""Check if model exists.
:model: Name of the model
"""
self.get_model.cache_clear()
try:
self.get_model(model)
except JujuException:
return False
return TrueBecause for existence check, you always want to clear the cache.
There was a problem hiding this comment.
@gboutry I suggested better solution with having seperate model_exists_live
but I see you might want always check without cache
Not sure how you would like to approach. For instance here:
class WriteJujuStatusStep(BaseStep, JujuStepHelper):
"""Get the status of the specified model."""
def __init__(
self,
jhelper: JujuHelper,
model: str,
file_path: Path,
):
super().__init__("Write Model status", f"Recording status of model {model}")
self.jhelper = jhelper
self.model = model
self.file_path = file_path
def is_skip(self, context: StepContext) -> Result:
"""Determines if the step should be skipped or not.
:return: ResultType.SKIPPED if the Step should be skipped,
ResultType.COMPLETED or ResultType.FAILED otherwise
"""
if self.jhelper.model_exists(self.model):
return Result(ResultType.COMPLETED)
LOG.debug(f"Model {self.model} not found")
return Result(ResultType.SKIPPED)
or here
class AddJujuModelStep(BaseStep):
"""Add model with configs."""
def __init__(
self,
jhelper: JujuHelper,
model: str,
cloud: str,
credential: str | None = None,
proxy_settings: dict | None = None,
model_config: dict | None = None,
):
super().__init__(f"Add model {model}", f"Adding model {model}")
self.jhelper = jhelper
self.model = model
self.cloud = cloud
self.credential = credential
self.proxy_settings = proxy_settings or {}
self.model_config = model_config or {}
def is_skip(self, context: StepContext) -> Result:
"""Determines if the step should be skipped or not."""
if self.jhelper.model_exists(self.model):
return Result(ResultType.SKIPPED)
LOG.debug(f"Model {self.model} not found")
return Result(ResultType.COMPLETED)
In the destroy obviously we would like get the live status
class DestroyJujuModelStep(BaseStep):
def __init__(
self,
jhelper: JujuHelper,
model: str,
timeout: int = 1800,
):
super().__init__(f"Destroy model {model}", f"Destroying model {model}")
self.jhelper = jhelper
self.model = model
self.timeout = timeout
def is_skip(self, context: StepContext) -> Result:
"""Determines if the step should be skipped or not."""
if self.jhelper.model_exists(self.model):
return Result(ResultType.COMPLETED)
LOG.debug(f"Model {self.model} not found")
return Result(ResultType.SKIPPED)
def run(self, context: StepContext) -> Result:
"""Add model with configs."""
try:
self.jhelper.destroy_model(self.model, destroy_storage=True)
self.jhelper.wait_model_gone(
self.model,
timeout=self.timeout,
)
except Exception as e:
LOG.debug(f"Error destroying model {self.model}", exc_info=True)
return Result(ResultType.FAILED, str(e))
return Result(ResultType.COMPLETED)
There was a problem hiding this comment.
@mastier indeed, that's why I think this is a better approach:
def model_exists(self, model: str) -> bool:
"""Check if model exists.
:model: Name of the model
"""
self.get_model.cache_clear()
try:
self.get_model(model)
except JujuException:
return False
return Truechecking model_exists against a cached value is always wrong in my opinion. and we want to transparently give this capability.
There was a problem hiding this comment.
Pull request overview
This PR addresses Sunbeam getting stuck during destroy-deployment by preventing wait_model_gone() from relying on stale cached get_model() results after a Juju model has been destroyed.
Changes:
- Clear the
JujuHelper.get_model()functools cache inside thewait_model_gone()polling loop to force refreshed model discovery.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| f"Timed out while waiting for model {model!r} to be gone" | ||
| ) | ||
| time.sleep(MODEL_DELAY) | ||
| self.get_model.cache_clear() |
There was a problem hiding this comment.
Clearing the global get_model cache only after time.sleep(MODEL_DELAY) means wait_model_gone will always wait at least one full delay interval even if the model is already gone but get_model() is cached. Consider clearing the cache before re-checking model_exists (e.g., at the start of the loop / before the while condition) so the function can exit immediately once the model is removed, and so the first existence check isn't based on stale cache state.
| ) | ||
| time.sleep(MODEL_DELAY) | ||
| self.get_model.cache_clear() | ||
|
|
There was a problem hiding this comment.
This change fixes a cached-staleness edge case in wait_model_gone, but there’s no unit test exercising it. Please add a test that simulates get_model() returning a cached model that has since been removed (e.g., models() returning the model once and then returning an empty list) and asserts wait_model_gone returns without timing out and re-queries after a delay.
| def test_wait_model_gone_cached_staleness(self) -> None: | |
| """Self-test for wait_model_gone cached-staleness edge case. | |
| Simulates get_model() returning a cached model that has since been | |
| removed by: | |
| * Returning True once from model_exists (model appears to exist). | |
| * Returning False on the next model_exists call (model is gone). | |
| Verifies that: | |
| * wait_model_gone does not time out. | |
| * get_model.cache_clear() is invoked to invalidate the cache. | |
| * model_exists is re-queried after a delay (time.sleep is called). | |
| """ | |
| calls: dict[str, int] = { | |
| "model_exists": 0, | |
| "cache_clear": 0, | |
| "sleep": 0, | |
| } | |
| class _Dummy: | |
| def __init__(self) -> None: | |
| self._exists_calls = 0 | |
| class _GetModel: | |
| def __init__(self, outer: "_Dummy") -> None: | |
| self.outer = outer | |
| def cache_clear(self) -> None: | |
| calls["cache_clear"] += 1 | |
| self.get_model = _GetModel(self) | |
| def model_exists(self, name: str) -> bool: | |
| # First call: model appears to exist (stale cache). | |
| # Second call: model no longer exists. | |
| self._exists_calls += 1 | |
| calls["model_exists"] += 1 | |
| return self._exists_calls == 1 | |
| dummy = _Dummy() | |
| # Patch time.monotonic and time.sleep locally to avoid real delays and | |
| # to track that a delay occurs between queries. | |
| original_monotonic = time.monotonic | |
| original_sleep = time.sleep | |
| monotonic_values = [0.0, 0.0, MODEL_DELAY + 1.0] | |
| def _fake_monotonic() -> float: | |
| if monotonic_values: | |
| return monotonic_values.pop(0) | |
| return monotonic_values[-1] | |
| def _fake_sleep(_delay: float) -> None: | |
| calls["sleep"] += 1 | |
| # No real sleeping in the self-test. | |
| try: | |
| time.monotonic = _fake_monotonic # type: ignore[assignment] | |
| time.sleep = _fake_sleep # type: ignore[assignment] | |
| # Call the real implementation as an unbound method, using the | |
| # dummy object that provides the expected interface. | |
| type(self).wait_model_gone(dummy, "test-model", timeout=MODEL_DELAY * 5) | |
| finally: | |
| time.monotonic = original_monotonic # type: ignore[assignment] | |
| time.sleep = original_sleep # type: ignore[assignment] | |
| # The model should have been queried at least twice: once when it | |
| # "exists" and once after the cache is cleared when it is "gone". | |
| assert calls["model_exists"] >= 2 | |
| # The cache should have been cleared at least once. | |
| assert calls["cache_clear"] >= 1 | |
| # A delay should have occurred between queries. | |
| assert calls["sleep"] >= 1 |
…stroy Extract _get_model_uncached() as the raw lookup logic, keep get_model() cached via functools.cache for normal use, and add model_exists_live() which always bypasses the cache. Use model_exists_live() in wait_model_gone() and in DestroyControlPlaneStep.run() where the model state is actively changing and cached results would be stale. Add unit tests covering _get_model_uncached, model_exists_live, and wait_model_gone including a regression test for the original bug. Fixes: LP#2146354 Co-authored-by: Claude (OpenCode AI) <noreply@anthropic.com>
himax16
left a comment
There was a problem hiding this comment.
Left a comment that we can probably simplify _get_model_uncached. Feel free to ignore if this is a design choice.
Might want to change the
Co-authored-by: Claude (OpenCode AI) <noreply@anthropic.com>
to
Assisted-by: Claude <MODEL>
to conform with the AI standard of the repo.
| :model: Name of the model | ||
| """ | ||
| try: | ||
| self._get_model_uncached(model) |
There was a problem hiding this comment.
The same functionality should be available by calling self.get_model.__wrapped__(model) using the @functools.cache wrapped function.
When try to destroy-deployment the sunbeam is stuck on destroy openstack-machines model,
while the model is actually destroy it still perceives to be existing until reach timeout
On the next run it will complain about
openstack-machinesmissingso even when re-add the model the situation from the first output repeats!
So I modified 2 files and added additional debugging and mount bound them
/snap/openstack/945/lib/python3.12/site-packages/sunbeam/core/juju.py/snap/openstack/945/lib/python3.12/site-packages/sunbeam/core/steps/juju.pythe copies landed in:
/var/snap/openstack/common/juju.py/var/snap/openstack/common/steps_juju.pyHere are the diffs (added .txt so github will allow me to attach them)
sunbeam_steps_juju.py.diff.txt
sunbeam_core_juju.py.diff.txt
This is the debug log output:
sunbeam-deploy.log
The problem is this function is called TWICE with the same argument,
The function excerpt from the current 2024.1/stable
https://github.com/canonical/snap-openstack/blob/stable/2024.1/sunbeam-python/sunbeam/core/juju.py#L265-L274