Skip to content

fix(bootstrap): auto-cleanup Docker resources on failed gateway deploy#464

Merged
drew merged 2 commits intomainfrom
463-fix-gateway-deploy-cleanup-orphaned-volume
Mar 19, 2026
Merged

fix(bootstrap): auto-cleanup Docker resources on failed gateway deploy#464
drew merged 2 commits intomainfrom
463-fix-gateway-deploy-cleanup-orphaned-volume

Conversation

@drew
Copy link
Collaborator

@drew drew commented Mar 18, 2026

Summary

  • Auto-cleanup Docker resources (volume, container, network, image) when deploy_gateway_with_logs() fails after resource creation, eliminating orphaned volumes that block retries
  • Update corrupted-state error diagnosis to reflect automatic cleanup and mark it retryable
  • Add 3 regression tests validating the new diagnosis behavior

Related Issue

Fixes #463

Eliminates the need for downstream workarounds like NemoClaw PR #337, which shells out to docker volume rm from JavaScript/bash scripts.

Changes

crates/openshell-bootstrap/src/lib.rs

  • Wrap post-resource-creation steps in deploy_gateway_with_logs() inside an async block
  • On failure, call destroy_gateway_resources() before propagating the error
  • Log cleanup attempts and warn if cleanup itself fails (with manual recovery command)

crates/openshell-bootstrap/src/errors.rs

  • Update diagnose_corrupted_state(): explanation now mentions automatic cleanup, retryable set to true, first recovery step is description-only ("cleanup was automatic"), removed manual docker volume rm step
  • Add 3 tests:
    • test_diagnose_corrupted_state_is_retryable_after_auto_cleanup — verifies retryable flag and explanation text
    • test_diagnose_corrupted_state_recovery_no_manual_volume_rm — verifies no docker volume rm in recovery steps, first step is description-only
    • test_diagnose_corrupted_state_fallback_step_includes_gateway_name — verifies gateway name interpolation in fallback command

Testing

  • cargo test -p openshell-bootstrap — 80/80 pass (77 existing + 3 new)
  • cargo fmt --all -- --check — clean
  • cargo clippy -p openshell-bootstrap — no new warnings

Checklist

  • Tests added for changed behavior
  • No secrets or credentials committed
  • Changes scoped to the issue

drew added 2 commits March 18, 2026 16:34
When deploy_gateway_with_logs() fails after creating Docker resources
(volume, container, network), the orphaned volume blocks subsequent
retries with 'Corrupted cluster state'. Wrap the post-resource-creation
steps in an async block and call destroy_gateway_resources() on failure
to leave the environment in a clean, retryable state.

Also update the corrupted state diagnosis to reflect that cleanup is now
automatic and mark the error as retryable.

Fixes #463
Verify that the corrupted state diagnosis correctly reflects automatic
cleanup: retryable flag is true, explanation mentions automatic cleanup,
recovery steps no longer include manual docker volume rm, and the
fallback command interpolates the gateway name.

Refs #463
@drew drew requested a review from a team as a code owner March 18, 2026 23:38
@drew drew added the test:e2e Requires end-to-end coverage label Mar 18, 2026
@drew drew merged commit 4878b9b into main Mar 19, 2026
13 checks passed
@drew drew deleted the 463-fix-gateway-deploy-cleanup-orphaned-volume branch March 19, 2026 04:24
Copy link

@stiliyana94 stiliyana94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Z

linuxdevel pushed a commit to linuxdevel/OpenShell that referenced this pull request Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: failed gateway deploy leaves orphaned Docker volume, blocking retries

2 participants