Skip to content

fix(azure): default to checkpointable OS disks#111

Merged
steipete merged 3 commits into
openclaw:mainfrom
jwmoss:fix/azure-checkpoint
May 16, 2026
Merged

fix(azure): default to checkpointable OS disks#111
steipete merged 3 commits into
openclaw:mainfrom
jwmoss:fix/azure-checkpoint

Conversation

@jwmoss
Copy link
Copy Markdown
Contributor

@jwmoss jwmoss commented May 15, 2026

Summary

Makes Azure leases checkpointable by default after the interaction between Azure ephemeral OS disks and provider-native checkpoints showed that the old default could silently produce unusable checkpoint forks.

This follows two earlier changes:

Azure provider-native checkpoints from #99 made Azure OS disk snapshots part of the native checkpoint path, but Azure leases from #39 could default to ephemeral OS disks on supported D/F/E-family SKUs. Azure accepts a snapshot request for those VMs and reports success, but the resulting snapshot does not capture the live ephemeral OS disk state. This PR makes the checkpointable behavior the default instead.

  • Defaults direct Azure leases to managed StandardSSD_LRS OS disks.
  • Defaults brokered Azure leases to managed OS disks, including when azureOSDisk / CRABBOX_AZURE_OS_DISK is set to auto for compatibility.
  • Keeps explicit --azure-os-disk ephemeral / azure.osDisk: ephemeral for stateless leases that intentionally want local OS disks.
  • Preserves the checkpoint guard that refuses native Azure checkpoint creation from ephemeral OS disk leases.
  • Documents the behavior in the Azure, checkpoint, warmup/run, and configuration docs.

Actual binary runs

Direct Azure default OS disk

This was run without --azure-os-disk, using the direct Azure provider and the active Azure CLI login. Public IPs are omitted here.

./bin/crabbox warmup \
  --provider azure \
  --type Standard_D2ads_v6 \
  --market on-demand \
  --idle-timeout 10m \
  --ttl 30m \
  --timing-json

Observed output:

provisioning provider=azure lease=cbx_b7221565a064 slug=tidal-lobster class=beast preferred_type=Standard_D2ads_v6 location=eastus rg=crabbox-leases keep=true
provisioned lease=cbx_b7221565a064 server=crabbox-tidal-lobster-775f7392 type=Standard_D2ads_v6
leased cbx_b7221565a064 slug=tidal-lobster provider=azure server=crabbox-tidal-lobster-775f7392 type=Standard_D2ads_v6 ip=<public-ip> idle_timeout=10m0s expires=2026-05-15T22:54:01Z
ready ssh=crabbox@<public-ip>:2222 network=public workroot=/work/crabbox
warmup complete total=2m1.269s
{"provider":"azure","leaseId":"cbx_b7221565a064","slug":"tidal-lobster","syncMs":0,"syncSkipped":false,"commandMs":0,"totalMs":121269,"exitCode":0}

Azure VM OS disk verification:

az vm show \
  -g crabbox-leases \
  -n crabbox-tidal-lobster-775f7392 \
  --query "storageProfile.osDisk.{name:name,caching:caching,diffDiskSettings:diffDiskSettings,managedDisk:managedDisk.storageAccountType,managedDiskId:managedDisk.id}" \
  -o json

Observed output:

{
  "caching": "ReadWrite",
  "diffDiskSettings": null,
  "managedDisk": "StandardSSD_LRS",
  "managedDiskId": "/subscriptions/<redacted>/resourceGroups/crabbox-leases/providers/Microsoft.Compute/disks/crabbox-tidal-lobster-775f7392-osdisk",
  "name": "crabbox-tidal-lobster-775f7392-osdisk"
}

Cleanup:

./bin/crabbox stop --provider azure cbx_b7221565a064
az vm show -g crabbox-leases -n crabbox-tidal-lobster-775f7392 --query id -o tsv
az disk show -g crabbox-leases -n crabbox-tidal-lobster-775f7392-osdisk --query id -o tsv
./bin/crabbox inspect --provider azure --id cbx_b7221565a064 --json

Observed output:

deleted lease=cbx_b7221565a064 server=crabbox-tidal-lobster-775f7392 name=crabbox-tidal-lobster-775f7392
ERROR: (ResourceNotFound) The Resource 'Microsoft.Compute/virtualMachines/crabbox-tidal-lobster-775f7392' under resource group 'crabbox-leases' was not found.
ERROR: (NotFound) Disk crabbox-tidal-lobster-775f7392-osdisk is not found.
lease/server not found: cbx_b7221565a064

Brokered Azure default OS disk

This was run against a local Wrangler coordinator on 127.0.0.1:8787 with a temporary Azure service principal scoped to the crabbox-leases resource group. The Worker env intentionally set CRABBOX_AZURE_OS_DISK=auto to prove that broker-side auto now resolves to a managed OS disk. Tokens, service-principal credentials, subscription IDs, and public IPs are omitted here.

npx wrangler dev \
  --local \
  --ip 127.0.0.1 \
  --port 8787 \
  --env-file <temp-env-file> \
  --persist-to /tmp/crabbox-worker-broker-smoke-state \
  --log-level warn

Health check:

curl -fsS http://127.0.0.1:8787/v1/health

Observed output:

{"ok":true,"service":"crabbox-coordinator"}

Brokered warmup:

CRABBOX_COORDINATOR=http://127.0.0.1:8787 \
CRABBOX_COORDINATOR_TOKEN=<redacted> \
./bin/crabbox warmup \
  --provider azure \
  --type Standard_D2ads_v6 \
  --market on-demand \
  --idle-timeout 10m \
  --ttl 30m \
  --timing-json

Observed output:

coordinator lease class=beast preferred_type=Standard_D2ads_v6 keep=true slug=blue-hermit idle_timeout=10m0s ttl=30m0s
leased cbx_04c7b241b5b4 slug=blue-hermit server=0 type=Standard_D2ads_v6 ip=<public-ip> via coordinator
leased cbx_04c7b241b5b4 slug=blue-hermit provider=azure server=crabbox-blue-hermit-cc265b9b type=Standard_D2ads_v6 ip=<public-ip> idle_timeout=10m0s expires=2026-05-15T23:01:48Z
ready ssh=crabbox@<public-ip>:2222 network=public workroot=/work/crabbox
warmup complete total=1m54.61s
{"provider":"azure","leaseId":"cbx_04c7b241b5b4","slug":"blue-hermit","syncMs":0,"syncSkipped":false,"commandMs":0,"totalMs":114609,"exitCode":0}

Azure VM OS disk verification:

az vm show \
  -g crabbox-leases \
  -n crabbox-blue-hermit-cc265b9b \
  --query "storageProfile.osDisk.{name:name,caching:caching,diffDiskSettings:diffDiskSettings,managedDisk:managedDisk.storageAccountType,managedDiskId:managedDisk.id}" \
  -o json

Observed output:

{
  "caching": "ReadWrite",
  "diffDiskSettings": null,
  "managedDisk": "StandardSSD_LRS",
  "managedDiskId": "/subscriptions/<redacted>/resourceGroups/crabbox-leases/providers/Microsoft.Compute/disks/crabbox-blue-hermit-cc265b9b-osdisk",
  "name": "crabbox-blue-hermit-cc265b9b-osdisk"
}

Cleanup:

CRABBOX_COORDINATOR=http://127.0.0.1:8787 \
CRABBOX_COORDINATOR_TOKEN=<redacted> \
./bin/crabbox stop --provider azure cbx_04c7b241b5b4

az vm show -g crabbox-leases -n crabbox-blue-hermit-cc265b9b --query id -o tsv
az disk show -g crabbox-leases -n crabbox-blue-hermit-cc265b9b-osdisk --query id -o tsv

Observed output:

released lease=cbx_04c7b241b5b4 server=crabbox-blue-hermit-cc265b9b
ERROR: (ResourceNotFound) The Resource 'Microsoft.Compute/virtualMachines/crabbox-blue-hermit-cc265b9b' under resource group 'crabbox-leases' was not found.
ERROR: (ResourceNotFound) The Resource 'Microsoft.Compute/disks/crabbox-blue-hermit-cc265b9b-osdisk' under resource group 'crabbox-leases' was not found.

Temporary credential cleanup:

az role assignment delete --assignee <temp-app-id> --scope <crabbox-leases-rg-id>
az ad app delete --id <temp-app-id>
az ad sp show --id <temp-app-id> --query appId -o tsv

Observed output:

deleted temporary service principal <temp-app-id> and removed temp files
ERROR: Resource '<temp-app-id>' does not exist or one of its queried reference-property objects are not present.

Validation commands

go test ./internal/cli ./internal/providers/azure
npm test --prefix worker -- azure.test.ts config.test.ts
npm run check --prefix worker
npm run lint --prefix worker
npm run format:check --prefix worker
go build -trimpath -o bin/crabbox ./cmd/crabbox
git diff --check
./bin/crabbox warmup --help 2>&1 | rg -n "azure-os-disk"
./bin/crabbox run --help 2>&1 | rg -n "azure-os-disk"
git merge-tree --write-tree upstream/main HEAD

Notes for review

  • managed is the default Azure OS disk mode now.
  • auto is retained as an accepted value for compatibility, but resolves to managed so Azure matches AWS/GCP checkpoint expectations by default.
  • ephemeral is still available for users who explicitly want a local OS disk for stateless Azure leases.
  • Native Azure checkpoint creation still refuses leases with diffDiskSettings.option == Local, which is the only safe behavior for ephemeral OS disk VMs.
  • The current scripts/live-smoke.sh on main does not include an Azure provider branch, so the Azure proof above was run directly instead of through that script.

@steipete steipete merged commit 7823b10 into openclaw:main May 16, 2026
6 checks passed
@jwmoss jwmoss deleted the fix/azure-checkpoint branch May 18, 2026 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants