Skip to content

Clean up direct-engine state entries for remotely deleted resources removed from config#5496

Open
simonfaltum wants to merge 5 commits into
mainfrom
simonfaltum/b22-direct-stale-state
Open

Clean up direct-engine state entries for remotely deleted resources removed from config#5496
simonfaltum wants to merge 5 commits into
mainfrom
simonfaltum/b22-direct-stale-state

Conversation

@simonfaltum

Copy link
Copy Markdown
Member

Why

Found during a full-repo review of the CLI. With the direct engine, a resource that was deleted remotely and then removed from the bundle config could never leave the deployment state. The plan dropped its Delete entry, so the state cleanup never ran and the stale entry came back on every deploy. For pipelines and dashboards the same scenario was worse: the deploy hard-failed in metadata computation with "failed to compute relative path", and since the state never got cleaned, every following deploy failed the same way.

Changes

Before, deploying after a resource was deleted remotely and removed from config either left a stale state entry forever (jobs) or failed the deploy outright (pipelines, dashboards); now the deploy plans a delete for the stale entry, cleans it from state, and completes.

  • bundle/direct/bundle_plan.go: when planning a Delete and the remote read reports the resource is gone, keep the Delete entry instead of removing it from the plan. Delete() already tolerates missing resources and removes the entry from state when applied.
  • bundle/deployplan/plan.go: remove RemoveEntry, whose only caller was the code path above.
  • bundle/deploy/metadata/compute.go: add the missing l.File == "" guard to the pipelines and dashboards loops, matching the guard the jobs loop received in direct: handle remotely deleted resources #3710. State-only resources have no config location, and filepath.Rel errors on an empty path.

A side effect visible in test output: bundle destroy with the direct engine now lists a remotely deleted resource under "The following resources will be deleted", since its state entry is now explicitly cleaned instead of silently dropped.

Test plan

  • New acceptance test acceptance/bundle/resources/jobs/remote_delete/removed_from_config covering deploy, remote delete of a job and a pipeline, config removal, deploy twice, and summary, for both engines. With the direct engine the first deploy cleans the stale entries and the second deploy and summary are clean. The terraform variant also exercises the new metadata guard, since terraform keeps the gone resources in its state.
  • New unit test TestComputeMetadataMutatorStateOnlyResources verifying state-only jobs, pipelines, and dashboards are skipped without error.
  • Unit tests for bundle/direct, bundle/deployplan, and bundle/deploy/metadata.
  • Existing acceptance tests around the changed paths (jobs/remote_delete, volumes/remote-delete, volumes/remote-change-name, vector_search_indexes/drift, permissions, deployment/bind, deployment/unbind, bundle/destroy, bundle/generate, bundle/invariant, clusters, quality_monitors, registered_models, synced tables). Only jobs/remote_delete/destroy changed output (direct destroy now lists the gone job) and was regenerated with -update.
  • ./task fmt-q, ./task lint-q, ./task checks.

This pull request and its description were written by Isaac.

… from config

When a resource was deleted remotely and also removed from the bundle config, CalculatePlan dropped the Delete entry from the plan, so the stale state entry was never removed and reappeared on every deploy. Keep the Delete entry instead; Delete() already tolerates resources that are gone and removes the entry from state. Also add the missing location guard in metadata.Compute for pipelines and dashboards (jobs got it in #3710), which made the same scenario hard-fail the deploy.

Co-authored-by: Isaac
@simonfaltum simonfaltum requested a review from denik June 9, 2026 19:59
@simonfaltum simonfaltum temporarily deployed to test-trigger-is June 9, 2026 19:59 — with GitHub Actions Inactive
@simonfaltum simonfaltum temporarily deployed to test-trigger-is June 9, 2026 19:59 — with GitHub Actions Inactive
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Approval status: pending

/acceptance/bundle/ - needs approval

13 files changed
Suggested: @denik
Also eligible: @janniklasrose, @pietern, @shreyas-goenka, @lennartkats-db, @andrewnester, @anton-107

/bundle/ - needs approval

8 files changed
Suggested: @denik
Also eligible: @janniklasrose, @pietern, @shreyas-goenka, @lennartkats-db, @andrewnester, @anton-107

Any maintainer (@andrewnester, @anton-107, @denik, @pietern, @shreyas-goenka, @renaudhartert-db) can approve all areas.
See OWNERS for ownership rules.

@eng-dev-ecosystem-bot

eng-dev-ecosystem-bot commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: 232ca17

Run: 27410895224

Env 🟨​KNOWN 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
💚​ aws linux 7 15 264 974 9:27
🟨​ aws windows 7 15 266 972 16:33
🔄​ aws-ucws linux 4 7 15 356 888 28:30
💚​ aws-ucws windows 7 15 362 886 16:22
💚​ azure linux 1 17 267 972 9:43
💚​ azure windows 1 17 269 970 12:20
🔄​ azure-ucws linux 1 1 17 364 884 24:37
💚​ azure-ucws windows 1 17 367 882 12:42
💚​ gcp linux 1 17 263 975 12:38
💚​ gcp windows 1 17 265 973 12:10
27 interesting tests: 15 SKIP, 7 KNOWN, 5 flaky
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🟨​ TestAccept 💚​R 🟨​K 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestAccept/bundle/resources/model_serving_endpoints/basic 🙈​s 🙈​s 🔄​f ✅​p 🙈​s 🙈​s ✅​p ✅​p 🙈​s 🙈​s
🔄​ TestAccept/bundle/resources/model_serving_endpoints/basic/DATABRICKS_BUNDLE_ENGINE=direct 🔄​f ✅​p ✅​p ✅​p
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestAccept/bundle/resources/permissions/dashboards/create ✅​p ✅​p 🔄​f ✅​p ✅​p ✅​p ✅​p ✅​p 🙈​s 🙈​s
🔄​ TestAccept/bundle/resources/permissions/dashboards/create/DATABRICKS_BUNDLE_ENGINE=direct ✅​p ✅​p 🔄​f ✅​p ✅​p ✅​p ✅​p ✅​p
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 💚​R 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 💚​R 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 🟨​K 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/grants/select 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestSyncIncrementalSyncFileToPythonNotebook ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p ✅​p ✅​p
Top 50 slowest tests (at least 2 minutes):
duration env testname
7:27 azure windows TestAccept
6:12 gcp windows TestAccept
5:48 aws-ucws windows TestAccept
5:34 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
5:15 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
5:09 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
5:03 azure-ucws windows TestAccept
5:01 azure-ucws linux TestAccept/bundle/deploy/files/no-snapshot-sync/DATABRICKS_BUNDLE_ENGINE=direct
4:57 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:49 aws-ucws linux TestAccept/bundle/resources/volumes/recreate/DATABRICKS_BUNDLE_ENGINE=direct
4:42 azure-ucws linux TestAccept/bundle/resources/model_serving_endpoints/basic/DATABRICKS_BUNDLE_ENGINE=direct
4:42 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:38 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:29 azure-ucws linux TestAccept/bundle/resources/volumes/recreate/DATABRICKS_BUNDLE_ENGINE=terraform
4:18 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:12 aws-ucws linux TestAccept/bundle/resources/model_serving_endpoints/basic/DATABRICKS_BUNDLE_ENGINE=terraform
4:03 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:53 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:44 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:41 aws-ucws linux TestAccept/bundle/deploy/files/no-snapshot-sync/DATABRICKS_BUNDLE_ENGINE=direct
3:34 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:31 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:24 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:23 aws-ucws linux TestAccept/bundle/deployment/bind/job/generate-and-bind/DATABRICKS_BUNDLE_ENGINE=direct
3:22 aws-ucws linux TestAccept/bundle/deploy/files/no-snapshot-sync/DATABRICKS_BUNDLE_ENGINE=terraform
3:14 azure-ucws linux TestAccept/bundle/resources/volumes/recreate/DATABRICKS_BUNDLE_ENGINE=direct
3:11 azure-ucws linux TestAccept/bundle/deployment/bind/job/generate-and-bind/DATABRICKS_BUNDLE_ENGINE=terraform
3:10 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:04 aws-ucws linux TestAccept/bundle/destroy/jobs-and-pipeline/DATABRICKS_BUNDLE_ENGINE=terraform
3:02 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:55 aws-ucws linux TestAccept/bundle/resources/volumes/recreate/DATABRICKS_BUNDLE_ENGINE=terraform
2:53 azure linux TestAccept
2:46 aws linux TestAccept
2:45 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:44 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:44 azure-ucws linux TestAccept
2:43 azure-ucws linux TestAccept/bundle/resources/permissions/dashboards/create/DATABRICKS_BUNDLE_ENGINE=direct
2:42 gcp linux TestAccept
2:40 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:36 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:35 azure-ucws linux TestAccept/bundle/deploy/mlops-stacks/DATABRICKS_BUNDLE_ENGINE=terraform
2:32 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:31 aws-ucws linux TestAccept/bundle/resources/permissions/dashboards/create/DATABRICKS_BUNDLE_ENGINE=direct
2:28 azure-ucws linux TestAccept/bundle/deployment/bind/job/generate-and-bind/DATABRICKS_BUNDLE_ENGINE=direct
2:28 aws-ucws linux TestAccept/bundle/resources/model_serving_endpoints/basic/DATABRICKS_BUNDLE_ENGINE=direct
2:26 azure-ucws linux TestAccept/bundle/run_as/job_default/DATABRICKS_BUNDLE_ENGINE=direct
2:26 azure-ucws linux TestAccept/bundle/resources/model_serving_endpoints/basic/DATABRICKS_BUNDLE_ENGINE=terraform
2:25 aws-ucws linux TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct
2:22 azure-ucws linux TestAccept/bundle/resources/alerts/with_file/DATABRICKS_BUNDLE_ENGINE=direct
2:14 aws-ucws linux TestAccept/bundle/deploy/mlops-stacks/DATABRICKS_BUNDLE_ENGINE=direct

Co-authored-by: Isaac
@simonfaltum

Copy link
Copy Markdown
Member Author

The direct engine now keeps the delete entry for a remotely deleted resource so its stale state entry gets cleaned up during destroy, which makes the destroy output of acceptance/bundle/resources/dashboards/delete-trashed-out-of-band legitimately diverge per engine. Split that section into out.destroy.terraform.txt and out.destroy.direct.txt per the repo convention, same as the existing jobs/remote_delete/destroy test.

@denik denik left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find, thanks!

Comment thread bundle/deploy/metadata/compute.go Outdated
// root
l := b.Config.GetLocation("resources.pipelines." + name)
if l.File == "" {
// b.Config.Resources.Pipelines may include a pipeline that only exists in state but not in config

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment, we iterate over a config there, so pipeline in question must exist in config.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, the wording was confusing. At this point b.Config.Resources is not only user config: deployCore runs statemgmt.Load right before metadata.Compute, and StateToBundle injects resources that exist only in state (just id and modified_status, no file location). The terraform variant of the removed_from_config test hits this, since terraform keeps the gone resources in its state. Same guard the jobs loop got in #3710. Reworded the comments in a01eb5d to explain this.

Comment thread bundle/direct/bundle_plan.go Outdated
// This is not an error during deletion, so don't return false here
}
// If the resource is already gone remotely, keep the Delete entry: Delete()
// tolerates missing resources and applying the entry removes it from the state.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to avoid unnecessary call though, for 2 reasons: 1) performance and 2) to reduce chance of failing there because of unexpected error.

Perhaps we should pass this info in plan entry to deploy. E.g. set gone: true on the entry (we cannot use remote == nil since it will also be nil in case of read error and in that case we do want to call delete).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related:

A side effect visible in test output: bundle destroy with the direct engine now lists a remotely deleted resource under "The following resources will be deleted", since its state entry is now explicitly cleaned instead of silently dropped.

This is not great because the message is not correct anymore (we know delete is going to be nop-op). If we track it like I suggest above, we can filter out such plan entries from the warning.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a01eb5d. Delete entries now get gone: true when the planning read returns 404, and apply then only removes the entry from state without calling delete. Read errors still leave gone unset, so delete is attempted as before. The removed_from_config test now also asserts that no delete API calls are made for the gone resources.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a01eb5d. Deploy and destroy approvals now filter out gone entries, so they are not listed as deletions, do not trigger approval prompts, and do not trip prevent_destroy. With that the destroy output no longer diverges per engine, so I reverted the out.destroy split for delete-trashed-out-of-band.

Per review feedback: planning now sets gone on a Delete entry when the
remote read returns 404. Applying such an entry removes it from state
without calling the delete API, and deploy/destroy approval prompts no
longer list it as a deletion or ask for approval, since nothing is
actually deleted.

With that the destroy output for remotely deleted resources matches the
terraform engine again, so the per-engine out.destroy split is reverted
for delete-trashed-out-of-band and the removed_from_config test now also
asserts that no delete API calls are made for gone resources.

Also reword the state-only resource comments in metadata.Compute to
explain the statemgmt.Load injection that makes the guards necessary.

Co-authored-by: Isaac
@simonfaltum

Copy link
Copy Markdown
Member Author

Addressed the review feedback in a01eb5d:

  • Delete entries for resources that no longer exist remotely are now marked gone: true in the plan. Apply removes them from state without calling the delete API, and deploy/destroy approvals skip them (no deletion listing, no approval prompt, no prevent_destroy error), since nothing is actually deleted.
  • With gone entries filtered from the destroy listing, the destroy output for delete-trashed-out-of-band no longer diverges per engine, so the out.destroy.<engine>.txt split from the earlier revision is reverted, and the jobs/remote_delete/destroy output is back to what main has.
  • removed_from_config now records requests and asserts that no delete API calls are made for the gone resources.
  • Reworded the state-only resource comments in metadata.Compute to explain the statemgmt.Load injection that makes the guards necessary.

Validation: ./task fmt-q, ./task lint-q, ./task checks, unit tests for bundle/deployplan, bundle/direct, bundle/deploy/metadata, bundle/phases, and the full TestAccept/bundle acceptance suite pass locally.

@simonfaltum simonfaltum requested a review from denik June 12, 2026 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants