Skip to content

bundle/dms: gate deployment history recording on a dedicated env var#5666

Draft
shreyas-goenka wants to merge 30 commits into
mainfrom
record-deployment-history-env
Draft

bundle/dms: gate deployment history recording on a dedicated env var#5666
shreyas-goenka wants to merge 30 commits into
mainfrom
record-deployment-history-env

Conversation

@shreyas-goenka

@shreyas-goenka shreyas-goenka commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Why

The experimental "record deployment history" feature (DMS) was gated on DATABRICKS_BUNDLE_MANAGED_STATE, a name that no longer matches what the feature does. This renames the opt-in env var to DATABRICKS_BUNDLE_RECORD_DEPLOYMENT_HISTORY so the gate reads as the feature it enables.

What

  • Rename the env helper env.ManagedStateenv.RecordDeploymentHistory and the variable it reads (bundle/env/record_deployment_history.go, renamed from deployment_metadata.go).
  • Update the three call sites: bundle/deploy/lock/lock.go, bundle/statemgmt/state_pull.go, bundle/statemgmt/state_push.go.
  • Update the DMS acceptance EnvMatrix to the new name; regenerated out.test.toml.

Behavior is unchanged — only the env var name. output.txt files are untouched because the code path is identical.

Note

Based on main, but the branch is stacked on shreyas-goenka/deployment-metadata-service, so the diff currently includes the full DMS stack. It shrinks to just the env-var rename once the DMS branch merges to main. The rename is the last commit.

This pull request and its description were written by Isaac.

shreyas-goenka and others added 30 commits April 15, 2026 00:09
Add server-side deployment locking and state management via the
Deployment Metadata Service (DMS), gated behind DATABRICKS_BUNDLE_MANAGED_STATE=true.

Key changes:
- DeploymentLock interface with factory (DMS or filesystem based on env)
- DMS lock: version-based locking with heartbeat, operation reporting
- State read/write via ListResources/CreateOperation with per-resource state
- withDeploymentLock helper extracts lock boilerplate from deploy/destroy
- Temporary DMS client (libs/tmpdms) mirroring future SDK-generated code
- Mock DMS server for acceptance tests
- 6 acceptance tests covering deploy, destroy, plan, summary, sequential
  deploys, and adding resources with remote state

Co-authored-by: Isaac
LoadStateFromDMS is a state-loading function, not a lock function.
Moving it to statemgmt where it belongs alongside other state
management code.

Co-authored-by: Isaac
When we just created the deployment, LastVersionID is necessarily
empty so we can start at version "1" directly.

Co-authored-by: Isaac
Print requests inline in output.txt and clear remaining requests at
the end of each script so out.requests.txt is not generated.

Also update sequential-deploys test to add/remove resources across
deploys, asserting create and delete operations are captured.

Co-authored-by: Isaac
…erations

- Print DMS requests inline in output.txt via print_requests.py
- Update sequential-deploys to test create/delete across deploys
- Add protoLogs replacement to stabilize flaky telemetry timing
- Regenerate out.requests.txt golden files

Co-authored-by: Isaac
If CreateDeployment fails, the workspace should not contain a dangling
deployment ID pointing to a non-existent server record.

Co-authored-by: Isaac
The old lock.Acquire mutator checked for fs.ErrPermission and
fs.ErrNotExist and reported possible permission denied errors.
This was lost when refactoring to the DeploymentLock interface.

Co-authored-by: Isaac
Add print_requests.py cleanup at the end of each script to clear
remaining recorded requests, preventing out.requests.txt from being
generated as a golden file. DMS requests are already printed inline
in output.txt.

Co-authored-by: Isaac
Open out.requests.txt with explicit utf-8 encoding to handle
non-ASCII characters in request bodies.

Co-authored-by: Isaac
Regenerated with Python 3.11 after fixing the UnicodeDecodeError.
The output.txt files now contain the inline DMS request assertions
without the Python traceback errors.

Co-authored-by: Isaac
Keep resources.json maintained alongside the DMS deployment so users
have a backward path if they hit issues with the DMS-backed flow. Move
DMS-specific bookkeeping (the deployment_id that ties the bundle to a
server-side deployment record) into a sibling managed_service.json so
the two concerns stay cleanly separated.
A single async sender goroutine drains a buffered channel of operation
events; CRUD workers push onto the channel and continue. When the buffer
fills (capacity matches the worker pool), workers block on the send and
naturally back off — this is the only intended source of backpressure
on the worker pool.

Reporting is best-effort: a DMS API failure is logged and the sender
keeps draining. The deploy is no longer aborted when the audit-log
write fails. On a hard process crash, at most ~10 buffered events can
be lost (channel capacity).

Release() drains the reporter before completing the version so the audit
trail is as complete as possible on a clean shutdown.
…5406)

## Why

DMS-backed bundle deployments (run with
`DATABRICKS_BUNDLE_MANAGED_STATE=true DATABRICKS_BUNDLE_ENGINE=direct`)
never set `display_name` when creating the deployment record, so the
field is stored as `null`.

## What

Populate `DisplayName` from `bundle.Config.Bundle.Name` (i.e. the
`bundle.name` from `databricks.yml`) when issuing `CreateDeployment`.
This matches the human-readable label users already see in `databricks
bundle validate`.

## Tests

Existing `acceptance/bundle/dms/*` tests record the `CreateDeployment`
request body via `print_requests.py`; their `output.txt` files
regenerate to assert the new `display_name` field.

This pull request and its description were written by Isaac.
The deployment metadata service now accepts git provenance on a version
(origin_url, branch, commit) per databricks-eng/universe#2009991. Record
it on CreateVersion using the same values the CLI writes to metadata.json.
# Conflicts:
#	bundle/deploy/lock/acquire.go
#	bundle/statemgmt/state_push.go
#	cmd/bundle/utils/process.go
#	libs/testserver/fake_workspace.go
#	libs/testserver/server.go
… for determinism

Main's direct engine applies resources concurrently, so the order of recorded
CreateOperation requests varied between runs. Add --sort to print_requests.py
in the multi-resource DMS tests to make the recorded output deterministic.
Merging main changed several APIs the DMS code predates:
- WorkspaceClient now takes a ctx (workspace_filesystem.go).
- StateDB keeps a separate resource-key->ID index (stateIDs) that is
  authoritative during writes; Data.State is only reconstructed when the WAL
  is merged. LoadStateFromDMS wrote Data.State directly, leaving the index
  empty, so deletes failed with "missing in state". It now builds the
  database and calls OpenWithData, which populates the index.
- The inline operation reporter read the freshly-created resource ID and
  state from Data.State (stale during a deploy). It now reads the ID from
  GetResourceID and the state from the value just applied, so operations
  carry the real resource_id and state and the server round-trips them.
The SDK's JobDeployment/PipelineDeployment now carry deployment_id and
version_id (used to look up deployment metadata in the DMS). Stamp them onto
each job and pipeline so every resource records the deployment and the version
that produced it.

The IDs are only known after the deployment lock is acquired, so a new
deploy-phase mutator (AnnotateDeploymentVersion) sets them, running after the
lock and before the plan. The version is plumbed onto the bundle alongside the
deployment ID.

version_id changes on every deploy, so an ignore_local_changes rule keeps it
from triggering an update on its own; a real update still sends the current
version_id via the full-config Reset/EditPipeline. (Also adjusts isAborted to
errors.AsType for the Go 1.26 linter pulled in by the merge.)
…ion_id

Operations now carry the resource_id and full state (including the deployment
block with deployment_id/version_id), and the out.test.toml dump format changed
on main. sequential-deploys now shows the version_id rule working: deploy 2
bumps the version but the unchanged test_job records no operation.
## Changes
Set `display_name` on the DMS deployment version, using the bundle name
— the same value already recorded on the deployment.

The `Version` proto has a `display_name` field, but the `CreateVersion`
request never populated it, so every version came back with a null
`display_name` even though the deployment had one. This stamps it for
parity.

## Why
`display_name` is set on the deployment (from the bundle name) but was
missing on each version, leaving version records without a
human-readable label. Filling it in keeps deployment and version
metadata consistent.

## Tests
Updated the `bundle/dms` acceptance outputs and confirmed they pass.

This pull request and its description were written by Isaac, an AI
coding agent.
## Changes
Record the bundle target deployment mode on each DMS version. Adds a
`deployment_mode` field (and the `DEPLOYMENT_MODE_DEVELOPMENT` /
`DEPLOYMENT_MODE_PRODUCTION` enum) to `tmpdms.Version`, and sets it in
the `CreateVersion` request from `bundle.mode`.

Not set on the deployment: `Deployment.deployment_mode` is derived
server-side from the most recent version's mode (output-only), so the
CLI only sets it on the version. A target with no `mode` maps to an
empty value, which is omitted (the server treats it as unspecified) — we
don't fabricate a default.

## Why
The SDK's `bundle.Version` already carries `deployment_mode` ("captured
at the time of this version"), but the CLI never populated it, so every
version recorded a null mode. This stamps it so each version records
whether it was a development or production deployment.

## Tests
Added a unit test for the mode mapping (development / production /
unset). The `bundle/dms` acceptance outputs are unchanged because those
targets don't set a mode. Verified live against a workspace: a `mode:
development` target now records `deployment_mode:
DEPLOYMENT_MODE_DEVELOPMENT` on the created version.

This pull request and its description were written by Isaac, an AI
coding agent.
Stamp the deployment's workspace location onto the DMS Version, mirroring
the values the CLI already writes to metadata.json (see
bundle/deploy/metadata/compute.go): workspace root_path/file_path, the
sync root as file_path for source-linked deployments, and the workspace
git folder path. This makes the deployment metadata service an equivalent
source of truth for a bundle's workspace location, alongside the existing
git_info/display_name/target_name fields.

Co-authored-by: Isaac
Rename the experimental DMS opt-in env var from
DATABRICKS_BUNDLE_MANAGED_STATE to DATABRICKS_BUNDLE_RECORD_DEPLOYMENT_HISTORY
so it matches the feature's name (recording deployment history). The
env helper, all three call sites, and the DMS acceptance tests are
updated; behavior is unchanged.

Co-authored-by: Isaac
@shreyas-goenka shreyas-goenka changed the base branch from shreyas-goenka/deployment-metadata-service to main June 19, 2026 15:27
@eng-dev-ecosystem-bot

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: f186af0

Run: 27833767485

Env 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
🔄​ aws linux 2 7 13 259 927 10:53
💚​ aws windows 7 13 263 925 7:54
💚​ aws-ucws linux 7 13 357 841 6:20
💚​ aws-ucws windows 7 13 359 839 9:11
💚​ azure linux 1 15 264 925 5:36
💚​ azure windows 1 15 266 923 8:05
💚​ azure-ucws linux 1 15 362 837 6:48
💚​ azure-ucws windows 1 15 364 835 8:04
💚​ gcp linux 1 15 260 928 6:09
💚​ gcp windows 1 15 262 926 8:56
22 interesting tests: 13 SKIP, 7 RECOVERED, 2 flaky
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
💚​ TestAccept 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/recreate/embedding_dimension 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestSecretsPutSecretBytesValue 🔄​f ✅​p 🙈​s 🙈​s ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p
🔄​ TestSecretsPutSecretStringValue 🔄​f ✅​p 🙈​s 🙈​s ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p
Top 27 slowest tests (at least 2 minutes):
duration env testname
5:00 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:46 aws linux TestSecretsPutSecretStringValue
4:24 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:12 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:00 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:21 azure windows TestAccept
3:18 aws-ucws windows TestAccept
3:10 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:10 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:07 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:06 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:04 gcp windows TestAccept
3:01 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:59 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:58 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:55 aws windows TestAccept
2:50 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:48 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:46 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:43 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:35 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:34 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:30 aws windows TestSecretsPutSecretStringValue
2:25 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:24 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:23 azure-ucws windows TestAccept
2:20 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants