AIR CLI Integration: `air run` Command Pt. 1 - Add GPU accelerator type and compute config model by riddhibhagwat-db · Pull Request #5602 · databricks/cli

riddhibhagwat-db · 2026-06-15T03:43:27Z

Changes

Adds experimental/air/cmd/compute.go , which is the gpuType model and compute which is the block validation that the air run configuration layer depends on.
Specifically:

the training service accelerator types were added (GPU_1xA10, GPU_8xH100, GPU_1xH100)
parseGPUType resolves a YAML accelerator type string
gpusPerNode is the per node partition count based on the type name
computeConfig and validate() are the port of the python ComputeConfig validators

Why

This is the first, leaf-most piece of the air run port for the AIR CLI and the root of the config validation layer dependencies. This piece for compute does not depend on anything else so it lands first as a small and fully unit-tested unit.
Note that we also use exact case sensitive parsing since a potential typo in the user's YAML could misroute the run. Additionally, we only support GPU_* training service types (legacy MAPI types (eg. h100_80gb) are no longer supported and intentionally deprecated in this port. However, they still have their own display map for historical runs to be able to be displayed (but no new runs can use the MAPI path). Rendering them in get is unaffected since format.go keeps its own display map for historical runs.

Tests

Table-driven unit tests in compute_test.go: parseGPUType for valid types and rejected inputs (wrong casing, legacy types, unknown, empty); gpusPerNode counts plus its invalid-type error; and computeConfig.validate across valid configs and every failure mode (unknown/legacy type, non-positive count, non-multiple count, dual-pool conflict). go build, go test, and golangci-lint are clean.

Add the experimental `air` command group as the Go port surface for the Python `air` CLI. Every subcommand (run, status, list, logs, cancel, register-image) is registered as a stub that returns a not-implemented error; the real implementations land in later milestones. The package lives under experimental/air/cmd (imported as aircmd), matching the layout of the other experimental features (aitools, genie, postgres); cmd/experimental/ keeps only the dispatcher. TEST_PACKAGES in Taskfile.yml gains ./experimental/air/... so the unit tests keep running after the move. Includes unit tests for the command-tree wiring and the not-implemented stubs, plus an acceptance test exercising the stubs end-to-end. Co-authored-by: Isaac

Rename the run-details subcommand from `status` to `get`, matching the Python air CLI's current `air get run` naming (it replaced `get status`). Renames the file, constructor, command name, and updates the stub/help/unimplemented tests and goldens accordingly. Co-authored-by: Isaac

Implement the read-only run-details command (renamed from `status` to `get`). It fetches a job run via the Jobs API and renders the run's status, start time, duration, retries, experiment, accelerators, dashboard URL, MLflow deep-link, and a foreach/sweep summary. Output is the air-style {v, ts, data} JSON envelope under -o json, or a text view. Renames the command-level identifiers (status -> get) while keeping the run's "status" field/label. Adds format/mlflow/sweep/output helpers with unit tests and an acceptance test, and drops `get` from the not-implemented stub coverage. Co-authored-by: Isaac

Co-authored-by: Isaac

eng-dev-ecosystem-bot · 2026-06-15T04:19:20Z

Integration test report

Commit: 9efd3d1

Run: 27724619645

	Env	🔄flaky	💚RECOVERED	🙈SKIP	✅pass	🙈skip	Time
💚	aws linux		7	14	264	998	6:41
💚	aws windows		7	14	266	996	8:00
💚	aws-ucws linux		7	14	360	912	6:22
💚	aws-ucws windows		7	14	362	910	9:00
💚	azure linux		1	16	267	996	7:02
💚	azure windows		1	16	269	994	7:12
💚	azure-ucws linux		1	16	365	908	8:10
🔄	azure-ucws windows	2	1	16	365	906	8:23
💚	gcp linux		1	16	263	999	7:00
💚	gcp windows		1	16	265	997	9:11

23 interesting tests: 14 SKIP, 7 RECOVERED, 2 flaky

	Test Name	aws linux	aws windows	aws-ucws linux	aws-ucws windows	azure linux	azure windows	azure-ucws linux	azure-ucws windows	gcp linux	gcp windows
💚	TestAccept	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R
🙈	TestAccept/bundle/invariant/no_drift	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/permissions	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
💚	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions	💚R	💚R	💚R	💚R	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
💚	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct	💚R	💚R	💚R	💚R
💚	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform	💚R	💚R	💚R	💚R
💚	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions	💚R	💚R	💚R	💚R	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
💚	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct	💚R	💚R	💚R	💚R
💚	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform	💚R	💚R	💚R	💚R
🙈	TestAccept/bundle/resources/postgres_branches/basic	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_branches/recreate	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_branches/replace_existing	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_branches/update_protected	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_branches/without_branch_id	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_endpoints/basic	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_endpoints/recreate	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_projects/update_display_name	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/synced_database_tables/basic	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/vector_search_indexes/recreate/embedding_dimension	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/ssh/connection	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🔄	TestFsCpFileToFile	✅p	✅p	✅p	✅p	✅p	✅p	✅p	🔄f	✅p	✅p
🔄	TestFsCpFileToFile/local_to_uc-volumes	🙈s	🙈s	✅p	✅p	🙈s	🙈s	✅p	🔄f	🙈s	🙈s

Top 29 slowest tests (at least 2 minutes):

duration	env	testname
4:10	gcp windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:10	gcp linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:10	gcp windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:00	gcp linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:28	azure-ucws windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:26	aws-ucws windows	TestAccept
3:22	aws windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:21	azure-ucws windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:19	aws linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:19	aws linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:04	azure linux	TestSecretsPutSecretStringValue
3:04	aws-ucws linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:04	azure-ucws linux	TestSecretsPutSecretStringValue
3:02	aws-ucws windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:00	azure windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:59	aws-ucws windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:48	azure windows	TestAccept
2:47	azure-ucws linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:46	aws windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:45	aws-ucws linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:44	azure linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:40	azure-ucws windows	TestAccept
2:39	gcp windows	TestAccept
2:38	aws windows	TestAccept
2:34	azure-ucws linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:34	azure windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:24	azure linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:13	gcp linux	TestSecretsPutSecretStringValue
2:02	aws linux	TestSecretsPutSecretStringValue

pardis-beikzadeh-db · 2026-06-16T11:59:03Z

+	NodePoolID      string `yaml:"node_pool_id"`
+	PoolName        string `yaml:"pool_name"`


Hey let's leave out any pool related features from Go port. cc @ben-hansen-db @maggiewang-db I'd cc Yu Peng but he doesn't have a -db GH account?

Removed, thanks!

pardis-beikzadeh-db · 2026-06-16T12:00:33Z

+	perNode, err := gpusPerNode(g)
+	if err != nil {
+		return err
+	}
+	if c.NumAccelerators%perNode != 0 {
+		return fmt.Errorf("compute.num_accelerators for %s must be a multiple of %d, got %d", c.AcceleratorType, perNode, c.NumAccelerators)
+	}


I'm off the opinion this kind of check should be done in the backend. @maggiewang-db @ben-hansen-db @vinchenzo-db wdyt? can we do that easily using Training Service logic?

I think that based on the project milestones and as I discussed with Maggie yesterday, we want to port this in phases. As written in the project doc, we want to first port the run functionality directly as is (including the validation) and then move the validation & add handlers to the backend in milestone 3.

I agree. But my plan is to do that later in Milestone 3.2 after the initial lift and shift.
It needs some design to decide which validations to move to backend, which validations to keep in client

sounds good

Rename the RUN_ID arg placeholder to JOB_RUN_ID across get/logs/cancel to disambiguate it from other run identifiers. Hide the `logs --review` flag to match the Python CLI (help=argparse.SUPPRESS), and add the `-i` shorthand for `register-image --interactive-authenticate`. Co-authored-by: Isaac

maggiewang-db · 2026-06-17T05:52:16Z

+	case gpuType8xH100:
+		return 8, nil
+	}
+	return 0, fmt.Errorf("invalid GPU type %q", string(g))


Nit: By the time validate() reaches gpusPerNode(), parseGPUType() has already guaranteed g is valid.
It's ok to leave the code as is to be defensive. Just add a comment this shouldn't be reachable.

Added comment, thanks

Implement the read-only run-details command (renamed from `status` to `get`). It fetches a job run via the Jobs API and renders the run's status, start time, duration, retries, experiment, accelerators, dashboard URL, MLflow deep-link, and a foreach/sweep summary. Output is the air-style {v, ts, data} JSON envelope under -o json, or a text view. Renames the command-level identifiers (status -> get) while keeping the run's "status" field/label. Adds format/mlflow/sweep/output helpers with unit tests and an acceptance test, and drops `get` from the not-implemented stub coverage. Co-authored-by: Isaac

Co-authored-by: Isaac

The training-config block is command result data, but it was emitted via cmdio.LogString, which targets stderr. Write it to cmd.OutOrStdout() instead so it lands on stdout, matching the Python `air get`. Download/read failures stay on stderr as warnings. Co-authored-by: Isaac

`air get` derived Submitted and Duration from run-level start/end and truncated milliseconds to seconds. Port Python's _reported_attempt_timing so a retried run reports its latest attempt, and round to the nearest second to match Python's round(). Drops the run-level RunDuration shortcut, which diverged on retries. Co-authored-by: Isaac

mlflowURL resolved runs/get-output against Tasks[0], linking a retried run to its stale first attempt. Use the last task (latest attempt) to match Python (jobs_api_client.py:68). Co-authored-by: Isaac

…N with Python In -o json mode, error paths now emit the structured error envelope ({v, ts, error:{code, kind, message, retryable}}) and exit non-zero, matching the Python air CLI's print_json_error instead of letting the framework print a bare "Error: ..." string. Covers invalid RUN_ID, run-not-found, backend failures, and client/auth failures (wrapped PreRunE). Also align the success envelope with the Python CLI: - dashboard_url: construct {host}/jobs/runs/{id}?o={workspace_id} (via CurrentWorkspaceID) instead of using the API's run_page_url - started_at: datetime.isoformat() form ("+00:00" with microseconds), not RFC3339 "Z" - duration_seconds: rounded half-to-even to match Python's round() - use run-level start/end times for started_at and duration_seconds, dropping the last-attempt preference, which had no Python equivalent Co-authored-by: Isaac

Revert the run-level timing change from the previous commit: started_at and duration_seconds read from the last task's window again (reportedTiming), matching the released Python `air` output, which reports the latest attempt. The isoformat timestamp ("+00:00") and half-to-even rounding are kept. Co-authored-by: Isaac

The runs/get-output call passed run_id via the query-param arg and a nil request body, which this endpoint rejects with "expected a map", so the MLflow link was never produced for completed runs. Pass run_id through the request arg instead (the SDK serializes it to the query string for GET), which sends a valid body and returns the gen_ai_compute_output run info. Failed runs without MLflow output still yield no link: get-output 404s for them, so mlflowURL returns nil as before. Co-authored-by: Isaac

…output Nest the run-status command under a `get` parent group so the command is `air get run JOB_RUN_ID`, mirroring the Python CLI (the JOB_RUN_ID arg name matches the sibling air commands and avoids confusion with the MLflow run id). Align the text output with Python's `air get run`: lead with the dashboard link (hyperlinked, falling back to the bare URL off a terminal) followed by a gap, then the training config, then the status table. The table uses Python's field order, "N/A" for empty cells, a "2006-01-02 15:04 UTC" Submitted timestamp, and terminal hyperlinks on the Run ID, Experiment, and MLflow Run cells (the MLflow Run cell shows the run's name from the MLflow REST API). The JSON envelope is unchanged. Also reformat the training-config YAML shown in text mode so multi-line fields (e.g. command) render as block literals instead of escaped one-liners. Co-authored-by: Isaac

Add compute.go: the gpuType model and compute-block validation the upcoming `air run` config layer depends on. Defines the canonical GPU_* accelerator types, parseGPUType (exact, case-sensitive), gpusPerNode (partition counts), and computeConfig.validate (positive count, multiple-of-per-node, mutually exclusive node_pool_id/pool_name). Co-authored-by: Isaac

The training compute config no longer supports pool placement, so remove the node_pool_id and pool_name fields and the validation that rejected setting both. Co-authored-by: Isaac

riddhibhagwat-db added 5 commits June 12, 2026 20:40

experimental/air: rename stale TestBuildStatusData to TestBuildGetData

89042d0

Co-authored-by: Isaac

experimental/air: apply testifylint fixes in get/format tests

c99239c

Co-authored-by: Isaac

riddhibhagwat-db temporarily deployed to test-trigger-is June 15, 2026 03:44 — with GitHub Actions Inactive

riddhibhagwat-db requested review from maggiewang-db and simonfaltum June 15, 2026 03:45

riddhibhagwat-db requested a review from ben-hansen-db June 15, 2026 18:11

pardis-beikzadeh-db reviewed Jun 16, 2026

View reviewed changes

riddhibhagwat-db added 2 commits June 16, 2026 18:29

Merge branch 'main' into air-integration-m0

8a97e0f

riddhibhagwat-db temporarily deployed to test-trigger-is June 16, 2026 18:42 — with GitHub Actions Inactive

Merge branch 'main' into air-integration-m0

e04b698

maggiewang-db approved these changes Jun 17, 2026

View reviewed changes

pardis-beikzadeh-db approved these changes Jun 17, 2026

View reviewed changes

riddhibhagwat-db added 10 commits June 17, 2026 20:58

experimental/air: rename stale TestBuildStatusData to TestBuildGetData

3883791

Co-authored-by: Isaac

experimental/air: apply testifylint fixes in get/format tests

472a1fe

Co-authored-by: Isaac

experimental/air: link MLflow output to the latest attempt

d3bb64b

mlflowURL resolved runs/get-output against Tasks[0], linking a retried run to its stale first attempt. Use the last task (latest attempt) to match Python (jobs_api_client.py:68). Co-authored-by: Isaac

riddhibhagwat-db force-pushed the air-integration-m1-1 branch from 7af56f3 to a69e0d3 Compare June 17, 2026 21:03

riddhibhagwat-db added 2 commits June 17, 2026 21:24

experimental/air: drop node pool / pool name compute fields

62be1a1

The training compute config no longer supports pool placement, so remove the node_pool_id and pool_name fields and the validation that rejected setting both. Co-authored-by: Isaac

riddhibhagwat-db force-pushed the air-integration-m2-1 branch from f8477fc to 62be1a1 Compare June 17, 2026 21:26

riddhibhagwat-db changed the base branch from air-integration-m1-1 to air-cli June 17, 2026 22:39

riddhibhagwat-db changed the base branch from air-cli to air-integration-m1-1 June 17, 2026 22:40

Merge branch 'air-integration-m1-1' into air-integration-m2-1

9efd3d1

riddhibhagwat-db temporarily deployed to test-trigger-is June 17, 2026 22:46 — with GitHub Actions Inactive

Base automatically changed from air-integration-m1-1 to air-cli June 18, 2026 16:46

Merge branch 'air-cli' into air-integration-m2-1

6850d23

riddhibhagwat-db temporarily deployed to test-trigger-is June 18, 2026 16:50 — with GitHub Actions Inactive

riddhibhagwat-db merged commit f1601b2 into air-cli Jun 18, 2026
9 checks passed

riddhibhagwat-db deleted the air-integration-m2-1 branch June 18, 2026 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AIR CLI Integration: `air run` Command Pt. 1 - Add GPU accelerator type and compute config model#5602

AIR CLI Integration: `air run` Command Pt. 1 - Add GPU accelerator type and compute config model#5602
riddhibhagwat-db merged 22 commits into
air-clifrom
air-integration-m2-1

riddhibhagwat-db commented Jun 15, 2026

Uh oh!

eng-dev-ecosystem-bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

pardis-beikzadeh-db Jun 16, 2026

Uh oh!

riddhibhagwat-db Jun 16, 2026

Uh oh!

pardis-beikzadeh-db Jun 16, 2026

Uh oh!

riddhibhagwat-db Jun 16, 2026

Uh oh!

maggiewang-db Jun 16, 2026

Uh oh!

pardis-beikzadeh-db Jun 17, 2026

Uh oh!

maggiewang-db Jun 17, 2026

Uh oh!

riddhibhagwat-db Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		NodePoolID string `yaml:"node_pool_id"`
		PoolName string `yaml:"pool_name"`

Conversation

riddhibhagwat-db commented Jun 15, 2026

Changes

Why

Tests

Uh oh!

eng-dev-ecosystem-bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Integration test report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eng-dev-ecosystem-bot commented Jun 15, 2026 •

edited

Loading