Skip to content

AIR CLI Integration: air run Command Pt. 1 - Add GPU accelerator type and compute config model#5602

Merged
riddhibhagwat-db merged 22 commits into
air-clifrom
air-integration-m2-1
Jun 18, 2026
Merged

AIR CLI Integration: air run Command Pt. 1 - Add GPU accelerator type and compute config model#5602
riddhibhagwat-db merged 22 commits into
air-clifrom
air-integration-m2-1

Conversation

@riddhibhagwat-db

Copy link
Copy Markdown

Changes

Adds experimental/air/cmd/compute.go , which is the gpuType model and compute which is the block validation that the air run configuration layer depends on.
Specifically:

  • the training service accelerator types were added (GPU_1xA10, GPU_8xH100, GPU_1xH100)
  • parseGPUType resolves a YAML accelerator type string
  • gpusPerNode is the per node partition count based on the type name
  • computeConfig and validate() are the port of the python ComputeConfig validators

Why

This is the first, leaf-most piece of the air run port for the AIR CLI and the root of the config validation layer dependencies. This piece for compute does not depend on anything else so it lands first as a small and fully unit-tested unit.
Note that we also use exact case sensitive parsing since a potential typo in the user's YAML could misroute the run. Additionally, we only support GPU_* training service types (legacy MAPI types (eg. h100_80gb) are no longer supported and intentionally deprecated in this port. However, they still have their own display map for historical runs to be able to be displayed (but no new runs can use the MAPI path). Rendering them in get is unaffected since format.go keeps its own display map for historical runs.

Tests

Table-driven unit tests in compute_test.go: parseGPUType for valid types and rejected inputs (wrong casing, legacy types, unknown, empty); gpusPerNode counts plus its invalid-type error; and computeConfig.validate across valid configs and every failure mode (unknown/legacy type, non-positive count, non-multiple count, dual-pool conflict). go build, go test, and golangci-lint are clean.

Add the experimental `air` command group as the Go port surface for the
Python `air` CLI. Every subcommand (run, status, list, logs, cancel,
register-image) is registered as a stub that returns a not-implemented
error; the real implementations land in later milestones.

The package lives under experimental/air/cmd (imported as aircmd), matching
the layout of the other experimental features (aitools, genie, postgres);
cmd/experimental/ keeps only the dispatcher. TEST_PACKAGES in Taskfile.yml
gains ./experimental/air/... so the unit tests keep running after the move.

Includes unit tests for the command-tree wiring and the not-implemented
stubs, plus an acceptance test exercising the stubs end-to-end.

Co-authored-by: Isaac
Rename the run-details subcommand from `status` to `get`, matching the Python
air CLI's current `air get run` naming (it replaced `get status`). Renames the
file, constructor, command name, and updates the stub/help/unimplemented tests
and goldens accordingly.

Co-authored-by: Isaac
Implement the read-only run-details command (renamed from `status` to `get`).
It fetches a job run via the Jobs API and renders the run's status, start time,
duration, retries, experiment, accelerators, dashboard URL, MLflow deep-link,
and a foreach/sweep summary. Output is the air-style {v, ts, data} JSON envelope
under -o json, or a text view.

Renames the command-level identifiers (status -> get) while keeping the run's
"status" field/label. Adds format/mlflow/sweep/output helpers with unit tests
and an acceptance test, and drops `get` from the not-implemented stub coverage.

Co-authored-by: Isaac
@eng-dev-ecosystem-bot

eng-dev-ecosystem-bot commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: 9efd3d1

Run: 27724619645

Env 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
💚​ aws linux 7 14 264 998 6:41
💚​ aws windows 7 14 266 996 8:00
💚​ aws-ucws linux 7 14 360 912 6:22
💚​ aws-ucws windows 7 14 362 910 9:00
💚​ azure linux 1 16 267 996 7:02
💚​ azure windows 1 16 269 994 7:12
💚​ azure-ucws linux 1 16 365 908 8:10
🔄​ azure-ucws windows 2 1 16 365 906 8:23
💚​ gcp linux 1 16 263 999 7:00
💚​ gcp windows 1 16 265 997 9:11
23 interesting tests: 14 SKIP, 7 RECOVERED, 2 flaky
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
💚​ TestAccept 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/recreate/embedding_dimension 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestFsCpFileToFile ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p ✅​p
🔄​ TestFsCpFileToFile/local_to_uc-volumes 🙈​s 🙈​s ✅​p ✅​p 🙈​s 🙈​s ✅​p 🔄​f 🙈​s 🙈​s
Top 29 slowest tests (at least 2 minutes):
duration env testname
4:10 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:10 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:10 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:00 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:28 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:26 aws-ucws windows TestAccept
3:22 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:21 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:19 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:19 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:04 azure linux TestSecretsPutSecretStringValue
3:04 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:04 azure-ucws linux TestSecretsPutSecretStringValue
3:02 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:00 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:59 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:48 azure windows TestAccept
2:47 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:46 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:45 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:44 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:40 azure-ucws windows TestAccept
2:39 gcp windows TestAccept
2:38 aws windows TestAccept
2:34 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:34 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:24 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:13 gcp linux TestSecretsPutSecretStringValue
2:02 aws linux TestSecretsPutSecretStringValue

Comment thread experimental/air/cmd/compute.go Outdated
Comment on lines +58 to +59
NodePoolID string `yaml:"node_pool_id"`
PoolName string `yaml:"pool_name"`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey let's leave out any pool related features from Go port. cc @ben-hansen-db @maggiewang-db I'd cc Yu Peng but he doesn't have a -db GH account?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, thanks!

Comment on lines +73 to +79
perNode, err := gpusPerNode(g)
if err != nil {
return err
}
if c.NumAccelerators%perNode != 0 {
return fmt.Errorf("compute.num_accelerators for %s must be a multiple of %d, got %d", c.AcceleratorType, perNode, c.NumAccelerators)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm off the opinion this kind of check should be done in the backend. @maggiewang-db @ben-hansen-db @vinchenzo-db wdyt? can we do that easily using Training Service logic?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that based on the project milestones and as I discussed with Maggie yesterday, we want to port this in phases. As written in the project doc, we want to first port the run functionality directly as is (including the validation) and then move the validation & add handlers to the backend in milestone 3.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. But my plan is to do that later in Milestone 3.2 after the initial lift and shift.
It needs some design to decide which validations to move to backend, which validations to keep in client

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

Rename the RUN_ID arg placeholder to JOB_RUN_ID across get/logs/cancel to
disambiguate it from other run identifiers. Hide the `logs --review` flag to
match the Python CLI (help=argparse.SUPPRESS), and add the `-i` shorthand for
`register-image --interactive-authenticate`.

Co-authored-by: Isaac
case gpuType8xH100:
return 8, nil
}
return 0, fmt.Errorf("invalid GPU type %q", string(g))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: By the time validate() reaches gpusPerNode(), parseGPUType() has already guaranteed g is valid.
It's ok to leave the code as is to be defensive. Just add a comment this shouldn't be reachable.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment, thanks

Implement the read-only run-details command (renamed from `status` to `get`).
It fetches a job run via the Jobs API and renders the run's status, start time,
duration, retries, experiment, accelerators, dashboard URL, MLflow deep-link,
and a foreach/sweep summary. Output is the air-style {v, ts, data} JSON envelope
under -o json, or a text view.

Renames the command-level identifiers (status -> get) while keeping the run's
"status" field/label. Adds format/mlflow/sweep/output helpers with unit tests
and an acceptance test, and drops `get` from the not-implemented stub coverage.

Co-authored-by: Isaac
The training-config block is command result data, but it was emitted via
cmdio.LogString, which targets stderr. Write it to cmd.OutOrStdout() instead so
it lands on stdout, matching the Python `air get`. Download/read failures stay
on stderr as warnings.

Co-authored-by: Isaac
`air get` derived Submitted and Duration from run-level start/end and truncated
milliseconds to seconds. Port Python's _reported_attempt_timing so a retried run
reports its latest attempt, and round to the nearest second to match Python's
round(). Drops the run-level RunDuration shortcut, which diverged on retries.

Co-authored-by: Isaac
mlflowURL resolved runs/get-output against Tasks[0], linking a retried run to its
stale first attempt. Use the last task (latest attempt) to match Python
(jobs_api_client.py:68).

Co-authored-by: Isaac
…N with Python

In -o json mode, error paths now emit the structured error envelope
({v, ts, error:{code, kind, message, retryable}}) and exit non-zero, matching
the Python air CLI's print_json_error instead of letting the framework print a
bare "Error: ..." string. Covers invalid RUN_ID, run-not-found, backend
failures, and client/auth failures (wrapped PreRunE).

Also align the success envelope with the Python CLI:
- dashboard_url: construct {host}/jobs/runs/{id}?o={workspace_id} (via
  CurrentWorkspaceID) instead of using the API's run_page_url
- started_at: datetime.isoformat() form ("+00:00" with microseconds), not
  RFC3339 "Z"
- duration_seconds: rounded half-to-even to match Python's round()
- use run-level start/end times for started_at and duration_seconds, dropping
  the last-attempt preference, which had no Python equivalent

Co-authored-by: Isaac
Revert the run-level timing change from the previous commit: started_at and
duration_seconds read from the last task's window again (reportedTiming),
matching the released Python `air` output, which reports the latest attempt.
The isoformat timestamp ("+00:00") and half-to-even rounding are kept.

Co-authored-by: Isaac
The runs/get-output call passed run_id via the query-param arg and a nil
request body, which this endpoint rejects with "expected a map", so the
MLflow link was never produced for completed runs. Pass run_id through the
request arg instead (the SDK serializes it to the query string for GET),
which sends a valid body and returns the gen_ai_compute_output run info.

Failed runs without MLflow output still yield no link: get-output 404s for
them, so mlflowURL returns nil as before.

Co-authored-by: Isaac
…output

Nest the run-status command under a `get` parent group so the command is
`air get run JOB_RUN_ID`, mirroring the Python CLI (the JOB_RUN_ID arg name
matches the sibling air commands and avoids confusion with the MLflow run id).

Align the text output with Python's `air get run`: lead with the dashboard
link (hyperlinked, falling back to the bare URL off a terminal) followed by a
gap, then the training config, then the status table. The table uses Python's
field order, "N/A" for empty cells, a "2006-01-02 15:04 UTC" Submitted
timestamp, and terminal hyperlinks on the Run ID, Experiment, and MLflow Run
cells (the MLflow Run cell shows the run's name from the MLflow REST API). The
JSON envelope is unchanged.

Also reformat the training-config YAML shown in text mode so multi-line
fields (e.g. command) render as block literals instead of escaped one-liners.

Co-authored-by: Isaac
Add compute.go: the gpuType model and compute-block validation the upcoming
`air run` config layer depends on. Defines the canonical GPU_* accelerator
types, parseGPUType (exact, case-sensitive), gpusPerNode (partition counts),
and computeConfig.validate (positive count, multiple-of-per-node, mutually
exclusive node_pool_id/pool_name).

Co-authored-by: Isaac
The training compute config no longer supports pool placement, so remove the
node_pool_id and pool_name fields and the validation that rejected setting both.

Co-authored-by: Isaac
@riddhibhagwat-db riddhibhagwat-db changed the base branch from air-integration-m1-1 to air-cli June 17, 2026 22:39
@riddhibhagwat-db riddhibhagwat-db changed the base branch from air-cli to air-integration-m1-1 June 17, 2026 22:40
Base automatically changed from air-integration-m1-1 to air-cli June 18, 2026 16:46
@riddhibhagwat-db riddhibhagwat-db merged commit f1601b2 into air-cli Jun 18, 2026
9 checks passed
@riddhibhagwat-db riddhibhagwat-db deleted the air-integration-m2-1 branch June 18, 2026 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants