AIR CLI Integration: Implement the air get command#5600
Conversation
Add the experimental `air` command group as the Go port surface for the Python `air` CLI. Every subcommand (run, status, list, logs, cancel, register-image) is registered as a stub that returns a not-implemented error; the real implementations land in later milestones. The package lives under experimental/air/cmd (imported as aircmd), matching the layout of the other experimental features (aitools, genie, postgres); cmd/experimental/ keeps only the dispatcher. TEST_PACKAGES in Taskfile.yml gains ./experimental/air/... so the unit tests keep running after the move. Includes unit tests for the command-tree wiring and the not-implemented stubs, plus an acceptance test exercising the stubs end-to-end. Co-authored-by: Isaac
Rename the run-details subcommand from `status` to `get`, matching the Python air CLI's current `air get run` naming (it replaced `get status`). Renames the file, constructor, command name, and updates the stub/help/unimplemented tests and goldens accordingly. Co-authored-by: Isaac
air status commandair get command
Integration test reportCommit: 630cd91
29 interesting tests: 14 SKIP, 8 flaky, 7 RECOVERED
Top 24 slowest tests (at least 2 minutes):
|
|
|
||
| cmdio.LogString(ctx, "Training Configuration:") | ||
| cmdio.LogString(ctx, string(content)) | ||
| cmdio.LogString(ctx, "") |
There was a problem hiding this comment.
This helper function LogString writes to stderr, instead of stdout which was the original Python code behavior: https://github.com/databricks/cli/blob/main/libs/cmdio/log.go#L14-L18
There was a problem hiding this comment.
fixed, thanks for the catch!
|
|
||
| runID, err := strconv.ParseInt(args[0], 10, 64) | ||
| if err != nil || runID <= 0 { | ||
| return fmt.Errorf("invalid RUN_ID %q: must be a positive integer", args[0]) |
There was a problem hiding this comment.
In json mode does this return a plain Go error instead of json envelope?
This and a few other places should return json if --json flag is passed
There was a problem hiding this comment.
Fixed this, thanks for the catch!
| @@ -0,0 +1,36 @@ | |||
|
|
|||
| === get (text) | |||
| >>> [CLI] experimental air get 123 | |||
There was a problem hiding this comment.
this looks different than the output from the wheel side right? can you add a before / after screenshot to the PR description for easy review?
it's ok if the match in format is "coming next" I just want to make sure I understand how big the diff is exactly.
pardis-beikzadeh-db
left a comment
There was a problem hiding this comment.
Independent review against the Python air source (handle_status + cli_display/json_output). The success JSON envelope shape, MLflow URL construction, sweep table, and YAML panel all port faithfully — nice work. A few divergences inline below (#1, the -o json error path, is the one I'd treat as the most important; the rest are retry/rounding correctness).
Two more are easiest to review visually — could you add before/after side-by-sides to the PR description showing the old air vs the new databricks experimental air get output for: (a) the text view of a run, and (b) -o json of a run, plus a not-found case? That lets us confirm at a glance:
- text field ordering (Retries/Duration order and MLflow/User placement differ from the Python table at
cli_display.py:249), and - the JSON
started_atformat — Python emits…+00:00via.isoformat()(cli_entrypoint.py:1931), while the Go side emits…Zvia RFC3339 (format.go:44), which is a value change for strict consumers.
| if err != nil { | ||
| // The backend returns this when the run ID is unknown to the user. | ||
| if errors.Is(err, apierr.ErrResourceDoesNotExist) { | ||
| return fmt.Errorf("run %d not found: check the run ID and that it is a job run ID", runID) |
There was a problem hiding this comment.
In -o json mode the Python CLI emits a structured error envelope and exits 1 — print_json_error("NOT_FOUND"/"INTERNAL_ERROR", kind, msg, retryable) → {v, ts, error:{...}} (cli_entrypoint.py:2017-2022). Here RunE returns a bare error regardless of output mode, so the framework prints a plain Error: … string. A consumer parsing the JSON error envelope from air get --json would break. Consider rendering the error envelope when output is JSON. (This JSON not-found branch is also currently untested.)
| endMillis = time.Now().UnixMilli() | ||
| } | ||
|
|
||
| d := (endMillis - run.StartTime) / 1000 |
There was a problem hiding this comment.
nit: Python rounds to the nearest second: round((end - started_ms) / 1000) (cli_entrypoint.py:1934). Integer / 1000 truncates here, so e.g. an 11,500 ms run reports 11 vs Python's 12. Suggest rounding, e.g. (endMillis - run.StartTime + 500) / 1000.
Rename the RUN_ID arg placeholder to JOB_RUN_ID across get/logs/cancel to disambiguate it from other run identifiers. Hide the `logs --review` flag to match the Python CLI (help=argparse.SUPPRESS), and add the `-i` shorthand for `register-image --interactive-authenticate`. Co-authored-by: Isaac
adb8fb8 to
8e76b86
Compare
`air get` derived Submitted and Duration from run-level start/end and truncated milliseconds to seconds. Port Python's _reported_attempt_timing so a retried run reports its latest attempt, and round to the nearest second to match Python's round(). Drops the run-level RunDuration shortcut, which diverged on retries. Co-authored-by: Isaac
mlflowURL resolved runs/get-output against Tasks[0], linking a retried run to its stale first attempt. Use the last task (latest attempt) to match Python (jobs_api_client.py:68). Co-authored-by: Isaac
…N with Python
In -o json mode, error paths now emit the structured error envelope
({v, ts, error:{code, kind, message, retryable}}) and exit non-zero, matching
the Python air CLI's print_json_error instead of letting the framework print a
bare "Error: ..." string. Covers invalid RUN_ID, run-not-found, backend
failures, and client/auth failures (wrapped PreRunE).
Also align the success envelope with the Python CLI:
- dashboard_url: construct {host}/jobs/runs/{id}?o={workspace_id} (via
CurrentWorkspaceID) instead of using the API's run_page_url
- started_at: datetime.isoformat() form ("+00:00" with microseconds), not
RFC3339 "Z"
- duration_seconds: rounded half-to-even to match Python's round()
- use run-level start/end times for started_at and duration_seconds, dropping
the last-attempt preference, which had no Python equivalent
Co-authored-by: Isaac
Revert the run-level timing change from the previous commit: started_at and
duration_seconds read from the last task's window again (reportedTiming),
matching the released Python `air` output, which reports the latest attempt.
The isoformat timestamp ("+00:00") and half-to-even rounding are kept.
Co-authored-by: Isaac
The runs/get-output call passed run_id via the query-param arg and a nil request body, which this endpoint rejects with "expected a map", so the MLflow link was never produced for completed runs. Pass run_id through the request arg instead (the SDK serializes it to the query string for GET), which sends a valid body and returns the gen_ai_compute_output run info. Failed runs without MLflow output still yield no link: get-output 404s for them, so mlflowURL returns nil as before. Co-authored-by: Isaac
…output Nest the run-status command under a `get` parent group so the command is `air get run JOB_RUN_ID`, mirroring the Python CLI (the JOB_RUN_ID arg name matches the sibling air commands and avoids confusion with the MLflow run id). Align the text output with Python's `air get run`: lead with the dashboard link (hyperlinked, falling back to the bare URL off a terminal) followed by a gap, then the training config, then the status table. The table uses Python's field order, "N/A" for empty cells, a "2006-01-02 15:04 UTC" Submitted timestamp, and terminal hyperlinks on the Run ID, Experiment, and MLflow Run cells (the MLflow Run cell shows the run's name from the MLflow REST API). The JSON envelope is unchanged. Also reformat the training-config YAML shown in text mode so multi-line fields (e.g. command) render as block literals instead of escaped one-liners. Co-authored-by: Isaac
7af56f3 to
a69e0d3
Compare
|
|
||
| runID, err := strconv.ParseInt(args[0], 10, 64) | ||
| if err != nil || runID <= 0 { | ||
| return renderError(ctx, cmd, "INVALID_ARGS", "PERMANENT", true, |
There was a problem hiding this comment.
Invalid_args should not be a retryable error
A malformed JOB_RUN_ID is a permanent user error, so the JSON error envelope should report retryable=false; retrying the same bad argument can never succeed. Co-authored-by: Isaac
love the thorough response! thanks for all the fixes. I even agreed w/ your reasoning on 1 final nit: I find it odd that the job link is called a "dashboard" can we have that say sth else? maybe just "Job Link"? |
"Dashboard" was ambiguous; the link points at the Databricks Jobs run page, so "Job Link" is clearer. JSON output (dashboard_url) is unchanged. Co-authored-by: Isaac
Changes
Implements
databricks experimental ai get RUN_ID, the Go port of the Pythonair getcommand. It fetches the run viaJobs.GetRunand renders:User), and the run's dashboard URL.jobs/runs/get-output(thegen_ai_compute_outputfield is not modeled by the typed SDK, so it's fetched via a direct REST call).Why
getis the first real command integrated from the air cli and it sets the conventions the rest of the CLI will follow. The{v, ts, data}envelope mirrors the Python CLI so existing machine consumers keep working. The implementation is a faithful port ofhandle_status+ thecli_displayhelpers, verified field-by-field against the Python source:_display_foreach_sweep_status) and the training-config panel (_fetch_and_display_yaml_config); JSON output omits both, exactly matchingair get <run> --json.gen_ai_compute_outputfield (direct REST call), and the MLflow link / YAML fetch are best-effort (logic matches python cli)Tests
buildGetData, and all template branches (single-run minimal/all-fields, sweep, sweep-with-no-tasks).unittest.mocksuite) coverbuildSweepInfo,printConfigYAML,mlflowURL(overhttptest, since it bypasses the typed SDK), and theRunEinvalid-id / not-found branches.acceptance/experimental/air/get) runs the command end-to-end against a stubbed Jobs API: text output,-o json, and an invalid run ID.Manual verification outputs:
Successful run:


Failed run:

