AIR CLI Integration: air run Command Pt. 1 - Add GPU accelerator type and compute config model#5602
Conversation
Add the experimental `air` command group as the Go port surface for the Python `air` CLI. Every subcommand (run, status, list, logs, cancel, register-image) is registered as a stub that returns a not-implemented error; the real implementations land in later milestones. The package lives under experimental/air/cmd (imported as aircmd), matching the layout of the other experimental features (aitools, genie, postgres); cmd/experimental/ keeps only the dispatcher. TEST_PACKAGES in Taskfile.yml gains ./experimental/air/... so the unit tests keep running after the move. Includes unit tests for the command-tree wiring and the not-implemented stubs, plus an acceptance test exercising the stubs end-to-end. Co-authored-by: Isaac
Rename the run-details subcommand from `status` to `get`, matching the Python air CLI's current `air get run` naming (it replaced `get status`). Renames the file, constructor, command name, and updates the stub/help/unimplemented tests and goldens accordingly. Co-authored-by: Isaac
Implement the read-only run-details command (renamed from `status` to `get`).
It fetches a job run via the Jobs API and renders the run's status, start time,
duration, retries, experiment, accelerators, dashboard URL, MLflow deep-link,
and a foreach/sweep summary. Output is the air-style {v, ts, data} JSON envelope
under -o json, or a text view.
Renames the command-level identifiers (status -> get) while keeping the run's
"status" field/label. Adds format/mlflow/sweep/output helpers with unit tests
and an acceptance test, and drops `get` from the not-implemented stub coverage.
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Integration test reportCommit: 9efd3d1
23 interesting tests: 14 SKIP, 7 RECOVERED, 2 flaky
Top 29 slowest tests (at least 2 minutes):
|
| NodePoolID string `yaml:"node_pool_id"` | ||
| PoolName string `yaml:"pool_name"` |
There was a problem hiding this comment.
Hey let's leave out any pool related features from Go port. cc @ben-hansen-db @maggiewang-db I'd cc Yu Peng but he doesn't have a -db GH account?
| perNode, err := gpusPerNode(g) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| if c.NumAccelerators%perNode != 0 { | ||
| return fmt.Errorf("compute.num_accelerators for %s must be a multiple of %d, got %d", c.AcceleratorType, perNode, c.NumAccelerators) | ||
| } |
There was a problem hiding this comment.
I'm off the opinion this kind of check should be done in the backend. @maggiewang-db @ben-hansen-db @vinchenzo-db wdyt? can we do that easily using Training Service logic?
There was a problem hiding this comment.
I think that based on the project milestones and as I discussed with Maggie yesterday, we want to port this in phases. As written in the project doc, we want to first port the run functionality directly as is (including the validation) and then move the validation & add handlers to the backend in milestone 3.
There was a problem hiding this comment.
I agree. But my plan is to do that later in Milestone 3.2 after the initial lift and shift.
It needs some design to decide which validations to move to backend, which validations to keep in client
Rename the RUN_ID arg placeholder to JOB_RUN_ID across get/logs/cancel to disambiguate it from other run identifiers. Hide the `logs --review` flag to match the Python CLI (help=argparse.SUPPRESS), and add the `-i` shorthand for `register-image --interactive-authenticate`. Co-authored-by: Isaac
| case gpuType8xH100: | ||
| return 8, nil | ||
| } | ||
| return 0, fmt.Errorf("invalid GPU type %q", string(g)) |
There was a problem hiding this comment.
Nit: By the time validate() reaches gpusPerNode(), parseGPUType() has already guaranteed g is valid.
It's ok to leave the code as is to be defensive. Just add a comment this shouldn't be reachable.
Implement the read-only run-details command (renamed from `status` to `get`).
It fetches a job run via the Jobs API and renders the run's status, start time,
duration, retries, experiment, accelerators, dashboard URL, MLflow deep-link,
and a foreach/sweep summary. Output is the air-style {v, ts, data} JSON envelope
under -o json, or a text view.
Renames the command-level identifiers (status -> get) while keeping the run's
"status" field/label. Adds format/mlflow/sweep/output helpers with unit tests
and an acceptance test, and drops `get` from the not-implemented stub coverage.
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
The training-config block is command result data, but it was emitted via cmdio.LogString, which targets stderr. Write it to cmd.OutOrStdout() instead so it lands on stdout, matching the Python `air get`. Download/read failures stay on stderr as warnings. Co-authored-by: Isaac
`air get` derived Submitted and Duration from run-level start/end and truncated milliseconds to seconds. Port Python's _reported_attempt_timing so a retried run reports its latest attempt, and round to the nearest second to match Python's round(). Drops the run-level RunDuration shortcut, which diverged on retries. Co-authored-by: Isaac
mlflowURL resolved runs/get-output against Tasks[0], linking a retried run to its stale first attempt. Use the last task (latest attempt) to match Python (jobs_api_client.py:68). Co-authored-by: Isaac
…N with Python
In -o json mode, error paths now emit the structured error envelope
({v, ts, error:{code, kind, message, retryable}}) and exit non-zero, matching
the Python air CLI's print_json_error instead of letting the framework print a
bare "Error: ..." string. Covers invalid RUN_ID, run-not-found, backend
failures, and client/auth failures (wrapped PreRunE).
Also align the success envelope with the Python CLI:
- dashboard_url: construct {host}/jobs/runs/{id}?o={workspace_id} (via
CurrentWorkspaceID) instead of using the API's run_page_url
- started_at: datetime.isoformat() form ("+00:00" with microseconds), not
RFC3339 "Z"
- duration_seconds: rounded half-to-even to match Python's round()
- use run-level start/end times for started_at and duration_seconds, dropping
the last-attempt preference, which had no Python equivalent
Co-authored-by: Isaac
Revert the run-level timing change from the previous commit: started_at and
duration_seconds read from the last task's window again (reportedTiming),
matching the released Python `air` output, which reports the latest attempt.
The isoformat timestamp ("+00:00") and half-to-even rounding are kept.
Co-authored-by: Isaac
The runs/get-output call passed run_id via the query-param arg and a nil request body, which this endpoint rejects with "expected a map", so the MLflow link was never produced for completed runs. Pass run_id through the request arg instead (the SDK serializes it to the query string for GET), which sends a valid body and returns the gen_ai_compute_output run info. Failed runs without MLflow output still yield no link: get-output 404s for them, so mlflowURL returns nil as before. Co-authored-by: Isaac
…output Nest the run-status command under a `get` parent group so the command is `air get run JOB_RUN_ID`, mirroring the Python CLI (the JOB_RUN_ID arg name matches the sibling air commands and avoids confusion with the MLflow run id). Align the text output with Python's `air get run`: lead with the dashboard link (hyperlinked, falling back to the bare URL off a terminal) followed by a gap, then the training config, then the status table. The table uses Python's field order, "N/A" for empty cells, a "2006-01-02 15:04 UTC" Submitted timestamp, and terminal hyperlinks on the Run ID, Experiment, and MLflow Run cells (the MLflow Run cell shows the run's name from the MLflow REST API). The JSON envelope is unchanged. Also reformat the training-config YAML shown in text mode so multi-line fields (e.g. command) render as block literals instead of escaped one-liners. Co-authored-by: Isaac
7af56f3 to
a69e0d3
Compare
Add compute.go: the gpuType model and compute-block validation the upcoming `air run` config layer depends on. Defines the canonical GPU_* accelerator types, parseGPUType (exact, case-sensitive), gpusPerNode (partition counts), and computeConfig.validate (positive count, multiple-of-per-node, mutually exclusive node_pool_id/pool_name). Co-authored-by: Isaac
The training compute config no longer supports pool placement, so remove the node_pool_id and pool_name fields and the validation that rejected setting both. Co-authored-by: Isaac
f8477fc to
62be1a1
Compare
Changes
Adds
experimental/air/cmd/compute.go, which is thegpuTypemodel andcomputewhich is the block validation that theair runconfiguration layer depends on.Specifically:
GPU_1xA10,GPU_8xH100,GPU_1xH100)parseGPUTyperesolves a YAML accelerator type stringgpusPerNodeis the per node partition count based on the type namecomputeConfigandvalidate()are the port of the pythonComputeConfigvalidatorsWhy
This is the first, leaf-most piece of the
air runport for the AIR CLI and the root of the config validation layer dependencies. This piece for compute does not depend on anything else so it lands first as a small and fully unit-tested unit.Note that we also use exact case sensitive parsing since a potential typo in the user's YAML could misroute the run. Additionally, we only support
GPU_*training service types (legacy MAPI types (eg.h100_80gb) are no longer supported and intentionally deprecated in this port. However, they still have their own display map for historical runs to be able to be displayed (but no new runs can use the MAPI path). Rendering them in get is unaffected since format.go keeps its own display map for historical runs.Tests
Table-driven unit tests in compute_test.go: parseGPUType for valid types and rejected inputs (wrong casing, legacy types, unknown, empty); gpusPerNode counts plus its invalid-type error; and computeConfig.validate across valid configs and every failure mode (unknown/legacy type, non-positive count, non-multiple count, dual-pool conflict). go build, go test, and golangci-lint are clean.