You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
experimental/ssh: clarify GPU compute provisioning during ssh connect startup
GPU_8xH100 serverless capacity takes ~10 minutes at P50 and ~30 minutes at
P90 to acquire, but `ssh connect` gave up after a hard 10-minute startup
timeout with an opaque error:
Error: failed to ensure that ssh server is running: failed to submit and
start ssh server job: timed out: waiting for task to start (current
state: PENDING)
Users read this as a service outage rather than compute still being
provisioned (see the Zillow report in #remote-development-help).
- Raise the startup timeout to 40 minutes when --accelerator is set,
keeping 10 minutes otherwise.
- Print an upfront notice that GPU provisioning can take 10-30 minutes,
and reflect provisioning in the spinner text.
- On startup timeout, append guidance to the error: the run ID and run
page URL, that compute is likely still provisioning, and that the run
was left in place so re-running the command connects once it starts.
Co-authored-by: Isaac
Copy file name to clipboardExpand all lines: NEXT_CHANGELOG.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,7 @@
6
6
7
7
### CLI
8
8
* Show a once-per-day notice after a command when a newer CLI release is available, with a link to the release and the upgrade command for the detected install method. Suppressed for non-interactive/CI runs, JSON output, the Databricks Runtime, and development builds, and can be disabled with `DATABRICKS_CLI_DISABLE_UPDATE_CHECK` ([#5470](https://github.com/databricks/cli/pull/5470)).
9
+
*`ssh connect`: Increase the SSH server startup timeout from 10 to 40 minutes for GPU accelerators, show "Waiting for compute to start" while compute spins up (with a notice for GPU accelerators that provisioning can take upwards of 10 minutes), and explain on timeout that the job run was left in place so re-running the command connects once compute is available.
9
10
10
11
### Bundles
11
12
* Remove API enum values and types that are still in development from the `databricks-bundles` Python package; these were never accepted by the backend ([#5484](https://github.com/databricks/cli/pull/5484)).
// GPU capacity is acquired on demand and routinely takes 10+ minutes; without
669
+
// this notice users assume a long PENDING wait means the service is down.
670
+
cmdio.LogString(ctx, fmt.Sprintf("Waiting for %s compute to be provisioned. This can take upwards of 10 minutes depending on capacity...", opts.Accelerator))
671
+
waitingMessage=fmt.Sprintf("Waiting for %s compute to be provisioned...", opts.Accelerator)
fmt.Fprintf(&b, " The SSH server job (run ID: %d) did not start within %s; its compute is most likely still being provisioned.\n", runID, opts.TaskStartupTimeout)
742
+
ifopts.Accelerator!="" {
743
+
fmt.Fprintf(&b, " %s capacity can take longer than this to acquire when demand is high.\n", opts.Accelerator)
744
+
}
745
+
runLocation:="in the workspace UI (Jobs & Pipelines > Job Runs)"
0 commit comments