ssh: propagate server logs when the bootstrap job fails#5552
Merged
Conversation
The bootstrap notebook now tees the SSH server's output and, when the server process exits non-zero, fails the run with the last 2000 bytes of server logs in the exception, which "ssh connect" prints via the existing failed-run path. Also fixes the SIGCHLD subreaper handler silently swallowing the server's exit status, which could report a failed server as a successful run. Co-authored-by: Isaac
Collaborator
Integration test reportCommit: a5d1273
26 interesting tests: 15 SKIP, 6 RECOVERED, 5 flaky
Top 20 slowest tests (at least 2 minutes):
|
Collaborator
Integration test reportCommit: 1bf3c34
473 interesting tests: 406 MISS, 44 FAIL, 8 RECOVERED, 7 KNOWN, 3 PANIC, 3 flaky, 2 SKIP
Top 50 slowest tests (at least 2 minutes):
|
artchen-db
pushed a commit
to artchen-db/cli
that referenced
this pull request
Jun 18, 2026
databricks#5555) ## Changes - The SSH server keeps its recent warning/error log records in a bounded in-memory buffer (16KB, oldest evicted) and serves them at `/logs` next to the existing `/metadata` endpoint, behind the same driver-proxy auth. Implemented as a tee `slog.Handler`, so all records still flow to stdout (the run-page logs) unchanged. - When the spawned `ssh` client exits with a connection-level failure (code 255), `ssh connect` fetches `/logs` and prints the server's actual errors (e.g. `failed to start SSHD process: ... /usr/sbin/sshd: no such file or directory`). The generic "install openssh-server" hint remains as the fallback when no logs are available (e.g. older server versions without `/logs`); the fetch is best-effort. - Extracted `newDriverProxyRequest` from `getServerMetadata`, shared by the new logs fetch. ## Why When a connection attempt fails against a healthy-looking bootstrap job (FAILURE_MODES.md Mode 1: the container lacks `sshd`, the server logs the error per connection and keeps running), the real error was unreachable from the client: the Jobs API exposes no stdout logs for a running notebook task (`GetRunOutput` requires a terminal state and `RunOutput.Logs` is unsupported for notebook tasks). The server's own HTTP service behind the driver proxy is the only channel available while the job is alive. Complements databricks#5552, which covers the terminated-job case. ## Tests - New unit tests for the log buffer and tee handler (eviction, warn+ filtering, per-connection `session` attrs, HTTP handler); `./task test-exp-ssh` and full lint pass. - Manually verified against dogfood with a planted failing sshd path: the bootstrap job stays RUNNING, and after the connection drops the terminal prints `The SSH connection closed unexpectedly. Recent SSH server errors:` followed by the server's `failed to start SSHD process: fork/exec ...: no such file or directory` log line. A regular `ssh connect` (no plant) still connects end-to-end. This pull request and its description were written by Isaac.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
ssh connectbootstrap notebook now tees the SSH server's stdout/stderr (logs keep streaming live to the run page) and, when the server process exits non-zero, fails the run with the last 2000 bytes of server logs embedded in the exception. The CLI then prints that tail through the existing failed-run path (describeRunFailure).Popen.wait(), Python reported exit code 0, so a failed server could show up as a successful job run. Reaped statuses are now recorded and consulted.describeRunFailuretruncates the run outputErrorto the same 2000-byte cap as the trace, and skips it when the traceback already ends with it, so the log tail isn't printed twice.Why
When the server binary fails, the run output only carried
Command '[...]' returned non-zero exit status N— the server's actual logs were unreachable from the client because the Jobs API does not expose notebook-task stdout (RunOutput.Logsis unsupported for notebook tasks), leaving users to dig through the run page. Follow-up to #5456.Tests
describeRunFailure(log-tail dedupe betweenErrorandErrorTrace,Errortruncation);./task test-exp-sshpasses.RuntimeError: SSH server exited with code 7. Last server logs:followed by the captured stdout/stderr tail (printed once) and the run page URL. A regularssh connect(serverless) still connects, with logs streaming to the run page.No NEXT_CHANGELOG.md entry:
sshis an experimental command.This pull request and its description were written by Isaac.