feat: add Headless Tunnel CLI enhancement#756
Conversation
Adds a provisional enhancement covering a headless command-line experience for Datum tunnels, distributed as a datumctl plugin. The plugin supports three execution modes against the same underlying tunnel runtime: foreground (blocking), detached background daemon, and system service installation (systemd / launchd / Windows SCM). Service accounts power unattended auth via datumctl's existing credentials helper, enabling tunnels that run for days, weeks, or years without manual re-auth. Implements the goals listed in #698.
Design review — ambiguities and contradictionsStrong overall direction, especially the "auth is owned by datumctl" constraint. A few things to clarify before this moves from Contradictions (need a clear decision)1. §State Isolation says:
But the headless tunnel work in datum-cloud/app@76987b7 moved
Either is defensible but the design has to pick one. As written, the plugin would regress the cross-project collision fix. 2. The detach-mode "ready" definition contradicts the verify-phase work and the SLO. §Background Daemon Mode:
But:
Pick one of: (a) wait for the full verify phase before exit; (b) exit non-zero (e.g. 75 — see contradiction #3) when readiness isn't confirmed within the window; (c) don't print the hostname yet, and require 3. Exit code for slow-ready detach contradicts the exit-code table. §Exit Codes defines:
But §Background Daemon Mode explicitly says the parent exits 0 when readiness exceeds the 10s window. Exiting 0 with "well, run Ambiguities (need spelling out)4.
5. §Required Library Changes item 5:
But §System Service Installation persists
6. The §Auth Integration:
If the Rust process can exec the helper itself, what does 7. §Command Surface lists 8. Heartbeat scope in plugin mode is unspecified. We just shipped 9. "Control-plane dry-run" for install validation is hand-wavy. §System Service Installation install validations item 4:
A 10. Two paths captured at install time, only one is documented. §System Service Installation says the absolute path to
Either the YAML needs a 11. If I Minor inconsistencies12. Container/sidecar scope statement is fuzzy. §Non-Goals says OCI packaging is out of scope; §User Stories Story 5 shows the use case as "Future" and the design notes "the architecture supports it." This is fine narratively but a single sentence reconciling "out of scope for this PR" vs "designed to enable" would prevent reviewer confusion later. 13. Windows/arm64 inclusion is contradictory. §Risks treats it as contingent on CI passing. §Infrastructure Needed lists it as guaranteed. Pick one; if contingent, drop it from the matrix until the build is green. 14. ApiEnv env-var conditional. §Required Library Changes item 4: "honor Suggested follow-ups (not blocking)
Otherwise: ready to start implementing the lib-side changes (#1, #2, #3 in §Required Library Changes are well-scoped and additive). I'll wait to do that until the contradictions above have a direction. |
Resolution proposal — applying recent app-side precedentsContext from offline discussion: this enhancement was drafted before a week of intensive debugging in datum-cloud/app#130 that surfaced several non-obvious requirements. The agreement is that where today's app-side work conflicts with this enhancement, today's precedent wins — the enhancement should evolve to incorporate it. The trajectory is: today's Recording the resolution table and the lib-vs-binary boundary here so the design can absorb it. Resolutions for the ambiguities and contradictions flagged above
Lib-vs-binary boundary (the trajectory the rest of this work depends on)Mapping today's Shared crate
Binary in the
Go plugin in the
Lib stays Rust-only; the Go plugin is a thin process-management shell around the Rust binary, as the design already intends. Desktop app (
Suggested first step (additive against today's app/main)
None of that requires this enhancement to be @drewr — adding this so the resolutions are recorded against the enhancement itself rather than scattered across PR-130 review comments. Use the resolution table as the authoritative answer to my previous review comment until the enhancement body is updated. |
A single 503 from the Datum API server's Envoy front-end ("upstream
connect error or disconnect/reset before headers. reset reason:
connection termination" — typical when kube apiserver briefly drops
connections behind Envoy) was killing in-progress tunnel setups that
the next 750ms poll tick would have ridden over. Observed mid-EnvoyPatch
Policy-reconcile wait on a fresh tunnel: setup conditions were on the
slow-but-working path and the run aborted at the unrelated transient.
The runtime watch already handles this correctly — log on error and
keep going. Mirror that in await_tunnel_progress with a bounded retry:
up to MAX_CONSECUTIVE_POLL_ERRORS (10 ≈ 7.5s at the current cadence)
before bailing. Long enough to ride out a brief blip; short enough that
a genuinely unreachable control plane still surfaces fast.
The change lives in await_tunnel_progress (cli/src/main.rs) but the
function is on the future connect-lib side of the boundary discussed in
datum-cloud/enhancements#756 comment 4644292554 — it's pure orchestration
over TunnelService::get_active_progress, no rendering, no clap. The
shape (consecutive-error counter + bounded retry + bail-fast on hard
signals) is the one the lib will inherit.
…s work The CLI accepts --endpoint 127.0.0.1:11434 (no scheme) and passes that string through to verify_endpoints, which hands it to reqwest. Reqwest's request builder refuses to build a request from a URL without a scheme and returns a "builder error" — which our probe was reporting as "origin not reachable" indefinitely: ✓ proxy responding (0.4s) [https://...]: HTTP 200 … origin not reachable (0s) [127.0.0.1:11434]: builder error … origin not reachable (10s) [127.0.0.1:11434]: builder error ... The actual origin was reachable the whole time — the proxy probe got HTTP 200 through the tunnel back to the same host:port. Only the CLI's local probe was wedged. Apply lib::normalize_endpoint (the same canonicalization that TunnelSummary.endpoint stores) at the top of verify_endpoints so any bare host:port works as input. The displayed URL becomes the canonical form (http://127.0.0.1:11434), matching what's stored on the HTTPProxy. verify_endpoints is on the connect-lib side of the boundary we sketched in datum-cloud/enhancements#756 comment 4644292554 — defensive normalization belongs here so other callers (UI Edit dialog, the future plugin foreground listen path) don't have to remember to canonicalize.
Two mermaid sequence diagrams in Design Details that give system context for the tunnel runtime: - Per-request data flow: browser -> Envoy -> iroh-gateway -> ListenNode -> local service, including the iroh-relay fallback path. - Device discovery: heartbeat -> PCP -> iroh-dns controller -> DNS TXT, which is what lets the gateway find the device. Together they make the role of HeartbeatAgent and ListenNode in the daemon-internal diagram below much easier to explain verbally.
Empirical measurement on the headless tunnel CLI: - Restart immediately after Ctrl-C: ~30s to proxy responding (matches the default 30s TXT TTL — the gateway's resolver serves the stale record until it expires) - Restart 35s after Ctrl-C: ~1s to proxy responding Control-plane plumbing (heartbeat / NSO reconcile / DNSRecordSet / Envoy metadata) is sub-second in both cases, so the TTL is the sole bottleneck for hot-restart UX.
Summary
Adds a provisional enhancement document under
enhancements/networking/connectors/headless-tunnel-cli/for a headless command-line experience for Datum tunnels.The enhancement covers:
datumctl connect tunnel ...plugin that wraps the existingdatum-connectRust binary.tunnel ps/stop/logs)kardianos/service)Refs #698
Status
provisional/alpha/latest-milestone: v0.1. Submitting per the template guidance to "merge early and iterate" — feedback in subsequent PRs welcome.Test plan
networking/connectors/is appropriate (sibling toinitial-proposal/).implementable.