Skip to content

feat: add Headless Tunnel CLI enhancement#756

Draft
drewr wants to merge 6 commits into
mainfrom
headless-tunnel-cli
Draft

feat: add Headless Tunnel CLI enhancement#756
drewr wants to merge 6 commits into
mainfrom
headless-tunnel-cli

Conversation

@drewr

@drewr drewr commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a provisional enhancement document under enhancements/networking/connectors/headless-tunnel-cli/ for a headless command-line experience for Datum tunnels.

The enhancement covers:

  • A new datumctl connect tunnel ... plugin that wraps the existing datum-connect Rust binary.
  • Three execution modes against the same tunnel runtime:
    • Foreground (blocking, logs to stdout/stderr)
    • Background daemon (detached process, PID + log files, managed with tunnel ps/stop/logs)
    • System service (systemd / launchd / Windows SCM, via kardianos/service)
  • Service-account auth integration through datumctl's existing credentials helper, so long-running unattended tunnels refresh credentials automatically.
  • POSIX-conformant exit codes, named tunnels, user-vs-system scope, and a full PRR questionnaire.
  • Phased platform delivery: Linux/systemd in v0.1, macOS/launchd in v0.2, Windows SCM in v0.3.

Refs #698

Status

provisional / alpha / latest-milestone: v0.1. Submitting per the template guidance to "merge early and iterate" — feedback in subsequent PRs welcome.

Test plan

  • Reviewers confirm the goals/non-goals match the intent in Headless experience for Datum tunnels #698 and the discussion thread.
  • Reviewers validate the auth model (datumctl helper + service accounts) is the right unification point versus alternatives listed in §Alternatives.
  • Reviewers confirm the location under networking/connectors/ is appropriate (sibling to initial-proposal/).
  • Reviewers flag any PRR questions that need stronger answers before moving to implementable.

drewr added 3 commits June 5, 2026 03:24
Adds a provisional enhancement covering a headless command-line
experience for Datum tunnels, distributed as a datumctl plugin.

The plugin supports three execution modes against the same underlying
tunnel runtime: foreground (blocking), detached background daemon, and
system service installation (systemd / launchd / Windows SCM). Service
accounts power unattended auth via datumctl's existing credentials
helper, enabling tunnels that run for days, weeks, or years without
manual re-auth.

Implements the goals listed in #698.
@drewr

drewr commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

Design review — ambiguities and contradictions

Strong overall direction, especially the "auth is owned by datumctl" constraint. A few things to clarify before this moves from provisional to implementable. Grouped by severity.

Contradictions (need a clear decision)

1. listen_key storage layout conflicts with the just-landed per-project layout.

§State Isolation says:

Each named tunnel has its own subdirectory (<repo>/tunnels/<name>/) holding its own listen_key and proxy state.

But the headless tunnel work in datum-cloud/app@76987b7 moved listen_key to <repo>/projects/<project_id>/listen_key precisely to prevent cross-project iroh-pubkey collisions (the diagnosis behind this lives in PR #130). Two open questions:

  • Should named tunnels nest inside projects (<repo>/projects/<project>/tunnels/<name>/listen_key) so the cross-project guarantee survives?
  • Or does --name imply a project binding strong enough to replace the per-project scoping?

Either is defensible but the design has to pick one. As written, the plugin would regress the cross-project collision fix.

2. The detach-mode "ready" definition contradicts the verify-phase work and the SLO.

§Background Daemon Mode:

The parent Go process waits up to 10 seconds for the tunnel to reach ready, prints the assigned hostname, and exits 0. If readiness takes longer, the parent exits 0 with a hint to run tunnel status.

But:

  • §SLOs says: "99% of tunnel listen invocations reach ready state in under 30 seconds" — already longer than 10s.
  • The verify phase landing in datum-cloud/app@1ed969e has a 10-minute default for a reason: data-plane reachability on resume routinely takes 1–3 minutes (root cause traced in datum-cloud/iroh-gateway#12).
  • A parent that prints a hostname and exits 0 before the URL serves 2xx hands the user a URL that's broken-on-first-use ≥ 50% of the time today.

Pick one of: (a) wait for the full verify phase before exit; (b) exit non-zero (e.g. 75 — see contradiction #3) when readiness isn't confirmed within the window; (c) don't print the hostname yet, and require tunnel status to surface it once verified.

3. Exit code for slow-ready detach contradicts the exit-code table.

§Exit Codes defines:

75 — Temporary failure: tunnel exists but is still provisioning

But §Background Daemon Mode explicitly says the parent exits 0 when readiness exceeds the 10s window. Exiting 0 with "well, run tunnel status" is exactly the case 75 is designed for; the current design contradicts its own spec.

Ambiguities (need spelling out)

4. --name vs --label vs --id semantics aren't sharp.

  • User Story 2: listen --detach --name dev-app --label dev-app — same value used for both. Are they synonyms? If not, why is conflating them the canonical example?
  • §Command Surface uses --name N for daemon process management and --id X for update/delete. We've already shipped --id in the standalone CLI as the server-side HTTPProxy resource name. The plugin's --id and --name must therefore be different things — local vs server-side identifiers. Spell this out.
  • The picker UX shipped in datum-cloud/app@7de50c7 and the auto-adoption in @cff37e7 aren't mentioned anywhere. Do they survive in plugin mode? If yes, how do they interact with --name?

5. --project is transient-override in some places and persisted-config in others.

§Required Library Changes item 5:

treats --project as a transient override (no set_selected_context)

But §System Service Installation persists project: proj-prod in the YAML at install time. So:

  • At install time, --project is persistent.
  • At listen time (foreground or detach), --project is transient.
  • What if a user runs tunnel start --name foo --project different-one against an installed service? Does the override take effect, get ignored, or error out?

6. The DATUM_ACCESS_TOKEN env var has unclear role.

§Auth Integration:

The Rust binary in plugin mode receives a fresh bearer token from the Go plugin via env (DATUM_ACCESS_TOKEN) at startup and renews it by execing $DATUM_CREDENTIALS_HELPER.

If the Rust process can exec the helper itself, what does DATUM_ACCESS_TOKEN achieve? Optimization (avoid one helper exec at startup)? Or is the Rust process forbidden from execing the helper until the initial token expires? The design also says the Rust process runs the refresh loop — so the Go plugin's role after exec is just "start the Rust binary." The env var feels redundant.

7. tunnel run is "internal" but completely unspecified.

§Command Surface lists tunnel run --name N # internal; invoked by service unit. What does it accept? Does --system apply (the service unit args in §System Service Installation show it appears as --system for system-scoped services, but run itself isn't documented)? Where does it load config from when launched by systemd vs by start?

8. Heartbeat scope in plugin mode is unspecified.

We just shipped HeartbeatAgent::start_manual() (datum-cloud/app@b7e9d6b) so the CLI doesn't auto-enroll heartbeats for every accessible project. §Required Library Changes mentions HeartbeatAgent::with_pcp_client(...) but doesn't say whether plugin mode uses manual or auto-enroll. A named tunnel almost certainly wants manual mode (one project, no fan-out) — should be stated.

9. "Control-plane dry-run" for install validation is hand-wavy.

§System Service Installation install validations item 4:

The service account has permission to create tunnels in the target project (control-plane dry-run).

A SelfSubjectAccessReview against the relevant verbs is the natural fit — but say so. A literal apiserver dryRun=All Create isn't quite the same thing (it would still consume a quota check, and quota check timeouts are flaky enough today to false-fail install).

10. Two paths captured at install time, only one is documented.

§System Service Installation says the absolute path to datumctl-connect is captured for the service unit. But the credentials helper is datumctl (a different binary). The persisted YAML (<config-dir>/services/<name>.yaml) doesn't show the helper path captured anywhere — yet §Risks claims:

the plugin captures [the helper's] absolute path at install time

Either the YAML needs a credentials_helper_path field, or the design needs to clarify how the helper is found at runtime.

11. tunnel update in plugin mode vs persisted service YAML.

If I install --name prod --endpoint 127.0.0.1:8080 and later tunnel update --id prod-tunnel-xyz --endpoint 127.0.0.1:9000, the server-side HTTPProxy backend changes but the persisted YAML at <config-dir>/services/prod.yaml still says 8080. On next service restart, which one wins? Either the YAML must be re-read on each run (server is the truth) or the YAML must be updated by update (config is the truth). Pick one.

Minor inconsistencies

12. Container/sidecar scope statement is fuzzy. §Non-Goals says OCI packaging is out of scope; §User Stories Story 5 shows the use case as "Future" and the design notes "the architecture supports it." This is fine narratively but a single sentence reconciling "out of scope for this PR" vs "designed to enable" would prevent reviewer confusion later.

13. Windows/arm64 inclusion is contradictory. §Risks treats it as contingent on CI passing. §Infrastructure Needed lists it as guaranteed. Pick one; if contingent, drop it from the matrix until the build is green.

14. ApiEnv env-var conditional. §Required Library Changes item 4: "honor DATUM_API_HOST in plugin mode." How does ApiEnv know which mode? Plugin detection lives in cli/src/main.rs (item 5) — but ApiEnv is in lib/. Without thread-through, this becomes "honor DATUM_API_HOST always" or requires a new constructor variant.

Suggested follow-ups (not blocking)

  • §Implementation History could pin the per-project-key migration (@76987b7) and the idempotent-update_project (@0311960) as prerequisites — both are properties the plugin will rely on for "no resume churn" semantics, and §Required Library Changes doesn't mention them at all.
  • A short §Compatibility section noting that the desktop app, standalone datum-connect, and the plugin will each register distinct Connector objects against the same iroh public key — unless they share a Repo. The cross-agent semantics deserve explicit treatment given how much pain the cross-project collision case caused this past week.

Otherwise: ready to start implementing the lib-side changes (#1, #2, #3 in §Required Library Changes are well-scoped and additive). I'll wait to do that until the contradictions above have a direction.

@drewr drewr marked this pull request as draft June 7, 2026 22:17
@drewr

drewr commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

Resolution proposal — applying recent app-side precedents

Context from offline discussion: this enhancement was drafted before a week of intensive debugging in datum-cloud/app#130 that surfaced several non-obvious requirements. The agreement is that where today's app-side work conflicts with this enhancement, today's precedent wins — the enhancement should evolve to incorporate it. The trajectory is: today's app/cli/ will be split into (a) a binary in a new connect repo invoked by datumctl, and (b) a shared Rust crate in the same connect repo, used by the desktop app too.

Recording the resolution table and the lib-vs-binary boundary here so the design can absorb it.

Resolutions for the ambiguities and contradictions flagged above

# Issue Today's precedent Resolution
1 listen_key layout (§State Isolation says <repo>/tunnels/<name>/listen_key) datum-cloud/app@76987b7 — per-project keys at <repo>/projects/<id>/listen_key, motivated by cross-project iroh-pubkey collisions Named tunnels become per-name repos: <config-dir>/tunnels/<name>/ is itself a Repo. The per-project layout applies inside each named tunnel's repo. The plugin sets DATUM_CONNECT_REPO=<config-dir>/tunnels/<name> and Repo doesn't need to know about names.
2 Detach-mode "wait up to 10s for ready, then print hostname and exit 0" datum-cloud/app@1ed969e — verify phase with default 10m, plus datum-cloud/iroh-gateway#12 confirming routine 1–3 min iroh dial latency on resume Drop the wait-then-print contract. Detach writes the PID immediately and exits with a "setting up in background; tunnel status will show progress" message. Verification continues in the daemon's log. No misleading early-success print.
3 Exit code 75 ("still provisioning") defined but never used Same as #2 Becomes correctly unused (or removed from the table). Slow-ready isn't a state the parent ever observes once detach exits immediately.
4 --name vs --label vs --id semantics datum-cloud/app@ca4470f (--id = HTTPProxy resource name) + @a68d8ae (--id + --endpoint must agree exactly) --name is local-only (process management — PID file key, repo subdir name). --id is server-side. --label is decorative (display annotation). Document them as orthogonal. The picker + auto-adopt-single from @7de50c7 / @cff37e7 carry into the plugin's foreground listen.
5 --project "transient override" vs persisted in service YAML Today's CLI requires a selected context at Tunnel handler entry Persist project in the service YAML for installed services. Reject runtime override at tunnel run/tunnel start when it disagrees with the persisted value (fail loud, don't silently retarget). Foreground/detach: transient override is fine.
6 DATUM_ACCESS_TOKEN env var's role n/a (new) Drop it. The Rust process execs the helper at startup AND for refresh. One mechanism, less to keep in sync. Detect plugin mode via DATUM_SESSION (or an explicit DATUM_PLUGIN_MODE=1) instead of the presence of an initial token.
7 tunnel run subcommand surface n/a Spec it: tunnel run --name N [--system]. Loads the persisted YAML from the service dir. Refuses any other flags. The service unit literally writes that argv.
8 Heartbeat scope in plugin mode datum-cloud/app@b7e9d6bstart_manual() for single-project Plugin mode uses manual heartbeat scoped to the named tunnel's project. Drop the auto-enroll path for plugin entirely. Desktop app keeps start() because the multi-project UI actually wants that.
9 "Control-plane dry-run" install validation n/a Use SelfSubjectAccessReview against the required verbs (create on connectoradvertisements, httpproxies, etc.) Not dryRun=All Create — that would burn a quota-check round trip unnecessarily and is flaky enough today (@d8f7c96) to false-fail install.
10 Helper path capture n/a Add credentials_helper_path to the persisted YAML alongside name/project/session. The service unit reads it from there at run time. Spell out that the plugin captures it once, at install.
11 update vs persisted YAML truth datum-cloud/app@0311960update_project is idempotent at the lib boundary; server-side is the authoritative state Server is the source of truth for label/endpoint. YAML stores only what the daemon needs to start: name, project, session, credentials_helper_path (and the install metadata). tunnel update mutates the server only; run queries the server on startup for current label/endpoint.
12 Container/sidecar scope n/a Single sentence reconciling "out of scope for this PR; designed to enable." Resolve narratively, no behavior change.
13 windows/arm64 contingent in §Risks vs guaranteed in §Infrastructure n/a Drop from the v0.1 matrix until the build is green; add when verified.
14 ApiEnv honoring DATUM_API_HOST "in plugin mode" n/a Honor DATUM_API_HOST always (it's just an env var). Plugin-mode detection lives in the CLI layer, not in ApiEnv.

Lib-vs-binary boundary (the trajectory the rest of this work depends on)

Mapping today's app/lib/ + app/cli/ to the future connect repo:

Shared crate connect-lib (Rust) — consumed by the desktop app and by the plugin's Rust binary:

  • Repo (per-project / per-named-tunnel scoping)

  • ListenNode, ConnectNode, Node

  • TunnelService (already idempotent at the lib boundary per @0311960)

  • HeartbeatAgent with both start() and start_manual()

  • ProjectControlPlaneClient

  • TunnelProgress + ProgressStep + StepStatus

  • TunnelService::get_active_progress (the progress query)

  • await_tunnel_progress (moved out of cli/main.rs — pure orchestration on top of get_active_progress with a callback-based progress reporter; no println! baked in)

  • verify_endpoints (same — pure HTTP probe loop returning a Result; callers provide progress callbacks)

  • format_terminal_failure (pure string formatter; fits in lib)

  • New: BearerSource trait with two implementations:

    • DatumCloudClient (existing OAuth path — used by desktop)
    • ExternalTokenSource (helper-execed — used by plugin)

    ProjectControlPlaneClient::new_with_bearer_source gives both consumers a clean entry point.

Binary in the connect repo (Rust) — what datumctl invokes:

  • clap argument parsing
  • Plugin-mode detection (env-based)
  • Picker UX using inquire (kept out of the lib so the lib doesn't force a TUI dep on consumers)
  • Progress-line rendering (the println!(" ✓ {} ({:.1}s) [{resource}]", ...) strings) — implemented as a callback that consumes lib's TunnelProgress events
  • QuietTracing (the tracing_subscriber::reload::Handle guard) — CLI concern, not lib
  • Daemon mode (double-fork, PID files, log rotation, signal handling)
  • Service-mode run subcommand
  • Verify-endpoint output rendering

Go plugin in the connect repo (separate process) — what's between datumctl and the Rust binary:

  • Cobra command surface
  • Service-manager (kardianos/service) integration
  • Locates and execs the Rust binary
  • Reads DATUM_* env from datumctl; sets the right ones for the child

Lib stays Rust-only; the Go plugin is a thin process-management shell around the Rust binary, as the design already intends.

Desktop app (app/ui/) — uses connect-lib directly:

  • Picks up the per-project Repo layout automatically
  • Keeps DatumCloudClient for OAuth (unchanged)
  • Keeps HeartbeatAgent::start() (auto-enroll) — correct behavior for the multi-project UI
  • Drops any direct dependency on the cli binary (which it doesn't have today anyway)

Suggested first step (additive against today's app/main)

  1. Extract await_tunnel_progress and verify_endpoints from app/cli/src/main.rs into app/lib/ with callback-based progress reporting. No println! in lib; the CLI provides closures that do the rendering. This is the only piece of today's cli/ that's genuinely shared-library-shaped.
  2. Add BearerSource trait + the ExternalTokenSource impl alongside the existing DatumCloudClient. ProjectControlPlaneClient::new_with_bearer_source is the entry point. This is item 1 in §Required Library Changes — additive, no migration.
  3. Land both in app/ first, behind the existing --id/picker/verify-phase flow, so the new shapes get exercised in production before the repo split. When connect/ exists, the move is a git filter-branch-style extraction.

None of that requires this enhancement to be implementable — it's strictly additive against today's main.


@drewr — adding this so the resolutions are recorded against the enhancement itself rather than scattered across PR-130 review comments. Use the resolution table as the authoritative answer to my previous review comment until the enhancement body is updated.

drewr added a commit to datum-cloud/app that referenced this pull request Jun 8, 2026
A single 503 from the Datum API server's Envoy front-end ("upstream
connect error or disconnect/reset before headers. reset reason:
connection termination" — typical when kube apiserver briefly drops
connections behind Envoy) was killing in-progress tunnel setups that
the next 750ms poll tick would have ridden over. Observed mid-EnvoyPatch
Policy-reconcile wait on a fresh tunnel: setup conditions were on the
slow-but-working path and the run aborted at the unrelated transient.

The runtime watch already handles this correctly — log on error and
keep going. Mirror that in await_tunnel_progress with a bounded retry:
up to MAX_CONSECUTIVE_POLL_ERRORS (10 ≈ 7.5s at the current cadence)
before bailing. Long enough to ride out a brief blip; short enough that
a genuinely unreachable control plane still surfaces fast.

The change lives in await_tunnel_progress (cli/src/main.rs) but the
function is on the future connect-lib side of the boundary discussed in
datum-cloud/enhancements#756 comment 4644292554 — it's pure orchestration
over TunnelService::get_active_progress, no rendering, no clap. The
shape (consecutive-error counter + bounded retry + bail-fast on hard
signals) is the one the lib will inherit.
drewr added a commit to datum-cloud/app that referenced this pull request Jun 8, 2026
…s work

The CLI accepts --endpoint 127.0.0.1:11434 (no scheme) and passes that
string through to verify_endpoints, which hands it to reqwest. Reqwest's
request builder refuses to build a request from a URL without a scheme
and returns a "builder error" — which our probe was reporting as
"origin not reachable" indefinitely:

  ✓ proxy responding (0.4s) [https://...]: HTTP 200
  … origin not reachable (0s) [127.0.0.1:11434]: builder error
  … origin not reachable (10s) [127.0.0.1:11434]: builder error
  ...

The actual origin was reachable the whole time — the proxy probe got
HTTP 200 through the tunnel back to the same host:port. Only the CLI's
local probe was wedged.

Apply lib::normalize_endpoint (the same canonicalization that
TunnelSummary.endpoint stores) at the top of verify_endpoints so any
bare host:port works as input. The displayed URL becomes the canonical
form (http://127.0.0.1:11434), matching what's stored on the HTTPProxy.

verify_endpoints is on the connect-lib side of the boundary we sketched
in datum-cloud/enhancements#756 comment 4644292554 — defensive
normalization belongs here so other callers (UI Edit dialog, the future
plugin foreground listen path) don't have to remember to canonicalize.
drewr added 2 commits June 9, 2026 21:48
Two mermaid sequence diagrams in Design Details that give system context
for the tunnel runtime:

- Per-request data flow: browser -> Envoy -> iroh-gateway -> ListenNode
  -> local service, including the iroh-relay fallback path.
- Device discovery: heartbeat -> PCP -> iroh-dns controller -> DNS TXT,
  which is what lets the gateway find the device.

Together they make the role of HeartbeatAgent and ListenNode in the
daemon-internal diagram below much easier to explain verbally.
Empirical measurement on the headless tunnel CLI:

- Restart immediately after Ctrl-C: ~30s to proxy responding
  (matches the default 30s TXT TTL — the gateway's resolver serves
  the stale record until it expires)
- Restart 35s after Ctrl-C: ~1s to proxy responding

Control-plane plumbing (heartbeat / NSO reconcile / DNSRecordSet /
Envoy metadata) is sub-second in both cases, so the TTL is the
sole bottleneck for hot-restart UX.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants