Skip to content

IrohDNSPublished=True flips to DeferredToOwner mid-session on an actively-serving Connector #174

Description

@drewr

Summary

I had two cases with different semantics where an otherwise healthy tunnel stopped functioning after hours. After adding more logging and diagnosing other issues, the second of these cases revealed that there was a collision in resources across two different Datum Cloud accounts.

Much of the following is Opus 4.7 context that needed to save somewhere. But the two biggest needs I have from this issue are:

Observed behavior

On a Connector with IrohDNSPublished=True; Reason=Owner that has been serving traffic, the condition periodically flips to:

Type:    IrohDNSPublished
Status:  False
Reason:  DeferredToOwner
Message: iroh DNS record is owned by Connector /<project>/default/<other-connector>
         (uid <other-uid>)

The transition occurs on a server-side reconcile. No client action triggers it (no Create/Update/Delete on the affected Connector or its dependencies). After the flip, traffic to the corresponding tunnel returns 5xx at the edge until ownership is reclaimed.

Two cases of this behavior have been observed in the field. They share the same user-visible pattern but the cited "owner" Connector is in different states in each.


Case 1 — cited owner does not exist in the same project

  • Active Connector: datum-connect-ltwr5, project drewr-y4nd1b, UID e43a8dd4-b10e-42de-b9ff-032788491a24.
  • Cited owner: datum-connect-jttwh, same project drewr-y4nd1b, UID 226a90b6-3cad-4242-9eff-c2c71a335545.
  • datumctl get connector datum-connect-jttwh --project drewr-y4nd1b -n default returns NotFound (not a permission error — same auth context that successfully resolves ltwr5).
  • IrohDNSPublished on ltwr5 was True; Owner from 2026-06-07T14:44:39Z, flipped to False; DeferredToOwner at 2026-06-07T18:17:58Z — ~3.5 hours of active operation before the flip.
  • Live HTTPProxy tied to the affected Connector (tunnel-kl6gj, hostname talk-known-phk4r.datumproxy.net) returned HTTP 503 from the moment the flip was visible, until recovery.

Case 2 — cited owner is a live Connector in a different user's project; ownership oscillates

  • Active Connector: datum-connect-mhxj5, project drewr-y4nd1b, account a...s@gmail.com.
  • Cited owner: datum-connect-dwq9z, project my-project-4h6v89, account d...s@datum.net. 87 days old. Confirmed to exist via datumctl get connector datum-connect-dwq9z --project my-project-4h6v89 -n default under d...s@datum.net's auth context.
  • Both Connectors hold the same iroh public key (c1469d2f5d1547edbdc2dcbd007d97f92be25e1efbbd9bb728a1d160797fd2b8) in status.connectionDetails.publicKey.id.
  • The flip from True; Owner to False; DeferredToOwner on mhxj5 occurred after the agent had been actively serving traffic for many hours.
  • Approximately 13 seconds after the runtime watch reported the flip and the agent disconnected, IrohDNSPublished on mhxj5 transitioned back to True; Owner (Last Transition Time: 2026-06-08T15:44:51Z). The arbitration result inverted in seconds, on its own.
  • At the time of the flip:
    • dwq9z's Lease had presumably been expired for an extended period (no agent active in my-project-4h6v89).
    • mhxj5's Lease was being renewed on schedule by an active heartbeat.

Common pattern across both cases

  • The affected Connector has been actively serving traffic, with Ready=True and a fresh Lease, at the moment of the flip.
  • The flip is unilateral from the operator side; no client request or resource mutation triggers it.
  • The flip causes data-plane breakage (5xx at the edge) for as long as the Connector remains in DeferredToOwner.

Repro coordinates

Useful for an operator-side investigation:

Field Case 1 Case 2
Affected Connector drewr-y4nd1b/datum-connect-ltwr5 drewr-y4nd1b/datum-connect-mhxj5
Affected Connector UID e43a8dd4-b10e-42de-b9ff-032788491a24 (not captured; can re-query)
Cited "owner" drewr-y4nd1b/datum-connect-jttwh my-project-4h6v89/datum-connect-dwq9z
"Owner" UID 226a90b6-3cad-4242-9eff-c2c71a335545 c0968142-c465-440a-ad40-fdb797ebf366
"Owner" current existence Does not exist (NotFound) Exists, 87d old, in different user's project
Flip detected at 2026-06-07T18:17:58Z ~2026-06-08T15:44:38Z
Subsequent flip back Not observed (agent shut down via runtime watch) 2026-06-08T15:44:51Z (after agent shut down)
Live tunnel impact HTTP 503 on talk-known-phk4r.datumproxy.net Runtime watch caught the flip and exited the agent

Why this is filed

Investigation notes (proposed mechanisms, narrowing the cases) are in the comments below, where they belong as evidence is gathered — not in the issue body. The behavior described above is what's observed; the causes will be characterized as the comments accumulate.

The CLI-side mitigation that catches the flip and exits the agent cleanly is datum-cloud/app@6264818 (runtime watch, fail-fast on DeferredToOwner). The CLI-side recovery path is datum-cloud/app@76987b7 (per-project listen_key, prevents future cross-project collisions within a single user's accounts).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions