Skip to content

Fix tunnels staying offline at the edge after a connector connects#211

Merged
scotwells merged 1 commit into
mainfrom
fix/connector-209-edge-retranslation
Jun 23, 2026
Merged

Fix tunnels staying offline at the edge after a connector connects#211
scotwells merged 1 commit into
mainfrom
fix/connector-209-edge-retranslation

Conversation

@scotwells

@scotwells scotwells commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What this fixes

A tunnel connector could come online but the edge would keep serving HTTP 503 for it — the data plane never picked up that the tunnel was ready. The interim workaround was a CLI re-patch to force it.

Fixes #209.

Why it happened

The connector's live status reaches the edge correctly, but Envoy Gateway wasn't re-translating against it. EG only re-runs its translation — and the extension hook that programs tunnels — when a resource it watches changes through an annotation-aware path, and a Connector update doesn't qualify. So fresh liveness sat in the extension server's cache while EG kept serving the stale (offline) program.

What changed

  • A small controller co-located with the extension server watches the Connector and, when its liveness changes, nudges the owning Gateway so EG re-translates — after the fresh liveness is already in the local cache.
  • Removed the earlier trigger that ran from the project control plane and raced the connector status's propagation to the edge.
  • RBAC: the extension server may now patch Gateways.

Impact

  • Connectors come online at the edge on their own, within a few seconds of connecting — no manual step.
  • The connect-lib refresh_connection_details() workaround is no longer required (safe to leave or remove).

Testing

  • Unit tests for the new controller (liveness → Gateway touch, offline value, missing Gateway, change-only triggering).
  • Existing connector and replicator suites updated and passing.

Follow-ups (not in this PR)

  • Leader election for the new controller (it runs on every replica today; patches are idempotent).
  • An e2e test for the edge re-translation path.

🤖 Generated with Claude Code

@scotwells scotwells force-pushed the fix/connector-209-edge-retranslation branch from d08d4d8 to 14684dd Compare June 23, 2026 16:22
…#209)

When a connector comes online (Ready False→True), its liveness reaches the
edge via the upstream-status annotation, but Envoy Gateway did not re-translate
against it. EG watches Connector with a generation-only predicate, so the
annotation change is ignored, and the previous project-side Gateway annotation
touch raced the annotation's hub→edge propagation. EG could translate while the
edge still saw the connector offline, serving 503 with no recovery.

Add an edge-local controller in the extension-server process that watches the
replicated Connector and touches the owning Gateway when liveness changes —
after the new liveness is already in the shared cache — forcing EG to
re-translate against fresh data. Remove the racy project-side touch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@scotwells scotwells force-pushed the fix/connector-209-edge-retranslation branch from 14684dd to 18d3516 Compare June 23, 2026 16:28
@scotwells scotwells changed the title Fix tunnels staying offline at the edge after a connector connects (#209) Fix tunnels staying offline at the edge after a connector connects Jun 23, 2026
@scotwells scotwells marked this pull request as ready for review June 23, 2026 16:32
@ecv ecv self-requested a review June 23, 2026 17:06

@ecv ecv left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Connector upstream-status annotation not re-mirrored after Ready condition flips True

2 participants