Skip to content

feat(controller/node): publishable P2P — per-pod NLB Service + ExternalAddress gate (#360 PR-2)#362

Closed
bdchatham wants to merge 1 commit into
mainfrom
feat/publishable-p2p-service
Closed

feat(controller/node): publishable P2P — per-pod NLB Service + ExternalAddress gate (#360 PR-2)#362
bdchatham wants to merge 1 commit into
mainfrom
feat/publishable-p2p-service

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Summary

PR-2 of the publishable-P2P series (#360). PR-1 (#361) landed the contract; this PR wires the controller path that actually provisions the per-pod NLB Service and writes Status.ExternalAddress.

What this PR delivers

internal/controller/node/external_address.go (NEW)reconcileExternalAddress runs before reconcileStatefulSet:

  • Parent SND lookup via metav1.GetControllerOf (single source of truth on SND.Spec.Networking.TCP, no field on SeiNodeSpec).
  • Opt-out path — no TCP → delete Service, clear Status.ExternalAddress, PublishableReady=False/PublishableDisabled, gate=open.
  • Fail-closed when SEI_VPC_CIDR unset — the safety net for harbor's Cilium CGNAT where target-type: ip can't route.
  • Pod-IP routability check — pod outside CIDR → fail closed; pod absent (first reconcile) → stamp anyway, recheck next round.
  • Per-pod LB Service server-side-applied with the AWS LBC contract (type=nlb, scheme=internet-facing, nlb-target-type=ip, cross-zone-load-balancing-enabled=true), TCP/26656, externalTrafficPolicy: Local, selector via statefulset.kubernetes.io/pod-name=<seinode>-0.
  • ExternalAddress writerService.Status.LoadBalancer.Ingress[0].HostnameStatus.ExternalAddress = "<host>:26656". Empty-host guard: don't clear on transient absence; only opt-out clears.

internal/controller/node/controller.go — gate call before reconcileStatefulSet; Watches(&SeiNodeDeployment{}) with UID-filtered mapper so children re-reconcile when the parent toggles Networking.TCP; RBAC for seinodedeployments get/list/watch.

api/v1alpha1/seinode_types.goConditionPublishableReady + 5 reason constants.

internal/controller/nodedeployment/{networking,status}.go — three call sites migrated from Networking == nil to !Networking.HTTPEnabled(). Without this, manifests setting only networking.tcp: {} would spuriously stamp HTTPRoutes + ClusterIP from the SND reconciler.

Backward compatibility

  • networking: nil — unchanged.
  • networking: {} — unchanged (HTTPEnabled=true via backcompat).
  • networking: { http: {} } — unchanged (HTTPEnabled=true).
  • networking: { tcp: {} }now correctly does NOT trigger HTTPRoutes (was a latent bug after PR-1 if anyone set this).
  • networking: { http: {}, tcp: {} } — both layers active.

Test plan

  • 12 unit tests in external_address_test.go — standalone, opt-out, opt-out cleanup, VPC unset (fail-closed), pod outside CIDR (fail-closed), Service stamping + gate held closed, hostname populated, empty-host guard, mapper.
  • Envtest TestPublishableP2P_HoldsGateUntilHostnamePopulates — full bootstrap-gate happy path + opt-out cleanup transition.
  • go test ./api/v1alpha1/... ./internal/... — all green.
  • make test-integration — envtest suite green (59s).
  • golangci-lint run ./internal/controller/node/... ./internal/controller/nodedeployment/... ./api/v1alpha1/... — no new issues.
  • make manifests generate — no diff (no new CRD spec fields; new RBAC marker collapsed into existing role).
  • CI on this PR.

Out of scope (separate follow-ups)

Refs

🤖 Generated with Claude Code

…alAddress gate

PR-2 of the publishable-P2P series (#360).
Wires up the SeiNode reconciler to stamp a per-pod L4 LoadBalancer
Service when the parent SND has Networking.TCP set, and holds
StatefulSet creation until Status.ExternalAddress is populated so the
first pod's config.toml carries the correct p2p.external_address.

internal/controller/node/external_address.go (NEW)
- reconcileExternalAddress runs before reconcileStatefulSet:
  * Parent SND lookup via OwnerReference (no Spec.Networking on
    SeiNode itself — single source of truth on the SND).
  * Opt-out path: no TCP → delete Service, clear ExternalAddress,
    PublishableReady=False/PublishableDisabled, gate=open.
  * Fail-closed when SEI_VPC_CIDR unset (the safety net for harbor's
    Cilium CGNAT where target-type=ip can't route).
  * Pod-IP routability check: pod outside CIDR → fail closed; pod
    absent (first reconcile) → stamp anyway, recheck next round.
  * Server-side apply the per-pod LoadBalancer Service with the AWS
    LBC contract (type=nlb, scheme=internet-facing, target-type=ip,
    cross-zone=true), TCP/26656, externalTrafficPolicy=Local, pod-name
    selector targeting ordinal-0.
  * Read svc.Status.LoadBalancer.Ingress[0].Hostname → write
    Status.ExternalAddress = "<host>:26656". Empty-host guard: don't
    clear an existing value on transient absence; only opt-out clears.

internal/controller/node/controller.go
- runExternalAddressGate call slotted in before reconcileStatefulSet;
  gate-held returns short-circuit Reconcile after a status flush.
- Watches(&SeiNodeDeployment{}) with a UID-filtered mapper to wake
  child SeiNodes when the parent toggles Networking.TCP.
- +kubebuilder:rbac for seinodedeployments get/list/watch.

api/v1alpha1/seinode_types.go
- ConditionPublishableReady + 5 reason constants (Disabled,
  VPCCIDRNotConfigured, PodOutsideVPCCIDR, ServiceProvisioning,
  ExternalAddressReady) for the gate's status surface.

internal/controller/nodedeployment/{networking,status}.go
- Migrate three call sites from `Networking == nil` to
  `!Networking.HTTPEnabled()`. Required so manifests that set only
  `networking.tcp: {}` (no http sub-struct) don't get spurious
  HTTPRoutes / ClusterIP from the SND reconciler. Legacy semantics
  preserved for `networking: {}` and `networking: { http: {} }`.

internal/controller/nodedeployment/envtest/fixtures/snd.go
- WithPublishableP2P() option for envtest fixtures.

Tests
- 12 unit tests in external_address_test.go covering all branches
  (standalone, opt-out, opt-out cleanup, VPC unset, pod outside,
  Service stamping, hostname populated, empty-host guard, mapper).
- New envtest covers the full bootstrap-gate happy path and opt-out
  cleanup transition.

Out of scope (separate follow-up PRs):
- LabelPeerSource resolver preference for Status.ExternalAddress.
- Terraform SG opening for tcp/26656.
- SEI_VPC_CIDR env on the platform-side manager-patch.yaml (controller
  defaults closed without it; rollout is gated on platform PR).

Refs #360
Design: sei-protocol/platform docs/designs/sei-publishable-p2p-nlb.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented May 28, 2026

PR Summary

Medium Risk
Changes node bootstrap ordering and cloud networking (NLB Services) with env-dependent fail-closed behavior; misconfigured SEI_VPC_CIDR blocks rollout until platform wiring lands.

Overview
When a parent SeiNodeDeployment enables Networking.TCP, the SeiNode controller now provisions a per-pod internet-facing AWS NLB Service (<seinode>-p2p, TCP 26656) and blocks StatefulSet creation until Status.ExternalAddress is set from the NLB hostname—so the first pod’s CometBFT config gets the right p2p.external_address.

The flow is fail-closed without SEI_VPC_CIDR, refuses non–VPC-routable pod IPs (CGNAT), surfaces PublishableReady with reason constants, and only clears ExternalAddress on intentional TCP opt-out (transient empty LB hostname is preserved). SND spec changes re-queue children via a new SeiNodeDeployment watch; SND HTTP networking now keys off HTTPEnabled() so TCP-only configs no longer create HTTPRoutes.

Reviewed by Cursor Bugbot for commit c1bbf43. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c1bbf43. Configure here.

// fires from the existing Owns(&Service{}) watch on hostname update.
if done, err := r.runExternalAddressGate(ctx, node, flushStatus); done {
return ctrl.Result{}, err
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PublishableReady condition not seeded before early-return paths

Low Severity

ConditionPublishableReady is first set inside reconcileExternalAddress, which executes after the PhaseFailed early-return at line 108. The setNodePausedCondition call correctly sits before that exit (line 99), but PublishableReady has no equivalent seed. A SeiNode that enters PhaseFailed before the gate ever runs will never have the condition present, violating the always-present discipline the envtest at line 162–169 intends to enforce.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by learned rule: CRD conditions must follow School 1 doctrine — always-present, seeded on first reconcile

Reviewed by Cursor Bugbot for commit c1bbf43. Configure here.

// fires from the existing Owns(&Service{}) watch on hostname update.
if done, err := r.runExternalAddressGate(ctx, node, flushStatus); done {
return ctrl.Result{}, err
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Publishable gate blocks pause from reducing replicas

Medium Severity

When the publishable gate holds closed (hostname not yet populated), the reconcile short-circuits before reconcileStatefulSet. Since paused nodes rely on reconcileStatefulSet to set Replicas=0, pausing a node during the ~2-min hostname-provisioning window (or while SEI_VPC_CIDR is unset) has no effect — the pod keeps running until the gate opens. The gate doesn't distinguish between first-boot (no STS yet) and steady-state (STS already exists and needs mutation).

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by learned rule: Convergence-tracking status fields must not advance while SND is paused

Reviewed by Cursor Bugbot for commit c1bbf43. Configure here.

@bdchatham
Copy link
Copy Markdown
Collaborator Author

Closing in favor of an architectural reframing surfaced during peer review.

The current PR puts publishable-P2P networking ownership in the SeiNode reconciler, which is the wrong layer — networking is consistently a SeiNodeDeployment concern in this codebase (HTTPRoutes and the internal ClusterIP are both SND-owned). The per-pod LB Service should live with them. The current PR also took a raw NLB hostname rather than an external-dns-fronted deterministic vanity hostname, which forced the bootstrap Service-ready gate and the per-reconcile validation gates. With a deterministic hostname known at SND-create time, the gate dissolves.

Replacement PR will:

  • Move per-child Service stamping to the SND reconciler
  • Use external-dns + a deterministic <seinode>-p2p.platform.sei.io hostname so Status.ExternalAddress is known up front
  • Drop runExternalAddressGate and the parent-SND lookup
  • Lift SEI_VPC_CIDR validation from per-reconcile to startup (or admission)

The call-site migration from Networking == nil to !HTTPEnabled() will land as its own small PR since it is cleanly correct standalone.

Tracking on #360.

@bdchatham bdchatham closed this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant