feat(controller/node): publishable P2P — per-pod NLB Service + ExternalAddress gate (#360 PR-2)#362
feat(controller/node): publishable P2P — per-pod NLB Service + ExternalAddress gate (#360 PR-2)#362bdchatham wants to merge 1 commit into
Conversation
…alAddress gate PR-2 of the publishable-P2P series (#360). Wires up the SeiNode reconciler to stamp a per-pod L4 LoadBalancer Service when the parent SND has Networking.TCP set, and holds StatefulSet creation until Status.ExternalAddress is populated so the first pod's config.toml carries the correct p2p.external_address. internal/controller/node/external_address.go (NEW) - reconcileExternalAddress runs before reconcileStatefulSet: * Parent SND lookup via OwnerReference (no Spec.Networking on SeiNode itself — single source of truth on the SND). * Opt-out path: no TCP → delete Service, clear ExternalAddress, PublishableReady=False/PublishableDisabled, gate=open. * Fail-closed when SEI_VPC_CIDR unset (the safety net for harbor's Cilium CGNAT where target-type=ip can't route). * Pod-IP routability check: pod outside CIDR → fail closed; pod absent (first reconcile) → stamp anyway, recheck next round. * Server-side apply the per-pod LoadBalancer Service with the AWS LBC contract (type=nlb, scheme=internet-facing, target-type=ip, cross-zone=true), TCP/26656, externalTrafficPolicy=Local, pod-name selector targeting ordinal-0. * Read svc.Status.LoadBalancer.Ingress[0].Hostname → write Status.ExternalAddress = "<host>:26656". Empty-host guard: don't clear an existing value on transient absence; only opt-out clears. internal/controller/node/controller.go - runExternalAddressGate call slotted in before reconcileStatefulSet; gate-held returns short-circuit Reconcile after a status flush. - Watches(&SeiNodeDeployment{}) with a UID-filtered mapper to wake child SeiNodes when the parent toggles Networking.TCP. - +kubebuilder:rbac for seinodedeployments get/list/watch. api/v1alpha1/seinode_types.go - ConditionPublishableReady + 5 reason constants (Disabled, VPCCIDRNotConfigured, PodOutsideVPCCIDR, ServiceProvisioning, ExternalAddressReady) for the gate's status surface. internal/controller/nodedeployment/{networking,status}.go - Migrate three call sites from `Networking == nil` to `!Networking.HTTPEnabled()`. Required so manifests that set only `networking.tcp: {}` (no http sub-struct) don't get spurious HTTPRoutes / ClusterIP from the SND reconciler. Legacy semantics preserved for `networking: {}` and `networking: { http: {} }`. internal/controller/nodedeployment/envtest/fixtures/snd.go - WithPublishableP2P() option for envtest fixtures. Tests - 12 unit tests in external_address_test.go covering all branches (standalone, opt-out, opt-out cleanup, VPC unset, pod outside, Service stamping, hostname populated, empty-host guard, mapper). - New envtest covers the full bootstrap-gate happy path and opt-out cleanup transition. Out of scope (separate follow-up PRs): - LabelPeerSource resolver preference for Status.ExternalAddress. - Terraform SG opening for tcp/26656. - SEI_VPC_CIDR env on the platform-side manager-patch.yaml (controller defaults closed without it; rollout is gated on platform PR). Refs #360 Design: sei-protocol/platform docs/designs/sei-publishable-p2p-nlb.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR SummaryMedium Risk Overview The flow is fail-closed without Reviewed by Cursor Bugbot for commit c1bbf43. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c1bbf43. Configure here.
| // fires from the existing Owns(&Service{}) watch on hostname update. | ||
| if done, err := r.runExternalAddressGate(ctx, node, flushStatus); done { | ||
| return ctrl.Result{}, err | ||
| } |
There was a problem hiding this comment.
PublishableReady condition not seeded before early-return paths
Low Severity
ConditionPublishableReady is first set inside reconcileExternalAddress, which executes after the PhaseFailed early-return at line 108. The setNodePausedCondition call correctly sits before that exit (line 99), but PublishableReady has no equivalent seed. A SeiNode that enters PhaseFailed before the gate ever runs will never have the condition present, violating the always-present discipline the envtest at line 162–169 intends to enforce.
Additional Locations (1)
Triggered by learned rule: CRD conditions must follow School 1 doctrine — always-present, seeded on first reconcile
Reviewed by Cursor Bugbot for commit c1bbf43. Configure here.
| // fires from the existing Owns(&Service{}) watch on hostname update. | ||
| if done, err := r.runExternalAddressGate(ctx, node, flushStatus); done { | ||
| return ctrl.Result{}, err | ||
| } |
There was a problem hiding this comment.
Publishable gate blocks pause from reducing replicas
Medium Severity
When the publishable gate holds closed (hostname not yet populated), the reconcile short-circuits before reconcileStatefulSet. Since paused nodes rely on reconcileStatefulSet to set Replicas=0, pausing a node during the ~2-min hostname-provisioning window (or while SEI_VPC_CIDR is unset) has no effect — the pod keeps running until the gate opens. The gate doesn't distinguish between first-boot (no STS yet) and steady-state (STS already exists and needs mutation).
Additional Locations (1)
Triggered by learned rule: Convergence-tracking status fields must not advance while SND is paused
Reviewed by Cursor Bugbot for commit c1bbf43. Configure here.
|
Closing in favor of an architectural reframing surfaced during peer review. The current PR puts publishable-P2P networking ownership in the SeiNode reconciler, which is the wrong layer — networking is consistently a SeiNodeDeployment concern in this codebase (HTTPRoutes and the internal ClusterIP are both SND-owned). The per-pod LB Service should live with them. The current PR also took a raw NLB hostname rather than an external-dns-fronted deterministic vanity hostname, which forced the bootstrap Service-ready gate and the per-reconcile validation gates. With a deterministic hostname known at SND-create time, the gate dissolves. Replacement PR will:
The call-site migration from Tracking on #360. |


Summary
PR-2 of the publishable-P2P series (#360). PR-1 (#361) landed the contract; this PR wires the controller path that actually provisions the per-pod NLB Service and writes
Status.ExternalAddress.What this PR delivers
internal/controller/node/external_address.go(NEW) —reconcileExternalAddressruns beforereconcileStatefulSet:metav1.GetControllerOf(single source of truth onSND.Spec.Networking.TCP, no field onSeiNodeSpec).Status.ExternalAddress,PublishableReady=False/PublishableDisabled, gate=open.SEI_VPC_CIDRunset — the safety net for harbor's Cilium CGNAT wheretarget-type: ipcan't route.type=nlb,scheme=internet-facing,nlb-target-type=ip,cross-zone-load-balancing-enabled=true), TCP/26656,externalTrafficPolicy: Local, selector viastatefulset.kubernetes.io/pod-name=<seinode>-0.Service.Status.LoadBalancer.Ingress[0].Hostname→Status.ExternalAddress = "<host>:26656". Empty-host guard: don't clear on transient absence; only opt-out clears.internal/controller/node/controller.go— gate call beforereconcileStatefulSet;Watches(&SeiNodeDeployment{})with UID-filtered mapper so children re-reconcile when the parent togglesNetworking.TCP; RBAC forseinodedeploymentsget/list/watch.api/v1alpha1/seinode_types.go—ConditionPublishableReady+ 5 reason constants.internal/controller/nodedeployment/{networking,status}.go— three call sites migrated fromNetworking == nilto!Networking.HTTPEnabled(). Without this, manifests setting onlynetworking.tcp: {}would spuriously stamp HTTPRoutes + ClusterIP from the SND reconciler.Backward compatibility
networking: nil— unchanged.networking: {}— unchanged (HTTPEnabled=true via backcompat).networking: { http: {} }— unchanged (HTTPEnabled=true).networking: { tcp: {} }— now correctly does NOT trigger HTTPRoutes (was a latent bug after PR-1 if anyone set this).networking: { http: {}, tcp: {} }— both layers active.Test plan
external_address_test.go— standalone, opt-out, opt-out cleanup, VPC unset (fail-closed), pod outside CIDR (fail-closed), Service stamping + gate held closed, hostname populated, empty-host guard, mapper.TestPublishableP2P_HoldsGateUntilHostnamePopulates— full bootstrap-gate happy path + opt-out cleanup transition.go test ./api/v1alpha1/... ./internal/...— all green.make test-integration— envtest suite green (59s).golangci-lint run ./internal/controller/node/... ./internal/controller/nodedeployment/... ./api/v1alpha1/...— no new issues.make manifests generate— no diff (no new CRD spec fields; new RBAC marker collapsed into existing role).Out of scope (separate follow-ups)
LabelPeerSourceresolver enhancement — make it preferStatus.ExternalAddressover headless DNS when set (per the scope cut on feat(api): networking http/tcp sub-structs + seinodeDeployment peer variant (#360 PR-1) #361). Tracked as part of Implement publishable P2P (per-pod L4 NLB + ExternalAddress) #360.tcp/26656 from 0.0.0.0/0— platform PR.SEI_VPC_CIDRenv var onmanager-patch.yaml— platform PR. Until that lands, the feature is inert (condition surfacesVPCCIDRNotConfigured, gate held closed). Safe to merge this PR independently.Refs
sei-protocol/platformdocs/designs/sei-publishable-p2p-nlb.md🤖 Generated with Claude Code