[GEP-36] Implement SelfHostedShootExposure controller using STACKIT NLB#103
[GEP-36] Implement SelfHostedShootExposure controller using STACKIT NLB#103
SelfHostedShootExposure controller using STACKIT NLB#103Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
b0238cc to
f89d103
Compare
SelfHostedShootExposure controller using STACKIT NLB
|
Ah, I have some questions that came to my mind where I'm not sure what if I can completely handle it here or if we need a follow-up? @timebertt |
|
I added a way to provide |
There was a problem hiding this comment.
Pull request overview
This PR adds first-class support for Gardener’s GEP-36 SelfHostedShootExposure resource to the STACKIT provider extension by introducing a dedicated controller/actuator backed by STACKIT Network Load Balancer (NLB), along with unit and integration coverage and the necessary wiring (RBAC, Helm values, manager flags).
Changes:
- Implement
SelfHostedShootExposurecontroller/actuator logic, including STACKIT LB reconciliation (create/update fast-path for target pools, full PUT updates for plan/ACL changes, readiness handling). - Add provider config API types + validation for STACKIT-specific exposure config (LB plan + access control source ranges) and generate API reference/deepcopy updates.
- Add integration tests + envtest CRD, plus extension manager/helm wiring (controller switch, flags, RBAC, values, Makefile target).
Reviewed changes
Copilot reviewed 28 out of 28 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test/integration/testdata/upstream-crds/10-crd-extensions.gardener.cloud_selfhostedshootexposures.yaml | Adds upstream CRD testdata for envtest-based integration testing. |
| test/integration/selfhostedshootexposure/stackit/selfhostedshootexposure_test.go | End-to-end integration test exercising LB create/update/delete via STACKIT APIs. |
| test/integration/selfhostedshootexposure/stackit/selfhostedshootexposure_suite_test.go | Ginkgo suite bootstrap for the integration tests. |
| pkg/stackit/client/mock/loadbalancing_mock.go | Extends the gomock LB client with create/update/target-pool update + ProjectID. |
| pkg/stackit/client/loadbalancing.go | Extends LB client interface and implementation to support create/update/target-pool update; treats NotFound on GET as nil result. |
| pkg/controller/selfhostedshootexposure/suite_test.go | Ginkgo suite bootstrap for controller unit tests. |
| pkg/controller/selfhostedshootexposure/resources_test.go | Unit tests for resource discovery behavior (getExistingResources). |
| pkg/controller/selfhostedshootexposure/resources_loadbalancer_test.go | Unit tests for LB reconcile/delete logic and helpers (targets/ACL comparisons). |
| pkg/controller/selfhostedshootexposure/resources_loadbalancer.go | Core LB reconciliation implementation (create/update fast path, readiness checks, ACL/plan handling). |
| pkg/controller/selfhostedshootexposure/resources.go | Resource container and existing-resource lookup for reconciliation. |
| pkg/controller/selfhostedshootexposure/options_test.go | Unit tests for option derivation (plan, labels, region mapping, CIDR validation). |
| pkg/controller/selfhostedshootexposure/options.go | Derives controller options from Cluster/Infrastructure/providerConfig; validates providerConfig. |
| pkg/controller/selfhostedshootexposure/add.go | Registers the controller with controller-runtime manager using Gardener extension framework. |
| pkg/controller/selfhostedshootexposure/actuator_test.go | Basic actuator tests for error paths / delegation behavior. |
| pkg/controller/selfhostedshootexposure/actuator.go | Implements Gardener actuator interface: credentials selection, reconcile/delete flows, ingresses from LB VIP. |
| pkg/cmd/options.go | Wires the new controller into the controller switch options. |
| pkg/apis/stackit/validation/selfhostedshootexposure_test.go | Unit tests for providerConfig validation (CIDR canonicality/parse, indexed errors). |
| pkg/apis/stackit/validation/selfhostedshootexposure.go | Validation logic for selfhosted exposure providerConfig. |
| pkg/apis/stackit/v1alpha1/zz_generated.deepcopy.go | Adds deepcopy implementations for new API types. |
| pkg/apis/stackit/v1alpha1/types_selfhostedshootexposure.go | Introduces STACKIT providerConfig API types for selfhosted exposure. |
| pkg/apis/stackit/v1alpha1/register.go | Registers new API types into scheme. |
| hack/api-reference/api.md | Updates generated API reference docs to include new types/fields. |
| docs/cloudprovider.md | Updates documentation to include permissions needed for the new controller. |
| cmd/gardener-extension-provider-stackit/app/app.go | Adds controller CLI options wiring (max concurrent reconciles + reconcile options plumbing). |
| charts/gardener-extension-provider-stackit/values.yaml | Adds Helm values for selfhostedshootexposure controller concurrency. |
| charts/gardener-extension-provider-stackit/templates/rbac.yaml | Extends RBAC for selfhostedshootexposures and status subresource. |
| charts/gardener-extension-provider-stackit/templates/deployment.yaml | Adds controller flag for max concurrent reconciles from Helm values. |
| Makefile | Adds test-integration-exposure target to run the new integration test suite. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/assign |
timebertt
left a comment
There was a problem hiding this comment.
Thanks for this PR!
Here is the first part of my review. I haven't quite made it through. E.g., I haven't reviewed the tests yet.
There was a problem hiding this comment.
The gardener-extension-provider-stackit/test/integration/selfhostedshootexposure/stackit is not registered as a prow check currently, I could add it if it's worth it (the tests take about 3 minutes or so).
If we add the test code here, we should also run it in CI to ensure the tests keep working.
| Namespace: exposure.Spec.CredentialsRef.Namespace, | ||
| } | ||
| } else { | ||
| // Fall back to the default cloud-provider secret |
There was a problem hiding this comment.
We don't need this fallback. If the shoot has managed infrastructure, Gardener will publish both the CredentialsRef and the cloudprovider secret. Otherwise, it won't publish either.
I.e., you can blindly pass the values from exposure.Spec.CredentialsRef to the client constructor.
There was a problem hiding this comment.
Adapted. I also was not sure how the WorkloadIdentity flow can be integrated here, but I think I got it from the STACKIT SDK Readme now and will handle it.
| ProjectID() string | ||
|
|
||
| ListLoadBalancers(ctx context.Context) ([]loadbalancer.LoadBalancer, error) | ||
| DeleteLoadBalancer(ctx context.Context, lbName string) error |
There was a problem hiding this comment.
Nit: can you sort the methods in the "usual order" like in the iaas client?
There was a problem hiding this comment.
I'm not sure I get the order. After ProjectID, sometimes it's Create -> Get -> Delete (Network), then it's Create -> Delete -> Get (Security Group), then Get -> Create -> Delete (Keypair)?
| GetLoadBalancer(ctx context.Context, id string) (*loadbalancer.LoadBalancer, error) | ||
| CreateLoadBalancer(ctx context.Context, payload loadbalancer.CreateLoadBalancerPayload) (*loadbalancer.LoadBalancer, error) | ||
| UpdateLoadBalancer(ctx context.Context, lbName string, payload loadbalancer.UpdateLoadBalancerPayload) (*loadbalancer.LoadBalancer, error) | ||
| UpdateLoadBalancerTargetPool(ctx context.Context, lbName, tpName string, payload loadbalancer.UpdateTargetPoolPayload) (*loadbalancer.TargetPool, error) |
There was a problem hiding this comment.
Try to avoid abbreviations. I would replace lbName with name/loadBalancerName and tpName with targetPool/targetPoolName
| lb, err := l.Client.GetLoadBalancer(ctx, l.projectID, l.region, lbName).Execute() | ||
| if err != nil { | ||
| if IsNotFound(err) { | ||
| return nil, nil |
There was a problem hiding this comment.
Code calling the client might want to (and should) handle a NotFound error explicitly. Let's remove the handling here to allow that.
|
|
||
| // targetsEqual compares two target lists for equality. | ||
| // Both lists should be sorted by IP for correct comparison. | ||
| func targetsEqual(spec, lb []loadbalancer.Target) bool { |
There was a problem hiding this comment.
Consider using cmp.Equal instead of manually implementing this function. See
| // The LB API may or may not de-duplicate (causing reconciliation loop), remove duplicates | ||
| // for a clean approach. | ||
| func stringSetsEqual(a, b []string) bool { | ||
| return slices.Equal(sortedUnique(a), sortedUnique(b)) |
There was a problem hiding this comment.
I expect this to be more efficient:
return sets.New(a...).Equal(sets.New(b...))| targetPoolName, | ||
| loadbalancer.UpdateTargetPoolPayload{ | ||
| Name: new(targetPoolName), | ||
| TargetPort: &r.SelfHostedShootExposure.Spec.Port, |
There was a problem hiding this comment.
As a general rule, avoid sharing memory with k8s API objects. I.e., instead of referencing (&), create a new value on the stack and reference it (new(...) or ptr.To(...)).
| // Re-read the LB so downstream readiness checks don't see the pre-write status (STACKIT | ||
| // transitions the LB to STATUS_PENDING on any write). UpdateLoadBalancerTargetPool only | ||
| // returns the TargetPool, so a full GET is needed. | ||
| refreshed, err := r.LBClient.GetLoadBalancer(ctx, r.ResourceName) |
There was a problem hiding this comment.
Reading this, I'm asking myself whether using the UpdateLoadBalancerTargetPool is actually the "fast path". Using the UpdateLoadBalancer method spares a second API call, right?
| if err != nil { | ||
| return fmt.Errorf("error refreshing load balancer after target pool update: %w", err) | ||
| } | ||
| r.LoadBalancer = refreshed |
There was a problem hiding this comment.
Consider assigning to r.LoadBalancer directly instead of saving in an intermediate variable.
timebertt
left a comment
There was a problem hiding this comment.
Made it through the PR. Let me know if you want to discuss any comments in a chat :)
| // Endpoints are populated asynchronously by gardenlet; empty endpoints should trigger a clean requeue, | ||
| // not a fatal error. | ||
| var rae *reconcilerutils.RequeueAfterError | ||
| Expect(errors.As(err, &rae)).To(BeTrue()) |
There was a problem hiding this comment.
Nit: you could use errors.Is here
| // not a fatal error. | ||
| var rae *reconcilerutils.RequeueAfterError | ||
| Expect(errors.As(err, &rae)).To(BeTrue()) | ||
| Expect(rae.Cause.Error()).To(ContainSubstring("waiting for endpoints to be populated")) |
There was a problem hiding this comment.
Nit: you could use MatchError() here
| @@ -0,0 +1,396 @@ | |||
| --- | |||
There was a problem hiding this comment.
You should add this file to make gardener-crds
| $(INFRA_TEST_FLAGS) | ||
|
|
||
| .PHONY: test-integration-exposure | ||
| test-integration-exposure: $(REPORT_COLLECTOR) $(SETUP_ENVTEST) $(GINKGO) ## Run selfhostedshootexposure integration tests |
There was a problem hiding this comment.
Should be documented in testing.md as well
| httpClient, err := rest.HTTPClientFor(restConfig) | ||
| Expect(err).NotTo(HaveOccurred()) | ||
| mapper, err := apiutil.NewDynamicRESTMapper(restConfig, httpClient) | ||
| Expect(err).NotTo(HaveOccurred()) |
There was a problem hiding this comment.
Setting up a dedicated/custom http client and rest mapper seems unnecessary
| opts := &Options{ | ||
| SelfHostedShootExposure: exposure, | ||
| ProjectID: projectID, | ||
| ResourceName: fmt.Sprintf("%s-exposure-%s", cluster.Shoot.Status.TechnicalID, exposure.Name), |
There was a problem hiding this comment.
On a second thought, keeping this exposure.Name part would ensure we don't need any migrations or compatibility code when Gardener later on finds another use case for creating multiple SelfHostedShootExposure objects.
|
|
||
| AfterEach(func() { | ||
| // Best-effort cleanup of LB via STACKIT API (in case the controller didn't delete it) | ||
| if lbName != "" { |
| AfterEach(func() { | ||
| // Best-effort cleanup of LB via STACKIT API (in case the controller didn't delete it) | ||
| if lbName != "" { | ||
| lb, _ := lbClient.GetLoadBalancer(ctx, lbName) |
There was a problem hiding this comment.
What about transient errors calling the API?
You should probably wrap such calls in test in an Eventually
| lb, _ := lbClient.GetLoadBalancer(ctx, lbName) | ||
| if lb != nil { | ||
| log.Info("Cleaning up leftover load balancer", "name", lbName) | ||
| _ = lbClient.DeleteLoadBalancer(ctx, lbName) |
There was a problem hiding this comment.
Same here, use Eventually to tolerate transient errors in API calls and wait for the load balancer to be deleted
| }).WithTimeout(5 * time.Minute).WithPolling(5 * time.Second).Should(Succeed()) | ||
|
|
||
| By("verify SelfHostedShootExposure CR is being reconciled") | ||
| Expect(c.Get(ctx, client.ObjectKeyFromObject(exposure), exposure)).To(Succeed()) |
There was a problem hiding this comment.
Use Eventually for waiting for controller actions on API object
How to categorize this PR?
/kind enhancement
What this PR does / why we need it:
As part of the Gardener GEP-36: Self-Hosted Shoot Exposure, this PR adds a controller for the
SelfHostedShootExposureresource in the extension.The controller fulfils the respective
SelfHostedShootExposurecontract and Extension Controller Interface described in the GEP. It used the STACKIT Network Loadbalancing service as backing service to expose the endpoints referenced in the resource Spec, which can be updated by the Gardenlet.Which issue(s) this PR fixes:
Part of gardener/gardener#13602
Special notes for your reviewer:
I tested this using two approaches:
DaemonSetnginxon a worker pool and providing the worker IPs in aSelfHostedShootExposurepicked up by the Extension, you may usemake mirrord-runfor that.The
gardener-extension-provider-stackit/test/integration/selfhostedshootexposure/stackitis not registered as a prow check currently, I could add it if it's worth it (the tests take about 3 minutes or so)./cc @stackitcloud/ske-gardener