Skip to content

[GEP-36] Implement SelfHostedShootExposure controller using STACKIT NLB#103

Open
jamand wants to merge 6 commits intomainfrom
feat/self-hosted-shoot-exposure-controller
Open

[GEP-36] Implement SelfHostedShootExposure controller using STACKIT NLB#103
jamand wants to merge 6 commits intomainfrom
feat/self-hosted-shoot-exposure-controller

Conversation

@jamand
Copy link
Copy Markdown
Member

@jamand jamand commented Apr 15, 2026

How to categorize this PR?

/kind enhancement

What this PR does / why we need it:

As part of the Gardener GEP-36: Self-Hosted Shoot Exposure, this PR adds a controller for the SelfHostedShootExposure resource in the extension.

The controller fulfils the respective SelfHostedShootExposure contract and Extension Controller Interface described in the GEP. It used the STACKIT Network Loadbalancing service as backing service to expose the endpoints referenced in the resource Spec, which can be updated by the Gardenlet.

Which issue(s) this PR fixes:
Part of gardener/gardener#13602

Special notes for your reviewer:

I tested this using two approaches:

  1. Manual testing via exposing a DaemonSet nginx on a worker pool and providing the worker IPs in a SelfHostedShootExposure picked up by the Extension, you may use make mirrord-run for that.
  2. Running integration tests I added (which test basic operations like creation, update of target pools, plan, deletion).

The gardener-extension-provider-stackit/test/integration/selfhostedshootexposure/stackit is not registered as a prow check currently, I could add it if it's worth it (the tests take about 3 minutes or so).

/cc @stackitcloud/ske-gardener

@ske-prow ske-prow Bot requested a review from a team April 15, 2026 11:47
@ske-prow ske-prow Bot added the kind/enhancement Enhancement, improvement, extension label Apr 15, 2026
@ske-prow
Copy link
Copy Markdown

ske-prow Bot commented Apr 15, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ftl for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ske-prow ske-prow Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 15, 2026
@jamand jamand force-pushed the feat/self-hosted-shoot-exposure-controller branch from b0238cc to f89d103 Compare April 15, 2026 13:22
@timebertt timebertt changed the title feat: Implement SelfHostedShootExposure controller using STACKIT NLB (GEP-36) [GEP-36] Implement SelfHostedShootExposure controller using STACKIT NLB Apr 16, 2026
@jamand
Copy link
Copy Markdown
Member Author

jamand commented Apr 16, 2026

Ah, I have some questions that came to my mind where I'm not sure what if I can completely handle it here or if we need a follow-up? @timebertt
As I've just finished working on gardener/gardener#14420 I realised that the control-plane behind self-hosted shoot exposure does not have any access control implemented now. It's not directly part of the spec. I could add it into the Provider-specific specs and do ACLs at the load balancer level. Then we wouldn't need to care about PROXY Protocol and needing to pass the Client IP through. As it's just the control plane and not different endpoints (like discovery server, ske-api) that may require different access levels, there is no hard need for that to my understanding. We just have to check how to use existing logic to update the ACLs for dynamically changing CIDRs like we do in other use cases.

@jamand
Copy link
Copy Markdown
Member Author

jamand commented Apr 16, 2026

I added a way to provide AllowedSourceRanges in the Provider specs for now (also setting, modifying, removal tested in the integration tests) in the last commit.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds first-class support for Gardener’s GEP-36 SelfHostedShootExposure resource to the STACKIT provider extension by introducing a dedicated controller/actuator backed by STACKIT Network Load Balancer (NLB), along with unit and integration coverage and the necessary wiring (RBAC, Helm values, manager flags).

Changes:

  • Implement SelfHostedShootExposure controller/actuator logic, including STACKIT LB reconciliation (create/update fast-path for target pools, full PUT updates for plan/ACL changes, readiness handling).
  • Add provider config API types + validation for STACKIT-specific exposure config (LB plan + access control source ranges) and generate API reference/deepcopy updates.
  • Add integration tests + envtest CRD, plus extension manager/helm wiring (controller switch, flags, RBAC, values, Makefile target).

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/integration/testdata/upstream-crds/10-crd-extensions.gardener.cloud_selfhostedshootexposures.yaml Adds upstream CRD testdata for envtest-based integration testing.
test/integration/selfhostedshootexposure/stackit/selfhostedshootexposure_test.go End-to-end integration test exercising LB create/update/delete via STACKIT APIs.
test/integration/selfhostedshootexposure/stackit/selfhostedshootexposure_suite_test.go Ginkgo suite bootstrap for the integration tests.
pkg/stackit/client/mock/loadbalancing_mock.go Extends the gomock LB client with create/update/target-pool update + ProjectID.
pkg/stackit/client/loadbalancing.go Extends LB client interface and implementation to support create/update/target-pool update; treats NotFound on GET as nil result.
pkg/controller/selfhostedshootexposure/suite_test.go Ginkgo suite bootstrap for controller unit tests.
pkg/controller/selfhostedshootexposure/resources_test.go Unit tests for resource discovery behavior (getExistingResources).
pkg/controller/selfhostedshootexposure/resources_loadbalancer_test.go Unit tests for LB reconcile/delete logic and helpers (targets/ACL comparisons).
pkg/controller/selfhostedshootexposure/resources_loadbalancer.go Core LB reconciliation implementation (create/update fast path, readiness checks, ACL/plan handling).
pkg/controller/selfhostedshootexposure/resources.go Resource container and existing-resource lookup for reconciliation.
pkg/controller/selfhostedshootexposure/options_test.go Unit tests for option derivation (plan, labels, region mapping, CIDR validation).
pkg/controller/selfhostedshootexposure/options.go Derives controller options from Cluster/Infrastructure/providerConfig; validates providerConfig.
pkg/controller/selfhostedshootexposure/add.go Registers the controller with controller-runtime manager using Gardener extension framework.
pkg/controller/selfhostedshootexposure/actuator_test.go Basic actuator tests for error paths / delegation behavior.
pkg/controller/selfhostedshootexposure/actuator.go Implements Gardener actuator interface: credentials selection, reconcile/delete flows, ingresses from LB VIP.
pkg/cmd/options.go Wires the new controller into the controller switch options.
pkg/apis/stackit/validation/selfhostedshootexposure_test.go Unit tests for providerConfig validation (CIDR canonicality/parse, indexed errors).
pkg/apis/stackit/validation/selfhostedshootexposure.go Validation logic for selfhosted exposure providerConfig.
pkg/apis/stackit/v1alpha1/zz_generated.deepcopy.go Adds deepcopy implementations for new API types.
pkg/apis/stackit/v1alpha1/types_selfhostedshootexposure.go Introduces STACKIT providerConfig API types for selfhosted exposure.
pkg/apis/stackit/v1alpha1/register.go Registers new API types into scheme.
hack/api-reference/api.md Updates generated API reference docs to include new types/fields.
docs/cloudprovider.md Updates documentation to include permissions needed for the new controller.
cmd/gardener-extension-provider-stackit/app/app.go Adds controller CLI options wiring (max concurrent reconciles + reconcile options plumbing).
charts/gardener-extension-provider-stackit/values.yaml Adds Helm values for selfhostedshootexposure controller concurrency.
charts/gardener-extension-provider-stackit/templates/rbac.yaml Extends RBAC for selfhostedshootexposures and status subresource.
charts/gardener-extension-provider-stackit/templates/deployment.yaml Adds controller flag for max concurrent reconciles from Helm values.
Makefile Adds test-integration-exposure target to run the new integration test suite.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/controller/selfhostedshootexposure/actuator.go Outdated
Comment thread docs/cloudprovider.md Outdated
Comment thread pkg/controller/selfhostedshootexposure/resources_loadbalancer.go Outdated
Comment thread pkg/controller/selfhostedshootexposure/resources_loadbalancer.go Outdated
@timebertt
Copy link
Copy Markdown
Member

/assign

Copy link
Copy Markdown
Member

@timebertt timebertt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR!
Here is the first part of my review. I haven't quite made it through. E.g., I haven't reviewed the tests yet.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gardener-extension-provider-stackit/test/integration/selfhostedshootexposure/stackit is not registered as a prow check currently, I could add it if it's worth it (the tests take about 3 minutes or so).

If we add the test code here, we should also run it in CI to ensure the tests keep working.

Namespace: exposure.Spec.CredentialsRef.Namespace,
}
} else {
// Fall back to the default cloud-provider secret
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this fallback. If the shoot has managed infrastructure, Gardener will publish both the CredentialsRef and the cloudprovider secret. Otherwise, it won't publish either.
I.e., you can blindly pass the values from exposure.Spec.CredentialsRef to the client constructor.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adapted. I also was not sure how the WorkloadIdentity flow can be integrated here, but I think I got it from the STACKIT SDK Readme now and will handle it.

ProjectID() string

ListLoadBalancers(ctx context.Context) ([]loadbalancer.LoadBalancer, error)
DeleteLoadBalancer(ctx context.Context, lbName string) error
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can you sort the methods in the "usual order" like in the iaas client?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I get the order. After ProjectID, sometimes it's Create -> Get -> Delete (Network), then it's Create -> Delete -> Get (Security Group), then Get -> Create -> Delete (Keypair)?

GetLoadBalancer(ctx context.Context, id string) (*loadbalancer.LoadBalancer, error)
CreateLoadBalancer(ctx context.Context, payload loadbalancer.CreateLoadBalancerPayload) (*loadbalancer.LoadBalancer, error)
UpdateLoadBalancer(ctx context.Context, lbName string, payload loadbalancer.UpdateLoadBalancerPayload) (*loadbalancer.LoadBalancer, error)
UpdateLoadBalancerTargetPool(ctx context.Context, lbName, tpName string, payload loadbalancer.UpdateTargetPoolPayload) (*loadbalancer.TargetPool, error)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to avoid abbreviations. I would replace lbName with name/loadBalancerName and tpName with targetPool/targetPoolName

lb, err := l.Client.GetLoadBalancer(ctx, l.projectID, l.region, lbName).Execute()
if err != nil {
if IsNotFound(err) {
return nil, nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code calling the client might want to (and should) handle a NotFound error explicitly. Let's remove the handling here to allow that.


// targetsEqual compares two target lists for equality.
// Both lists should be sorted by IP for correct comparison.
func targetsEqual(spec, lb []loadbalancer.Target) bool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using cmp.Equal instead of manually implementing this function. See

if cmp.Equal(rule, wanted, stackit.ProtocolComparison, stackit.MapStringAnyComparison, cmpopts.IgnoreFields(iaas.SecurityGroupRule{}, "Description", "Id", "CreatedAt", "UpdatedAt", "SecurityGroupId")) {
for an example.

// The LB API may or may not de-duplicate (causing reconciliation loop), remove duplicates
// for a clean approach.
func stringSetsEqual(a, b []string) bool {
return slices.Equal(sortedUnique(a), sortedUnique(b))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect this to be more efficient:

return sets.New(a...).Equal(sets.New(b...))

targetPoolName,
loadbalancer.UpdateTargetPoolPayload{
Name: new(targetPoolName),
TargetPort: &r.SelfHostedShootExposure.Spec.Port,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general rule, avoid sharing memory with k8s API objects. I.e., instead of referencing (&), create a new value on the stack and reference it (new(...) or ptr.To(...)).

// Re-read the LB so downstream readiness checks don't see the pre-write status (STACKIT
// transitions the LB to STATUS_PENDING on any write). UpdateLoadBalancerTargetPool only
// returns the TargetPool, so a full GET is needed.
refreshed, err := r.LBClient.GetLoadBalancer(ctx, r.ResourceName)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this, I'm asking myself whether using the UpdateLoadBalancerTargetPool is actually the "fast path". Using the UpdateLoadBalancer method spares a second API call, right?

if err != nil {
return fmt.Errorf("error refreshing load balancer after target pool update: %w", err)
}
r.LoadBalancer = refreshed
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider assigning to r.LoadBalancer directly instead of saving in an intermediate variable.

Copy link
Copy Markdown
Member

@timebertt timebertt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it through the PR. Let me know if you want to discuss any comments in a chat :)

// Endpoints are populated asynchronously by gardenlet; empty endpoints should trigger a clean requeue,
// not a fatal error.
var rae *reconcilerutils.RequeueAfterError
Expect(errors.As(err, &rae)).To(BeTrue())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you could use errors.Is here

// not a fatal error.
var rae *reconcilerutils.RequeueAfterError
Expect(errors.As(err, &rae)).To(BeTrue())
Expect(rae.Cause.Error()).To(ContainSubstring("waiting for endpoints to be populated"))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you could use MatchError() here

@@ -0,0 +1,396 @@
---
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add this file to make gardener-crds

Comment thread Makefile
$(INFRA_TEST_FLAGS)

.PHONY: test-integration-exposure
test-integration-exposure: $(REPORT_COLLECTOR) $(SETUP_ENVTEST) $(GINKGO) ## Run selfhostedshootexposure integration tests
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be documented in testing.md as well

httpClient, err := rest.HTTPClientFor(restConfig)
Expect(err).NotTo(HaveOccurred())
mapper, err := apiutil.NewDynamicRESTMapper(restConfig, httpClient)
Expect(err).NotTo(HaveOccurred())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting up a dedicated/custom http client and rest mapper seems unnecessary

opts := &Options{
SelfHostedShootExposure: exposure,
ProjectID: projectID,
ResourceName: fmt.Sprintf("%s-exposure-%s", cluster.Shoot.Status.TechnicalID, exposure.Name),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought, keeping this exposure.Name part would ensure we don't need any migrations or compatibility code when Gardener later on finds another use case for creating multiple SelfHostedShootExposure objects.


AfterEach(func() {
// Best-effort cleanup of LB via STACKIT API (in case the controller didn't delete it)
if lbName != "" {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can lbName ever be empty?

AfterEach(func() {
// Best-effort cleanup of LB via STACKIT API (in case the controller didn't delete it)
if lbName != "" {
lb, _ := lbClient.GetLoadBalancer(ctx, lbName)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about transient errors calling the API?
You should probably wrap such calls in test in an Eventually

lb, _ := lbClient.GetLoadBalancer(ctx, lbName)
if lb != nil {
log.Info("Cleaning up leftover load balancer", "name", lbName)
_ = lbClient.DeleteLoadBalancer(ctx, lbName)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, use Eventually to tolerate transient errors in API calls and wait for the load balancer to be deleted

}).WithTimeout(5 * time.Minute).WithPolling(5 * time.Second).Should(Succeed())

By("verify SelfHostedShootExposure CR is being reconciled")
Expect(c.Get(ctx, client.ObjectKeyFromObject(exposure), exposure)).To(Succeed())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Eventually for waiting for controller actions on API object

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/enhancement Enhancement, improvement, extension size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants