NO-JIRA: Force webhook failure policy to Ignore until cluster has bootstrapped#476
NO-JIRA: Force webhook failure policy to Ignore until cluster has bootstrapped#476JoelSpeed wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@JoelSpeed: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
📝 WalkthroughWalkthroughThe controller gains logic to detect cluster bootstrap state and conditionally mutate webhook configurations. When the cluster isn't fully bootstrapped, webhook FailurePolicy is set to Ignore to prevent blocking initialization. Once all ClusterOperators reach Available and NotProgressing states, webhooks are applied unmutated with FailurePolicy set to Fail. Changes
Sequence Diagram(s)sequenceDiagram
participant Controller
participant ClusterOperator
participant AssetMutator
participant WebhookConfig
Controller->>ClusterOperator: Query ClusterOperator conditions
ClusterOperator-->>Controller: Return status (Available, Progressing, Degraded)
alt Cluster Not Bootstrapped
Controller->>AssetMutator: Load webhook assets
AssetMutator->>AssetMutator: Set FailurePolicy to Ignore
AssetMutator->>WebhookConfig: Apply mutated webhook
WebhookConfig-->>AssetMutator: Configuration applied
else Cluster Bootstrapped
Controller->>AssetMutator: Load webhook assets
AssetMutator->>AssetMutator: Set FailurePolicy to Fail
AssetMutator->>WebhookConfig: Apply unmutated webhook
WebhookConfig-->>AssetMutator: Configuration applied
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/pipeline required I need to see E2E results for this, even if it must change later |
|
Scheduling tests matching the |
|
@JoelSpeed: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
pkg/controllers/staticresourceinstaller/controller_test.go (1)
116-118: Scope FailurePolicy assertions to expected webhook configurations.Line 116 and Line 149 assert over the entire list with
HaveEach. That can pass on an empty list and can fail when unrelatedValidatingWebhookConfigurationobjects exist. Prefer explicit name-scoped assertions plusNot(BeEmpty()).Suggested assertion pattern
-Eventually(kWithCtx(ctx).ObjectList(&admissionregistrationv1.ValidatingWebhookConfigurationList{}), 10*time.Second).WithContext(ctx).Should(HaveField("Items", HaveEach( - HaveField("Webhooks", ConsistOf(HaveField("FailurePolicy", HaveValue(Equal(admissionregistrationv1.Ignore))))), -))) +Eventually(kWithCtx(ctx).ObjectList(&admissionregistrationv1.ValidatingWebhookConfigurationList{}), 10*time.Second).WithContext(ctx).Should( + HaveField("Items", SatisfyAll( + Not(BeEmpty()), + ContainElement(SatisfyAll( + HaveField("ObjectMeta.Name", Equal("openshift-compatibility-requirements-apiextensions-openshift-io-v1alpha1-compatibilityrequirement-validation")), + HaveField("Webhooks", HaveEach(HaveField("FailurePolicy", HaveValue(Equal(admissionregistrationv1.Ignore))))), + )), + ContainElement(SatisfyAll( + HaveField("ObjectMeta.Name", Equal("openshift-compatibility-requirements-apiextensions-k8s-io-v1-customresourcedefinition-validation")), + HaveField("Webhooks", HaveEach(HaveField("FailurePolicy", HaveValue(Equal(admissionregistrationv1.Ignore))))), + )), + )), +)Apply the same structure to the Fail-policy assertion.
Also applies to: 149-151
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controllers/staticresourceinstaller/controller_test.go` around lines 116 - 118, The test currently asserts FailurePolicy across the entire ValidatingWebhookConfigurationList using HaveEach, which can pass incorrectly on empty lists or be affected by unrelated objects; update the assertions to scope to the expected webhook configurations by selecting the specific ValidatingWebhookConfiguration by name (e.g., filter the ValidatingWebhookConfigurationList returned by kWithCtx(ctx).ObjectList(...) for the expected metadata.Name) and assert Not(BeEmpty()) on that filtered result, then assert on its Webhooks field that each expected webhook has FailurePolicy == admissionregistrationv1.Ignore; apply the same pattern to the second occurrence around lines 149-151 so both checks explicitly target the named webhook configs instead of the whole list.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controllers/staticresourceinstaller/controller.go`:
- Around line 205-220: The NotFound branch currently forces
obj.Webhooks[i].FailurePolicy = ptr.To(admissionregistrationv1.Ignore) whenever
the validating webhook is missing; change it to only apply this "ignore on
create" behavior when the cluster is still bootstrapping (e.g., check a
bootstrap-complete flag or method on the controller such as c.isBootstrapped()
or c.bootstrapComplete) and otherwise return an error or proceed to create with
the default/Fail policy; update the same gating logic in the other NotFound
handling block around lines 259-274 so both places only set FailurePolicy to
Ignore if the controller reports the cluster is not yet bootstrapped.
- Line 260: The error message in mutateMutatingWebhookConfiguration incorrectly
says "validating webhook configuration"; update the fmt.Errorf call that returns
an error for obj.Name to refer to "mutating webhook configuration" instead, i.e.
change the string literal passed to fmt.Errorf in
mutateMutatingWebhookConfiguration so it reads "failed to get mutating webhook
configuration %s: %w" (preserving obj.Name and err).
- Around line 311-327: The clusterBootstrapped function currently only updates
policiesMatch when names match, so missing required webhook names in
currentWebhooks can leave policiesMatch true; fix by for each webhook in
webhooks, search for a matching name in currentWebhooks and if none is found set
policiesMatch = false (and continue), and if found compare failurePolicy
pointers as before (require both non-nil and equal) — update the logic in
clusterBootstrapped to explicitly check existence of a matching currentWebhook
for every desired webhook (webhooks/currentWebhooks/policiesMatch).
- Around line 54-58: Replace the plain bool field
initialClusterOperatorsBootstrapped on staticResourceInstallerController with
sync/atomic's atomic.Bool and update all accesses: use
initialClusterOperatorsBootstrapped.Load() where the predicate callbacks and
reconciliation read the flag and
initialClusterOperatorsBootstrapped.Store(true/false) where reconciliation
writes it (ensure you add the "sync/atomic" import). Also fix the error text
that currently mentions "validating webhook configuration" when handling a
MutatingWebhookConfiguration—change the message to "mutating webhook
configuration" so it accurately reflects the MutatingWebhookConfiguration being
mutated.
---
Nitpick comments:
In `@pkg/controllers/staticresourceinstaller/controller_test.go`:
- Around line 116-118: The test currently asserts FailurePolicy across the
entire ValidatingWebhookConfigurationList using HaveEach, which can pass
incorrectly on empty lists or be affected by unrelated objects; update the
assertions to scope to the expected webhook configurations by selecting the
specific ValidatingWebhookConfiguration by name (e.g., filter the
ValidatingWebhookConfigurationList returned by kWithCtx(ctx).ObjectList(...) for
the expected metadata.Name) and assert Not(BeEmpty()) on that filtered result,
then assert on its Webhooks field that each expected webhook has FailurePolicy
== admissionregistrationv1.Ignore; apply the same pattern to the second
occurrence around lines 149-151 so both checks explicitly target the named
webhook configs instead of the whole list.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (2)
pkg/controllers/staticresourceinstaller/controller.gopkg/controllers/staticresourceinstaller/controller_test.go
| type staticResourceInstallerController struct { | ||
| assetNames []string // The names of the assets to install. | ||
| kubeClient kubernetes.Interface | ||
| assetNames []string // The names of the assets to install. | ||
| client client.Client | ||
| kubeClient kubernetes.Interface | ||
| initialClusterOperatorsBootstrapped bool |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check the Go version
cat go.mod | head -20Repository: openshift/cluster-capi-operator
Length of output: 1001
🏁 Script executed:
# Read the relevant sections of the controller file
wc -l pkg/controllers/staticresourceinstaller/controller.goRepository: openshift/cluster-capi-operator
Length of output: 132
🏁 Script executed:
# Check the struct definition and field declaration
sed -n '50,70p' pkg/controllers/staticresourceinstaller/controller.goRepository: openshift/cluster-capi-operator
Length of output: 814
🏁 Script executed:
# Check the predicate callback at line 97
sed -n '90,110p' pkg/controllers/staticresourceinstaller/controller.goRepository: openshift/cluster-capi-operator
Length of output: 1110
🏁 Script executed:
# Check the reconcile method and initialClusterOperatorsBootstrapped access
sed -n '330,360p' pkg/controllers/staticresourceinstaller/controller.goRepository: openshift/cluster-capi-operator
Length of output: 1145
🏁 Script executed:
# Check the error text at line 260
sed -n '255,265p' pkg/controllers/staticresourceinstaller/controller.goRepository: openshift/cluster-capi-operator
Length of output: 884
🏁 Script executed:
# Look for the clusterOperatorsBootstrapped method implementation
sed -n '332,355p' pkg/controllers/staticresourceinstaller/controller.goRepository: openshift/cluster-capi-operator
Length of output: 926
Protect initialClusterOperatorsBootstrapped with synchronization to prevent data race.
Line 97 reads this flag from predicate callbacks that run in the watch event loop, while lines 333 and 352 read and write it during reconciliation in the worker pool. These concurrent execution contexts create an unsynchronized access to a shared boolean field, which is a data race. Use atomic.Bool to synchronize access.
Additionally, the error message at line 260 incorrectly refers to "validating webhook configuration" when the function mutates a MutatingWebhookConfiguration. Update the error text to say "mutating webhook configuration".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controllers/staticresourceinstaller/controller.go` around lines 54 - 58,
Replace the plain bool field initialClusterOperatorsBootstrapped on
staticResourceInstallerController with sync/atomic's atomic.Bool and update all
accesses: use initialClusterOperatorsBootstrapped.Load() where the predicate
callbacks and reconciliation read the flag and
initialClusterOperatorsBootstrapped.Store(true/false) where reconciliation
writes it (ensure you add the "sync/atomic" import). Also fix the error text
that currently mentions "validating webhook configuration" when handling a
MutatingWebhookConfiguration—change the message to "mutating webhook
configuration" so it accurately reflects the MutatingWebhookConfiguration being
mutated.
| if err := c.client.Get(ctx, client.ObjectKey{Name: obj.Name}, currentObj); err != nil && !apierrors.IsNotFound(err) { | ||
| return nil, fmt.Errorf("failed to get validating webhook configuration %s: %w", obj.Name, err) | ||
| } else if err != nil && apierrors.IsNotFound(err) { | ||
| // If the object doesn't currently exist, apply it initially with the failure policy set to ignore | ||
| // so that we don't block cluster operators during cluster bootstrap. | ||
| for i := range obj.Webhooks { | ||
| obj.Webhooks[i].FailurePolicy = ptr.To(admissionregistrationv1.Ignore) | ||
| } | ||
|
|
||
| data, err := k8syaml.Marshal(obj) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to marshal object: %w", err) | ||
| } | ||
|
|
||
| return data, nil | ||
| } |
There was a problem hiding this comment.
Avoid forcing Ignore on NotFound after bootstrap is complete.
Line 207 and Line 261 return early with Ignore whenever the webhook object is missing. In a bootstrapped cluster, this still weakens policy even though Fail should be allowed.
Suggested fix (gate NotFound path on bootstrap state)
- } else if err != nil && apierrors.IsNotFound(err) {
+ } else if err != nil && apierrors.IsNotFound(err) {
+ bootstrapped, berr := c.clusterOperatorsBootstrapped(ctx)
+ if berr != nil {
+ return nil, fmt.Errorf("failed to check if cluster operators are bootstrapped: %w", berr)
+ }
+ if bootstrapped {
+ return raw, nil
+ }
// If the object doesn't currently exist, apply it initially with the failure policy set to ignore
// so that we don't block cluster operators during cluster bootstrap.
for i := range obj.Webhooks {
obj.Webhooks[i].FailurePolicy = ptr.To(admissionregistrationv1.Ignore)
}Also applies to: 259-274
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controllers/staticresourceinstaller/controller.go` around lines 205 -
220, The NotFound branch currently forces obj.Webhooks[i].FailurePolicy =
ptr.To(admissionregistrationv1.Ignore) whenever the validating webhook is
missing; change it to only apply this "ignore on create" behavior when the
cluster is still bootstrapping (e.g., check a bootstrap-complete flag or method
on the controller such as c.isBootstrapped() or c.bootstrapComplete) and
otherwise return an error or proceed to create with the default/Fail policy;
update the same gating logic in the other NotFound handling block around lines
259-274 so both places only set FailurePolicy to Ignore if the controller
reports the cluster is not yet bootstrapped.
| currentObj := &admissionregistrationv1.MutatingWebhookConfiguration{} | ||
|
|
||
| if err := c.client.Get(ctx, client.ObjectKey{Name: obj.Name}, currentObj); err != nil && !apierrors.IsNotFound(err) { | ||
| return nil, fmt.Errorf("failed to get validating webhook configuration %s: %w", obj.Name, err) |
There was a problem hiding this comment.
Fix incorrect resource type in error message.
Line 260 is in mutateMutatingWebhookConfiguration but the message says “validating webhook configuration”, which is misleading when debugging failures.
Suggested fix
- return nil, fmt.Errorf("failed to get validating webhook configuration %s: %w", obj.Name, err)
+ return nil, fmt.Errorf("failed to get mutating webhook configuration %s: %w", obj.Name, err)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| return nil, fmt.Errorf("failed to get validating webhook configuration %s: %w", obj.Name, err) | |
| return nil, fmt.Errorf("failed to get mutating webhook configuration %s: %w", obj.Name, err) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controllers/staticresourceinstaller/controller.go` at line 260, The error
message in mutateMutatingWebhookConfiguration incorrectly says "validating
webhook configuration"; update the fmt.Errorf call that returns an error for
obj.Name to refer to "mutating webhook configuration" instead, i.e. change the
string literal passed to fmt.Errorf in mutateMutatingWebhookConfiguration so it
reads "failed to get mutating webhook configuration %s: %w" (preserving obj.Name
and err).
| func (c *staticResourceInstallerController) clusterBootstrapped(ctx context.Context, webhooks, currentWebhooks []webhookPolicy) (bool, error) { | ||
| // First check if the existing policies match the desired policies. | ||
| // This means we already applied the manifest as it comes from the assets directly | ||
| // without any mutation. To do that, the cluster must already have been bootstrapped. | ||
| policiesMatch := true | ||
|
|
||
| for _, webhook := range webhooks { | ||
| for _, currentWebhook := range currentWebhooks { | ||
| if webhook.Name == currentWebhook.Name { | ||
| policiesMatch = policiesMatch && webhook.FailurePolicy != nil && currentWebhook.FailurePolicy != nil && *webhook.FailurePolicy == *currentWebhook.FailurePolicy | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if policiesMatch { | ||
| return true, nil | ||
| } |
There was a problem hiding this comment.
clusterBootstrapped can return true for mismatched webhook sets.
At Line 317, the nested comparison only updates policiesMatch when names match. If a required webhook name is missing in currentWebhooks, policiesMatch may stay true and skip bootstrap gating.
Suggested fix (compare by name with existence checks)
func (c *staticResourceInstallerController) clusterBootstrapped(ctx context.Context, webhooks, currentWebhooks []webhookPolicy) (bool, error) {
- policiesMatch := true
-
- for _, webhook := range webhooks {
- for _, currentWebhook := range currentWebhooks {
- if webhook.Name == currentWebhook.Name {
- policiesMatch = policiesMatch && webhook.FailurePolicy != nil && currentWebhook.FailurePolicy != nil && *webhook.FailurePolicy == *currentWebhook.FailurePolicy
- }
- }
- }
+ if len(webhooks) != len(currentWebhooks) {
+ return c.clusterOperatorsBootstrapped(ctx)
+ }
+
+ currentByName := map[string]webhookPolicy{}
+ for _, currentWebhook := range currentWebhooks {
+ currentByName[currentWebhook.Name] = currentWebhook
+ }
+
+ policiesMatch := true
+ for _, webhook := range webhooks {
+ currentWebhook, ok := currentByName[webhook.Name]
+ if !ok {
+ policiesMatch = false
+ break
+ }
+ policiesMatch = policiesMatch &&
+ webhook.FailurePolicy != nil &&
+ currentWebhook.FailurePolicy != nil &&
+ *webhook.FailurePolicy == *currentWebhook.FailurePolicy
+ }
if policiesMatch {
return true, nil
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controllers/staticresourceinstaller/controller.go` around lines 311 -
327, The clusterBootstrapped function currently only updates policiesMatch when
names match, so missing required webhook names in currentWebhooks can leave
policiesMatch true; fix by for each webhook in webhooks, search for a matching
name in currentWebhooks and if none is found set policiesMatch = false (and
continue), and if found compare failurePolicy pointers as before (require both
non-nil and equal) — update the logic in clusterBootstrapped to explicitly check
existence of a matching currentWebhook for every desired webhook
(webhooks/currentWebhooks/policiesMatch).
|
I think the Coderabbit races are probably correct, however LGTM on approach for a stop gap during dev. Maybe comments somewhere that this is temporary? |
|
@JoelSpeed: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Pretty sure we replaced this with #479. @JoelSpeed reopen if I'm missing something. /close |
|
@mdbooth: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
At the moment, there are connectivity issues sometimes on bootstrap when webhooks are set to FailurePolicy Fail.
In the long term, we hope to introduce a connection between the bootstrap API server and the in cluster running webhooks, but for now, we need a workaround to prevent cluster bootstrap issues failing on broken webhooks.
This PR add mutation to VWC and MWC in the static resource controller to set them to Ignore, until such a time that we see either the existing resource has Fail (or matches what's in the manifest itself), or all ClusterOperators are reporting Available=true, Progressing=false, Degraded=false. In a traditional cluster bootstrap, this happens some time after the bootstrap API server has already shut down, and means we can be confident then at setting the policy to Fail.
Summary by CodeRabbit
New Features
Tests