AGENT-1443: IRI Add certificate regeneration to MCS cert rotation controller#5721
AGENT-1443: IRI Add certificate regeneration to MCS cert rotation controller#5721rwsu wants to merge 2 commits intoopenshift:mainfrom
Conversation
…troller When the MCS CA rotates, the IRI TLS certificate (internal-release-image-tls) was not being regenerated, leaving it signed by the old CA. This adds a reconcileIRICertificate() method that generates a new IRI server cert signed by the current MCS CA and creates/updates the secret. The rotation is triggered on MCS CA rotation via CA bundle ConfigMap events. When the machine-config-server-ca Secret is deleted or expires, library-go's CertRotationController regenerates the CA key pair and updates the machine-config-server-ca ConfigMap (CA bundle). Our addConfigMap and updateConfigMap handlers detect this ConfigMap change and call reconcileIRICertificate() alongside reconcileUserDataSecrets(). Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
…cert rotation Gate IRI certificate reconciliation on the NoRegistryClusterInstall feature flag so it only runs on clusters that use the IRI registry. Add localhost, 127.0.0.1, and ::1 to the IRI certificate SANs to match the installer-generated certificate. Add idempotency check that verifies the existing IRI cert against the current MCS CA before regenerating, preventing unnecessary Secret updates and node rollouts on controller restarts. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
|
@rwsu: This pull request references AGENT-1443 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rwsu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/retest-required |
|
/cc @djoshy @andfasano |
|
@rwsu: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/verified by TestIRICertificateRotation unit test and @rwsu |
|
@rwsu: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
djoshy
left a comment
There was a problem hiding this comment.
Overall makes sense, left a few comments.
| return | ||
| } | ||
|
|
||
| // Get hostnames from the dynamic serving rotation (includes api-int hostname and platform VIPs) |
There was a problem hiding this comment.
Are the hostnames from the infra object required for the IRI cert? I was under the impression it wasn't.
It's possible the infra hostnames can get updated outside of the configmap being updated(very unlikely, but plauisble), that could result in the hostnames in the IRI cert would be stale. The MCS TLS cert handles this as it feeds of the dynamic serving rotation so I thought it would be worth mentioning:
If we do want infra hostnames changes to be accounted for, we can call reconcileIRICertificate when the hostname queue gets updated, and I would also recommend adding a check for the IPs in isIRICertValid(). If not, we can just take out this bit that uses the hostnames from the dynamic serving rotation.
There was a problem hiding this comment.
During the installation, when the initial IRI TLS certificate is generated (see here), we don't use hostnames, just localhost and the apiInt url.
The entry point for the IRI (logical) service is - and must remain - the apiInt. The localhost must be supported as a special case to allow masters to consume their own local registries in case of reboot / disconnect (ie when the apiInt is not reachable for any reason). So to summarize I don't think hostname are really required, but just apiInt and localHost (same behavior of the installer asset)
| c.reconcileUserDataSecrets() | ||
| }() | ||
| go func() { | ||
| c.reconcileIRICertificate() | ||
| }() |
There was a problem hiding this comment.
I forgot about these being individual go routines in my original implementation. There is a very small chance of a race here on create or updates. Given that the CA rotates once every 8 years or so, the risk should be minimal. Even if there is a race, the fresh GETs on the CA should result in valid creates/updates.
If we want to be careful we can combine these into one thread and set up a mutex, but IMO that's not a blocker for merging this. The safest way to do this would be via a workqueue like our other controllers, but again definitely not a blocker, just me lamenting my original decisions 😄
There was a problem hiding this comment.
What about serializing them? Ie:
go func() {
c.reconcileUserDataSecrets()
c.reconcileIRICertificate()
}()
| } | ||
|
|
||
| for _, test := range tests { | ||
| test := test |
There was a problem hiding this comment.
Not a blocker, but this pattern is no longer needed in go 1.22+ ref: https://go.dev/doc/go1.22#language
| c.reconcileUserDataSecrets() | ||
| }() | ||
| go func() { | ||
| c.reconcileIRICertificate() | ||
| }() |
There was a problem hiding this comment.
What about serializing them? Ie:
go func() {
c.reconcileUserDataSecrets()
c.reconcileIRICertificate()
}()
| return | ||
| } | ||
| klog.Infof("Reconciling IRI certificate") | ||
|
|
There was a problem hiding this comment.
It is also required to check for the presence of the IRI cluster resource. If not present, it means that the feature is not enabled
| return nil | ||
| } | ||
|
|
||
| func (c *CertRotationController) reconcileIRICertificate() { |
There was a problem hiding this comment.
this internal method is very long, please evaluate to refactor it in smaller methods
| klog.Errorf("Cannot get IRI TLS secret: %v", err) | ||
| return | ||
| } | ||
|
|
There was a problem hiding this comment.
It looks like this whole block could be simplified with the happy path, and secretExists removed. Ie:
iriSecret, err := c.kubeClient.CoreV1().Secrets(ctrlcommon.MCONamespace).Get(context.TODO(), ctrlcommon.InternalReleaseImageTLSSecretName, metav1.GetOptions{})
if err != nil && !errors.IsNotFound(err) {
klog.Errorf("Cannot get IRI TLS secret: %v", err)
return
}
if iriSecret != nil && c.isIRICertValid(iriSecret, ca) {
klog.Infof("IRI TLS certificate is still valid under the current MCS CA, skipping rotation")
return
}
Later you could simply check for iriSecret != nil
| return | ||
| } | ||
|
|
||
| // Get hostnames from the dynamic serving rotation (includes api-int hostname and platform VIPs) |
There was a problem hiding this comment.
During the installation, when the initial IRI TLS certificate is generated (see here), we don't use hostnames, just localhost and the apiInt url.
The entry point for the IRI (logical) service is - and must remain - the apiInt. The localhost must be supported as a special case to allow masters to consume their own local registries in case of reboot / disconnect (ie when the apiInt is not reachable for any reason). So to summarize I don't think hostname are really required, but just apiInt and localHost (same behavior of the installer asset)
| return | ||
| } | ||
|
|
||
| if !secretExists { |
There was a problem hiding this comment.
Why do we need to take into account this case? It doesn't seem a valid one (at least here): if the secret does not exist for any reason, the IRI controller will be broken and won't work as well
There was a problem hiding this comment.
I think it would be really useful to have at least one e2e test for the rotation, if not too complex, here https://github.com/openshift/machine-config-operator/tree/main/test/e2e-iri. As soon as openshift/release#73866 will land, it will be possible to verify directly the e2e tests for the IRI controller in the MCO presubmit jobs
- What I did
- How to verify it
oc delete secret machine-config-server-ca -n openshift-machine-config-operator
curl --cacert /etc/iri-registry/certs/tls.crt https://api-int.:22625/v2/_catalog
- Description for the changelog
When the MCS CA rotates, the IRI TLS certificate (internal-release-image-tls) was not being regenerated, leaving it signed by the old CA. This adds a reconcileIRICertificate() method that generates a new IRI server cert signed by the current MCS CA and creates/updates the secret, triggered alongside user-data secret reconciliation on CA configmap add/update events.