Skip to content

AGENT-1443: IRI Add certificate regeneration to MCS cert rotation controller#5721

Open
rwsu wants to merge 2 commits intoopenshift:mainfrom
rwsu:AGENT-1443-IRI-cert-rotation
Open

AGENT-1443: IRI Add certificate regeneration to MCS cert rotation controller#5721
rwsu wants to merge 2 commits intoopenshift:mainfrom
rwsu:AGENT-1443-IRI-cert-rotation

Conversation

@rwsu
Copy link

@rwsu rwsu commented Feb 27, 2026

- What I did

  • Updated certrotation_controller to generate new IRI TLS certificates when the MCS CA rotates
  • Created reconcileIRICertificate function to update the IRI TLS certificates
  • The function only runs if the NoRegistryClusterInstall feature gate is enabled
  • Add localhost, 127.0.0.1, and ::1 to the IRI certificate SANs to match the installer-generated certificate.
  • Add idempotency check that verifies the existing IRI cert against the current MCS CA before regenerating, preventing unnecessary Secret updates and node rollouts on controller restarts.

- How to verify it

  • Force the MCS CA to rotate by deleting the secret
    oc delete secret machine-config-server-ca -n openshift-machine-config-operator
  • Verify the IRI TLS certificates have been updated at /etc/iri-registry/certs/tls.crt
  • Verify the IRI registry continues to work with the new certificates
    curl --cacert /etc/iri-registry/certs/tls.crt https://api-int.:22625/v2/_catalog

- Description for the changelog

When the MCS CA rotates, the IRI TLS certificate (internal-release-image-tls) was not being regenerated, leaving it signed by the old CA. This adds a reconcileIRICertificate() method that generates a new IRI server cert signed by the current MCS CA and creates/updates the secret, triggered alongside user-data secret reconciliation on CA configmap add/update events.

rwsu added 2 commits February 27, 2026 17:14
…troller

When the MCS CA rotates, the IRI TLS certificate (internal-release-image-tls)
was not being regenerated, leaving it signed by the old CA. This adds a
reconcileIRICertificate() method that generates a new IRI server cert signed
by the current MCS CA and creates/updates the secret.

The rotation is triggered on MCS CA rotation via CA bundle ConfigMap events.
When the machine-config-server-ca Secret is deleted or expires, library-go's
CertRotationController regenerates the CA key pair and updates the
machine-config-server-ca ConfigMap (CA bundle). Our addConfigMap and
updateConfigMap handlers detect this ConfigMap change and call
reconcileIRICertificate() alongside reconcileUserDataSecrets().

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
…cert rotation

Gate IRI certificate reconciliation on the NoRegistryClusterInstall
feature flag so it only runs on clusters that use the IRI registry.
Add localhost, 127.0.0.1, and ::1 to the IRI certificate SANs to
match the installer-generated certificate. Add idempotency check
that verifies the existing IRI cert against the current MCS CA
before regenerating, preventing unnecessary Secret updates and
node rollouts on controller restarts.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Feb 27, 2026

@rwsu: This pull request references AGENT-1443 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

- What I did

  • Updated certrotation_controller to generate new IRI TLS certificates when the MCS CA rotates
  • Created reconcileIRICertificate function to update the IRI TLS certificates
  • The function only runs if the NoRegistryClusterInstall feature gate is enabled
  • Add localhost, 127.0.0.1, and ::1 to the IRI certificate SANs to match the installer-generated certificate.
  • Add idempotency check that verifies the existing IRI cert against the current MCS CA before regenerating, preventing unnecessary Secret updates and node rollouts on controller restarts.

- How to verify it

  • Force the MCS CA to rotate by deleting the secret
    oc delete secret machine-config-server-ca -n openshift-machine-config-operator
  • Verify the IRI TLS certificates have been updated at /etc/iri-registry/certs/tls.crt
  • Verify the IRI registry continues to work with the new certificates
    curl --cacert /etc/iri-registry/certs/tls.crt https://api-int.:22625/v2/_catalog

- Description for the changelog

When the MCS CA rotates, the IRI TLS certificate (internal-release-image-tls) was not being regenerated, leaving it signed by the old CA. This adds a reconcileIRICertificate() method that generates a new IRI server cert signed by the current MCS CA and creates/updates the secret, triggered alongside user-data secret reconciliation on CA configmap add/update events.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 27, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rwsu
Once this PR has been reviewed and has the lgtm label, please assign isabella-janssen for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rwsu
Copy link
Author

rwsu commented Mar 2, 2026

/retest-required

@rwsu
Copy link
Author

rwsu commented Mar 2, 2026

/cc @djoshy @andfasano

@openshift-ci openshift-ci bot requested review from andfasano and djoshy March 2, 2026 15:30
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 2, 2026

@rwsu: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@rwsu
Copy link
Author

rwsu commented Mar 2, 2026

/verified by TestIRICertificateRotation unit test and @rwsu

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 2, 2026
@openshift-ci-robot
Copy link
Contributor

@rwsu: This PR has been marked as verified by TestIRICertificateRotation unit test and @rwsu.

Details

In response to this:

/verified by TestIRICertificateRotation unit test and @rwsu

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

@djoshy djoshy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall makes sense, left a few comments.

return
}

// Get hostnames from the dynamic serving rotation (includes api-int hostname and platform VIPs)
Copy link
Contributor

@djoshy djoshy Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the hostnames from the infra object required for the IRI cert? I was under the impression it wasn't.

It's possible the infra hostnames can get updated outside of the configmap being updated(very unlikely, but plauisble), that could result in the hostnames in the IRI cert would be stale. The MCS TLS cert handles this as it feeds of the dynamic serving rotation so I thought it would be worth mentioning:

CertCreator: &certrotation.ServingRotation{
Hostnames: c.hostnamesRotation.GetHostnames,
HostnamesChanged: c.hostnamesRotation.hostnamesChanged,
},

If we do want infra hostnames changes to be accounted for, we can call reconcileIRICertificate when the hostname queue gets updated, and I would also recommend adding a check for the IPs in isIRICertValid(). If not, we can just take out this bit that uses the hostnames from the dynamic serving rotation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the installation, when the initial IRI TLS certificate is generated (see here), we don't use hostnames, just localhost and the apiInt url.
The entry point for the IRI (logical) service is - and must remain - the apiInt. The localhost must be supported as a special case to allow masters to consume their own local registries in case of reboot / disconnect (ie when the apiInt is not reachable for any reason). So to summarize I don't think hostname are really required, but just apiInt and localHost (same behavior of the installer asset)

Comment on lines 348 to +352
c.reconcileUserDataSecrets()
}()
go func() {
c.reconcileIRICertificate()
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot about these being individual go routines in my original implementation. There is a very small chance of a race here on create or updates. Given that the CA rotates once every 8 years or so, the risk should be minimal. Even if there is a race, the fresh GETs on the CA should result in valid creates/updates.

If we want to be careful we can combine these into one thread and set up a mutex, but IMO that's not a blocker for merging this. The safest way to do this would be via a workqueue like our other controllers, but again definitely not a blocker, just me lamenting my original decisions 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about serializing them? Ie:

go func() {
		c.reconcileUserDataSecrets()
		c.reconcileIRICertificate()
}()

}

for _, test := range tests {
test := test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, but this pattern is no longer needed in go 1.22+ ref: https://go.dev/doc/go1.22#language

Comment on lines 348 to +352
c.reconcileUserDataSecrets()
}()
go func() {
c.reconcileIRICertificate()
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about serializing them? Ie:

go func() {
		c.reconcileUserDataSecrets()
		c.reconcileIRICertificate()
}()

return
}
klog.Infof("Reconciling IRI certificate")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also required to check for the presence of the IRI cluster resource. If not present, it means that the feature is not enabled

return nil
}

func (c *CertRotationController) reconcileIRICertificate() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this internal method is very long, please evaluate to refactor it in smaller methods

klog.Errorf("Cannot get IRI TLS secret: %v", err)
return
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this whole block could be simplified with the happy path, and secretExists removed. Ie:

iriSecret, err := c.kubeClient.CoreV1().Secrets(ctrlcommon.MCONamespace).Get(context.TODO(), ctrlcommon.InternalReleaseImageTLSSecretName, metav1.GetOptions{})
if err != nil && !errors.IsNotFound(err) {
        klog.Errorf("Cannot get IRI TLS secret: %v", err)
		return
}
if iriSecret != nil && c.isIRICertValid(iriSecret, ca) {
   	klog.Infof("IRI TLS certificate is still valid under the current MCS CA, skipping rotation")
    return
}

Later you could simply check for iriSecret != nil

return
}

// Get hostnames from the dynamic serving rotation (includes api-int hostname and platform VIPs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the installation, when the initial IRI TLS certificate is generated (see here), we don't use hostnames, just localhost and the apiInt url.
The entry point for the IRI (logical) service is - and must remain - the apiInt. The localhost must be supported as a special case to allow masters to consume their own local registries in case of reboot / disconnect (ie when the apiInt is not reachable for any reason). So to summarize I don't think hostname are really required, but just apiInt and localHost (same behavior of the installer asset)

return
}

if !secretExists {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to take into account this case? It doesn't seem a valid one (at least here): if the secret does not exist for any reason, the IRI controller will be broken and won't work as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be really useful to have at least one e2e test for the rotation, if not too complex, here https://github.com/openshift/machine-config-operator/tree/main/test/e2e-iri. As soon as openshift/release#73866 will land, it will be possible to verify directly the e2e tests for the IRI controller in the MCO presubmit jobs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants