Skip to content

feat(azure): enable ACR image pulls via managed identity on worker nodes#8205

Closed
twolff-gh wants to merge 5 commits into
openshift:mainfrom
twolff-gh:azure-acr-mi-api
Closed

feat(azure): enable ACR image pulls via managed identity on worker nodes#8205
twolff-gh wants to merge 5 commits into
openshift:mainfrom
twolff-gh:azure-acr-mi-api

Conversation

@twolff-gh

@twolff-gh twolff-gh commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

Followup to #7793

Add support for configuring kubelet's image credential provider to
authenticate to Azure Container Registry (ACR) using a user-assigned
managed identity. This eliminates the need for static pull secrets
by leveraging IMDS-based token acquisition on worker nodes.

Summary by CodeRabbit

  • New Features

    • Azure node pools can include image registry credentials (managed identity + registries) to enable ACR authentication on worker nodes.
    • When provided, worker VMs are assigned the configured identity and a MachineConfig is generated to install the ACR credential provider and supporting config files.
  • Bug Fixes

    • Serialization compatibility between versions preserved for Azure node pool platform and registry credential data.
  • Tests

    • Added comprehensive tests covering config generation, ignition payloads, validation, and JSON/YAML serialization.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 10, 2026
@openshift-ci

openshift-ci Bot commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an optional ImageRegistryCredentials field to AzureNodePoolPlatform and a new AzureImageRegistryCredentials type with validation (ManagedIdentity, Registries). The operator sets VMIdentityUserAssigned and UserAssignedIdentities on Azure VM specs when credentials are present. It generates a worker MachineConfig (Ignition v3.2.0) embedding a kubelet credential-provider YAML and an acr-azure.json payload derived from Azure identifiers. Tests cover API JSON compatibility, credential-provider and acr JSON generation, Ignition contents, and MachineConfig creation. Kubeapilinter exclusions were added for the new field and test-only structs.

Sequence Diagram(s)

sequenceDiagram
    actor Admin as Admin/User
    participant NP as NodePool Spec
    participant Op as Operator
    participant AzureSpec as Azure VM Template
    participant MCO as MachineConfig Generator
    participant Ign as Ignition Config
    participant Node as Worker Node

    Admin->>NP: Create/Update NodePool with ImageRegistryCredentials
    Op->>NP: Reconcile NodePool
    Op->>AzureSpec: Build azureMachineTemplateSpec()
    alt ImageRegistryCredentials set
        AzureSpec->>AzureSpec: Set Identity = VMIdentityUserAssigned
        AzureSpec->>AzureSpec: Add UserAssignedIdentities (ManagedIdentity)
    end
    Op->>MCO: generateMCORawConfig()
    MCO->>MCO: getACRCredentialProviderConfig()
    alt Azure + ImageRegistryCredentials
        MCO->>MCO: generateACRCredentialProviderMachineConfig()
        MCO->>Ign: Create Ignition v3.2.0 with:
        Ign->>Ign: /etc/kubernetes/credential-providers/acr-credential-provider.yaml
        Ign->>Ign: /etc/kubernetes/acr-azure.json
        MCO->>Op: Return MachineConfig (wrapped in ConfigMap)
    end
    Op->>Node: Apply MachineConfig / Ignition
    Node->>Node: Kubelet uses credential provider (ManagedIdentity)
Loading
🚥 Pre-merge checks | ✅ 8 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Topology-Aware Scheduling Compatibility ⚠️ Warning PR unconditionally labels MachineConfig as worker role without topology-aware validation, causing silent failures on SNO/TNF/TNA clusters with no dedicated worker nodes. Add topology validation to detect non-HyperShift topologies (SNO, TNF, TNA) and return explicit errors, or document the limitation prominently.
✅ Passed checks (8 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main feature: enabling ACR image pulls via managed identity on worker nodes, which aligns with the primary changes adding ImageRegistryCredentials support to AzureNodePoolPlatform and generating ACR credential provider machine configs.
Stable And Deterministic Test Names ✅ Passed All test function names are static and descriptive following Go conventions; table-driven test cases use stable names without dynamic values like timestamps or UUIDs.
Test Structure And Quality ✅ Passed Tests follow established quality patterns: single responsibility per test, table-driven organization, no problematic setup/cleanup, synchronous execution without timeouts, and descriptive assertion messages consistent with codebase.
Microshift Test Compatibility ✅ Passed The pull request does not add any Ginkgo e2e tests. All three test files added are standard Go unit tests using the testing package rather than Ginkgo framework tests.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This pull request does not add any Ginkgo e2e tests. All newly added test files use standard Go unit testing with the func TestXXX(t *testing.T) pattern. No Ginkgo patterns like It(), Describe(), Context(), or When() are present.
Ote Binary Stdout Contract ✅ Passed The pull request contains no violations of the OTE Binary Stdout Contract. No process-level code writes to stdout.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The custom check for IPv6 and disconnected network compatibility targets Ginkgo e2e tests, but this PR adds only standard Go unit tests using table-driven patterns with no Ginkgo imports or BDD-style declarations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added do-not-merge/needs-area needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Apr 10, 2026
@openshift-ci

openshift-ci Bot commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: twolff-gh
Once this PR has been reviewed and has the lgtm label, please assign enxebre for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/azure PR/issue for Azure (AzurePlatform) platform and removed do-not-merge/needs-area labels Apr 10, 2026
short-lived tokens for ACR image pulls instead of relying on static pull secrets.
Changing this field will trigger a node rollout.
properties:
managedIdentity:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for standalone or other cases, do we want to support system-assigned identity? If so, we wouldn't have a customer pass these in at creation time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the scope of this feature, I believe it was only for user-assigned identities. I'll confirm.
If we needed to support system-assigned, I wonder if it would need to be populated differently / new field?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, user-assigned was the main ask, but system-assigned would likely follow in the future.
Do they make sense as separate work - user now and system as another PR after?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if system-assigned identities are added per-VM, then this doesn't make sense from a feature perspective. Maybe try and see what happens, then ignore it or implement it as you see fit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I see, a cluster needs to be created before getting a principal id which some process would need permissions to access customer resources.

Does the customer trust the platform to write IAM policies on their container registry? (Don't know)

Seems like a followup item. I'll gladly look into

twolff-gh added a commit to twolff-gh/hypershift that referenced this pull request Apr 13, 2026
…/R3)

Address bennerv review comments R2 and R3 on PR openshift#8205: make the
registries field optional (remove +required and MinItems=1), allow empty
registries to fall through to default wildcard patterns (*.azurecr.io,
etc.), and update tests accordingly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hypershift-operator/controllers/nodepool/config.go`:
- Around line 242-248: The code dereferences
cg.hostedCluster.Spec.Platform.Azure directly which can panic; before calling
generateACRCredentialProviderMachineConfig, add a guard that
cg.hostedCluster.Spec.Platform != nil and cg.hostedCluster.Spec.Platform.Azure
!= nil, and handle the nil case (return an appropriate error or skip registry
credential config) so you don't call generateACRCredentialProviderMachineConfig
with a nil azureSpec; keep references to
cg.nodePool.Spec.Platform.Azure.ImageRegistryCredentials and the
generateACRCredentialProviderMachineConfig call when adding the guard and
subsequent handling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 51b240dc-cd08-46f4-8ba4-b589a0fd067b

📥 Commits

Reviewing files that changed from the base of the PR and between 4402b11 and 6578833.

⛔ Files ignored due to path filters (14)
  • api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/AAA_ungated.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/GCPPlatform.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/OpenStack.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • client/applyconfiguration/hypershift/v1beta1/azureimageregistrycredentials.go is excluded by !client/**
  • client/applyconfiguration/hypershift/v1beta1/azurenodepoolplatform.go is excluded by !client/**
  • client/applyconfiguration/utils.go is excluded by !client/**
  • cmd/install/assets/hypershift-operator/zz_generated.crd-manifests/nodepools-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/hypershift-operator/zz_generated.crd-manifests/nodepools-Default.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/hypershift-operator/zz_generated.crd-manifests/nodepools-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/azure.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
📒 Files selected for processing (8)
  • api/.golangci.yml
  • api/hypershift/v1beta1/azure.go
  • api/hypershift/v1beta1/azure_test.go
  • hypershift-operator/controllers/nodepool/azure.go
  • hypershift-operator/controllers/nodepool/azure_acr.go
  • hypershift-operator/controllers/nodepool/azure_acr_test.go
  • hypershift-operator/controllers/nodepool/azure_test.go
  • hypershift-operator/controllers/nodepool/config.go

Comment thread hypershift-operator/controllers/nodepool/config.go
@twolff-gh twolff-gh changed the title WIP: feat(azure): enable ACR image pulls via managed identity on worker nodes feat(azure): enable ACR image pulls via managed identity on worker nodes Apr 14, 2026
@twolff-gh twolff-gh marked this pull request as ready for review April 14, 2026 22:34
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 14, 2026
@openshift-ci openshift-ci Bot requested review from cblecker and jparrill April 14, 2026 22:34
twolff-gh added a commit to twolff-gh/hypershift that referenced this pull request Apr 14, 2026
…/R3)

Address bennerv review comments R2 and R3 on PR openshift#8205: make the
registries field optional (remove +required and MinItems=1), allow empty
registries to fall through to default wildcard patterns (*.azurecr.io,
etc.), and update tests accordingly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 14, 2026
Comment thread api/.golangci.yml Outdated
Comment thread api/hypershift/v1beta1/azure.go Outdated
Comment thread api/hypershift/v1beta1/azure_test.go Outdated
// acrDefaultMatchImages lists the default wildcard patterns for Azure Container Registry
// endpoints across all Azure clouds. These patterns are always appended to the user-specified
// registries to ensure the credential provider covers all standard ACR endpoints.
var acrDefaultMatchImages = []string{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these not be defaulted in the api for the registries?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I found, these come from MCO credential provider config, if changed upstream, API default could go stale. Does that sound accurate?

Can or should we change this at the MCO config?

Comment thread api/hypershift/v1beta1/azure.go Outdated
Comment thread api/hypershift/v1beta1/azure.go Outdated
// multiple registries. When specified, these registries are added to the credential
// provider's matchImages list alongside the default wildcard patterns (*.azurecr.io,
// *.azurecr.cn, *.azurecr.de, *.azurecr.us). When omitted, only the default wildcard
// patterns are used, which cover all standard Azure Container Registry endpoints.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call out a customer only gets 16 of the 20 spots since 4 are used for the wildcards?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like this is additive. If the hard limit is 20 including the defaults, you should limit this to only what the user can actually provide.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaxItems=16 added. user gets 16 slots, 4 defaults added by controller.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the registries field. It was in the original WIP, but doesnt make sense in this final version.
Adds more complexity than necessary
The default patterns copied in plus the RBAC for ids should be enough

Comment thread api/hypershift/v1beta1/azure.go Outdated
Comment thread hypershift-operator/controllers/nodepool/azure.go Outdated

azureMachineTemplate.Template.Spec.SSHPublicKey = dummySSHKey

if nodePool.Spec.Platform.Azure.ImageRegistryCredentials != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment documenting that identity assignment currently replaces (not appends), and should be refactored to append if additional identities are needed in the future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Comment thread hypershift-operator/controllers/nodepool/azure_acr.go Outdated
Comment thread hypershift-operator/controllers/nodepool/azure_acr_test.go Outdated
Comment thread api/hypershift/v1beta1/azure.go Outdated
Comment thread api/hypershift/v1beta1/azure.go
Comment thread api/hypershift/v1beta1/azure.go Outdated
// multiple registries. When specified, these registries are added to the credential
// provider's matchImages list alongside the default wildcard patterns (*.azurecr.io,
// *.azurecr.cn, *.azurecr.de, *.azurecr.us). When omitted, only the default wildcard
// patterns are used, which cover all standard Azure Container Registry endpoints.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like this is additive. If the hard limit is 20 including the defaults, you should limit this to only what the user can actually provide.

Comment thread api/hypershift/v1beta1/azure.go Outdated
Comment thread api/hypershift/v1beta1/azure.go Outdated
Comment thread api/hypershift/v1beta1/azure.go Outdated
Comment on lines +142 to +144
// multiple registries. When specified, these registries are added to the credential
// provider's matchImages list alongside the default wildcard patterns (*.azurecr.io,
// *.azurecr.cn, *.azurecr.de, *.azurecr.us). When omitted, only the default wildcard

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I allowed to specify an entry that matches one of these default patterns?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Controller duplicates silently. Duplicate is dropped.
Does this work?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

De-duplication is fine - should we prevent users from specifying the wildcard patterns that are applied by default though? It doesn't look like there would be any value to the users to be able to specify those aside from unnecessarily consuming some of their 16 slots.

Do we anticipate these defaults potentially changing much in the future (i.e removing one of them)? If so, it could make sense to allow end-users to explicitly provide them to be more resilient to changes in the defaults.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the registries field. It was in the original WIP, but doesnt make sense in this final version.
Adds more complexity than necessary
The default patterns copied in plus the RBAC for ids should be enough

Comment thread api/hypershift/v1beta1/azure.go Outdated
@twolff-gh

Copy link
Copy Markdown
Contributor Author

/retest

Comment thread api/hypershift/v1beta1/azure.go Outdated
// +kubebuilder:validation:MaxItems=16
// +kubebuilder:validation:items:MinLength=1
// +kubebuilder:validation:items:MaxLength=253
// +kubebuilder:validation:items:Pattern=`^(\*\.)?[a-zA-Z0-9]([a-zA-Z0-9\-]*[a-zA-Z0-9])?(\.[a-zA-Z0-9]([a-zA-Z0-9\-]*[a-zA-Z0-9])?)+$`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using the Pattern marker, we generally recommend using the XValidation marker to enforce the pattern via a CEL expression and supplying a human-readable error message on validation failure.

Using the pattern marker means that the end user is provided the literal regex as part of the validation failure instead of a user-friendly message that describes the constraints.

Comment thread api/hypershift/v1beta1/azure.go Outdated
// Format: /subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{identityName}
//
// +required
ManagedIdentity AzureManagedIdentityResourceID `json:"managedIdentity,omitempty"`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include the length constraints for this field (enforced by the type) in the GoDoc here as well please

@cblecker

Copy link
Copy Markdown
Member

/uncc

@openshift-ci openshift-ci Bot removed the request for review from cblecker April 16, 2026 16:06
Comment thread hypershift-operator/controllers/nodepool/azure_acr.go
@twolff-gh twolff-gh force-pushed the azure-acr-mi-api branch 2 times, most recently from 2af9a37 to de3bfd7 Compare April 17, 2026 14:26
Comment on lines +139 to +141
// resourceID is the ARM resource ID of the user-assigned managed identity.
//
// Format: /subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{identityName}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully document the length constraints as well please

Comment thread api/hypershift/v1beta1/azure.go Outdated
Comment on lines +146 to +150
// clientID is the client ID of the user-assigned managed identity. This field is optional
// and is mainly used for CI purposes.
//
// +optional
ClientID AzureClientID `json:"clientID,omitempty"`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a use case where this field provides value to end users?

We shouldn't design the API surface for what we need for testing (unless I'm misunderstanding "CI purposes").

twolff-gh and others added 5 commits April 22, 2026 20:00
…Identity

Add AzureImageRegistryCredentials struct on AzureNodePoolPlatform to
configure kubelet credential provider for ACR authentication using a
user-assigned managed identity on worker node VMs.

- UserAssignedManagedIdentity struct with resourceID (required)
- AzureManagedIdentityResourceID named type with CEL ARM resource ID
  validation, MinLength=1, MaxLength=512
- ImageRegistryCredentials is optional, value type with omitzero,
  triggers node rollout when changed
- No Registries field: controller hardcodes wildcard patterns covering
  all standard ACR endpoints (*.azurecr.io/cn/de/us)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ntroller

Add ACR credential provider MachineConfig generation and CAPI managed
identity attachment for Azure NodePools with imageRegistryCredentials.

When imageRegistryCredentials.managedIdentity is set on a NodePool:
- AzureMachineTemplate gets Identity: UserAssigned with the MI
- MachineConfig overrides MCO credential provider config to point at
  a separate acr-azure.json with useManagedIdentityExtension: true
- Kubelet uses acr-credential-provider binary for short-lived IMDS
  tokens, no static pull secrets needed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add YAML-driven envtest cases for imageRegistryCredentials validation:
- Valid managedIdentity with camelCase resourceGroups (pass)
- Lowercase resourcegroups rejected (fail, consistent with other
  Azure resource ID validations in this file)
- Invalid format string (fail)
- Wrong provider type (fail)
- Missing identity name (fail)
- Extra path segments (fail)
- imageRegistryCredentials not set (pass)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci

openshift-ci Bot commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

@twolff-gh: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci

Copy link
Copy Markdown

The PR branch is diverged and behind_by: 2 — it does not include the fix. Now I have all the evidence needed.

Test Failure Analysis Complete

Job Information

  • Prow Job: Envtest Vanilla Kube 1.35.0 (GitHub Actions)
  • Build ID: 24809099130 / Job ID 72609904879
  • PR: #8205 — feat(azure): enable ACR image pulls via managed identity on worker nodes
  • Branch: azure-acr-mi-api (head: b4ae9c17)
  • Conclusion Job: Job ID 72610671370 — aggregator gate that failed because the envtest job failed

Test Failure Analysis

Error

[FAILED] Timed out after 30.001s.
CRD clusterresourcesets.addons.cluster.x-k8s.io should be fully removed
Expected
    <bool>: false
to be true

[FAIL] CRD Installation [It] should install all CRDs for feature set "TechPreviewNoUpgrade"
  /home/runner/_work/hypershift/hypershift/test/envtest/generator.go:262

609 Passed | 1 Failed | 0 Pending | 0 Skipped

Summary

The Envtest Vanilla Kube 1.35.0 job failed because the CRD clusterresourcesets.addons.cluster.x-k8s.io was not fully removed within the 30-second timeout during the TechPreviewNoUpgrade CRD installation test. This is a known flaky test (tracked as OCPBUGS-83585) that was already fixed on main by PR #8261 (merged April 17, 2026). The PR #8205 branch does not include this fix because it has not been rebased onto current main. The Conclusion job is simply an aggregator gate that fails whenever any upstream envtest job fails — it has no independent failure. The PR's own changes (Azure ACR managed identity for worker nodes) are completely unrelated to the failing test.

Root Cause

The root cause is a race condition in CRD cleanup within test/envtest/generator.go, function GenerateCRDInstallTest().

The bug (present on PR branch): After installing and validating CRDs for a feature set, the test calls envtest.UninstallCRDs() but returns immediately without waiting for the API server to actually finalize CRD deletion. When the next test suite starts, the clusterresourcesets.addons.cluster.x-k8s.io CRD is still in a deletion-pending state, causing the 30-second timeout assertion to fail.

The fix (on main via PR #8261, commit fa1cecda): Added an Eventually loop after UninstallCRDs() that polls each CRD until apierrors.IsNotFound(err) returns true, ensuring deletion is fully complete before proceeding.

Why only Kube 1.35.0? CRD garbage collection speed varies by Kubernetes version. The Kube 1.35.0 envtest binary has slower CRD finalization timing, making it the most susceptible to this race condition. Other Kube versions and OCP-variant envtests happened to finalize fast enough to avoid the timeout.

Why is PR #8205 affected? The PR branch was created on April 10, 2026, before the fix was merged on April 17, 2026. The branch is 2 commits behind main and has diverged status, meaning the fix commit fa1cecda is not present.

Recommendations
  1. Rebase PR feat(azure): enable ACR image pulls via managed identity on worker nodes #8205 onto current main to pick up the fix from PR OCPBUGS-83585: Wait for CRD removal in GenerateCRDInstallTest to fix flaky envtest #8261 (commit fa1cecda). This is the only action needed — no changes to the PR's own code are required.

    git fetch origin main
    git rebase origin/main
    git push --force-with-lease
  2. Re-run the workflow after rebasing. The Envtest Vanilla Kube 1.35.0 test should pass, and the Conclusion aggregator gate will pass as a consequence.

  3. No PR code changes needed — the PR's Azure ACR managed identity feature is unrelated to this CRD installation test failure.

Evidence
Evidence Detail
Failing test CRD Installation [It] should install all CRDs for feature set "TechPreviewNoUpgrade"
Failing CRD clusterresourcesets.addons.cluster.x-k8s.io timed out on removal (30s)
Test file test/envtest/generator.go:262
Known bug OCPBUGS-83585 — race condition in CRD cleanup
Fix PR #8261 — merged April 17, 2026 (commit fa1cecda)
PR branch status diverged, 2 commits behind main — missing the fix
PR changes scope Azure ACR MI API types, nodepool CRDs, ACR credential controller — no CAPI CRDs touched
Conclusion job Aggregator gate — failed solely because envtest job failed
Test results 609 Passed, 1 Failed, 0 Pending, 0 Skipped

// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:MaxLength=512
// +kubebuilder:validation:XValidation:rule="self.matches('^/subscriptions/[^/]+/resourceGroups/[^/]+/providers/Microsoft\\\\.ManagedIdentity/userAssignedIdentities/[^/]+$')",message="must be a valid ARM resource ID for a user-assigned managed identity (e.g., /subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{name})"
type AzureManagedIdentityResourceID string

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From another PR I reviewed recently that enforces this format, it looks like subscription ID and resource group can be further validated to be the correct format like:

// +kubebuilder:validation:XValidation:rule="oldSelf.hasValue() && oldSelf.value() == self || size(self.split('/')) == 9 && self.split('/')[2].matches('^[0-9a-fA-F]{8}-([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{12}$')",message="the subscriptionId in the imageID must be a valid UUID. It should be 5 groups of hyphen separated hexadecimal characters in the form 8-4-4-4-12",optionalOldSelf=true
// +kubebuilder:validation:XValidation:rule=`oldSelf.hasValue() && oldSelf.value() == self || size(self.split('/')) == 9 && self.split('/')[4].matches('^[a-zA-Z0-9_\\(\\)\\.-]{1,90}$')`,message="the resourceGroupName should be between 1 and 90 characters, consisting only of alphanumeric characters, hyphens, underscores, periods and parenthesis",optionalOldSelf=true

Reference: https://github.com/openshift/hypershift/pull/8106/changes#diff-11b7c56a5faf0c5e2154fc35474456e2aac1b6ea8e22e5bbda6c3a3996d91afaR143-R144

@twolff-gh twolff-gh closed this Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/azure PR/issue for Azure (AzurePlatform) platform ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants