Skip to content

SPLAT-2167: Added dedicated host support for AWS#374

Open
vr4manta wants to merge 6 commits intoopenshift:mainfrom
vr4manta:SPLAT-2167
Open

SPLAT-2167: Added dedicated host support for AWS#374
vr4manta wants to merge 6 commits intoopenshift:mainfrom
vr4manta:SPLAT-2167

Conversation

@vr4manta
Copy link

@vr4manta vr4manta commented Oct 1, 2025

SPLAT-2167

Changes

  • Added dedicated host support for AWS
  • Created new dedicated host tests
  • Fixed tests that were breaking due to updates upstream that were pulled in
  • Added missing permission ec2:DescribeInstanceTypes to cluster api credentials request

Dependencies

Notes

There seems to be a required permission that was missing for dedicated host support. When running these changes, a warning event was observed in the cluster capi operator namespace.

6s          Warning   FailedDescribeInstanceTypes   awscluster/ngirard-dh-5bb5w                           insufficient permissions to describe instance types for instance type "m6i.xlarge", falling back to the default architecture of "x86_64": operation error EC2: DescribeInstanceTypes, https response error StatusCode: 403, RequestID: 387549b4-ab58-48af-b14d-3882b6c7da52, api error UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::726924432237:user/ngirard-dh-5bb5w-openshift-cluster-api-aws-72f7q is not authorized to perform: ec2:DescribeInstanceTypes because no identity-based policy allows the ec2:DescribeInstanceTypes action

Summary by CodeRabbit

  • New Features

    • Added AWS dedicated-host support with tenancy modes ("default", "dedicated", "host"), HostID validation, and deterministic dynamic host-allocation tag ordering.
  • Security / Permissions

    • AWS credentials policy expanded to allow instance-type discovery (ec2:DescribeInstanceTypes).
  • Tests

    • Expanded conversion and fuzz tests covering host-affinity, HostID formats, dynamic host allocation, and validation scenarios.
  • Bug Fixes

    • Infra diffs now include host-affinity and HostID, enabling drift detection.
  • Chores

    • Dependency/module version updates.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 1, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 1, 2025

@vr4manta: This pull request references SPLAT-2167 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

SPLAT-2167

Changes

  • Added dedicated host support for AWS

Dependencies

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 1, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 1, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link

coderabbitai bot commented Oct 1, 2025

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds AWS dedicated-host placement support across CAPA↔MAPI conversions, extends fuzz and unit tests for placement scenarios, re-enables host field diffing in infra comparisons, bumps several module deps, and adds ec2:DescribeInstanceTypes to the credentials-request.

Changes

Cohort / File(s) Summary
Credentials Request
manifests/0000_30_cluster-api_01_credentials-request.yaml
Added ec2:DescribeInstanceTypes to AWS provider IAM actions.
CAPA → MAPI conversion
pkg/conversion/capi2mapi/aws.go, pkg/conversion/capi2mapi/aws_fuzz_test.go, pkg/conversion/capi2mapi/aws_test.go
Add dedicated-host support: tenancy constants (TenancyDefault, TenancyDedicated, TenancyHost), HostID regex/validation and new error messages, host-affinity conversion helpers, deterministic tag sorting, integrate Placement into toProviderSpec/status, and expand fuzz/tests for HostAffinity/HostID cases.
MAPI → CAPA conversion
pkg/conversion/mapi2capi/aws.go, pkg/conversion/mapi2capi/aws_fuzz_test.go, pkg/conversion/mapi2capi/aws_test.go
Implement conversion helpers mapping MAPI HostPlacement → CAPA fields (HostAffinity, HostID, DynamicHostAllocation); populate AWSMachineSpec/status from placement; add fuzzPlacement and placement tests; sort dynamic allocation tags deterministically.
Controllers — machineset / machinesync
pkg/controllers/machinesetsync/machineset_sync_controller.go, pkg/controllers/machinesync/machine_sync_mapi2capi_infrastructure.go
Removed diff-ignore options for AWS spec.hostID and spec.hostAffinity; those fields are now considered in infra template and infra machine comparisons.
Module updates
go.mod, e2e/go.mod, hack/tools/go.mod
Bumped sigs.k8s.io/cluster-api-provider-aws and openshift/api references, added a replace for github.com/openshift/api, and bumped kustomize/tool versions in hack/tools/go.mod.

Sequence Diagram(s)

sequenceDiagram
  participant CAPA as CAPA AWSMachine
  participant C2M as capi2mapi.converter
  participant MAPI as MAPI ProviderConfig/Placement
  participant M2C as mapi2capi.converter
  participant Controller as machineset/machinesync

  CAPA->>C2M: provide AWSMachineSpec (Tenancy, HostAffinity, HostID, DynamicHostAllocation)
  C2M->>MAPI: convert to ProviderConfig.Placement & ProviderStatus.DedicatedHost
  MAPI->>M2C: expose ProviderConfig.Placement
  M2C->>CAPA: convert Placement back to AWSMachineSpec/Status
  Controller->>MAPI: compute diffs (includes hostID/hostAffinity)
  Controller->>Controller: act on drift/updates
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through specs and regex with glee,

Mapped HostAffinity, HostID, tags in a spree,
Fuzzers twirled through eight playful ways,
Diffs now mind hosts by nights and days,
A tiny rabbit patch, neat as can be.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 1, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 1, 2025

@vr4manta: This pull request references SPLAT-2167 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

SPLAT-2167

Changes

  • Added dedicated host support for AWS

Dependencies

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 1, 2025
@vr4manta vr4manta force-pushed the SPLAT-2167 branch 2 times, most recently from 6355141 to 7ba97c3 Compare November 10, 2025 16:22
@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 11, 2025

@vr4manta: This pull request references SPLAT-2167 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

SPLAT-2167

Changes

  • Added dedicated host support for AWS

Dependencies

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@vr4manta vr4manta force-pushed the SPLAT-2167 branch 2 times, most recently from dc53d82 to e5cbce2 Compare November 11, 2025 17:16
@vr4manta
Copy link
Author

/test all

@vr4manta
Copy link
Author

/test all

@vr4manta vr4manta force-pushed the SPLAT-2167 branch 2 times, most recently from ed41da7 to 320eac0 Compare November 13, 2025 14:40
@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 13, 2025

@vr4manta: This pull request references SPLAT-2167 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

SPLAT-2167

Changes

  • Added dedicated host support for AWS

Dependencies

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 14, 2025

@vr4manta: This pull request references SPLAT-2167 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

SPLAT-2167

Changes

  • Added dedicated host support for AWS
  • Created new dedicated host tests
  • Fixed tests that were breaking due to updates upstream that were pulled in
  • Added missing permission ec2:DescribeInstanceTypes to cluster api credentials request

Dependencies

Notes

There seems to be a required permission that was missing for dedicated host support. When running these changes, a warning event was observed in the cluster capi operator namespace.

6s          Warning   FailedDescribeInstanceTypes   awscluster/ngirard-dh-5bb5w                           insufficient permissions to describe instance types for instance type "m6i.xlarge", falling back to the default architecture of "x86_64": operation error EC2: DescribeInstanceTypes, https response error StatusCode: 403, RequestID: 387549b4-ab58-48af-b14d-3882b6c7da52, api error UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::726924432237:user/ngirard-dh-5bb5w-openshift-cluster-api-aws-72f7q is not authorized to perform: ec2:DescribeInstanceTypes because no identity-based policy allows the ec2:DescribeInstanceTypes action

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@vr4manta vr4manta marked this pull request as ready for review November 14, 2025 13:15
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 14, 2025
@damdo
Copy link
Member

damdo commented Feb 27, 2026

@huali9 yes we need an upstream release for these:
kubernetes-sigs/cluster-api-provider-aws@v2.10.1...release-2.10

and then a upstream -> downstream sync

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 3, 2026
@sunzhaohua2
Copy link
Contributor

/test e2e-aws-capi-techpreview

@vr4manta
Copy link
Author

vr4manta commented Mar 5, 2026

I'm looking into the e2e failure to see what is up

@huali9
Copy link
Contributor

huali9 commented Mar 6, 2026

"failed to update MAPI machine set: admission webhook \"validation.machineset.machine.openshift.io\" denied the request: spec.placement.host: Forbidden: host may only be specified when tenancy is 'host'"

The error is caused by inconsistent between MAPI and CAPI.
Only set hostAffinity: default is allowed in CAPI. But the corresponding configuration is not allowed in MAPI:
Only set

            host:
              affinity: AnyAvailable  

I already raised a bug https://issues.redhat.com/browse/OCPBUGS-73821 for this before.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 9, 2026
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 9, 2026
@vr4manta
Copy link
Author

vr4manta commented Mar 9, 2026

/test e2e-aws-capi-techpreview

@vr4manta
Copy link
Author

vr4manta commented Mar 9, 2026

/test e2e-aws-capi-techpreview

@vr4manta
Copy link
Author

vr4manta commented Mar 9, 2026

Fixed e2e. A few minor issues found in conversion logic (both ways -> capi2mapi and mapi2capi). Now manually running some of the tests from OCPBUGS-73821. Some tests may not be completely valid, but making sure behavior is same on both side.

@huali9
Copy link
Contributor

huali9 commented Mar 10, 2026

/test regression-clusterinfra-aws-ipi-techpreview-capi

@huali9
Copy link
Contributor

huali9 commented Mar 10, 2026

Thank you for addressing the e2e issues. I'm currently testing the feature. Given the scope of the changes, I'll need some time to perform a full retest of the functionality to ensure everything works as expected.

However, if you prefer to merge the PR sooner, I can add the verified label now, as regression tests (e2e-aws-capi-techpreview and regression-clusterinfra-aws-ipi-techpreview-capi) have passed successfully. Please let me know which option works best for you.

@JoelSpeed
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 10, 2026
@openshift-ci-robot
Copy link

Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage.

@vr4manta
Copy link
Author

/test e2e-aws-ovn e2e-aws-ovn-serial-1of2 e2e-aws-ovn-serial-2of2 e2e-aws-ovn-techpreview e2e-aws-ovn-techpreview-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2026

@vr4manta: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 3820d2b link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-openstack-ovn-techpreview 68b083f link true /test e2e-openstack-ovn-techpreview
ci/prow/e2e-metal3-capi-techpreview 68b083f link false /test e2e-metal3-capi-techpreview

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@vr4manta
Copy link
Author

/retest

@vr4manta
Copy link
Author

Thank you for addressing the e2e issues. I'm currently testing the feature. Given the scope of the changes, I'll need some time to perform a full retest of the functionality to ensure everything works as expected.

However, if you prefer to merge the PR sooner, I can add the verified label now, as regression tests (e2e-aws-capi-techpreview and regression-clusterinfra-aws-ipi-techpreview-capi) have passed successfully. Please let me know which option works best for you.

@huali9 I am fine waiting on the finishing of the tests as long as it doesn't take too long. I may have to rebase again and rerun all tests if it takes too long and I'll be out tomorrow and Friday. Thanks for taking a look and for all the feedback!

@huali9
Copy link
Contributor

huali9 commented Mar 11, 2026

Thank you! @vr4manta I'll aim to complete the tests and updates by the end of this week. Thanks for your patience and feedback!

@huali9
Copy link
Contributor

huali9 commented Mar 13, 2026

I've completed testing/retesting this feature for both CAPI and MAPI. However, I found four issues related to dynamicHostAllocation (3 for CAPI and 1 for CPMS):

  1. CAPI machine stuck in Pending: When dynamicHostAllocation is enabled, the CAPI machine gets stuck in Pending state, and the awsmachine shows a 403 UnauthorizedOperation error.
  tenancy: host
  hostAffinity: host
  dynamicHostAllocation:
    tags:
      testkey: qatest
  1. CAPI dynamicHostAllocation cleanup issue: MAPI dynamicHostAllocation works correctly (MAPI machine becomes Running in the automatically created dedicated host). However, when I change the machine authoritativeAPI from MachineAPI to ClusterAPI and delete the machine, it gets stuck in Deleting state with a panic error in CAPA logs:
I0313 03:20:21.953077       1 awsmachine_controller.go:367] "EC2 instance terminated successfully" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api/huliu-aws313a-vcqgl-worker-us-east-2a-ghkw2" namespace="openshift-cluster-api" name="huliu-aws313a-vcqgl-worker-us-east-2a-ghkw2" reconcileID="0f67b182-dbfd-48c8-923d-f68c54c0b969" machine="openshift-cluster-api/huliu-aws313a-vcqgl-worker-us-east-2a-ghkw2" cluster="openshift-cluster-api/huliu-aws313a-vcqgl" instance-id="i-0d16cf72f447e86c2"
I0313 03:20:21.953106       1 awsmachine_controller.go:376] "Releasing dynamically allocated dedicated host" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api/huliu-aws313a-vcqgl-worker-us-east-2a-ghkw2" namespace="openshift-cluster-api" name="huliu-aws313a-vcqgl-worker-us-east-2a-ghkw2" reconcileID="0f67b182-dbfd-48c8-923d-f68c54c0b969" machine="openshift-cluster-api/huliu-aws313a-vcqgl-worker-us-east-2a-ghkw2" cluster="openshift-cluster-api/huliu-aws313a-vcqgl" hostID="h-04b410609213753f7"
E0313 03:20:22.079916       1 runtime.go:142] "Observed a panic" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api/huliu-aws313a-vcqgl-worker-us-east-2a-ghkw2" namespace="openshift-cluster-api" name="huliu-aws313a-vcqgl-worker-us-east-2a-ghkw2" reconcileID="0f67b182-dbfd-48c8-923d-f68c54c0b969" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"" stacktrace=<
goroutine 480 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x66a44b8, 0xc002de0fc0}, {0x55a7a60, 0x8d95450})
/build/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:132 +0xbc
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile.func1()
/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:108 +0x112
panic({0x55a7a60?, 0x8d95450?})
/usr/lib/golang/src/runtime/panic.go:783 +0x132
sigs.k8s.io/cluster-api-provider-aws/v2/pkg/cloud/services/ec2.(*Service).getReleaseHostsOutput(0xc001817c08?, 0x6638b80?)
/build/pkg/cloud/services/ec2/dedicatedhosts.go:242 +0x1d
sigs.k8s.io/cluster-api-provider-aws/v2/pkg/cloud/services/ec2.(*Service).ReleaseDedicatedHost(0xc0006f08c0, {0x66a44b8, 0xc002de0fc0}, {0xc002c6daa0, 0x13})
/build/pkg/cloud/services/ec2/dedicatedhosts.go:123 +0x1c8
sigs.k8s.io/cluster-api-provider-aws/v2/controllers.(*AWSMachineReconciler).reconcileDelete(0xc001177700, {0x66a44b8, 0xc002de0fc0}, 0xc0021257a0, {0x66d33b0, 0xc001817c08}, {0x66e4490, 0xc001817c08}, {0x66e3a40, 0xc001817c08}, ...)
/build/controllers/awsmachine_controller.go:377 +0xcbd
sigs.k8s.io/cluster-api-provider-aws/v2/controllers.(*AWSMachineReconciler).Reconcile(0xc001177700, {0x66a44b8, 0xc002de0fc0}, {{{0xc002c6d980?, 0x30?}, {0xc001bc8510?, 0xc000100008?}}})
/build/controllers/awsmachine_controller.go:236 +0xbd6

It appears that when CAPI tries to delete a dedicated host created by MAPI, it encounters a nil pointer error. I believe we should support cross-API operations (MAPI create → CAPI delete, and vice versa) since we already support MAPI2CAPI and CAPI2MAPI conversions.

On the AWS console, the instance Terminated, but the dedicated host remains Available. Even after destroying the cluster, the dedicated host is still Available and must be manually released.

Destroy job failed (I'm not sure the reason), some log shows:

03-12 20:00:43.514 level=debug msg=unrecognized EC2 resource type dedicated-host arn=arn:aws:ec2:us-east-2:301721915996:dedicated-host/h-0741d78d65cb2dbc6

I think the installer destroy process should support dedicated host cleanup as a safety net.

  1. CAPI to MAPI sync validation error: When I configure:
  tenancy: host
  dynamicHostAllocation:
    tags:
      testkey: qatest

Syncing to MAPI fails with:

    message: 'failed to convert CAPI machine set to MAPI machine set: spec.dynamicHostAllocation:
      Invalid value: {"tags":{"testkey":"qatest"}}: dynamicHostAllocation is only
      allowed when hostAffinity is host'
    reason: FailedToConvertCAPIMachineSetToMAPI

However, the equivalent MAPI configuration works fine:

  tenancy: host
  host:
    affinity: AnyAvailable
    dedicatedHost:
      allocationStrategy: Dynamic
      dynamicHostAllocation:
        tags:
        - name: team
          value: qa

I'm not sure why this sync error occurs. But this validation seems incorrect and should be removed, as the CAPI configuration is valid and also works in MAPI.

  1. CPMS dedicated host not released: When using dynamicHostAllocation with ControlPlaneMachineSet (CPMS), the dedicated host is not released when the control plane machine is deleted. Note that deleting worker machines correctly releases the dedicated host automatically, so this issue appears to be specific to CPMS. Additionally, destroy the cluster also failed to release it, and I had to manually release the dedicated host.

MAO logs show the host was allocated but not released:

liuhuali@Lius-MacBook-Pro huali-test % oc logs  machine-api-controllers-94458854-ldlg9  -c machine-controller |grep h-00f27f8d415ec44a8 
I0313 06:29:17.683178       1 dedicatedhosts.go:53] Successfully allocated dedicated host h-00f27f8d415ec44a8 for machine huliu-aws313a-vcqgl-master-4mgws-2
I0313 06:29:17.683194       1 instances.go:482] Allocated dedicated host h-00f27f8d415ec44a8 for machine huliu-aws313a-vcqgl-master-4mgws-2
liuhuali@Lius-MacBook-Pro huali-test % 

I've also updated the test cases with more detailed information:
https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-87006
https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-86951

@vr4manta Could you please help investigate these issues? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants