Skip to content

[integ-tests-framework] Make capacity reservations for all instance types#7461

Open
hanwen-cluster wants to merge 1 commit into
aws:developfrom
hanwen-cluster:developjun29
Open

[integ-tests-framework] Make capacity reservations for all instance types#7461
hanwen-cluster wants to merge 1 commit into
aws:developfrom
hanwen-cluster:developjun29

Conversation

@hanwen-cluster

@hanwen-cluster hanwen-cluster commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Description of changes

  1. With [integ-tests] Improve test_proxy to avoid insufficient capacity error #7440, we started to make capacity reservations for {"c5.xlarge", "m6g.xlarge", "m6i.xlarge"}, and use other similar instance types if a capacity reservation fails to creat. This commit expands the logic to all instance types.
    1.1. With instance types <= .xlarge, we make duplicate capacity reservations because multiple tests in parallel could use the same instance types, therefore need multiple capacity reservations. With instance types >.xlarge, we make only one capacity reservation because tests with larger instance types usually make capacity reservations early in the test definition (e.g. test_efa in commercial makes capacity reservation in develop.yaml), therefore this second layer of capacity reservation shouldn't make duplicate capacity reservations.
    1.2. With instance types supporting EFA, create the capacity reservation in a placement group. With instance types not supporting EFA, create the capacity reservation without a placement group.
  2. With this commit, resolve_instance_with_capacity allows specifying alternative_instance_types. Prior to this commit alternative_instance_types was always calculated with get_similar_instance_types, which could be too restrictive, so don't give too many alternatives for instance types like c5n.18xlarge
  3. Improve test_efa in isolated_regions to take a flag to use any efa instances to avoid Insufficient Capacity Error. test_efa in commercial doesn't need this, because it could try out different regions. In isolated regions, the test has to run in a specific region.

Tests

test-suites:
  efa:
    test_efa.py::test_efa:
      dimensions:
        - regions: ["ap-southeast-5"]
          instances: ["c5n.18xlarge"]
          oss: ["alinux2023"]
          schedulers: ["slurm"]
          flags: ["any-efa-instances"]
        - regions: ["us-east-1"]
          instances: ["c5n.18xlarge"]
          oss: ["alinux2023"]
          schedulers: ["slurm"]

In the above tests, the test in us-east-1 passed completely. The test in ap-southeast-5 failed some checks in fabtest because it was using g6.8xlarge. This failure is not a regression from this PR, and won't surface in isolated regions because fabtest is not run in isolated regions.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…ypes

1. With aws#7440, we started to make capacity reservations for {"c5.xlarge", "m6g.xlarge", "m6i.xlarge"}, and use other similar instance types if a capacity reservation fails to creat. This commit expands the logic to all instance types.
1.1. With instance types <= .xlarge, we make duplicate capacity reservations because multiple tests in parallel could use the same instance types, therefore need multiple capacity reservations. With instance types >.xlarge, we make only one capacity reservation because tests with larger instance types usually make capacity reservations early in the test definition (e.g. test_efa in commercial makes capacity reservation in `develop.yaml`), therefore this second layer of capacity reservation shouldn't make duplicate capacity reservations.
1.2. With instance types supporting EFA, create the capacity reservation in a placement group. With instance types not supporting EFA, create the capacity reservation without a placement group.
2. With this commit, resolve_instance_with_capacity allows specifying alternative_instance_types. Prior to this commit alternative_instance_types was always calculated with `get_similar_instance_types`, which could be too restrictive, so don't give too many alternatives for instance types like `c5n.18xlarge`
3. Improve test_efa in isolated_regions to take a flag to use any efa instances to avoid Insufficient Capacity Error. test_efa in commercial doesn't need this, because it could try out different regions. In isolated regions, the test has to run in a specific region.
@hanwen-cluster hanwen-cluster requested review from a team as code owners June 29, 2026 21:05
@hanwen-cluster hanwen-cluster added the skip-changelog-update Disables the check that enforces changelog updates in PRs label Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog-update Disables the check that enforces changelog updates in PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant