Skip to content

[Test] Add end2end test to cover cluster patching.#7456

Open
gmarciani wants to merge 1 commit into
aws:developfrom
gmarciani:wip/mgiacomo/3160/test-self-patch-0623-1
Open

[Test] Add end2end test to cover cluster patching.#7456
gmarciani wants to merge 1 commit into
aws:developfrom
gmarciani:wip/mgiacomo/3160/test-self-patch-0623-1

Conversation

@gmarciani

@gmarciani gmarciani commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Description of changes

Add end2end test to cover cluster patching.

Minor changes made in this PR:

  1. move the gpu workload script to a common folder so that it can be reused by this new test.

Test flow

  1. Retrieve the AMI the cluster will use
  2. Run EC2 instance with the source AMI, run patching commands and create a new patched AMI.
  3. Create a cluster that uses the source AMI
  4. Validate that the cluster is functional (GPU workload submitted from login node)
  5. Stop the login nodes.
  6. Update the cluster so that compute and login nodes use the patched AMI
  7. Patch the head node in place and reboot it.
  8. Start login nodes
  9. Validate that the cluster is functional (GPU workload submitted from login node)

Tests

ONGOING:

test-suites:
  patching:
    test_patching.py::test_patching_cluster:
      dimensions:
        - regions: [{{ g4dn_8xlarge_CAPACITY_RESERVATION_3_INSTANCES_2_HOURS_NOPG_rhel8 }}]
          instances: ["g4dn.8xlarge"]
          oss: ["rhel8"]
          schedulers: ["slurm"]
        - regions: [{{ g4dn_8xlarge_CAPACITY_RESERVATION_3_INSTANCES_2_HOURS_NOPG_alinux2023 }}]
          instances: ["g4dn.8xlarge"]
          oss: ["alinux2023"]
          schedulers: ["slurm"]
        - regions: [{{ g4dn_8xlarge_CAPACITY_RESERVATION_3_INSTANCES_2_HOURS_NOPG_ubuntu2204 }}]
          instances: ["g4dn.8xlarge"]
          oss: ["ubuntu2204"]
          schedulers: ["slurm"]

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@gmarciani gmarciani added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x Test labels Jun 23, 2026
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/test-self-patch-0623-1 branch 4 times, most recently from 4568e29 to c86eb8d Compare June 24, 2026 17:09
@gmarciani gmarciani changed the title [Test] Add test to cover self-patching. [Test] Add end2end test to cover cluster patching. Jun 24, 2026
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/test-self-patch-0623-1 branch from c86eb8d to 29db146 Compare June 24, 2026 22:32
@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.92%. Comparing base (bdf724a) to head (29db146).
⚠️ Report is 12 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #7456   +/-   ##
========================================
  Coverage    89.91%   89.92%           
========================================
  Files          180      180           
  Lines        16200    16226   +26     
========================================
+ Hits         14567    14592   +25     
- Misses        1633     1634    +1     
Flag Coverage Δ
unittests 89.92% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/test-self-patch-0623-1 branch from 29db146 to f08e36f Compare June 25, 2026 11:09
@gmarciani gmarciani marked this pull request as ready for review June 25, 2026 11:22
@gmarciani gmarciani requested review from a team as code owners June 25, 2026 11:22
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/test-self-patch-0623-1 branch 9 times, most recently from 6d3cf9e to 7093bf8 Compare June 25, 2026 19:48

# Retrieve the AMI the cluster will use *before* creation. Since the create
# config does not set a custom AMI, the cluster uses this default pcluster AMI.
base_ami = retrieve_latest_ami(region, os, ami_type="pcluster", request=request)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this respect the "AMI_OWNER" specification in the testing infrastructure? If not, how about retrieving AMIs from the CloudFormation template of the first cluster creation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Godd catch, it may not. Fixing

# Assert the kernel modules loaded before patching are still loaded on the head,
# compute and login nodes (which were all replaced or patched).
for node_type in NODE_TYPES:
_assert_kernel_modules_loaded(cluster, scheduler_commands_factory, node_type, kernel_modules_before)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure the number of kernel modules don't change before and after the update?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, what we want to enforce is that all the kernel modules loaded before the patching are still loaded after the patching (will fix the assertion).

That said, it may be that this assertion will be anyways too hard because a kernel bump could legitimately invalidate some kernel modules. My proposal is to start with a hard assertion as long as it succeeds and we will relax if we face legitimate removals.

@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/test-self-patch-0623-1 branch from 7093bf8 to 54f8309 Compare June 26, 2026 12:10
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/test-self-patch-0623-1 branch from 54f8309 to 15669a7 Compare June 26, 2026 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-changelog-update Disables the check that enforces changelog updates in PRs Test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants