Skip to content

fix: skip VMs with host passthrough during DRS plan generation#13099

Closed
raniers1 wants to merge 2 commits intoapache:mainfrom
raniers1:fix/4.21-drs-skip-passthrough-vms
Closed

fix: skip VMs with host passthrough during DRS plan generation#13099
raniers1 wants to merge 2 commits intoapache:mainfrom
raniers1:fix/4.21-drs-skip-passthrough-vms

Conversation

@raniers1
Copy link
Copy Markdown

@raniers1 raniers1 commented May 4, 2026

When iterating over VMs in getBestMigration, the call to listHostsForMigrationOfVM throws InvalidParameterValueException for VMs using a GPU passthrough profile. This exception was not caught inside the loop, causing it to propagate up and abort the entire DRS plan generation for the cluster, leaving all other eligible VMs without a migration plan.

Fix wraps the listHostsForMigrationOfVM call in a try-catch block. When the exception is caught, the VM is skipped with a debug log message and the loop continues evaluating the remaining VMs.

Also adds unit test testGetBestMigrationSkipsPassthroughVm to verify that a VM throwing InvalidParameterValueException is skipped while other eligible VMs are still considered for migration.

Fixes: #13098

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

Bug Severity

  • Minor

How Has This Been Tested?

  • Added unit test testGetBestMigrationSkipsPassthroughVm in ClusterDrsServiceImplTest. The test mocks one VM throwing InvalidParameterValueException and verifies that the exception is caught, the VM is skipped, and the remaining eligible VM is still selected for migration.
  • All 15 tests in ClusterDrsServiceImplTest pass with no failures.
  • Validated in a live environment. After the fix, the VM with host passthrough is correctly skipped with a debug log instead of aborting the entire cluster DRS plan:

DEBUG [o.a.c.c.ClusterDrsServiceImpl] Skipping VM VM instance
{"id":621,"instanceName":"i-23-621-VM","state":"Running","type":"User"}
for DRS, unsupported operation: Unsupported operation, VM uses host passthrough, cannot migrate.

When iterating over VMs in getBestMigration, the call to
listHostsForMigrationOfVM throws InvalidParameterValueException
for VMs using a vGPU passthrough profile. This exception was not
caught inside the loop, causing it to propagate up and abort the
entire DRS plan generation for the cluster, leaving all other
eligible VMs without a migration plan.

Fix wraps the listHostsForMigrationOfVM call in a try-catch block.
When the exception is caught, the VM is skipped with a debug log
message and the loop continues evaluating the remaining VMs.

Also adds unit test testGetBestMigrationSkipsPassthroughVm to verify
that a VM throwing InvalidParameterValueException is skipped while
other eligible VMs are still considered for migration.

Fixes: apache#13098
@boring-cyborg
Copy link
Copy Markdown

boring-cyborg Bot commented May 4, 2026

Congratulations on your first Pull Request and welcome to the Apache CloudStack community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/cloudstack/blob/main/CONTRIBUTING.md)
Here are some useful points:

@sureshanaparti
Copy link
Copy Markdown
Contributor

@raniers1 the mentioned fix got reverted during the merge commit, can you check.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 3.52%. Comparing base (1e512ab) to head (4bdfe1b).

❗ There is a different number of reports uploaded between BASE (1e512ab) and HEAD (4bdfe1b). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (1e512ab) HEAD (4bdfe1b)
unittests 1 0
Additional details and impacted files
@@              Coverage Diff              @@
##               main   #13099       +/-   ##
=============================================
- Coverage     18.08%    3.52%   -14.56%     
=============================================
  Files          6037      464     -5573     
  Lines        542437    40137   -502300     
  Branches      66420     7555    -58865     
=============================================
- Hits          98088     1415    -96673     
+ Misses       433333    38534   -394799     
+ Partials      11016      188    -10828     
Flag Coverage Δ
uitests 3.52% <ø> (ø)
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vishesh92
Copy link
Copy Markdown
Member

@raniers1 can you create the fix against 4.20 branch.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent cluster DRS plan generation from aborting when encountering VMs that cannot be migrated due to host passthrough (e.g., vGPU passthrough), by skipping those VMs and continuing to evaluate the remaining candidates.

Changes:

  • Skip VMs that trigger InvalidParameterValueException during migration-host evaluation so DRS can still produce a plan for other eligible VMs.
  • Add a unit test intended to verify passthrough VMs are skipped while other VMs are still considered.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +884 to +892
Mockito.when(managementServer.listHostsForMigrationOfVM(vmPassthrough, 0L, 500L, null, vmList))
.thenThrow(new InvalidParameterValueException("Unsupported operation, VM uses host passthrough, cannot migrate"));
Mockito.when(managementServer.listHostsForMigrationOfVM(vmNormal, 0L, 500L, null, vmList)).thenReturn(
new Ternary<>(new Pair<>(List.of(destHost), 1), List.of(destHost), Map.of(destHost, false)));
Mockito.when(balancedAlgorithm.getMetrics(cluster, vmNormal, serviceOffering, destHost, new HashMap<>(),
new HashMap<>(), false)).thenReturn(new Ternary<>(1.0, 0.5, 1.5));

Pair<VirtualMachine, Host> bestMigration = clusterDrsService.getBestMigration(cluster, balancedAlgorithm,
vmList, vmIdServiceOfferingMap, new HashMap<>(), new HashMap<>());
@raniers1 raniers1 closed this May 5, 2026
@raniers1
Copy link
Copy Markdown
Author

raniers1 commented May 5, 2026

Closing this PR. The fix was accidentally reverted during conflict resolution. Will reopen targeting the 4.20 branch as requested by @vishesh92

@raniers1
Copy link
Copy Markdown
Author

raniers1 commented May 5, 2026

After further investigation, the fix was already introduced in the 4.20 branch via a prior refactoring of getCompatibleHostAndVmStorageMotionCache, which includes a catch (Exception e) block that prevents InvalidParameterValueException from propagating and aborting the DRS plan generation.

The bug specifically affects the 4.21.0.0 release tag, which does not have an active maintenance branch. All currently maintained branches (4.20, 4.22, main) already handle this case.

Thank you for the review and guidance, @vishesh92

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DRS plan generation fails for entire cluster when a VM uses host passthrough

5 participants