Skip to content

server: check server capacity when start/deploy a vm#5339

Merged
nvazquez merged 1 commit into
apache:4.15from
shapeblue:4.16-check-server-capacity-start-or-migrate-vm
Sep 3, 2021
Merged

server: check server capacity when start/deploy a vm#5339
nvazquez merged 1 commit into
apache:4.15from
shapeblue:4.16-check-server-capacity-start-or-migrate-vm

Conversation

@weizhouapache
Copy link
Copy Markdown
Member

@weizhouapache weizhouapache commented Aug 19, 2021

Description

This PR fixes the issue that server capacity check is skipped, when start or migrate a vm with specified hostid.

=======UPDATE 2021-09-01===========
We have decided to only add capacity check. the other checks (host state, affinity group, etc) are not added.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

@weizhouapache
Copy link
Copy Markdown
Member Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@weizhouapache a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@weizhouapache
Copy link
Copy Markdown
Member Author

weizhouapache commented Aug 19, 2021

@andrijapanicsb @nvazquez

the following checks are added when deploy/start a vm with specified id.
(1) host state (Up and Enabled)
(2) DPDK enabled (if it is enabled on vm, host should be enabled also)
(3) hypervisor capability (max guest limits)
(4) server capacity

not sure if users are happy with the checks.

ps: (1) to (3) already exist in destination host check in vm migration. (4) is a new check in vm migration.

=======UPDATE 2021-09-01===========

We have decided to only add capacity check. the other checks are not added.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 932

@weizhouapache
Copy link
Copy Markdown
Member Author

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@weizhouapache a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@yadvr yadvr added this to the 4.16.0.0 milestone Aug 19, 2021
@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-1721)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 34502 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5339-t1721-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_snapshots.py
Smoke tests completed. 89 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

Copy link
Copy Markdown
Contributor

@nvazquez nvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @weizhouapache looks good. What about host tags checks against the offering? cc. @andrijapanicsb

@nvazquez nvazquez closed this Aug 20, 2021
@nvazquez nvazquez reopened this Aug 20, 2021
Comment thread server/src/main/java/com/cloud/vm/UserVmManagerImpl.java Outdated
Comment thread server/src/main/java/com/cloud/vm/UserVmManagerImpl.java Outdated
Copy link
Copy Markdown
Contributor

@sureshanaparti sureshanaparti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm, minor changes to code format required (check suggestions)

@weizhouapache
Copy link
Copy Markdown
Member Author

@nvazquez
actually vm cannot be started or migrated if there is no enough server capacity in 4.15 and master.

the process to deploy/start a vm without destination host is
(1) check server capacity of all hosts
(2) select a host
(3) double check if host has enough capacity (because step 1 might take long time if there are many hosts). this step was added by #3728

if destination host is specified, step 1 and 2 are skipped for now. but there is still a new check in step 3.
I am not sure if necessary to add an extra step to check server capacity (it takes few milliseconds) in this PR.

current error

  1. allow.deploy.vm.if.deploy.on.given.host.fails=false
    Screenshot from 2021-08-23 10-05-23

  2. allow.deploy.vm.if.deploy.on.given.host.fails=true
    Screenshot from 2021-08-23 10-09-46

@weizhouapache
Copy link
Copy Markdown
Member Author

clgtm, minor changes to code format required (check suggestions)

thanks @sureshanaparti , changes have been committed.

deployOnGivenHost = true;
}
} catch (InvalidParameterValueException ex) {
s_logger.error("Cannot deploy the VM to specified host due to: " + ex.getMessage());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could pass the exception to the log.

if (!AllowDeployVmIfGivenHostFails.value()) {
deployOnGivenHost = true;
try {
checkVmDestinationServerCapacity(vm, destinationHost);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler to use _capacityMgr.checkIfHostHasCpuCapabilityAndCapacity ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidjumani I will change it.thanks

@weizhouapache
Copy link
Copy Markdown
Member Author

@nvazquez @davidjumani updated this pr.

allow.deploy.vm.if.deploy.on.given.host.fails=false
Screenshot from 2021-08-24 10-04-26

allow.deploy.vm.if.deploy.on.given.host.fails=true
Screenshot from 2021-08-24 10-05-16

@weizhouapache weizhouapache changed the title server: check server capacity when start/deploy or migrate a vm server: check server capacity when start/deploy migrate a vm Aug 25, 2021
@weizhouapache weizhouapache changed the title server: check server capacity when start/deploy migrate a vm server: check server capacity when start/deploy a vm Aug 25, 2021
@weizhouapache weizhouapache reopened this Aug 25, 2021
@weizhouapache weizhouapache reopened this Aug 26, 2021
@yadvr
Copy link
Copy Markdown
Member

yadvr commented Aug 27, 2021

@weizhouapache does this affect 4.15 too? If so, pl raise PR against 4.15 branch.

@weizhouapache weizhouapache force-pushed the 4.16-check-server-capacity-start-or-migrate-vm branch from 0e47334 to de39782 Compare August 27, 2021 11:14
@weizhouapache weizhouapache changed the base branch from main to 4.15 August 27, 2021 11:18
@weizhouapache
Copy link
Copy Markdown
Member Author

@weizhouapache does this affect 4.15 too? If so, pl raise PR against 4.15 branch.

@rhtyd @nvazquez changed base branch to 4.15

@nvazquez
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian. SL-JID 1065

@nvazquez
Copy link
Copy Markdown
Contributor

@blueorangutan test ubuntu18 kvm-ubuntu18

@apache apache deleted a comment from blueorangutan Aug 31, 2021
@apache apache deleted a comment from blueorangutan Aug 31, 2021
@apache apache deleted a comment from blueorangutan Aug 31, 2021
@apache apache deleted a comment from blueorangutan Aug 31, 2021
@blueorangutan
Copy link
Copy Markdown

@nvazquez a Trillian-Jenkins test job (ubuntu18 mgmt + kvm-ubuntu18) has been kicked to run smoke tests

@yadvr yadvr modified the milestones: 4.16.0.0, 4.15.2.0 Aug 31, 2021
@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-1845)
Environment: kvm-ubuntu18 (x2), Advanced Networking with Mgmt server u18
Total time taken: 47260 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5339-t1845-kvm-ubuntu18.zip
Intermittent failure detected: /marvin/tests/smoke/test_accounts.py
Intermittent failure detected: /marvin/tests/smoke/test_deploy_vm_iso.py
Intermittent failure detected: /marvin/tests/smoke/test_domain_vpc_offerings.py
Intermittent failure detected: /marvin/tests/smoke/test_iso.py
Intermittent failure detected: /marvin/tests/smoke/test_routers_network_ops.py
Intermittent failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermittent failure detected: /marvin/tests/smoke/test_host_maintenance.py
Intermittent failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 83 look OK, 4 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestAccounts>:setup Error 0.00 test_accounts.py
ContextSuite context=TestAddVmToSubDomain>:setup Error 0.00 test_accounts.py
test_DeleteDomain Error 0.89 test_accounts.py
test_forceDeleteDomain Failure 0.88 test_accounts.py
test_forceDeleteDomain Error 3.10 test_accounts.py
ContextSuite context=TestRemoveUserFromAccount>:setup Error 5.39 test_accounts.py
ContextSuite context=TestTemplateHierarchy>:teardown Error 425.51 test_accounts.py
test_03_1_create_iso_with_checksum_md5_negative Error 999.88 test_iso.py
test_03_create_iso_with_checksum_md5 Error 65.47 test_iso.py
test_01_migrate_VM_and_root_volume Error 72.08 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 51.01 test_vm_life_cycle.py
test_01_secure_vm_migration Error 112.08 test_vm_life_cycle.py
test_02_unsecure_vm_migration Error 170.18 test_vm_life_cycle.py
test_08_migrate_vm Error 48.24 test_vm_life_cycle.py
test_hostha_kvm_host_degraded Error 681.86 test_hostha_kvm.py
test_hostha_kvm_host_fencing Error 666.57 test_hostha_kvm.py

@nvazquez
Copy link
Copy Markdown
Contributor

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-1869)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 52093 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5339-t1869-kvm-centos7.zip
Smoke tests completed. 88 look OK, 3 have errors
Only failed tests results shown below:

Test Result Time (s) Test File
test_deploy_vm_start_failure Error 63.18 test_deploy_vm.py
test_deploy_vm_volume_creation_failure Error 66.22 test_deploy_vm.py
test_vm_ha Error 60.51 test_vm_ha.py
test_vm_sync Error 125.86 test_vm_sync.py

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Sep 2, 2021

Lgtm, failures not related to this PR.
@weizhouapache @nvazquez @borisstoyanov @vladimirpetrov anybody testing this?

@nvazquez
Copy link
Copy Markdown
Contributor

nvazquez commented Sep 2, 2021

Kicking a new round of tests while is being tested
@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

Copy link
Copy Markdown
Contributor

@vladimirpetrov vladimirpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM based on manual testing. Tested scenarios:

(1) cpu speed must be smaller than or equals to cpu mhz of physical host (this does not consider over provisioning)
(2) VM with more cores than available on the host and enough CPU speed and memory
(3) host must have enough cpu resource (vm cpu*speed)
(4) host must have enough memory

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-1943)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 34076 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5339-t1943-kvm-centos7.zip
Smoke tests completed. 87 look OK, 0 have errors
Only failed tests results shown below:

Test Result Time (s) Test File

@nvazquez nvazquez merged commit cf6dc66 into apache:4.15 Sep 3, 2021
@weizhouapache weizhouapache deleted the 4.16-check-server-capacity-start-or-migrate-vm branch December 9, 2022 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants