Skip to content

Randomize managed volume copy host#5789

Merged
yadvr merged 3 commits into
apache:4.16from
mlsorensen:merge-shuffle-volume-migration
Dec 30, 2021
Merged

Randomize managed volume copy host#5789
yadvr merged 3 commits into
apache:4.16from
mlsorensen:merge-shuffle-volume-migration

Conversation

@mlsorensen
Copy link
Copy Markdown
Contributor

  • Managed volume copy was always returning first host that could see storage pools

  • Fix null value in logging for ScaleIOPrimaryDataStoreDriver due to if/else logic error

Signed-off-by: Marcus Sorensen mls@apple.com

Description

This PR fixes #5788

It shuffles the list of valid host candidates that are used for processing volume migrations.

It also fixes a logic error in debug logging that caused 'null' to be printed. Originally concatenating a static string with potentially null value and then testing to see if the result is null, which would never be true.

A more ideal solution might be to track the jobs on each host and attempt to balance the load according to how busy each host is. This was investigated and there's currently no obvious way to look up job assignments to hosts via the DB, the Commands sent to Agents are kept live in memory via the AgentAttache. The ideal solution would be a much larger feature to coordinate Command assignments across the management server cluster, and possibly need improved agent comms to provide live status of Command progress.

I additionally spent some time looking into how to isolate and unit test findUpAndEnabledHostWithAccessToStoragePools(). Currently it would require mocking three Daos, the DataStoreProviderManager, a DataStoreProvider, and a DataStoreDriver. It's possible to write a test for this, however I was concerned that the unit test would be coding implementation across three methods into the test. These mocked Daos exist up to three methods deep - if the implementation of any of these methods changes then the test needs to be rewritten to code in the new implementation. I wasn't sure this was the right thing to do just to add a simple one-line shuffle but I'm happy to go that route if needed.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

How Has This Been Tested?

Have run this code against a live environment and tested volume migration.

@sureshanaparti sureshanaparti added this to the 4.16.1.0 milestone Dec 20, 2021
Copy link
Copy Markdown
Contributor

@sureshanaparti sureshanaparti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR @mlsorensen , code LGTM

@sureshanaparti sureshanaparti changed the base branch from main to 4.16 December 20, 2021 05:24
@sureshanaparti sureshanaparti changed the base branch from 4.16 to main December 20, 2021 05:25
@sureshanaparti
Copy link
Copy Markdown
Contributor

Hi @mlsorensen I think, this is good to go in 4.16.1. Can you rebase with 4.16 branch. Thanks.

@sureshanaparti
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1956

@apache apache deleted a comment from blueorangutan Dec 20, 2021
@apache apache deleted a comment from blueorangutan Dec 20, 2021
@mlsorensen mlsorensen force-pushed the merge-shuffle-volume-migration branch from 9b73c17 to 53e5c41 Compare December 20, 2021 16:15
* Managed volume copy was always returning first host that could see storage pools

* Fix null value in logging for ScaleIOPrimaryDataStoreDriver due to if/else logic error

Signed-off-by: Marcus Sorensen <mls@apple.com>
@mlsorensen mlsorensen force-pushed the merge-shuffle-volume-migration branch from 53e5c41 to 1d1e128 Compare December 20, 2021 16:17
@mlsorensen mlsorensen changed the base branch from main to 4.16 December 20, 2021 16:18
@mlsorensen
Copy link
Copy Markdown
Contributor Author

Hi @mlsorensen I think, this is good to go in 4.16.1. Can you rebase with 4.16 branch. Thanks.

Done. Do I need a separate PR for main or will this get into main?

@sureshanaparti
Copy link
Copy Markdown
Contributor

Hi @mlsorensen I think, this is good to go in 4.16.1. Can you rebase with 4.16 branch. Thanks.

Done. Do I need a separate PR for main or will this get into main?

Thanks @mlsorensen , No need of separate PR for main, this will be forward merged to main.

@sureshanaparti
Copy link
Copy Markdown
Contributor

@blueorangutan package

@sureshanaparti
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✖️ el7 ✔️ el8 ✖️ debian ✖️ suse15. SL-JID 1971

@shwstppr
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@shwstppr a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1973

@sureshanaparti
Copy link
Copy Markdown
Contributor

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

Comment on lines +632 to +633
String debugMessage = "Initiating copy from PowerFlex template volume on host ";
LOGGER.debug(destHost != null ? debugMessage + destHost.getId() : debugMessage);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use String.format here:

Suggested change
String debugMessage = "Initiating copy from PowerFlex template volume on host ";
LOGGER.debug(destHost != null ? debugMessage + destHost.getId() : debugMessage);
LOGGER.debug(String.format("Initiating copy from PowerFlex template volume on host %s", destHost != null ? destHost.getId() : ""));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

Comment on lines +652 to +653
String debugMessage = "Initiating copy from PowerFlex volume on host ";
LOGGER.debug(destHost != null ? debugMessage + destHost.getId() : debugMessage);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also could use String.format here.

if (hostIds.isEmpty()) {
return null;
}
Collections.shuffle(hostIds);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could add some log to this method, about which host was selected (or not).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's logging elsewhere, where the returned value is used.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong on this one, but I think that there are not many logs regarding the chosen host to migrate.
Call hierarchy that I've taken a quick check does not present logs. Assuming your comment, this would be logged later on in the call hierarchy ( I haven't checked all the way). Even if it is, you can see that many exceptions can be thrown not too late after the host ID is retrieved.

> StorageManagerImpl.findUpAndEnabledHostWithAccessToStoragePools(List<Long>)  (com.cloud.storage)
    > VolumeServiceImpl.copyManagedVolume(VolumeInfo, DataStore)  (org.apache.cloudstack.storage.volume)
        > VolumeServiceImpl.copyVolume(VolumeInfo, DataStore)  (org.apache.cloudstack.storage.volume)
            > VolumeOrchestrator.copyVolumeFromSecToPrimary(VolumeInfo, VirtualMachine, VirtualMachineTemplate, DataCenter, Pod, Long, ServiceOffering, ...)  (org.apache.cloudstack.engine.orchestration)
            ...
            > VolumeOrchestrator.migrateVolume(Volume, StoragePool)  (org.apache.cloudstack.engine.orchestration)
            ...
            > VolumeApiServiceImpl.orchestrateExtractVolume(long, long)  (com.cloud.storage)
            ...

Additionally, the canHostAccessStoragePools called inside the findUpAndEnabledHostWithAccessToStoragePools can return false. If the host cannot access, not much is logged as well.

Please, let us know in case we are missing something here.

Copy link
Copy Markdown
Member

@yadvr yadvr Dec 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logging is not of the randomised list but the selected host is logged by the deployment planner or during provisioning or volume operation.

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-2698)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 31743 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5789-t2698-kvm-centos7.zip
Smoke tests completed. 91 look OK, 0 have errors
Only failed tests results shown below:

Test Result Time (s) Test File

Signed-off-by: Marcus Sorensen <mls@apple.com>
@sureshanaparti
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1985

Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm (none blocking suggestion added for the log message)

private Answer copyTemplateToVolume(DataObject srcData, DataObject destData, Host destHost) {
// Copy PowerFlex/ScaleIO template to volume
LOGGER.debug("Initiating copy from PowerFlex template volume on host " + destHost != null ? destHost.getId() : "");
LOGGER.debug(String.format("Initiating copy from PowerFlex template volume on host %s", destHost != null ? destHost.getId() : ""));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LOGGER.debug(String.format("Initiating copy from PowerFlex template volume on host %s", destHost != null ? destHost.getId() : ""));
LOGGER.debug(String.format("Initiating copy from PowerFlex template volume on host %s", destHost != null ? destHost.getId() : "<unknown>"));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, why not?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing this. Going to use "not specified" @DaanHoogland since that sounds less like an error. There is code later to resolve a destination host if this is not specified.

private Answer copyVolume(DataObject srcData, DataObject destData, Host destHost) {
// Copy PowerFlex/ScaleIO volume
LOGGER.debug("Initiating copy from PowerFlex volume on host " + destHost != null ? destHost.getId() : "");
LOGGER.debug(String.format("Initiating copy from PowerFlex template volume on host %s", destHost != null ? destHost.getId() : ""));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Signed-off-by: Marcus Sorensen <mls@apple.com>
@yadvr
Copy link
Copy Markdown
Member

yadvr commented Dec 23, 2021

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✖️ el7 ✔️ el8 ✖️ debian ✔️ suse15. SL-JID 2003

@sureshanaparti
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

Copy link
Copy Markdown
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM !

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 2007

@apache apache deleted a comment from blueorangutan Dec 24, 2021
@apache apache deleted a comment from blueorangutan Dec 24, 2021
@apache apache deleted a comment from blueorangutan Dec 24, 2021
@sureshanaparti
Copy link
Copy Markdown
Contributor

@blueorangutan test keepEnv

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-2738)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 30871 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5789-t2738-kvm-centos7.zip
Smoke tests completed. 91 look OK, 0 have errors
Only failed tests results shown below:

Test Result Time (s) Test File

Copy link
Copy Markdown
Member

@GabrielBrascher GabrielBrascher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @mlsorensen.
Overall looks good, I just added a comment related to the @GutoVeronezi's log suggestion.

I particularly struggle sometimes to debug CloudStack issues due to the lack of information in the logs and it could be that this flow is one of these cases. I am not sure though, so I would need to double-check on this one.

I appreciate any feedback regarding it.

if (hostIds.isEmpty()) {
return null;
}
Collections.shuffle(hostIds);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong on this one, but I think that there are not many logs regarding the chosen host to migrate.
Call hierarchy that I've taken a quick check does not present logs. Assuming your comment, this would be logged later on in the call hierarchy ( I haven't checked all the way). Even if it is, you can see that many exceptions can be thrown not too late after the host ID is retrieved.

> StorageManagerImpl.findUpAndEnabledHostWithAccessToStoragePools(List<Long>)  (com.cloud.storage)
    > VolumeServiceImpl.copyManagedVolume(VolumeInfo, DataStore)  (org.apache.cloudstack.storage.volume)
        > VolumeServiceImpl.copyVolume(VolumeInfo, DataStore)  (org.apache.cloudstack.storage.volume)
            > VolumeOrchestrator.copyVolumeFromSecToPrimary(VolumeInfo, VirtualMachine, VirtualMachineTemplate, DataCenter, Pod, Long, ServiceOffering, ...)  (org.apache.cloudstack.engine.orchestration)
            ...
            > VolumeOrchestrator.migrateVolume(Volume, StoragePool)  (org.apache.cloudstack.engine.orchestration)
            ...
            > VolumeApiServiceImpl.orchestrateExtractVolume(long, long)  (com.cloud.storage)
            ...

Additionally, the canHostAccessStoragePools called inside the findUpAndEnabledHostWithAccessToStoragePools can return false. If the host cannot access, not much is logged as well.

Please, let us know in case we are missing something here.

Copy link
Copy Markdown
Member

@GabrielBrascher GabrielBrascher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

Parallel volume migrations on KVM are processed on same host

9 participants