Randomize managed volume copy host#5789
Conversation
sureshanaparti
left a comment
There was a problem hiding this comment.
thanks for the PR @mlsorensen , code LGTM
|
Hi @mlsorensen I think, this is good to go in 4.16.1. Can you rebase with 4.16 branch. Thanks. |
|
@blueorangutan package |
|
@sureshanaparti a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1956 |
9b73c17 to
53e5c41
Compare
* Managed volume copy was always returning first host that could see storage pools * Fix null value in logging for ScaleIOPrimaryDataStoreDriver due to if/else logic error Signed-off-by: Marcus Sorensen <mls@apple.com>
53e5c41 to
1d1e128
Compare
Done. Do I need a separate PR for main or will this get into main? |
Thanks @mlsorensen , No need of separate PR for main, this will be forward merged to main. |
|
@blueorangutan package |
|
@blueorangutan package |
|
@sureshanaparti a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✖️ el7 ✔️ el8 ✖️ debian ✖️ suse15. SL-JID 1971 |
|
@blueorangutan package |
|
@shwstppr a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1973 |
|
@blueorangutan test |
|
@sureshanaparti a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
| String debugMessage = "Initiating copy from PowerFlex template volume on host "; | ||
| LOGGER.debug(destHost != null ? debugMessage + destHost.getId() : debugMessage); |
There was a problem hiding this comment.
We could use String.format here:
| String debugMessage = "Initiating copy from PowerFlex template volume on host "; | |
| LOGGER.debug(destHost != null ? debugMessage + destHost.getId() : debugMessage); | |
| LOGGER.debug(String.format("Initiating copy from PowerFlex template volume on host %s", destHost != null ? destHost.getId() : "")); | |
| String debugMessage = "Initiating copy from PowerFlex volume on host "; | ||
| LOGGER.debug(destHost != null ? debugMessage + destHost.getId() : debugMessage); |
There was a problem hiding this comment.
We also could use String.format here.
| if (hostIds.isEmpty()) { | ||
| return null; | ||
| } | ||
| Collections.shuffle(hostIds); |
There was a problem hiding this comment.
I think we could add some log to this method, about which host was selected (or not).
There was a problem hiding this comment.
There's logging elsewhere, where the returned value is used.
There was a problem hiding this comment.
I might be wrong on this one, but I think that there are not many logs regarding the chosen host to migrate.
Call hierarchy that I've taken a quick check does not present logs. Assuming your comment, this would be logged later on in the call hierarchy ( I haven't checked all the way). Even if it is, you can see that many exceptions can be thrown not too late after the host ID is retrieved.
> StorageManagerImpl.findUpAndEnabledHostWithAccessToStoragePools(List<Long>) (com.cloud.storage)
> VolumeServiceImpl.copyManagedVolume(VolumeInfo, DataStore) (org.apache.cloudstack.storage.volume)
> VolumeServiceImpl.copyVolume(VolumeInfo, DataStore) (org.apache.cloudstack.storage.volume)
> VolumeOrchestrator.copyVolumeFromSecToPrimary(VolumeInfo, VirtualMachine, VirtualMachineTemplate, DataCenter, Pod, Long, ServiceOffering, ...) (org.apache.cloudstack.engine.orchestration)
...
> VolumeOrchestrator.migrateVolume(Volume, StoragePool) (org.apache.cloudstack.engine.orchestration)
...
> VolumeApiServiceImpl.orchestrateExtractVolume(long, long) (com.cloud.storage)
...
Additionally, the canHostAccessStoragePools called inside the findUpAndEnabledHostWithAccessToStoragePools can return false. If the host cannot access, not much is logged as well.
Please, let us know in case we are missing something here.
There was a problem hiding this comment.
The logging is not of the randomised list but the selected host is logged by the deployment planner or during provisioning or volume operation.
|
Trillian test result (tid-2698)
|
Signed-off-by: Marcus Sorensen <mls@apple.com>
|
@blueorangutan package |
|
@sureshanaparti a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1985 |
DaanHoogland
left a comment
There was a problem hiding this comment.
clgtm (none blocking suggestion added for the log message)
| private Answer copyTemplateToVolume(DataObject srcData, DataObject destData, Host destHost) { | ||
| // Copy PowerFlex/ScaleIO template to volume | ||
| LOGGER.debug("Initiating copy from PowerFlex template volume on host " + destHost != null ? destHost.getId() : ""); | ||
| LOGGER.debug(String.format("Initiating copy from PowerFlex template volume on host %s", destHost != null ? destHost.getId() : "")); |
There was a problem hiding this comment.
| LOGGER.debug(String.format("Initiating copy from PowerFlex template volume on host %s", destHost != null ? destHost.getId() : "")); | |
| LOGGER.debug(String.format("Initiating copy from PowerFlex template volume on host %s", destHost != null ? destHost.getId() : "<unknown>")); |
There was a problem hiding this comment.
Implementing this. Going to use "not specified" @DaanHoogland since that sounds less like an error. There is code later to resolve a destination host if this is not specified.
| private Answer copyVolume(DataObject srcData, DataObject destData, Host destHost) { | ||
| // Copy PowerFlex/ScaleIO volume | ||
| LOGGER.debug("Initiating copy from PowerFlex volume on host " + destHost != null ? destHost.getId() : ""); | ||
| LOGGER.debug(String.format("Initiating copy from PowerFlex template volume on host %s", destHost != null ? destHost.getId() : "")); |
Signed-off-by: Marcus Sorensen <mls@apple.com>
|
@blueorangutan package |
|
@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✖️ el7 ✔️ el8 ✖️ debian ✔️ suse15. SL-JID 2003 |
|
@blueorangutan package |
|
@sureshanaparti a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 2007 |
|
@blueorangutan test keepEnv |
|
@sureshanaparti a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
Trillian test result (tid-2738)
|
GabrielBrascher
left a comment
There was a problem hiding this comment.
Thanks for the PR, @mlsorensen.
Overall looks good, I just added a comment related to the @GutoVeronezi's log suggestion.
I particularly struggle sometimes to debug CloudStack issues due to the lack of information in the logs and it could be that this flow is one of these cases. I am not sure though, so I would need to double-check on this one.
I appreciate any feedback regarding it.
| if (hostIds.isEmpty()) { | ||
| return null; | ||
| } | ||
| Collections.shuffle(hostIds); |
There was a problem hiding this comment.
I might be wrong on this one, but I think that there are not many logs regarding the chosen host to migrate.
Call hierarchy that I've taken a quick check does not present logs. Assuming your comment, this would be logged later on in the call hierarchy ( I haven't checked all the way). Even if it is, you can see that many exceptions can be thrown not too late after the host ID is retrieved.
> StorageManagerImpl.findUpAndEnabledHostWithAccessToStoragePools(List<Long>) (com.cloud.storage)
> VolumeServiceImpl.copyManagedVolume(VolumeInfo, DataStore) (org.apache.cloudstack.storage.volume)
> VolumeServiceImpl.copyVolume(VolumeInfo, DataStore) (org.apache.cloudstack.storage.volume)
> VolumeOrchestrator.copyVolumeFromSecToPrimary(VolumeInfo, VirtualMachine, VirtualMachineTemplate, DataCenter, Pod, Long, ServiceOffering, ...) (org.apache.cloudstack.engine.orchestration)
...
> VolumeOrchestrator.migrateVolume(Volume, StoragePool) (org.apache.cloudstack.engine.orchestration)
...
> VolumeApiServiceImpl.orchestrateExtractVolume(long, long) (com.cloud.storage)
...
Additionally, the canHostAccessStoragePools called inside the findUpAndEnabledHostWithAccessToStoragePools can return false. If the host cannot access, not much is logged as well.
Please, let us know in case we are missing something here.
Managed volume copy was always returning first host that could see storage pools
Fix null value in logging for ScaleIOPrimaryDataStoreDriver due to if/else logic error
Signed-off-by: Marcus Sorensen mls@apple.com
Description
This PR fixes #5788
It shuffles the list of valid host candidates that are used for processing volume migrations.
It also fixes a logic error in debug logging that caused 'null' to be printed. Originally concatenating a static string with potentially null value and then testing to see if the result is null, which would never be true.
A more ideal solution might be to track the jobs on each host and attempt to balance the load according to how busy each host is. This was investigated and there's currently no obvious way to look up job assignments to hosts via the DB, the Commands sent to Agents are kept live in memory via the AgentAttache. The ideal solution would be a much larger feature to coordinate Command assignments across the management server cluster, and possibly need improved agent comms to provide live status of Command progress.
I additionally spent some time looking into how to isolate and unit test
findUpAndEnabledHostWithAccessToStoragePools(). Currently it would require mocking three Daos, the DataStoreProviderManager, a DataStoreProvider, and a DataStoreDriver. It's possible to write a test for this, however I was concerned that the unit test would be coding implementation across three methods into the test. These mocked Daos exist up to three methods deep - if the implementation of any of these methods changes then the test needs to be rewritten to code in the new implementation. I wasn't sure this was the right thing to do just to add a simple one-line shuffle but I'm happy to go that route if needed.Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
How Has This Been Tested?
Have run this code against a live environment and tested volume migration.