Suspending the VM prior to deleting snapshots to avoid corruption, th…#4032
Conversation
|
@blueorangutan package |
|
@andrijapanicsb a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
@weizhouapache does this one makes more sense/checks than the #4033 - i.e. shall we work on this one? |
DaanHoogland
left a comment
There was a problem hiding this comment.
we should only start/resume a VM if it was running before we suspended it, not if it was stopped or paused by some other operation.
I suggest we add a var with oldState and check if oldState.equals(DomainInfo.DomainState.VIR_DOMAIN_RUNNING) before the resume statements.
|
Packaging result: ✔centos7 ✔debian. JID-1172 |
|
@ggoodrich-ipp my suggestion there ^^ won't catch all possible conflicts, and won't give us a full mutex, but for the case we are handling now it should do. (cc @weizhouapache @rhtyd again) |
| snapshot = dm.snapshotLookupByName(cmd.getTarget().getSnapshotName()); | ||
|
|
||
| // Suspend the VM prior to deleting the snapshot to guard against corruption | ||
| dm.suspend(); |
There was a problem hiding this comment.
Should we check if VM is running? (i.e. not in any other states)
There was a problem hiding this comment.
does this mean you agree with my statement above, @rhtyd ?
the try catch won't work, it will succeed leaving a corrupt disk image (according to @andrijapanicsb 's tests)
| if (dm != null) { | ||
| // Make sure if the VM is paused, then resume it, in case we got an exception during our delete() and didn't have the chance before | ||
| try { | ||
| dm = libvirtComputingResource.getDomain(conn, vmName); |
There was a problem hiding this comment.
dm is already not null, why overwrite it again?
There was a problem hiding this comment.
I was under the impression that dm needed to be refreshed after a state change, as I saw this same thing being done in KVMStorageProcessor -> backupSnapshot(). Rather than validate that it was necessary or not, I decided that safer is always better, as refreshing the domain variable won't cause harm.
yadvr
left a comment
There was a problem hiding this comment.
Left some remarks, needs testing/verification. LGTM otherwise.
@weizhouapache @DaanHoogland can you comment as well?
|
as we are in release phase already and #4033 is simpler let's go with that. the logic is essentially the same. |
|
@ggoodrich-ipp can you fix the conflicts, thanks. |
Conflicts resolved. |
|
@blueorangutan package |
|
@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✖centos7 ✖debian. JID-1353 |
|
@ggoodrich-ipp could you check if |
That seems questionable, due to the potential for corruption, in this case. I wouldn't think we would want an option that could lead to corruption. |
| // Make sure if the VM is paused, then resume it, in case we got an exception during our delete() and didn't have the chance before | ||
| try { | ||
| dm = libvirtComputingResource.getDomain(conn, vmName); | ||
| state = dm.getInfo().state; | ||
| if (state == DomainInfo.DomainState.VIR_DOMAIN_PAUSED) { | ||
| dm.resume(); | ||
| } | ||
| } catch (LibvirtException e) { | ||
| s_logger.error("Failed to resume vm after delete snapshot " + cmd.getTarget().getSnapshotName() + " on vm: " + vmName + " return true : " + e); | ||
| } | ||
|
|
There was a problem hiding this comment.
can you extract this in a separate method please?
| didDelete = true; | ||
|
|
||
| // Resume the VM | ||
| dm = libvirtComputingResource.getDomain(conn, vmName); | ||
| state = dm.getInfo().state; | ||
| if (state == DomainInfo.DomainState.VIR_DOMAIN_PAUSED) { | ||
| dm.resume(); | ||
| } |
There was a problem hiding this comment.
resume action leads to an implicit assumption which must be guarded during maintenance by future generations: failing resume is concluded by examining a var named didDelete. dangerous assumption even with the current logic computing. see below.
| } else if (didDelete) { | ||
| s_logger.error("Failed to resume vm after delete snapshot " + cmd.getTarget().getSnapshotName() + " on vm: " + vmName + " return true : " + e); | ||
| return new DeleteVMSnapshotAnswer(cmd, cmd.getVolumeTOs()); |
There was a problem hiding this comment.
here a var didDelete leads us to conclude that the current exception thrown is caused by dm.resume(). this boolean should be named some thing like tryingResume instead. better would be to investigate the nature of the exception.
| } | ||
| } catch (final Exception ex) { | ||
| s_logger.debug("Failed to delete snapshots on primary", ex); | ||
| s_logger.error("Failed to delete snapshots on primary", ex); |
|
I see @DaanHoogland re-targeted this to 4.15, which must have resolved the conflicts. Are there any other issues with this PR that need to be addressed prior to approval, @weizhouapache @shwstppr ? Let me know, and I'll try to accommodate. Thanks! |
|
travis is reporting a checkstyle error but i cannot find it closing |
|
re-opening to re-kick travis |
|
@ggoodrich-ipp could you please fix the build error ? |
|
@blueorangutan package |
|
@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✖centos7 ✖centos8 ✖debian. JID-2745 |
|
@blueorangutan package |
|
@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✖centos7 ✖centos8 ✖debian. JID-2759 |
|
@blueorangutan package |
|
@shwstppr a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✖centos7 ✖centos8 ✖debian. JID-2780 |
|
Packaging result: ✔centos7 ✖centos8 ✔debian. JID-2783 |
|
@blueorangutan test |
|
@shwstppr a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
Trillian test result (tid-3594)
|
shwstppr
left a comment
There was a problem hiding this comment.
LGTM on code changes and test results. May need manual testing
|
@weizhouapache have you done any manual testing on this? |
|
Did another round of code review, LGTM |
…en resuming
Description
These changes are related to PR #3194, but include suspending/resuming the VM when doing a VM snapshot as well, when deleting a VM snapshot, as it is performing the same operations via Libvirt. Also, there was an issue with the UI/localization changes in the prior PR, as that PR was altering the Volume snapshot behavior, but was altering the VM snapshot wording. Both have been altered in this PR.
Issuing this in response to the work happening in PR #4029.
Types of changes
Screenshots (if appropriate):
How Has This Been Tested?