Skip to content

Suspending the VM prior to deleting snapshots to avoid corruption, th…#4032

Merged
yadvr merged 5 commits into
apache:4.15from
ippathways:gjg-snapshot-pause-vm
Feb 25, 2021
Merged

Suspending the VM prior to deleting snapshots to avoid corruption, th…#4032
yadvr merged 5 commits into
apache:4.15from
ippathways:gjg-snapshot-pause-vm

Conversation

@ggoodrich-ipp
Copy link
Copy Markdown
Contributor

@ggoodrich-ipp ggoodrich-ipp commented Apr 15, 2020

…en resuming

Description

These changes are related to PR #3194, but include suspending/resuming the VM when doing a VM snapshot as well, when deleting a VM snapshot, as it is performing the same operations via Libvirt. Also, there was an issue with the UI/localization changes in the prior PR, as that PR was altering the Volume snapshot behavior, but was altering the VM snapshot wording. Both have been altered in this PR.

Issuing this in response to the work happening in PR #4029.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

How Has This Been Tested?

@andrijapanicsb
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@andrijapanicsb a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@andrijapanicsb
Copy link
Copy Markdown
Contributor

@weizhouapache does this one makes more sense/checks than the #4033 - i.e. shall we work on this one?

Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should only start/resume a VM if it was running before we suspended it, not if it was stopped or paused by some other operation.
I suggest we add a var with oldState and check if oldState.equals(DomainInfo.DomainState.VIR_DOMAIN_RUNNING) before the resume statements.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos7 ✔debian. JID-1172

@DaanHoogland
Copy link
Copy Markdown
Contributor

@ggoodrich-ipp my suggestion there ^^ won't catch all possible conflicts, and won't give us a full mutex, but for the case we are handling now it should do. (cc @weizhouapache @rhtyd again)

snapshot = dm.snapshotLookupByName(cmd.getTarget().getSnapshotName());

// Suspend the VM prior to deleting the snapshot to guard against corruption
dm.suspend();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if VM is running? (i.e. not in any other states)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May wrap in a try-with?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean you agree with my statement above, @rhtyd ?
the try catch won't work, it will succeed leaving a corrupt disk image (according to @andrijapanicsb 's tests)

if (dm != null) {
// Make sure if the VM is paused, then resume it, in case we got an exception during our delete() and didn't have the chance before
try {
dm = libvirtComputingResource.getDomain(conn, vmName);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dm is already not null, why overwrite it again?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that dm needed to be refreshed after a state change, as I saw this same thing being done in KVMStorageProcessor -> backupSnapshot(). Rather than validate that it was necessary or not, I decided that safer is always better, as refreshing the domain variable won't cause harm.

Copy link
Copy Markdown
Member

@yadvr yadvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some remarks, needs testing/verification. LGTM otherwise.
@weizhouapache @DaanHoogland can you comment as well?

@yadvr yadvr requested a review from weizhouapache April 16, 2020 11:46
@DaanHoogland
Copy link
Copy Markdown
Contributor

as we are in release phase already and #4033 is simpler let's go with that. the logic is essentially the same.

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Jun 11, 2020

@ggoodrich-ipp can you fix the conflicts, thanks.

@ggoodrich-ipp
Copy link
Copy Markdown
Contributor Author

@ggoodrich-ipp can you fix the conflicts, thanks.

Conflicts resolved.

@yadvr yadvr requested review from DaanHoogland and yadvr June 12, 2020 04:48
@yadvr
Copy link
Copy Markdown
Member

yadvr commented Jun 12, 2020

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✖centos7 ✖debian. JID-1353

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Jun 17, 2020

@ggoodrich-ipp could you check if quiescevm is true in the createSnapshot API? Would that be better than doing it by default?

@ggoodrich-ipp
Copy link
Copy Markdown
Contributor Author

@ggoodrich-ipp could you check if quiescevm is true in the createSnapshot API? Would that be better than doing it by default?

That seems questionable, due to the potential for corruption, in this case. I wouldn't think we would want an option that could lead to corruption.

Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor remarks

Comment on lines +120 to +130
// Make sure if the VM is paused, then resume it, in case we got an exception during our delete() and didn't have the chance before
try {
dm = libvirtComputingResource.getDomain(conn, vmName);
state = dm.getInfo().state;
if (state == DomainInfo.DomainState.VIR_DOMAIN_PAUSED) {
dm.resume();
}
} catch (LibvirtException e) {
s_logger.error("Failed to resume vm after delete snapshot " + cmd.getTarget().getSnapshotName() + " on vm: " + vmName + " return true : " + e);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you extract this in a separate method please?

Comment on lines +69 to +76
didDelete = true;

// Resume the VM
dm = libvirtComputingResource.getDomain(conn, vmName);
state = dm.getInfo().state;
if (state == DomainInfo.DomainState.VIR_DOMAIN_PAUSED) {
dm.resume();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resume action leads to an implicit assumption which must be guarded during maintenance by future generations: failing resume is concluded by examining a var named didDelete. dangerous assumption even with the current logic computing. see below.

Comment on lines +111 to +113
} else if (didDelete) {
s_logger.error("Failed to resume vm after delete snapshot " + cmd.getTarget().getSnapshotName() + " on vm: " + vmName + " return true : " + e);
return new DeleteVMSnapshotAnswer(cmd, cmd.getVolumeTOs());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here a var didDelete leads us to conclude that the current exception thrown is caused by dm.resume(). this boolean should be named some thing like tryingResume instead. better would be to investigate the nature of the exception.

}
} catch (final Exception ex) {
s_logger.debug("Failed to delete snapshots on primary", ex);
s_logger.error("Failed to delete snapshots on primary", ex);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@DaanHoogland DaanHoogland added this to the 4.15.1.0 milestone Jan 26, 2021
@ggoodrich-ipp
Copy link
Copy Markdown
Contributor Author

I see @DaanHoogland re-targeted this to 4.15, which must have resolved the conflicts. Are there any other issues with this PR that need to be addressed prior to approval, @weizhouapache @shwstppr ? Let me know, and I'll try to accommodate. Thanks!

@DaanHoogland
Copy link
Copy Markdown
Contributor

travis is reporting a checkstyle error but i cannot find it closing

@DaanHoogland
Copy link
Copy Markdown
Contributor

re-opening to re-kick travis

@DaanHoogland DaanHoogland reopened this Jan 29, 2021
@weizhouapache
Copy link
Copy Markdown
Member

@ggoodrich-ipp could you please fix the build error ?

@yadvr yadvr closed this Feb 19, 2021
@yadvr yadvr reopened this Feb 19, 2021
@yadvr
Copy link
Copy Markdown
Member

yadvr commented Feb 19, 2021

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✖centos7 ✖centos8 ✖debian. JID-2745

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Feb 19, 2021

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✖centos7 ✖centos8 ✖debian. JID-2759

@shwstppr
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@shwstppr a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✖centos7 ✖centos8 ✖debian. JID-2780

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos7 ✖centos8 ✔debian. JID-2783

@shwstppr
Copy link
Copy Markdown
Contributor

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@shwstppr a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-3594)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 30418 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4032-t3594-kvm-centos7.zip
Smoke tests completed. 86 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

Copy link
Copy Markdown
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on code changes and test results. May need manual testing

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Feb 23, 2021

@weizhouapache have you done any manual testing on this?

@yadvr yadvr closed this Feb 23, 2021
@yadvr yadvr reopened this Feb 23, 2021
@yadvr
Copy link
Copy Markdown
Member

yadvr commented Feb 25, 2021

Did another round of code review, LGTM

@yadvr yadvr merged commit af0f642 into apache:4.15 Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants