Skip to content

Suspending a VM before snapshot deletion (see PR #3193)#3194

Merged
yadvr merged 2 commits into
apache:4.11from
khmarochos:4.11.2.0_careful_snapshot_deletion
Jun 4, 2019
Merged

Suspending a VM before snapshot deletion (see PR #3193)#3194
yadvr merged 2 commits into
apache:4.11from
khmarochos:4.11.2.0_careful_snapshot_deletion

Conversation

@khmarochos
Copy link
Copy Markdown
Contributor

Description

To make sure that a qemu2-image won't be corrupted by the snapshot deletion procedure which is being performed after copying the snapshot to a secondary store, I'd propose to put a VM in to suspended state.

Fixes: #3193

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

I tried it in my environment, a VM had been suspended, the snapshot had been removed and then the VM's work had been resumed.

@ustcweizhou
Copy link
Copy Markdown
Contributor

two questions
(1) should vm be resumed after deletion ?
(2) should vm be paused when backup a snapshot ?

@khmarochos
Copy link
Copy Markdown
Contributor Author

Hello and thank you for your attention. Here are the answers.

(1) should vm be resumed after deletion ?

The VM gets resumed right after the snapshot deletion procedure:

/*
* libvirt on RHEL6 doesn't handle resume event emitted from
* qemu
*/
vm = resource.getDomain(conn, vmName);
state = vm.getInfo().state;
if (state == DomainInfo.DomainState.VIR_DOMAIN_PAUSED) {
vm.resume();
}

(2) should vm be paused when backup a snapshot ?

The VM is being paused when the snapshot is being created, but then it resumes its work, so the snapshot is being copied to the secondary storage, some time passes and then the snapshot is being deleted from the image file without suspension of the VM, which could be quite a risky maneuver.

@ustcweizhou
Copy link
Copy Markdown
Contributor

@Melnik13 thanks for your reply.
As you said, the vm will be suspended in creating snapshot and resumed. snapshot will be backed up to secondary storage after that.
However I do not think removing a vmsnapshot on a running vm will cause qcow2 to be corrupted.
It would be better if you (and anyone else) can provide more information. For example, do you use recurring snapshots ?

@khmarochos
Copy link
Copy Markdown
Contributor Author

Hello,

I saw how it happens about a dozen times, really. When it happened to VMs that are being constantly monitored by Nagios (there were at least 2 occurrences), I saw that these VM were working fine during the "backing up" period (when their snapshots were copied to the secondary storage), but they failed right after their snapshots had been deleted from their volumes' images on primary storage.

Sometimes it's possible to repair these images with qemu-img check -r all, but in most cases, it's fatal (especially if a guest has XFS).

@Slair1
Copy link
Copy Markdown
Contributor

Slair1 commented Mar 4, 2019

We have also had corrupted VMs. We are on 4.9.3 with CentOS/KVM. I can't say which part of the snapshot process is killing our VMs - but it has been horrible to deal with. We are using NFS

@ustcweizhou
Copy link
Copy Markdown
Contributor

@Melnik13 good, thanks for your reply.
this PR is good for me now

@ustcweizhou
Copy link
Copy Markdown
Contributor

@Melnik13 by the way ,could you please add another commit to suspend vm when delete a vm snapshot ?

@borisstoyanov
Copy link
Copy Markdown
Contributor

@Melnik13 I just wonder if there are any steps to reproduce this corruption? could a debugger and step by step execution come handy here?

@NuxRo
Copy link
Copy Markdown
Contributor

NuxRo commented Apr 2, 2019

Hello,

Has there been any more progress on this?
It's terribly worrying for this corruption to happen, especially as I was planning on using the snapshot feature for backups.

@bwsw
Copy link
Copy Markdown
Contributor

bwsw commented Apr 2, 2019

@borisstoyanov push high io to vm vol and overall storage and try playing with volume snapshots. After doing some of them 10-20+, adding new and removing old (e.g. scheduled scenario) stop vm and you have a chance it will not boot because of corrupted qcow2 image.

@Slair1
Copy link
Copy Markdown
Contributor

Slair1 commented Apr 12, 2019

@Melnik13 any updates on this PR?

@khmarochos
Copy link
Copy Markdown
Contributor Author

Dear colleagues, sorry for the pause!

On the 29th of April, I'm going to install the patched version in one of my production environments to make sure that my patch couldn't harm anything. I tested it in the lab, but I'd also like to check in a populous environment.

@NuxRo
Copy link
Copy Markdown
Contributor

NuxRo commented May 9, 2019

Hello, how's it going so far? Any more corruptions?

@yadvr yadvr added this to the 4.13.0.0 milestone May 27, 2019
@yadvr
Copy link
Copy Markdown
Member

yadvr commented May 27, 2019

@Melnik13 any updates?

@yadvr yadvr changed the base branch from 4.11.2.0 to master May 27, 2019 12:38
@yadvr yadvr changed the base branch from master to 4.11 May 27, 2019 12:38
@yadvr yadvr modified the milestones: 4.13.0.0, 4.11.3.0 May 27, 2019
@yadvr yadvr added the type:bug label May 27, 2019
@yadvr yadvr closed this May 30, 2019
@yadvr yadvr reopened this May 30, 2019
@yadvr
Copy link
Copy Markdown
Member

yadvr commented May 30, 2019

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2810

@khmarochos
Copy link
Copy Markdown
Contributor Author

Dear colleagues, I'm really sorry for such a long pause. There are certain conditions that you would consider as a "good excuse", so I hope you won't blame me too much for my silence. Thanks!

@apache apache deleted a comment from blueorangutan May 30, 2019
@yadvr
Copy link
Copy Markdown
Member

yadvr commented May 30, 2019

@blueorangutan test centos7 kvm-centos6

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-3610)
Environment: kvm-centos6 (x2), Advanced Networking with Mgmt server 7
Total time taken: 27219 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3194-t3610-kvm-centos6.zip
Intermittent failure detected: /marvin/tests/smoke/test_service_offerings.py
Smoke tests completed. 67 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_service_offering_cpu_limit_use Error 1.33 test_service_offerings.py

@Slair1
Copy link
Copy Markdown
Contributor

Slair1 commented May 30, 2019

@Melnik13 @rhtyd
From everything I've been able to find, it isn't supportable to take snapshots of volumes on running VMs w/ KVM. I see in the CloudStack code, this functionality is also disabled by default with "kvm.snapshot.enabled" set to false by default. The code does however allow snapshots if VMs are stopped or the volume is detached.

Here is a link to an article talking about not using qemu-img to take snapshots of running VMs (there are lots of other posts/pages agreeing).

https://www.cyberciti.biz/faq/how-to-create-create-snapshot-in-linux-kvm-vmdomain/

Should we be suspending the VM before taking the snapshot? Is suspending enough to prevent corruption or is it just "safer"? We are just very gun shy with snapshots on KVM after experiencing corruption on several VMs...

Thanks for any input!

@khmarochos
Copy link
Copy Markdown
Contributor Author

khmarochos commented May 30, 2019

@Slair1

Here is a link to an article talking about not using qemu-img to take snapshots of running VMs (there are lots of other posts/pages agreeing).
https://www.cyberciti.biz/faq/how-to-create-create-snapshot-in-linux-kvm-vmdomain/
Should we be suspending the VM before taking the snapshot? Is suspending enough to prevent corruption or is it just "safer"? We are just very gun shy with snapshots on KVM after experiencing corruption on several VMs...

You're absolutely right, but ACS do not use qemu-img to create a snapshot of a running instance. It runs qemu-img only if the instance is stopped, but for the running one it calls libvirt to perform the job (and libvirt always suspends the VM).

The problem occurs when ACS is deleting the snapshot from the running instance's volume. When the deletion procedure is being performed, libvirt doesn't suspend an instance (well, the instance is being frozen somehow, but virsh list still shows it as running).

As I think, it's not a bug of ACS, as it seems to be a bug of libvirt or QEMU/KVM (by the way, users of Proxmox are facing these issues too), but our workaround (manually suspending the instance before the snapshot deletion) could help us to outflank the problem.

Thanks.

@yadvr
Copy link
Copy Markdown
Member

yadvr commented May 30, 2019

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@ustcweizhou
Copy link
Copy Markdown
Contributor

code lgtm, tested ok.
can you also add some information as part of this change to inform cloudstack/kvm users in api and on ui that vm will be suspended and resumed when take or remove a vm snapshot?

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-3618)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 28889 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3194-t3618-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 67 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_hostha_enable_ha_when_host_disabled Error 0.64 test_hostha_kvm.py
test_hostha_enable_ha_when_host_disconected Error 936.12 test_hostha_kvm.py
test_hostha_enable_ha_when_host_in_maintenance Error 302.81 test_hostha_kvm.py

@yadvr
Copy link
Copy Markdown
Member

yadvr commented May 31, 2019

@Melnik13 we can merge this as soon as @ustcweizhou 's review comment is addressed. Thanks.

@khmarochos
Copy link
Copy Markdown
Contributor Author

khmarochos commented May 31, 2019

@ustcweizhou

code lgtm, tested ok.
can you also add some information as part of this change to inform cloudstack/kvm users in api and on ui that vm will be suspended and resumed when take or remove a vm snapshot?

Thank you for your comment. Where it would be better to add this information?

I'd also like to mention that nothing will be changed from the point of view of a user, because an instance is actually being frozen when the snapshot is being deleted. We suspect that the instance isn't being "properly" suspended (the instance doesn't react to anything, but virsh list is still showing it as running) and exactly that causes image corruption.

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Jun 3, 2019

@Melnik13 perhaps you can send a PR towards cloudstack-documentation https://github.com/apache/cloudstack-documentation and in CloudStack UI, when we see the create/delete vm snapshot form/dialog it can say the operation will pause the guest VM.

@ustcweizhou
Copy link
Copy Markdown
Contributor

@Melnik13
Beside what @rhtyd said, you can change the description in CreateVMSnapshotCmd.java and DeleteVMSnapshotCmd.java

We already have some message in ui/l10n/ to warn users that "Please notice that the instance will be paused during the snapshoting, and resumed after snapshotting, if it runs on KVM." when create a vm snapshot.
It is better to add similar warning when delete a vm snapshot.

Copy link
Copy Markdown
Contributor

@anuragaw anuragaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM based on code and discussions here.

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Jun 4, 2019

Looks like it is already handled. The UI already says this:
Screenshot from 2019-06-04 15-59-05

I've updated the text for snapshot deletion confirmation. Will merge based on this.

@ustcweizhou
Copy link
Copy Markdown
Contributor

@rhtyd good
lgtm now.

@yadvr yadvr merged commit c94ee14 into apache:4.11 Jun 4, 2019
@khmarochos khmarochos deleted the 4.11.2.0_careful_snapshot_deletion branch June 4, 2019 10:41
@Slair1 Slair1 mentioned this pull request Apr 9, 2020
5 tasks
ggoodrich-ipp pushed a commit to ippathways/cloudstack that referenced this pull request Apr 14, 2020
…he#3194)

To make sure that a qemu2-image won't be corrupted by the snapshot deletion procedure which is being performed after copying the snapshot to a secondary store, I'd propose to put a VM in to suspended state.

Additional reference: https://bugzilla.redhat.com/show_bug.cgi?id=920020#c5

Fixes apache#3193

(cherry picked from commit c94ee14)
yadvr pushed a commit that referenced this pull request Feb 25, 2021
…n, th… (#4032)

These changes are related to PR #3194, but include suspending/resuming the VM when doing a VM snapshot as well, when deleting a VM snapshot, as it is performing the same operations via Libvirt. Also, there was an issue with the UI/localization changes in the prior PR, as that PR was altering the Volume snapshot behavior, but was altering the VM snapshot wording. Both have been altered in this PR.

Issuing this in response to the work happening in PR #4029.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants