Skip to content

Garbage snapshots are left on Primary and Secondary#3649

Closed
GabrielBrascher wants to merge 6 commits into
apache:4.13from
CLDIN:snapshot-deletion-issues
Closed

Garbage snapshots are left on Primary and Secondary#3649
GabrielBrascher wants to merge 6 commits into
apache:4.13from
CLDIN:snapshot-deletion-issues

Conversation

@GabrielBrascher
Copy link
Copy Markdown
Member

@GabrielBrascher GabrielBrascher commented Oct 21, 2019

Description

Tested both cases:

  • Snapshot backed up on Secondary (stored on primary and secondary)
  • Snapshot stored only on primary storage.

This fix works for KVM and XenServer.

Note: I changed the name of the strategy from XenserverSnapshotStrategy to DefaultSnapshotStrategy due to the fact that the "Xenserver" strategy handles also Ceph, KVM, and do return StrategyPriority.DEFAULT in any case except REVERT.

    @Override
    public StrategyPriority canHandle(Snapshot snapshot, SnapshotOperation op) {
        if (SnapshotOperation.REVERT.equals(op)) {
            long volumeId = snapshot.getVolumeId();
            VolumeVO volumeVO = volumeDao.findById(volumeId);

            if (volumeVO != null && ImageFormat.QCOW2.equals(volumeVO.getFormat())) {
                return StrategyPriority.DEFAULT;
            }
            return StrategyPriority.CANT_HANDLE;
        }
        return StrategyPriority.DEFAULT;
    }

Fixes: #3646

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

@GabrielBrascher
Copy link
Copy Markdown
Member Author

@andrijapanicsb Can you please test this PR and check if this fixes the snapshot deletion garbage issue on XenServer + NFS? Thanks!

@GabrielBrascher GabrielBrascher changed the title [WIP] Garbage snapshots are left on Primary and Secondary Garbage snapshots are left on Primary and Secondary Nov 11, 2019
@GabrielBrascher
Copy link
Copy Markdown
Member Author

@andrijapanicsb can you please check this commit? It worked with KVM + Ceph and might also fix the reported snapshot deletion issue for XenServer + NFS.

@andrijapanicsb
Copy link
Copy Markdown
Contributor

thx @GabrielBrascher. Will do, just to get some free time for it.

@andrijapanicsb
Copy link
Copy Markdown
Contributor

andrijapanicsb commented Nov 22, 2019

Alright - so far tested only KVM (centos 7) with NFS:

While backup to secondary = true, results as follows:

  • snap created on Prim Stor, copied to Sec Store, snap deleted from Prim Store (exists only on Sec Store)
  • revert from snap is possible (image moved back and overwritten on Prim Store, as FULL clone)
  • When deleting snap, it's instantly removed from Sec Stor NFS, DB records are changed to "destroyed" state for both snap reference on Prim Store and on Sec Store but no DB records removed so far.
  • CleanUp thread finds snaps in snapshos_store_ref, and removes them from the snapshots_store_ref table - but it only removes the ones that are pointing to the Sec Store, not the ones that are pointing to PrimStore (though they are marked as deleted, but rows physically present in the table)

In general, looking good, except that those records for snap reference on Prim Store are not deleted from the snapshots_store_ref table - not sure what was the original behaviour here, years ago.

While backup to secondary = false, results as follows:

  • snap created and stays only on Prim Store, all good.
  • can't delete the snap - getting error "Failed to delete snapshot:java.util.MissingFormatArgumentException: Format specifier '%d'"
  • cant revert volume from snap (seems to require to be on secondary storage, API confirms this if you try to revert it via API) - what is the use of this setup then? I'm certainly missing something or perhaps idea is that we have identical behaviour as many years ago, just this time we don't copy it over to Sec Store @GabrielBrascher (you can only create a template or a volume out of it - can't restore it)

Here is a more detailed error when deleting a snap that exists only on Prim Store:

2019-11-22 22:18:42,983 DEBUG [c.c.a.ApiServlet] (qtp504527234-18:ctx-12265ddd ctx-75ecde1c) (logid:83e43335) ===END=== 10.1.0.1 -- GET command=deleteSnapshot&id=73120b09-10d7-48e0-bddb-42b442bbd79e&response=json&_=1574460913233
2019-11-22 22:18:43,030 DEBUG [c.c.s.s.SnapshotManagerImpl] (API-Job-Executor-3:ctx-2cf618ce job-75 ctx-39866784) (logid:6a5a4b68) Failed to delete snapshot: 7:java.util.MissingFormatArgumentException: Format specifier '%d'
2019-11-22 22:18:43,038 ERROR [c.c.a.ApiAsyncJobDispatcher] (API-Job-Executor-3:ctx-2cf618ce job-75) (logid:6a5a4b68) Unexpected exception while executing org.apache.cloudstack.api.command.user.snapshot.DeleteSnapshotCmd
com.cloud.utils.exception.CloudRuntimeException: Failed to delete snapshot:java.util.MissingFormatArgumentException: Format specifier '%d'
at com.cloud.storage.snapshot.SnapshotManagerImpl.deleteSnapshot(SnapshotManagerImpl.java:611)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:338)
at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:197)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
at org.apache.cloudstack.network.contrail.management.EventUtils$EventInterceptor.invoke(EventUtils.java:107)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:174)
at com.cloud.event.ActionEventInterceptor.invoke(ActionEventInterceptor.java:51)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:174)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:92)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:185)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:212)
at com.sun.proxy.$Proxy199.deleteSnapshot(Unknown Source)
at org.apache.cloudstack.api.command.user.snapshot.DeleteSnapshotCmd.execute(DeleteSnapshotCmd.java:103)
at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:156)
at com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:108)
at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:583)
at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:531)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

2019-11-22 22:18:43,035 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-1:ctx-3d310368 job-102) (logid:ed551d42) Complete async job-102, jobStatus: FAILED, resultCode: 530, result: org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":530,"errortext":"Failed to delete snapshot:java.util.MissingFormatArgumentException: Format specifier \u0027%d\u0027"}

@GabrielBrascher I'm wondering, is the behaviour SAME in your tests with Ceph (you obviously do NOT have issues deleting the snap from the Ceph only ( backup to Sec Store = false)
Current behaviour that I see (snap removed from Prim when it's copied over to Sec) is the new in this PR - i.e. previously snap was left on Prim Store even though it was copied over to Sec Store.

@andrijapanicsb
Copy link
Copy Markdown
Contributor

I would say that you just fix, if possible, that deletion from Primary (when backup to Sec Store = false) and we are mostly good to go?

Let me briefly test with XS with NFS (but I expect identical stuff)

@andrijapanicsb
Copy link
Copy Markdown
Contributor

/CC @rhtyd

@andrijapanicsb
Copy link
Copy Markdown
Contributor

Well, things are a bit complicated on XenServer (7.1.0), need some more fix (nothing major I hope)

While backup to secondary = TRUE, results as follows:

  • can create snaps - after the snap is created, snap stays on both Primary and Secondary storage (here, incremental/delta snaps are copied to Secondary Storage - and each newer snap is referencing the older snap via the snapshost_store_ref table, the parent_snapshot_id field.
  • deletion needs to be done from the newest to the oldest snap (climbing in chain so to speak), otherwise deleting a snap in the middle of the chain (when there are multiple snaps) will be reported as failed, will produce partial records in DB, but if you later remove all older snaps in the chain etc - all will come to its place eventually... I assume this is OK/acceptable and potentially an old behaviour
  • deleting snaps WILL REMOVE them from Sec Storage fine - no garbage - FINE
  • deleting snaps WILL NOT REMOVE then from the Primary Storage - here we have garbage left
  • when the storage cleanup thread runs, it will remove needed records from DB just fine (inside snapshots_store_ref), both the ones referencing snaps on Primary and on Secondary storage - so this is working FINE for XenServer, but not for KVM - for KVM garbage is left for the rows pointing to Primary Storage (that's the storage cleanup thread that might need some fix for KVM)

If garbage that is left on Primary Storage could be fixed, it would be great. Otherwise, I can live with the current situation and I will then raise a new blocker issue in the GitHub (which boils to "let's fix it now") /CC @rhtyd @nvazquez

While backup to secondary = FALSE, results as follows:

  • creating a snap will produce a snap just on Primary Storage - FINE
    - trying to delete a snap FAILS, identical message as for KVM - "Failed to delete snapshot:java.util.MissingFormatArgumentException: Format specifier '%d'" - that should be a quick fix hopefully
  • Can't test if garbage is left on Primary Stor since deletion fails

@GabrielBrascher
Copy link
Copy Markdown
Member Author

Thanks for the detailed review/test @andrijapanicsb. I am going to check that exception.

@GabrielBrascher
Copy link
Copy Markdown
Member Author

@andrijapanicsb code has been updated with the last commit; that exception should not be launched now.

@GabrielBrascher GabrielBrascher changed the title Garbage snapshots are left on Primary and Secondary [WIP DO NOT MERGE] Garbage snapshots are left on Primary and Secondary Dec 12, 2019
@DaanHoogland
Copy link
Copy Markdown
Contributor

ping @GabrielBrascher , do we still want this in before 4.13.1/4.14?

@GabrielBrascher
Copy link
Copy Markdown
Member Author

@DaanHoogland The idea of this PR is to solve it for KVM/Ceph and XenServer/NFS as well.

I have been able to test it on Ceph + KVM. However, I was not able to test if it fixes for XenServer. That is why it is a "WIP". I can close this one and aim a fix isolated for Ceph only.

@DaanHoogland
Copy link
Copy Markdown
Contributor

well, the question is is it feasible (cc @andrijapanicsb ) if you want a fix in 4.13 your safest bet is of course not rely on us to have time, @GabrielBrascher ;)

@GabrielBrascher GabrielBrascher modified the milestones: 4.13.1.0, 4.15.0.0 Jan 3, 2020
@apache apache deleted a comment from blueorangutan Feb 19, 2020
@apache apache deleted a comment from blueorangutan Feb 19, 2020
@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-900

@DaanHoogland
Copy link
Copy Markdown
Contributor

@blueorangutan test matrix

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a Trillian-Jenkins matrix job (centos7 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@GabrielBrascher
Copy link
Copy Markdown
Member Author

@rhtyd @andrijapanicsb @DaanHoogland I have not been able to solve those issues yet, I am still investigating and hammering it :-(

@DaanHoogland
Copy link
Copy Markdown
Contributor

tnx @GabrielBrascher Note that the travis run has some errors as well, maybe this can help.

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-1044)
Environment: vmware-65u2 (x2), Advanced Networking with Mgmt server 7
Total time taken: 30889 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3649-t1044-vmware-65u2.zip
Intermittent failure detected: /marvin/tests/smoke/test_privategw_acl.py
Smoke tests completed. 76 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_02_vpc_privategw_static_routes Failure 272.25 test_privategw_acl.py
test_03_vpc_privategw_restart_vpc_cleanup Failure 255.34 test_privategw_acl.py
test_04_rvpc_privategw_static_routes Failure 408.50 test_privategw_acl.py

@apache apache deleted a comment from blueorangutan Feb 20, 2020
@apache apache deleted a comment from blueorangutan Feb 20, 2020
@yadvr
Copy link
Copy Markdown
Member

yadvr commented Feb 28, 2020

Ping - update on this @GabrielBrascher @andrijapanicsb @DaanHoogland ?

@GabrielBrascher
Copy link
Copy Markdown
Member Author

@rhtyd not much progress so far from my side.

@GabrielBrascher
Copy link
Copy Markdown
Member Author

Quick update: today I have been digging on the code and debugging. This week I will be addressing more time here to solve it as soon as possible.

@andrijapanicsb
Copy link
Copy Markdown
Contributor

thx @GabrielBrascher

@yadvr yadvr closed this Mar 5, 2020
@yadvr yadvr reopened this Mar 5, 2020
@DaanHoogland
Copy link
Copy Markdown
Contributor

@GabrielBrascher did you manage? as code freeze is upon us, will you need more time or resources?

@GabrielBrascher
Copy link
Copy Markdown
Member Author

@DaanHoogland I am working on it, the fix that I am implementing is working for KVM + NFS (primary) It still needs more testing (not yet updated here the fix). I am also Implementing now the KVM +NFS (secondary) snapshot deletion.

@GabrielBrascher GabrielBrascher force-pushed the snapshot-deletion-issues branch from d25b4eb to c9c1888 Compare March 13, 2020 04:38
@DaanHoogland
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✖centos6 ✖centos7 ✖debian. JID-1049

@DaanHoogland
Copy link
Copy Markdown
Contributor

you have trailing whitespaces in KVMStorageProcessor @GabrielBrascher , checkstyle is objecting.

@DaanHoogland DaanHoogland mentioned this pull request Mar 16, 2020
5 tasks
@DaanHoogland DaanHoogland mentioned this pull request Mar 16, 2020
5 tasks
@DaanHoogland
Copy link
Copy Markdown
Contributor

joint effort continuing in #3669

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants