Skip reuse of removed layers in copyFileFromOtherLayer#890
Conversation
|
I am not sure how to test this since it fixes the flaking test. |
|
If you fix flakes it is always good to include a log and/or example failure output so reviewers can verify that your theory makes sense. |
mtrmac
left a comment
There was a problem hiding this comment.
What is the failure log + path?
Intuitively, this fix looks unlikely.
- Most importantly, the storage should be in control of the fallback path: if
convert_imagesis set, we must not fallback (“the user wants composefs and we don’t have a non-partial-pull composefs code path”) - IIRC nothing in the partial pull code path references the
BlobInfoCache. PrepareStagedLayercan easily fail on network issues, but that should be handled by a higher-level retry likec/common/pkg/retry. This fallback is not the right mechanism for handling transient network errors.
= return |
When `copyFileFromOtherLayer` opens a source layer directory that was removed by a concurrent `rmi`, return "not found" instead of a fatal error so the differ fetches content from the remote blob instead. Fixes podman flake: "quadlet - kube build from unavailable image with no tag Signed-off-by: Jan Rodák <hony.com@seznam.cz>
4e35efb to
6114e0a
Compare
mtrmac
left a comment
There was a problem hiding this comment.
= we build the cache in layersCache.load (within chunked.NewDiffer), that temporarily locks the store but we don’t hold any locks between then and PrepareStagedLayer, so the layer can be removed in the meantime.
Thanks, that makes sense to me.
Cc: @giuseppe
Note that PrepareStagedLayer does not hold a storage lock at all, so we might succeed in opening srcDirfd here, but still fail opening srcFile if the layer is concurrently being deleted.
I think the findChunkInOtherLayers path should also check for a missing file… and that might be a more difficult fix because we build a list of ranges to download, and later we might find that the chunk has been removed in the meantime.
I’m not sure what is a good fix here — in principle we need to fall back to downloading from the remote, that might be a fairly invasive restructuring.
Maybe, for now, we can just have the initial checksum computation resilient, and still flake/fail if the file is removed concurrently (+ filing an issue to track the outstanding case) — that way the recurring flake would be fixed and not blocked on a larger redesign. But I might well be missing something here, maybe the restructuring is easy, maybe we can do something else here. (Keep an open file descriptor? Probably not.)
we will need some more fixes around handling ENOENT, but I think this should fix the root issue because the file will still be accessible. |
I've asked Claude Code to implement this logic (keep an open file descriptor and use that), I've got this: I've reviewed and it makes sense to me, but I've not tested it yet |
Without some extra trick, I don’t think we can rely on keeping… potentially millions of file descriptors open concurrently. |
giuseppe
left a comment
There was a problem hiding this comment.
LGTM
I'll deal separately with the chunks deduplication race condition
|
This needs to rebased to properly trigger CI. As for the logic I cannot really speak to that code paths and have no time to dive deeper into but if @giuseppe is good with it that is good enough for me. |
When
copyFileFromOtherLayeropens a source layer directory that was removed by a concurrentrmi, return "not found" instead of a fatal error so the differ fetches content from the remote blob instead.Fixes podman flake:
quadlet - kube build from unavailable image with no tagnot ok 271 |252| quadlet - kube build from unavailable image with no tag in 8211ms
The test (tagged ci:parallel) removes quay.io/libpod/busybox:latest with
rmi -i, then a systemd service re-pulls it viapodman kube play --build. During the partial pull, the chunked differ's layers cache (getLayersCache(store)) finds a reference to local layer7ffd7a2b...for content reuse. WhencopyFileFromOtherLayertries to open that layer's diff directory, it getsENOENTbecause rmi already deleted the overlay.Fix (in
storage/pkg/chunked/storage_linux.go): The fix belongs in storage where the layers cache is consulted. WhencopyFileFromOtherLayeropens a source layer directory that no longer exists, it should return "not found" (skip reuse) rather than a fatal error.