don't cache prefix-filtered listings as complete directory entries#1034
Merged
martindurant merged 1 commit intoJun 29, 2026
Merged
Conversation
| delimiter="/", | ||
| prefix="", | ||
| versions=False, | ||
| partial=False, |
Member
There was a problem hiding this comment.
Why is prefix alone not enough to indicate whether the listing is partial or not?
Contributor
Author
There was a problem hiding this comment.
I think you are right. The only issue was that _lsdir overwrites the prefix argument with the key-derived directory prefix (prefix = key.lstrip("/") + "/" + prefix) before the cache-write, at which point it's non-empty for every non-root path (e.g. a plain ls("bucket/data") has prefix == "data/"). Gating on it there would disable caching for all subdirectories. Capturing the caller's prefix up front (partial = bool(prefix)) avoids that and lets us drop the extra kwarg.
I have updated the code accordingly.
34e33ed to
253a8c8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When
glob("dir/train-*")is called, fsspec extractstrain-as a stem prefix and calls_find(path, prefix="train-", maxdepth=1)for server-side filtering. Inside_find's maxdepth+prefix branch,_lsdirruns a delimiter-based S3 listing filtered to that stem, but then unconditionally writes the partial result intodircache[path]as if it were a complete directory listing.A subsequent
glob("dir/test-*")orls("dir")hits this stale cache and misses every file that didn't match the first stem, returning empty results even though the files exist on S3.The same poisoning happens without
glob:Only s3fs has this defect. gcsfs already guards its equivalent write with if not prefix; adlfs never writes dircache from _find at all.
Fix
Add a partial=False keyword to _lsdir. When True, the dircache[path] = files write is skipped. _find's maxdepth+prefix branch is the only call site that passes partial=True. All other callers keep partial=False and continue to populate dircache normally. A pre-existing full cache entry is returned unchanged via the existing if path not in self.dircache guard.
Linked to fsspec:2054