Skip to content

don't cache prefix-filtered listings as complete directory entries#1034

Merged
martindurant merged 1 commit into
fsspec:mainfrom
bruno-hays:fix/partial-listing-dircache-poisoning
Jun 29, 2026
Merged

don't cache prefix-filtered listings as complete directory entries#1034
martindurant merged 1 commit into
fsspec:mainfrom
bruno-hays:fix/partial-listing-dircache-poisoning

Conversation

@bruno-hays

Copy link
Copy Markdown
Contributor

Problem

When glob("dir/train-*") is called, fsspec extracts train- as a stem prefix and calls _find(path, prefix="train-", maxdepth=1) for server-side filtering. Inside _find's maxdepth+prefix branch, _lsdir runs a delimiter-based S3 listing filtered to that stem, but then unconditionally writes the partial result into dircache[path] as if it were a complete directory listing.
A subsequent glob("dir/test-*") or ls("dir") hits this stale cache and misses every file that didn't match the first stem, returning empty results even though the files exist on S3.
The same poisoning happens without glob:

fs.find("bucket/data", prefix="train-", maxdepth=1)
fs.ls("bucket/data")  # returns only train-* files

Only s3fs has this defect. gcsfs already guards its equivalent write with if not prefix; adlfs never writes dircache from _find at all.

Fix

Add a partial=False keyword to _lsdir. When True, the dircache[path] = files write is skipped. _find's maxdepth+prefix branch is the only call site that passes partial=True. All other callers keep partial=False and continue to populate dircache normally. A pre-existing full cache entry is returned unchanged via the existing if path not in self.dircache guard.

Linked to fsspec:2054

Comment thread s3fs/core.py Outdated
delimiter="/",
prefix="",
versions=False,
partial=False,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is prefix alone not enough to indicate whether the listing is partial or not?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right. The only issue was that _lsdir overwrites the prefix argument with the key-derived directory prefix (prefix = key.lstrip("/") + "/" + prefix) before the cache-write, at which point it's non-empty for every non-root path (e.g. a plain ls("bucket/data") has prefix == "data/"). Gating on it there would disable caching for all subdirectories. Capturing the caller's prefix up front (partial = bool(prefix)) avoids that and lets us drop the extra kwarg.
I have updated the code accordingly.

@bruno-hays bruno-hays force-pushed the fix/partial-listing-dircache-poisoning branch from 34e33ed to 253a8c8 Compare June 28, 2026 10:09
@martindurant martindurant merged commit b7ec8db into fsspec:main Jun 29, 2026
31 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants