Skip to content

fix(glob): discard partial dircache entry after prefix-filtered _find#2058

Closed
bruno-hays wants to merge 2 commits into
fsspec:masterfrom
bruno-hays:fix/glob-prefix-poisons-dircache
Closed

fix(glob): discard partial dircache entry after prefix-filtered _find#2058
bruno-hays wants to merge 2 commits into
fsspec:masterfrom
bruno-hays:fix/glob-prefix-poisons-dircache

Conversation

@bruno-hays

Copy link
Copy Markdown

Problem

_glob passes a prefix= hint to _find so backends (s3fs, gcsfs, adlfs) can filter server-side up to the first wildcard. Those backends store the filtered result in dircache under the parent directory key — but the entry contains only prefix-matching files, making it a partial (misleading) directory listing.
Consequence: after glob("dir/train-*"), a call to glob("dir/test-*") or fs.exists("dir/test-file") hits the stale partial entry and silently returns nothing / False, even though the files exist.
This regression was introduced in 2026.4.0 by #1996.

Fix

After _find returns with a prefixed query, discard the dircache entry for root. The result is already captured in allpaths so nothing is lost; the next lookup for the same directory fetches a fresh full listing.

if prefix and root:
    self.dircache.pop(root.rstrip("/"), None)

@martindurant

Copy link
Copy Markdown
Member

I had not anticipated such a solution! Does this miss the case that the dircache was already populated before calling find(), so we end up discarding legitimate listings?

@martindurant

Copy link
Copy Markdown
Member

Seems to be causing explicit failures

When `_glob` passes `prefix=` to `_find`, backends like s3fs, gcsfs and
adlfs perform a server-side filtered listing and store the result in
`dircache` under the parent directory key.  That entry is a *partial*
listing (only files matching the prefix), but it gets treated as a
complete directory listing by every subsequent operation on the same
path.

Consequence: after `glob("dir/train-*")`, a call to `glob("dir/test-*")`
or `fs.exists("dir/test-file")` hits the cached train-only listing and
returns an empty result / False, even though the test files exist on the
remote storage.  The regression was introduced in 2026.4.0 by the
prefix= optimisation (PR fsspec#1996).

Fix: after `_find` returns, remove the `dircache` entry for `root` when
a prefix was used.  The next lookup for the same directory will perform
a fresh full listing and cache it correctly.

Adds a regression test using a mock backend that faithfully simulates
the partial-caching behaviour (cache-hit path returns only the
prefix-filtered subset, triggering the exact failure mode).

Co-authored-by: Cursor <cursoragent@cursor.com>
@bruno-hays bruno-hays force-pushed the fix/glob-prefix-poisons-dircache branch from 749f2df to 65cbeb8 Compare June 26, 2026 17:29
@bruno-hays

Copy link
Copy Markdown
Author

I fixed the test, sorry for oversight.
We indeed did discard existing cache, I added a guard.
After checking the code, it seems like only s3fs is affected by this bug as gcsfs does not cache prefixed listings and adlfs does not write to dircache at all.
I still think it's worth fixing in fsspec as it is where the issue originated and prevents the same error from arising in another backend, and I'm not 100% sure either for other backends anyway.
And it's worth fixing in s3fs as there could be other functions that suffer from the same bug

@bruno-hays

Copy link
Copy Markdown
Author

Here is my proposed fix for s3fs: fsspec/s3fs#1034

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants