Skip to content

Single-pass MemoryFileSystem.find() to avoid quadratic listing#2055

Open
Nas01010101 wants to merge 1 commit into
fsspec:masterfrom
Nas01010101:perf/memoryfs-single-pass-find
Open

Single-pass MemoryFileSystem.find() to avoid quadratic listing#2055
Nas01010101 wants to merge 1 commit into
fsspec:masterfrom
Nas01010101:perf/memoryfs-single-pass-find

Conversation

@Nas01010101

Copy link
Copy Markdown

What

Give MemoryFileSystem a single-pass find() so that listing a tree is linear in the number of stored paths instead of quadratic.

The problem

MemoryFileSystem keeps every path in one flat dict (self.store). The inherited AbstractFileSystem.find() walks the tree by calling ls() once per directory, and MemoryFileSystem.ls() scans the entire store on every call:

for p2 in tuple(self.store):
    if p2.startswith(starter):
        ...

So find() over a tree with D directories and N stored paths does O(D * N) work. For a store with many directories this dominates anything built on find().

The fix

Override find() on MemoryFileSystem with one pass over the flat store (the store already contains every path, so no per-directory recursion is needed). The override reproduces the base semantics exactly: maxdepth, withdirs (including implied ancestor directories and explicitly-created empty pseudo_dirs), detail, the file-as-path case, and inclusion of the search root itself when withdirs is set. It is additive and touches no existing code path.

This speeds up everything that delegates to find(): find, du, glob, and expand_path. walk() is left unchanged (it still calls ls() per directory and remains O(D * N)); it could get the same treatment in a follow-up if wanted.

Correctness

A new test, test_find_does_not_scan_per_directory, asserts the override never calls ls() (so the quadratic factor is gone), and test_find_matches_generic checks the override against the generic AbstractFileSystem.find() for equality across the full matrix of roots (including "", a file path, and a missing path), maxdepth in {None, 1, 2, 3}, withdirs, and detail. Comparing against the base implementation directly means this test also guards against future drift: if the generic find() semantics change, the test fails.

The original development run also diff-tested the override against the base across 13,560 randomized configurations with zero mismatches.

Performance

MemoryFileSystem.find("/data"), old generic ls()-per-directory vs the single-pass override, 50 files per directory, outputs verified identical:

files dirs old (ms) new (ms) speedup
2,500 50 6.1 1.2 5.2x
5,000 100 20.1 2.7 7.5x
10,000 200 80.1 5.4 14.7x
20,000 400 285.9 12.3 23.3x

Old time grows quadratically, new time grows linearly, so the gap widens with the number of directories.

Notes

  • BSD-3, no behavioural change to the public API.
  • docs/source/changelog.rst has an entry under a new Dev section; the #XXXX PR reference will be filled in once this PR has a number.

MemoryFileSystem inherited the generic AbstractFileSystem.find(), which
walks the tree calling ls() once per directory. Each ls() scans the whole
global store, so listing a tree is O(n_dirs * n_entries).

Override find() with a single pass over the flat store, producing output
identical to the generic implementation across roots, maxdepth, withdirs
and detail. On a 10k-file / 200-directory tree this is ~20x faster and the
gain grows with the tree size; find(), du(), expand_path() and glob() all
benefit since they delegate to find().

Add a differential test against AbstractFileSystem.find() and a regression
test asserting find() no longer calls ls() per directory.
@Nas01010101 Nas01010101 force-pushed the perf/memoryfs-single-pass-find branch from cd585c6 to 38d9992 Compare June 21, 2026 19:42
@martindurant

Copy link
Copy Markdown
Member

Your implementation is probably fine (I haven't looked in detail yet), but is there really a usecase for >>1000 files in a memoryFS?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants