Single-pass MemoryFileSystem.find() to avoid quadratic listing#2055
Open
Nas01010101 wants to merge 1 commit into
Open
Single-pass MemoryFileSystem.find() to avoid quadratic listing#2055Nas01010101 wants to merge 1 commit into
Nas01010101 wants to merge 1 commit into
Conversation
MemoryFileSystem inherited the generic AbstractFileSystem.find(), which walks the tree calling ls() once per directory. Each ls() scans the whole global store, so listing a tree is O(n_dirs * n_entries). Override find() with a single pass over the flat store, producing output identical to the generic implementation across roots, maxdepth, withdirs and detail. On a 10k-file / 200-directory tree this is ~20x faster and the gain grows with the tree size; find(), du(), expand_path() and glob() all benefit since they delegate to find(). Add a differential test against AbstractFileSystem.find() and a regression test asserting find() no longer calls ls() per directory.
cd585c6 to
38d9992
Compare
Member
|
Your implementation is probably fine (I haven't looked in detail yet), but is there really a usecase for >>1000 files in a memoryFS? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Give
MemoryFileSystema single-passfind()so that listing a tree is linear in the number of stored paths instead of quadratic.The problem
MemoryFileSystemkeeps every path in one flat dict (self.store). The inheritedAbstractFileSystem.find()walks the tree by callingls()once per directory, andMemoryFileSystem.ls()scans the entire store on every call:So
find()over a tree withDdirectories andNstored paths doesO(D * N)work. For a store with many directories this dominates anything built onfind().The fix
Override
find()onMemoryFileSystemwith one pass over the flat store (the store already contains every path, so no per-directory recursion is needed). The override reproduces the base semantics exactly:maxdepth,withdirs(including implied ancestor directories and explicitly-created emptypseudo_dirs),detail, the file-as-path case, and inclusion of the search root itself whenwithdirsis set. It is additive and touches no existing code path.This speeds up everything that delegates to
find():find,du,glob, andexpand_path.walk()is left unchanged (it still callsls()per directory and remainsO(D * N)); it could get the same treatment in a follow-up if wanted.Correctness
A new test,
test_find_does_not_scan_per_directory, asserts the override never callsls()(so the quadratic factor is gone), andtest_find_matches_genericchecks the override against the genericAbstractFileSystem.find()for equality across the full matrix of roots (including"", a file path, and a missing path),maxdepth in {None, 1, 2, 3},withdirs, anddetail. Comparing against the base implementation directly means this test also guards against future drift: if the genericfind()semantics change, the test fails.The original development run also diff-tested the override against the base across 13,560 randomized configurations with zero mismatches.
Performance
MemoryFileSystem.find("/data"), old genericls()-per-directory vs the single-pass override, 50 files per directory, outputs verified identical:Old time grows quadratically, new time grows linearly, so the gap widens with the number of directories.
Notes
docs/source/changelog.rsthas an entry under a newDevsection; the#XXXXPR reference will be filled in once this PR has a number.