cache parsed trees in ParserManager to skip re-parsing unchanged files#18
cache parsed trees in ParserManager to skip re-parsing unchanged files#18lexasub merged 4 commits intolexasub:mainfrom
Conversation
|
Great work on the parse-tree caching! 👍 The implementation is clean, tests are comprehensive, and this will definitely speed up repeated indexing. One suggestion for a follow-up: currently the cache is in-memory only, so it's lost when the indexing utility restarts. For large repos, re-parsing everything on each ast-rag index run still adds up. Could we add persistence (e.g., SQLite) so the cache survives restarts? Along with that, it would be good to have: This way, frequent re-indexing benefits from previously parsed trees without re-parsing unchanged files. Happy to help with the implementation if you'd like! |
|
Hey , Thanks for the detailed feedback. valid point — the current in-memory cache helps within a single run, but doesn't survive restarts, which is exactly the pain point for anyone running ast-rag index repeatedly on a large codebase. SQLite sounds like the right fit here — lightweight, no server needed, and already in Python's stdlib. The main challenge is that tree-sitter Tree objects aren't directly serializable, but we can work around that by persisting the (file_path, content_hash, source_bytes) For the size limit + LRU eviction, I was thinking a last_accessed timestamp column in SQLite and a configurable max_entries / max_size_mb in the config (there's already ast_rag_config.json in the project root that could absorb those settings). Happy to take a crack at it as a follow-up PR if you're okay with the general direction. Would you prefer the SQLite file to sit alongside the existing .ast_rag_file_cache.json, or in a dedicated cache directory? |
|
moved all the cache stuff into its own parse_cache.py file. ParserManager now just holds a ParseCache instance and calls .get()/.put() — no hashing or counters sitting inside the parser. also made the cache injectable via the constructor so switching to sqlite later is just passing a different class in, no changes needed to ParserManager itself. threw in some config placeholders for max_entries/max_size_mb while i was at it for the lru follow-up. |
|
Nice! The ParseCache extraction is exactly what I had in mind 👍 A few thoughts before you dive into the SQLite backend:
Serialization of
On your questions: |
|
yeah would love the schema review before jumping in — here's what i'm thinking: CREATE TABLE parse_cache ( one thing i want to flag before implementing: the current get() interface returns a Tree |
|
Before you dive into the SQLite implementation, I'd like to propose an architectural pattern that solves the "different return types" problem elegantly: class LazyTree:
"""
A thin wrapper that defers tree loading until first attribute access.
Both in-memory and SQLite backends return LazyTree | None.
On first attribute/method access, the wrapped loader is invoked.
Subsequent accesses use the cached Tree.
Usage::
lazy = cache.get(path, source)
if lazy:
root = lazy.root_node # triggers load on first access
children = root.children # subsequent accesses are free
"""
def __init__(self, loader: Callable[[], Tree]) -> None:
self._loader = loader
self._tree: Optional[Tree] = None
def _ensure(self) -> None:
"""Load the Tree on first access."""
if self._tree is None:
self._tree = self._loader()
def __getattr__(self, name: str) -> Any:
"""Delegate all attribute access to the underlying Tree."""
self._ensure()
return getattr(self._tree, name)
def __repr__(self) -> str:
self._ensure()
return f"<LazyTree: {self._tree!r}>" def get(self, abs_path: str, source: bytes) -> Optional[LazyTree]:
entry = self._store.get(abs_path)
if entry and entry.hash == self.hash_source(source):
self._hits += 1
# Return SAME LazyTree instance (already loaded, but wrapper is cheap)
return LazyTree(lambda: entry.tree)
self._misses += 1
return None
def put(self, abs_path: str, source: bytes, tree: Tree) -> None:
# Wrap and store
self._store[abs_path] = (self.hash_source(source), tree) def get(self, abs_path: str, source: bytes) -> Optional[LazyTree]:
row = db.get(abs_path)
if row and row.content_hash == self.hash_source(source):
self._hits += 1
# Lazy load from SQLite on first access
return LazyTree(lambda: self._parser.parse(row.source_bytes))
self._misses += 1
return None def parse_file(self, file_path: str, source: Optional[bytes] = None,
old_tree: Optional[Tree] = None) -> Optional[Tree]:
# ... (read source, detect language)
lazy = self._cache.get(abs_path, source)
if lazy is not None:
# LazyTree automatically unwraps when returned as Tree
return lazy # type: ignore[return-value]
# Cold miss — parse and cache
tree = parser.parse(source, old_tree) if old_tree else parser.parse(source)
self._cache.put(abs_path, source, tree)
return tree class ParseCache:
def __init__(self) -> None:
self._store: dict[str, LazyTree] = {}
# ...
def get(self, abs_path: str, source: bytes) -> Optional[LazyTree]:
# Return the SAME LazyTree instance
lazy = self._store.get(abs_path)
if lazy and lazy._hash == self.hash_source(source):
return lazy # ← same instance, shared _tree cache
return Nonewhat do you think about such a solution when the ast is lazily given away? |
|
One more thing I wanted to flag — about separation of concerns: class ParseCache:
def get(self, abs_path: str, source: bytes,
loader: Optional[Callable[[], Tree]] = None) -> Optional[LazyTree]:
"""
On cache hit: return LazyTree(loader).
On cache miss: return None.
The loader is provided by ParserManager, so ParseCache stays
agnostic of how Trees are created.
"""
entry = self._store.get(abs_path)
if entry and entry.hash == self.hash_source(source):
self._hits += 1
return LazyTree(loader) if loader else None
self._misses += 1
return None def parse_file(self, file_path: str, source: Optional[bytes] = None) -> Optional[Tree]:
# ... (read source, detect language)
lazy = self._cache.get(abs_path, source,
loader=lambda: self._parsers[lang].parse(source))
if lazy:
return lazy # type: ignore
# Cold miss
tree = parser.parse(source)
self._cache.put(abs_path, source, tree)
return tree |
|
the loader approach is cleaner — ParseCache stays fully agnostic of tree-sitter, which is the right separation. two things i want to flag before implementing: for the in-memory backend, if get() returns LazyTree(caller_loader) on a hit, the stored Tree in the dict is never used — every "hit" still re-parses, just without disk I/O. to preserve the no-reparse guarantee, i think in-memory put() should pre-load the LazyTree (lazy._tree = tree) and get() should return the same instance rather than wrapping the caller's loader. so the two backends diverge slightly in how they use loader — in-memory ignores it, SQLite uses it. also caught this from the code: if those two are fine with you i'll start implementing. |
We can add a resolve: bool = False flag to parse_file() — when True, it forces resolution before returning (for workers). Default is False for main process. Summary: |
|
@r0h1tb alse please rename commits Types: feat: New feature |
|
@r0h1tb Any progress on adding SQLite persistence to the parse cache? Happy to help with any part of it if you need a hand |
|
HI @lexasub been busy with my Full time work . i will have a look at it soon . Will keep you in loop. |
… loader injection - Add LazyTree: thin proxy that defers tree loading until first attribute access. In-memory backend pre-populates _tree on put() so no re-parse ever occurs. SQLiteParseCache wraps caller-supplied loader= for deferred construction. Call .resolve() to force eager loading before crossing process boundaries. - Update ParseCache to store dict[str, LazyTree] and return the same pre-loaded instance on every cache hit so all callers share one Tree object. - Update ParseCache.get() to accept optional loader= param for interface parity with SQLiteParseCache (in-memory backend ignores it — tree already stored). - Update ParserManager.parse_file() to pass loader=lambda to cache.get() so ParseCache stays fully agnostic of tree-sitter (per lexasub's review feedback). - Add resolve: bool = False param to parse_file() — worker processes in ProcessPoolExecutor must pass resolve=True to avoid pickling lambdas.
- SQLiteParseCache: persistent backend backed by a local SQLite database that survives process restarts. Stores (file_path, content_hash, source_bytes). On hit, returns LazyTree(loader) so tree is re-parsed lazily from stored bytes only when first accessed — no parsing until first attribute access. - ParserManager: add factory in __init__ to select backend from config. Caller-supplied cache > config-driven > default in-memory. Set parse_cache.persistence_enabled = true to opt-in. - ast_rag_config.json: add persistence_enabled flag and db_path. - tests/test_ast_cache.py: add comprehensive tests (TestLazyTree, TestSQLiteParseCache, TestParserManagerIntegration, TestParserManagerSQLite, TestWorkerResolve).
|
hey @lexasub — sorry for the delay, pushed the SQLite implementation just now. here's what the two new commits do: the refactor commit pulls all the cache logic out of ParserManager into its own parse_cache.py file. ParserManager now just holds a cache instance and calls .get()/.put() — no hashing or counters sitting inside the parser itself. also went with your LazyTree suggestion — both backends return the same type, callers don't need to know which backend is active. the feat commit adds SQLiteParseCache alongside the existing in-memory ParseCache. the factory in ParserManager.init reads persistence_enabled from config and picks the right backend automatically. you can also inject a cache directly via the constructor which made testing much cleaner. one thing worth flagging on the loader injection — the in-memory backend ignores the loader entirely since the tree is already pre-populated on put(). the SQLite backend wraps it in a LazyTree so the actual parse only happens on first attribute access. also added resolve=True on parse_file() for the worker processes since lambdas inside LazyTree can't be pickled across process boundaries. for the LRU eviction — last_accessed column and index are in the schema, and max_entries/max_size_mb are in the config, but i intentionally left the actual eviction logic out for now since you mentioned that fits better alongside the tree serialization work. the structure is ready for it though, just needs the DELETE logic in put() when size exceeds the limit. next PR would cover: reading source_bytes back from SQLite so we actually skip the disk read on cache hits after restart (right now we still read from disk to get the hash) |
|
@r0h1tb merged, thanks for the clean refactor! The separation of concerns with parse_cache.py is much nicer now. Looking forward to the LRU + serialization follow-up. |
|
@r0h1tb code in main branch refactored, may be you need rebase you LOCAL changes |
when the same file is passed to parse_file() multiple times without any content change, we were calling the tree-sitter parser every single time. for large repos this adds up fast.
added a simple dict cache keyed by absolute file path. each entry stores a (sha256_hash, Tree) pair. on every parse_file() call we hash the source bytes and compare - if the hash matches we return the cached tree immediately, otherwise we parse fresh and update the slot.
the incremental parse path (old_tree parameter) is preserved as-is and still refreshes the cache slot with the newly produced tree.
two small helper methods:
also added tests/test_ast_cache.py covering hit, miss on content change, caller-supplied source bytes, cache clear, stats structure, and the incremental parse refresh path (11 tests, all passing)
Description
Briefly describe the changes in this PR.
Related Issue
Fixes #(issue number)
Type of Change
Checklist
pytest tests/ -v)ast-rag evaluate --alland quality is maintainedTesting
Describe how you tested these changes:
# Example: pytest tests/test_new_feature.py -v ast-rag evaluate --allScreenshots (if applicable)
Add screenshots or logs if relevant.
Additional Notes
Any other information that would help reviewers understand your PR.