Skip to content

cache parsed trees in ParserManager to skip re-parsing unchanged files#18

Merged
lexasub merged 4 commits intolexasub:mainfrom
r0h1tb:main
Mar 10, 2026
Merged

cache parsed trees in ParserManager to skip re-parsing unchanged files#18
lexasub merged 4 commits intolexasub:mainfrom
r0h1tb:main

Conversation

@r0h1tb
Copy link

@r0h1tb r0h1tb commented Feb 27, 2026

when the same file is passed to parse_file() multiple times without any content change, we were calling the tree-sitter parser every single time. for large repos this adds up fast.

added a simple dict cache keyed by absolute file path. each entry stores a (sha256_hash, Tree) pair. on every parse_file() call we hash the source bytes and compare - if the hash matches we return the cached tree immediately, otherwise we parse fresh and update the slot.

the incremental parse path (old_tree parameter) is preserved as-is and still refreshes the cache slot with the newly produced tree.

two small helper methods:

  • clear_tree_cache() - evict everything (useful after a full re-index)
  • tree_cache_stats() - returns hits/misses/size/hit_rate for debugging

also added tests/test_ast_cache.py covering hit, miss on content change, caller-supplied source bytes, cache clear, stats structure, and the incremental parse refresh path (11 tests, all passing)

Description

Briefly describe the changes in this PR.

Related Issue

Fixes #(issue number)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • My code follows the code style of this project
  • I have added tests that prove my fix/feature works
  • All new and existing tests passed (pytest tests/ -v)
  • I have updated the documentation accordingly
  • I have run ast-rag evaluate --all and quality is maintained
  • My changes generate no new warnings

Testing

Describe how you tested these changes:

# Example:
pytest tests/test_new_feature.py -v
ast-rag evaluate --all

Screenshots (if applicable)

Add screenshots or logs if relevant.

Additional Notes

Any other information that would help reviewers understand your PR.

@r0h1tb r0h1tb requested a review from lexasub as a code owner February 27, 2026 20:12
@lexasub
Copy link
Owner

lexasub commented Feb 28, 2026

Great work on the parse-tree caching! 👍 The implementation is clean, tests are comprehensive, and this will definitely speed up repeated indexing.

One suggestion for a follow-up: currently the cache is in-memory only, so it's lost when the indexing utility restarts. For large repos, re-parsing everything on each ast-rag index run still adds up.

Could we add persistence (e.g., SQLite) so the cache survives restarts? Along with that, it would be good to have:
- Cache size limit (e.g., max entries or max MB)
- LRU eviction to prevent unbounded growth

This way, frequent re-indexing benefits from previously parsed trees without re-parsing unchanged files.

Happy to help with the implementation if you'd like!

@r0h1tb
Copy link
Author

r0h1tb commented Feb 28, 2026

Hey ,

Thanks for the detailed feedback. valid point — the current in-memory cache helps within a single run, but doesn't survive restarts, which is exactly the pain point for anyone running ast-rag index repeatedly on a large codebase.

SQLite sounds like the right fit here — lightweight, no server needed, and already in Python's stdlib. The main challenge is that tree-sitter

Tree objects aren't directly serializable, but we can work around that by persisting the

(file_path, content_hash, source_bytes)
triplet and doing a lazy re-parse on cache load — so we still skip unchanged files even if we have to do the actual parse on first access after a restart.

For the size limit + LRU eviction, I was thinking a last_accessed timestamp column in SQLite and a configurable max_entries / max_size_mb in the config (there's already ast_rag_config.json in the project root that could absorb those settings).

Happy to take a crack at it as a follow-up PR if you're okay with the general direction. Would you prefer the SQLite file to sit alongside the existing .ast_rag_file_cache.json, or in a dedicated cache directory?

@r0h1tb
Copy link
Author

r0h1tb commented Feb 28, 2026

moved all the cache stuff into its own parse_cache.py file.

ParserManager now just holds a ParseCache instance and calls .get()/.put() — no hashing or counters sitting inside the parser.

also made the cache injectable via the constructor so switching to sqlite later is just passing a different class in, no changes needed to ParserManager itself.

threw in some config placeholders for max_entries/max_size_mb while i was at it for the lru follow-up.

@lexasub
Copy link
Owner

lexasub commented Feb 28, 2026

Nice! The ParseCache extraction is exactly what I had in mind 👍

A few thoughts before you dive into the SQLite backend:

  1. On persistence strategy:
    For a first pass, we could store (file_path, content_hash, source_bytes) in SQLite and re-parse on cache load. Yes, this still means parsing on each restart, but we still get the benefit of not re-reading files from disk and having a stable cache across runs.

Serialization of Tree objects (via tree.root_node.sexp() or binary format) is non-trivial and would benefit from its own dedicated PR with proper benchmarking — happy to review that as a follow-up.

  1. LRU design:
    Your last_accessed timestamp idea is solid. For the initial SQLite impl, we can keep it simple and just add the column — full LRU eviction logic can be part of the serialization PR.

  2. Config integration:
    Love that you added parse_cache to ast_rag_config.json. Suggest we also add a persistence_enabled: bool flag so users can opt-out if they prefer ephemeral caching.

  3. Interface compatibility:
    The cache= parameter in ParserManager.init is perfect — makes the SQLite backend a drop-in replacement. Just ensure the SQLite class implements .get(), .put(), .evict(), .clear(), and .stats() with identical signatures.

On your questions:
- SQLite location: I'd put it alongside .ast_rag_file_cache.json for consistency — maybe .ast_rag_parse_cache.sqlite?
- Next steps: Go ahead with the SQLite impl. I'd suggest:
1. Keep ParseCache as the default in-memory backend
2. Add SQLiteParseCache in the same module
3. Factory function in ParserManager to pick based on config
Let me know if you want me to review the SQLite schema before you implement!

@r0h1tb
Copy link
Author

r0h1tb commented Feb 28, 2026

yeah would love the schema review before jumping in — here's what i'm thinking:
sql

CREATE TABLE parse_cache (
file_path TEXT PRIMARY KEY,
content_hash TEXT NOT NULL,
source_bytes BLOB NOT NULL,
last_accessed REAL NOT NULL -- unix timestamp, for future
LRU
);
CREATE INDEX idx_last_accessed ON parse_cache(last_accessed);

one thing i want to flag before implementing: the current get() interface returns a Tree
object. SQLite can't store Trees, only the source bytes. so a SQLite "hit" would mean: hash
matches → skip disk I/O, re-parse from the stored bytes. but get() still technically
returns None (no cached Tree), just with source bytes pre-fetched.
i'm thinking parse_file() in ParserManager would call a separate
get_source_bytes(abs_path, content_hash) → bytes | None on the
SQLite backend to avoid the disk read, then always parse. this keeps the get() signature
identical across both backends.
does that approach make sense, or do you see a cleaner way to handle it?

@lexasub
Copy link
Owner

lexasub commented Feb 28, 2026

Before you dive into the SQLite implementation, I'd like to propose an architectural pattern that solves the "different return types" problem elegantly:

---                                                                    

Proposed Architecture: LazyTree Wrapper
                                                                           
The problem: In-memory get() returns Tree, but SQLite can only return bytes. This forces either:
 - Different interfaces per backend, or
 - Returning None from SQLite even on a "hit"
                                                                           
The solution: A lazy wrapper that loads on first access:
    class LazyTree:                                            
           """                                                       
           A thin wrapper that defers tree loading until first attribute access.
           
           Both in-memory and SQLite backends return LazyTree | None.
           On first attribute/method access, the wrapped loader is invoked.
           Subsequent accesses use the cached Tree.
                                    
           Usage::
              lazy = cache.get(path, source) 
              if lazy:                                                                                                                                                                                                                                                                                                    
                  root = lazy.root_node      # triggers load on first access
                  children = root.children   # subsequent accesses are free                                                                                                                                                                                                                                               
          """                                                                                                                                                                                                                                                                                                             
          
          def __init__(self, loader: Callable[[], Tree]) -> None:
              self._loader = loader
              self._tree: Optional[Tree] = None
          
          def _ensure(self) -> None:
              """Load the Tree on first access."""
              if self._tree is None:
                 self._tree = self._loader()
          
          def __getattr__(self, name: str) -> Any:
              """Delegate all attribute access to the underlying Tree."""
              self._ensure()
              return getattr(self._tree, name)
          
          def __repr__(self) -> str:
              self._ensure()
              return f"<LazyTree: {self._tree!r}>"
How Both Backends Use It

In-Memory (`ParseCache`):
       def get(self, abs_path: str, source: bytes) -> Optional[LazyTree]:
          entry = self._store.get(abs_path)
           if entry and entry.hash == self.hash_source(source):
               self._hits += 1
               # Return SAME LazyTree instance (already loaded, but wrapper is cheap)
               return LazyTree(lambda: entry.tree)
           self._misses += 1
           return None
       
      def put(self, abs_path: str, source: bytes, tree: Tree) -> None:
          # Wrap and store
          self._store[abs_path] = (self.hash_source(source), tree)
SQLite (`SQLiteParseCache`):
     def get(self, abs_path: str, source: bytes) -> Optional[LazyTree]:
          row = db.get(abs_path)
          if row and row.content_hash == self.hash_source(source):
              self._hits += 1
              # Lazy load from SQLite on first access
              return LazyTree(lambda: self._parser.parse(row.source_bytes))
          self._misses += 1
          return None
`ParserManager.parse_file()` — unchanged interface:
       def parse_file(self, file_path: str, source: Optional[bytes] = None, 
                      old_tree: Optional[Tree] = None) -> Optional[Tree]:
           # ... (read source, detect language)
           
           lazy = self._cache.get(abs_path, source)
           if lazy is not None:
               # LazyTree automatically unwraps when returned as Tree
               return lazy  # type: ignore[return-value]
           
          # Cold miss — parse and cache
          tree = parser.parse(source, old_tree) if old_tree else parser.parse(source)
          self._cache.put(abs_path, source, tree)
          return tree
Why This Works


┌───────────────────┬─────────────────────────────────────────┐
│ Aspect            │ Benefit                                 │
├───────────────────┼─────────────────────────────────────────┤
│ Unified interface │ Both backends return `LazyTree \        │
│ Transparent usage │ Callers use it like a normal Tree       │
│ Lazy loading      │ SQLite doesn't parse until first access │
│ No API changes    │ parse_file() signature stays the same   │
└───────────────────┴─────────────────────────────────────────┘

---

Important: Caching the LazyTree Instance

To avoid re-parsing on multiple get() calls, cache the LazyTree itself:
       class ParseCache:
           def __init__(self) -> None:
               self._store: dict[str, LazyTree] = {}
               # ...
           
           def get(self, abs_path: str, source: bytes) -> Optional[LazyTree]:
               # Return the SAME LazyTree instance
               lazy = self._store.get(abs_path)
               if lazy and lazy._hash == self.hash_source(source):
                  return lazy  # ← same instance, shared _tree cache
              return None
This ensures:

 lazy1 = cache.get(path, source)
 lazy2 = cache.get(path, source)
 assert lazy1 is lazy2  # same instance
  
 lazy1.root_node  # ← parses once
 lazy2.root_node  # ← uses cached _tree, no re-parse!

---

Revised Schema Suggestion

With this approach, the schema can stay simple for now (your suggestion):

 
Future additions (can be separate PR) :
 - size_bytes INTEGER — for max_size_mb limits
 - cache_version INTEGER DEFAULT 1 — for format migrations
 - Serialized tree column (when we figure out tree-sitter serialization)

---

Next Steps

If this architecture looks good:

 1. Add LazyTree class to ast_rag/parse_cache.py
 2. Update ParseCache to return LazyTree
 3. Implement SQLiteParseCache with the same interface
 4. Add factory in ParserManager to pick backend based on config

Let me know if you'd like me to sketch out the LazyTree tests or the SQLite factory logic!

what do you think about such a solution when the ast is lazily given away?

@lexasub
Copy link
Owner

lexasub commented Feb 28, 2026

One more thing I wanted to flag — about separation of concerns:

Looking at the SQLite implementation, there's a potential issue: if SQLiteParseCache.get() needs to return a Tree, it would have to know about the tree-sitter Parser to call .parse() on the loaded bytes. This couples the cache layer to the parser, which breaks encapsulation.

Cleaner approach: Pass the loader from the caller:
       class ParseCache:
           def get(self, abs_path: str, source: bytes, 
                   loader: Optional[Callable[[], Tree]] = None) -> Optional[LazyTree]:
               """
               On cache hit: return LazyTree(loader).
               On cache miss: return None.
               
               The loader is provided by ParserManager, so ParseCache stays
               agnostic of how Trees are created.
              """
              entry = self._store.get(abs_path)
              if entry and entry.hash == self.hash_source(source):
                  self._hits += 1
                  return LazyTree(loader) if loader else None
              self._misses += 1
              return None
Usage in `ParserManager`:
      def parse_file(self, file_path: str, source: Optional[bytes] = None) -> Optional[Tree]:
           # ... (read source, detect language)
           
           lazy = self._cache.get(abs_path, source,
                                  loader=lambda: self._parsers[lang].parse(source))
           if lazy:
               return lazy  # type: ignore
           
           # Cold miss
          tree = parser.parse(source)
          self._cache.put(abs_path, source, tree)
          return tree
Revised architecture:

 ParserManager (knows about parsers)
         │
         │ provides loader=
         ▼
 ParseCache (knows about storage only)
         │
         │ returns LazyTree(loader)
         ▼
  LazyTree (defers execution)

@r0h1tb
Copy link
Author

r0h1tb commented Feb 28, 2026

the loader approach is cleaner — ParseCache stays fully agnostic of tree-sitter, which is the right separation.

two things i want to flag before implementing:

for the in-memory backend, if get() returns LazyTree(caller_loader) on a hit, the stored Tree in the dict is never used — every "hit" still re-parses, just without disk I/O. to preserve the no-reparse guarantee, i think in-memory put() should pre-load the LazyTree (lazy._tree = tree) and get() should return the same instance rather than wrapping the caller's loader. so the two backends diverge slightly in how they use loader — in-memory ignores it, SQLite uses it.

also caught this from the code:
cli.py's index_folder command uses ProcessPoolExecutor and the results of parse_file() cross process boundaries. lambdas inside LazyTree can't be pickled, so any LazyTree returned by parse_file() would fail there. probably worth thinking about how to handle that path — maybe ParseCache just resolves eagerly when created inside a subprocess worker?

if those two are fine with you i'll start implementing.

@lexasub
Copy link
Owner

lexasub commented Feb 28, 2026

  1. In-memory backend: You're right — it should return the already-cached Tree without re-parsing. So LazyTree._tree is pre-populated on put(), and get() returns the same resolved instance. The loader pattern is SQLite-only.

  2. Multiprocessing: LazyTree with lambdas can't be pickled. Simplest solution: LazyTree is an in-process optimization only. Worker processes in ProcessPoolExecutor should resolve eagerly before returning.

We can add a resolve: bool = False flag to parse_file() — when True, it forces resolution before returning (for workers). Default is False for main process.

Summary:
- In-memory: LazyTree already resolved (_tree set)
- SQLite: LazyTree defers via loader
- Workers: call parse_file(resolve=True) to get plain Tree

@github-project-automation github-project-automation bot moved this to Backlog in raged kanban Feb 28, 2026
@lexasub lexasub moved this from Backlog to In progress in raged kanban Feb 28, 2026
@lexasub lexasub self-assigned this Feb 28, 2026
@lexasub
Copy link
Owner

lexasub commented Feb 28, 2026

@r0h1tb alse please rename commits
refactor(parsing_cache), feat(parsing) (and may be squash it)

<type>(<scope>): <description>

[optional body]

Types:

feat: New feature
fix: Bug fix
docs: Documentation changes
style: Code style changes (formatting)
refactor: Code refactoring
test: Test additions/changes
chore: Build/config changes

@lexasub
Copy link
Owner

lexasub commented Mar 8, 2026

@r0h1tb Any progress on adding SQLite persistence to the parse cache? Happy to help with any part of it if you need a hand

@r0h1tb
Copy link
Author

r0h1tb commented Mar 9, 2026

HI @lexasub been busy with my Full time work .

i will have a look at it soon .

Will keep you in loop.

… loader injection

- Add LazyTree: thin proxy that defers tree loading until first attribute access.
  In-memory backend pre-populates _tree on put() so no re-parse ever occurs.
  SQLiteParseCache wraps caller-supplied loader= for deferred construction.
  Call .resolve() to force eager loading before crossing process boundaries.

- Update ParseCache to store dict[str, LazyTree] and return the same pre-loaded
  instance on every cache hit so all callers share one Tree object.

- Update ParseCache.get() to accept optional loader= param for interface parity
  with SQLiteParseCache (in-memory backend ignores it — tree already stored).

- Update ParserManager.parse_file() to pass loader=lambda to cache.get() so
  ParseCache stays fully agnostic of tree-sitter (per lexasub's review feedback).

- Add resolve: bool = False param to parse_file() — worker processes in
  ProcessPoolExecutor must pass resolve=True to avoid pickling lambdas.
- SQLiteParseCache: persistent backend backed by a local SQLite database that
  survives process restarts. Stores (file_path, content_hash, source_bytes).
  On hit, returns LazyTree(loader) so tree is re-parsed lazily from stored
  bytes only when first accessed — no parsing until first attribute access.

- ParserManager: add factory in __init__ to select backend from config.
  Caller-supplied cache > config-driven > default in-memory.
  Set parse_cache.persistence_enabled = true to opt-in.

- ast_rag_config.json: add persistence_enabled flag and db_path.

- tests/test_ast_cache.py: add comprehensive tests (TestLazyTree, TestSQLiteParseCache,
  TestParserManagerIntegration, TestParserManagerSQLite, TestWorkerResolve).
@r0h1tb
Copy link
Author

r0h1tb commented Mar 9, 2026

hey @lexasub — sorry for the delay, pushed the SQLite implementation just now.

here's what the two new commits do:

the refactor commit pulls all the cache logic out of ParserManager into its own parse_cache.py file. ParserManager now just holds a cache instance and calls .get()/.put() — no hashing or counters sitting inside the parser itself. also went with your LazyTree suggestion — both backends return the same type, callers don't need to know which backend is active.

the feat commit adds SQLiteParseCache alongside the existing in-memory ParseCache. the factory in ParserManager.init reads persistence_enabled from config and picks the right backend automatically. you can also inject a cache directly via the constructor which made testing much cleaner.

one thing worth flagging on the loader injection — the in-memory backend ignores the loader entirely since the tree is already pre-populated on put(). the SQLite backend wraps it in a LazyTree so the actual parse only happens on first attribute access. also added resolve=True on parse_file() for the worker processes since lambdas inside LazyTree can't be pickled across process boundaries.

for the LRU eviction — last_accessed column and index are in the schema, and max_entries/max_size_mb are in the config, but i intentionally left the actual eviction logic out for now since you mentioned that fits better alongside the tree serialization work. the structure is ready for it though, just needs the DELETE logic in put() when size exceeds the limit.

next PR would cover:

reading source_bytes back from SQLite so we actually skip the disk read on cache hits after restart (right now we still read from disk to get the hash)
enforcing max_entries and max_size_mb with LRU eviction
tree serialization once we figure out the right format
let me know if anything needs changing before you merge!

@lexasub lexasub merged commit 570ee9a into lexasub:main Mar 10, 2026
@github-project-automation github-project-automation bot moved this from In progress to Done in raged kanban Mar 10, 2026
@lexasub
Copy link
Owner

lexasub commented Mar 10, 2026

@r0h1tb merged, thanks for the clean refactor! The separation of concerns with parse_cache.py is much nicer now. Looking forward to the LRU + serialization follow-up.

@lexasub
Copy link
Owner

lexasub commented Mar 13, 2026

@r0h1tb code in main branch refactored, may be you need rebase you LOCAL changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants