Skip to content

refactor(indexing): eliminate Tier 1, add hash gate & cooldown, rename to AST/Embed#9

Merged
SerPeter merged 6 commits intomainfrom
refactor/two-tier-pipeline
Mar 3, 2026
Merged

refactor(indexing): eliminate Tier 1, add hash gate & cooldown, rename to AST/Embed#9
SerPeter merged 6 commits intomainfrom
refactor/two-tier-pipeline

Conversation

@SerPeter
Copy link
Copy Markdown
Owner

@SerPeter SerPeter commented Mar 3, 2026

Summary

  • Remove Tier 1 pass-through consumerTier1GraphConsumer converted FileChangedASTDirty with zero value-add. Pipeline simplified from 3-tier to 2-stage: Watcher → file-changed → AST → embed-dirty → Embed
  • Add file hash gate — SHA-256 content hash (with whitespace normalization) skips unchanged files before parsing. Stored on Module/Package nodes in Memgraph via batch read/write.
  • Add per-file cooldown — Throttles rapid re-processing of the same file in daemon mode (default 10s). Deferred events are re-published after cooldown expires. Disabled in CLI reindex mode.
  • Rename Tier 2/3 → AST/Embed — Classes, consumer groups, log prefixes, span names, variables, and all documentation updated across 19 files.

Commits

  1. refactor(indexing): remove Tier 1 consumer, simplify to two-tier pipeline
  2. feat(indexing): add file hash gate to skip unchanged files
  3. feat(indexing): add per-file cooldown for daemon mode
  4. test(indexing): add integration tests for two-tier pipeline, hash gate, and cooldown
  5. refactor(indexing): rename Tier 2/3 to AST/Embed stage across code and docs

Test plan

  • Unit tests pass (463 passed)
  • Integration tests pass (7 passed, including 5 new tests against live Memgraph + Valkey)
  • Ruff lint + format clean
  • ty check clean (2 pre-existing warnings only)

SerPeter added 5 commits March 3, 2026 20:22
…line

Tier 1 was a pure pass-through converting FileChanged → ASTDirty with
zero value-add. Remove it to reduce latency, eliminate an extra Valkey
stream hop, and simplify the architecture.

Before: Watcher → file-changed → Tier1 → ast-dirty → Tier2 → Tier3
After:  Watcher → file-changed → Tier2 → embed-dirty → Tier3
Compute SHA-256 of file contents before parsing and compare against
stored hashes in Memgraph. Files with matching hashes are skipped
entirely, avoiding unnecessary AST parsing and graph writes.

- strip_whitespace mode normalizes formatting before hashing so
  formatter-only changes (e.g. ruff format) are ignored
- Hash gate is bypassed for deleted files and full reindexes (where
  stored hashes are empty)
- Pre-read file bytes are passed to the parser to avoid double I/O
Copilot AI review requested due to automatic review settings March 3, 2026 22:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the indexing pipeline from three tiers to two stages (AST → Embed), adding a content-hash gate to skip unchanged files and a per-file cooldown to throttle rapid reprocessing in daemon mode.

Changes:

  • Remove the Tier 1 pass-through consumer and wire FileChanged directly into the AST stage.
  • Add a SHA-256-based file hash gate (with optional whitespace normalization) persisted on Module/Package nodes in Memgraph.
  • Add per-file cooldown deferral/re-publish logic for daemon mode; rename Tier2/3 terminology to AST/Embed across code, tests, and docs.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/unit/search/test_embeddings.py Updates embed consumer naming and doc-section strings for AST/Embed rename.
tests/integration/indexing/test_consumers.py Adds integration tests for AST consumer, hash gate, and cooldown behavior.
tests/conftest.py Renames “Tier3” wording to “embed stage” in NO_EMBED docstring.
src/code_atlas/settings.py Adds index.file_hash_gate, index.strip_whitespace, and watcher.cooldown_s settings; updates embed settings wording.
src/code_atlas/search/embeddings.py Updates doc-section breadcrumb example text for “AST Stage”.
src/code_atlas/indexing/orchestrator.py Removes Tier1/Tier2 wiring; runs AST + optional Embed consumers; updates drain/publish logic.
src/code_atlas/indexing/daemon.py Starts AST/Embed consumers and passes watcher cooldown into AST consumer.
src/code_atlas/indexing/consumers.py Deletes Tier1; implements AST consumer hash gate + cooldown; renames Tier3 to Embed consumer.
src/code_atlas/indexing/init.py Re-exports ASTConsumer/EmbedConsumer instead of Tier1/2/3.
src/code_atlas/graph/client.py Adds batch read/write helpers for persisting file hashes on Module/Package nodes.
src/code_atlas/events.py Removes ASTDirty event/topic; updates event union and EmbedDirty docstring.
scripts/profile_index.py Renames profiling spans/labels from tier2/tier3 to ast/embed.
docs/guides/repo-guidelines.md Updates example consumer name to ASTConsumer.
docs/benchmarks.md Renames benchmark stage labels to AST/Embed.
docs/architecture.md Updates pipeline diagrams and narrative to the two-stage AST/Embed pipeline.
docs/adr/0006-pure-python-tree-sitter.md Updates ADR wording to AST consumer/stage naming.
docs/adr/0005-deployment-process-model.md Updates deployment/process model diagrams and wording to AST/Embed.
docs/adr/0004-event-driven-tiered-pipeline.md Updates ADR to describe the two-stage pipeline (FileChanged → AST → EmbedDirty → Embed).
CLAUDE.md Updates repository architecture and event model documentation for two-stage pipeline.
CHANGELOG.md Updates historical changelog wording to AST/Embed terminology.
.gitattributes Adds repo-wide gitattributes for text/binary handling and LFS patterns.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@SerPeter SerPeter merged commit 154c283 into main Mar 3, 2026
7 checks passed
@SerPeter SerPeter deleted the refactor/two-tier-pipeline branch March 3, 2026 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants