Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Default
# ==================
* text=auto eol=lf

# Python Source files
# =================
*.pxd text diff=python
*.py text diff=python
*.py3 text diff=python
*.pyw text diff=python
*.pyx text diff=python
*.pyz text diff=python
*.pyi text diff=python

# Python Binary files
# =================
*.db binary
*.p binary
*.pkl binary
*.pickle binary
*.pyc binary export-ignore
*.pyo binary export-ignore
*.pyd binary

# Jupyter notebook
# =================
*.ipynb text

# ML models
# =================
*.h5 filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
tokenizer.json filter=lfs diff=lfs merge=lfs -text
unigram.json filter=lfs diff=lfs merge=lfs -text

# Data files
# =================
*.csv filter=lfs diff=lfs merge=lfs -text
*.tsv filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs

# Presentation files
# =================
*.pptx filter=lfs diff=lfs merge=lfs -text
*.word filter=lfs diff=lfs merge=lfs -text
*.xlsx filter=lfs diff=lfs merge=lfs -text
*.xls filter=lfs diff=lfs merge=lfs -text
*.pdf filter=lfs diff=lfs merge=lfs -text

# Archives
# =================
*.7z filter=lfs diff=lfs merge=lfs -text
*.br filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.tar.gz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text

# Image files
# =================
*.jpg filter=lfs diff=lfs merge=lfs -text
*.jpeg filter=lfs diff=lfs merge=lfs -text
*.png filter=lfs diff=lfs merge=lfs -text
*.gif filter=lfs diff=lfs merge=lfs -text
*.webp filter=lfs diff=lfs merge=lfs -text
*.bmp filter=lfs diff=lfs merge=lfs -text
*.svg filter=lfs diff=lfs merge=lfs -text
*.tiff filter=lfs diff=lfs merge=lfs -text

# Other
# =================
# Windows - keep CRLF
*.exe filter=lfs diff=lfs merge=lfs -text
*.bat text eol=crlf
*.cmd text eol=crlf
*.ps1 text eol=crlf
6 changes: 3 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
- **ci**: Resolve ty check failures with --all-extras in CI
([`3f74816`](https://github.com/SerPeter/code-atlas/commit/3f7481635091d2d676aed75c3fbcaa5db4332242))

- **consumers**: Group batches by project in Tier1/Tier2
- **consumers**: Group batches by project in AST/Embed consumers
([#2](https://github.com/SerPeter/code-atlas/pull/2),
[`5107b24`](https://github.com/SerPeter/code-atlas/commit/5107b24a7dfbcb44cadc7917f632ae6a9743c057))

Expand Down Expand Up @@ -190,7 +190,7 @@
- **docs**: Add markdown parser with tree-sitter-markdown
([`e8d372c`](https://github.com/SerPeter/code-atlas/commit/e8d372c162652d6d73d1f66da5e14a61fcb2136a))

- **embeddings**: Add EmbedClient with litellm routing and Tier 3 pipeline
- **embeddings**: Add EmbedClient with litellm routing and embed pipeline
([`ad7c972`](https://github.com/SerPeter/code-atlas/commit/ad7c9726f2e48fdb8746b50547089c5c483bcb75))

- **embeddings**: Add three-tier embedding cache with Valkey backend
Expand Down Expand Up @@ -241,7 +241,7 @@
- **naming**: Worktree-aware naming and monorepo sub-project prefixing
([`2acdfb3`](https://github.com/SerPeter/code-atlas/commit/2acdfb33ba4b486f966272a01cf8a37f670661f6))

- **parser**: Add py-tree-sitter parser, implement Tier 2 pipeline, drop Rust
- **parser**: Add py-tree-sitter parser, implement AST pipeline, drop Rust
([`d56e7d2`](https://github.com/SerPeter/code-atlas/commit/d56e7d2a686ec279a52d85bbc4903f4d85f51a4e))

- **parsing**: Add multi-language support (10 languages, 7 modules)
Expand Down
8 changes: 4 additions & 4 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ src/code_atlas/
├── __init__.py # __version__ only
├── schema.py # Graph schema (labels, relationships, DDL generators)
├── settings.py # Pydantic configuration (atlas.toml + env vars)
├── events.py # Event types (FileChanged, ASTDirty, EmbedDirty) + Valkey Streams EventBus
├── events.py # Event types (FileChanged, EmbedDirty) + Valkey Streams EventBus
├── telemetry.py # OpenTelemetry integration
├── cli.py # Typer CLI entrypoint (index, search, status, mcp, daemon commands)
Expand All @@ -69,7 +69,7 @@ src/code_atlas/
├── indexing/
│ ├── orchestrator.py # Full-index, monorepo detection, staleness checking
│ ├── consumers.py # Tier 1/2/3 event consumers (batch-pull pattern)
│ ├── consumers.py # AST + Embed event consumers (batch-pull pattern)
│ ├── watcher.py # Filesystem watcher (watchfiles + hybrid debounce)
│ └── daemon.py # Daemon lifecycle manager (watcher + pipeline)
Expand All @@ -78,13 +78,13 @@ src/code_atlas/
└── health.py # Infrastructure health checks + diagnostics
```

**Event Pipeline:** File Watcher → Valkey Streams → Tier 1 (graph metadata) → Tier 2 (AST diff + gate) → Tier 3 (embeddings) → Memgraph
**Event Pipeline:** File Watcher → Valkey Streams → AST stage (hash gate + parse + diff) → Embed stage (embeddings) → Memgraph

**Query Pipeline:** MCP Server → Query Router → [Graph Search | Vector Search | BM25 Search] → RRF Fusion → Results

**Deployment:** Daemon (`atlas daemon start`) for indexing + MCP (`atlas mcp`) per agent session, decoupled via Valkey + Memgraph

**Event model:** Events are atomic — one logical change per event (one file per ASTDirty, one entity per EmbedDirty). Never bundle lists of work items into a single event; use `EventBus.publish_many()` for network-efficient batch publishing. The consumer's `max_batch_size` must directly control work volume, not just message count.
**Event model:** Events are atomic — one logical change per event (one file per FileChanged, one entity per EmbedDirty). Never bundle lists of work items into a single event; use `EventBus.publish_many()` for network-efficient batch publishing. The consumer's `max_batch_size` must directly control work volume, not just message count.

**Infrastructure:** Memgraph (graph DB, port 7687), TEI (embeddings, port 8080), Valkey (event bus, port 6379)

Expand Down
67 changes: 32 additions & 35 deletions docs/adr/0004-event-driven-tiered-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,53 +45,50 @@ Redis Streams provide the pub/sub backbone with consumer groups:
Typed frozen dataclasses with JSON serialization for Redis transport:

- `FileChanged(path, change_type, timestamp)` — published by file watcher
- `ASTDirty(paths, batch_id)` — published by Tier 1
- `EmbedDirty(entities: list[EntityRef], significance, batch_id)` — published by Tier 2
- `EmbedDirty(entities: list[EntityRef], significance, batch_id)` — published by AST stage

### Three-Stream Pipeline
### Two-Stage Pipeline

```
atlas:file-changed atlas:ast-dirty atlas:embed-dirty
stream stream stream
┌──────▼───────┐ ┌──────▼───────┐ ┌──────▼───────┐
File Watcher ────► │ Tier 1 │ ──────► │ Tier 2 │ ─gate─►│ Tier 3
Graph Metadata│ alwaysAST Diff + │ only │ Embeddings │
(0.5s batch)│ Graph Update │ if sig │ (15s batch) │
└──────────────┘ │ (3s batch) │ change └──────────────┘
└──────────────┘
atlas:file-changed atlas:embed-dirty
stream stream
│ │
┌──────▼───────┐ ┌──────▼───────┐
File Watcher ────► │ AST Stage │ ─────── significance gate ───► │ Embed Stage
hash gate +only if semantically changed │ Embeddings │
parse + diff │ (15s batch) │
│ (3s batch) │ └──────────────┘
└──────────────┘
```

Each tier pulls at its own pace via `XREADGROUP`, deduplicates within its batch window, and publishes downstream only if
warranted.
Each stage pulls at its own pace via `XREADGROUP`, deduplicates within its batch window, and publishes downstream only
if warranted.

### Per-Consumer Batch Policy

| Tier | Window | Max Batch | Dedup Key |
| -------------- | ------ | --------- | --------------------- |
| Tier 1 (Graph) | 0.5s | 50 | File path |
| Tier 2 (AST) | 3.0s | 20 | File path |
| Tier 3 (Embed) | 15.0s | 100 | Entity qualified name |
| Stage | Window | Max Batch | Dedup Key |
| ----- | ------ | --------- | --------------------- |
| AST | 3.0s | 30 | File path |
| Embed | 15.0s | 100 | Entity qualified name |

Hybrid batching: flush when count OR time threshold hit, whichever first. Same file changed 5× in window = 1 work item.

### Event Data Flow

```
FileChanged ASTDirty EmbedDirty
┌─────────────┐ ┌──────────────────┐ ┌──────────────────────────┐
│ path: str │ │ paths: [str] │ │ entities: [EntityRef] │
│ change_type │ ─Tier 1──► │ batch_id: str │ ─Tier 2─►│ significance: str │
│ timestamp │ └──────────────────┘ gate │ batch_id: str │
└─────────────┘ └──────────────────────────┘
EntityRef:
qualified_name, node_type,
file_path
FileChanged EmbedDirty
┌─────────────┐ ┌──────────────────────────┐
│ path: str │ │ entity: EntityRef │
│ change_type │ ─── AST stage ── sig gate ────────► │ significance: str │
│ timestamp │ └──────────────────────────┘
└─────────────┘ EntityRef:
qualified_name, node_type,
file_path
```

### Significance Gating (Tier 2 → 3)
### Significance Gating (AST → Embed)

Tier 2 evaluates whether a change is semantically significant enough to warrant re-embedding:
The AST stage evaluates whether a change is semantically significant enough to warrant re-embedding:

| Condition | Level | Action |
| --------------------------- | -------- | ------------------- |
Expand All @@ -115,25 +112,25 @@ own retries through this mechanism, avoiding the need for a separate dead-letter
- Cheap operations (staleness flags, graph metadata) are near-instant — MCP queries reflect changes within ~1s
- Expensive operations (embeddings) only run when semantically justified — significant cost reduction
- Decoupled stages can be developed, tested, and scaled independently
- Batching per tier matches the cost profile of each operation
- Batching per stage matches the cost profile of each operation
- Multi-process from day one — no rewrite needed when scaling
- Dual-use of Valkey for event bus + embedding cache
- Natural extension point: new tiers or event types can be added without restructuring
- Natural extension point: new stages or event types can be added without restructuring

### Negative

- More architectural complexity than a simple "reindex everything on change"
- Significance threshold heuristics need tuning and may produce false negatives (skipping re-embeds that should have
happened)
- Debugging event flow across tiers is harder than a linear pipeline
- Debugging event flow across stages is harder than a linear pipeline
- Additional infrastructure dependency (Valkey), though lightweight

### Risks

- Threshold tuning: too aggressive = stale embeddings, too conservative = excessive TEI calls. Need observability on
gate decisions.
- Event ordering: if Tier 2 processes file A before file B, but B depends on A's entities, the diff may be incorrect.
Batch boundaries must align with dependency boundaries.
- Event ordering: if the AST stage processes file A before file B, but B depends on A's entities, the diff may be
incorrect. Batch boundaries must align with dependency boundaries.
- Complexity creep: the event bus must stay simple. If we find ourselves adding routing rules, dead-letter queues, or
retry logic, we've gone too far.

Expand Down
22 changes: 8 additions & 14 deletions docs/adr/0005-deployment-process-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,12 +95,12 @@ decoupled via Valkey Streams and Memgraph:
└───────┬────────┘
┌────────────────┐
│ Create Consumer│ Idempotent XGROUP CREATE for all 3 streams
│ Create Consumer│ Idempotent XGROUP CREATE for pipeline streams
│ Groups │
└───────┬────────┘
┌────────────────┐
│ Start Tier │ asyncio.gather(tier1.run(), tier2.run(), tier3.run())
│ Start Pipeline │ asyncio.gather(ast.run(), embed.run())
│ Consumers │
└───────┬────────┘
Expand All @@ -111,7 +111,7 @@ decoupled via Valkey Streams and Memgraph:
┌────────────────┐ Git-based fast path: diff stored_commit..HEAD
│ Reconcile │ Fallback: mtime comparison for non-git or rebases
│ (progressive) │ Enqueue stale files → Tier 1 → 2 → 3
│ (progressive) │ Enqueue stale files → AST → Embed
└───────┬────────┘
┌────────────────┐
Expand Down Expand Up @@ -230,17 +230,11 @@ Queries: Agent calls MCP tools ─────► Memgraph ◄──── Da
### Data Flow at Runtime

```
┌──────────┐ FileChanged ┌─────────┐ ASTDirty ┌─────────┐
│ File │ ──► events ────────► │ Tier 1 │ ──► events ─────────► │ Tier 2 │
│ Watcher │ (Valkey Stream) │ (graph) │ (Valkey Stream) │ (AST) │
└──────────┘ └─────────┘ └────┬────┘
gate │
EmbedDirty│
(if sig) │
┌────▼────┐
│ Tier 3 │
│ (embed) │
└────┬────┘
┌──────────┐ FileChanged ┌───────────┐ EmbedDirty ┌───────────┐
│ File │ ──► events ────────► │ AST Stage │ ──► events ──────► │ Embed │
│ Watcher │ (Valkey Stream) │ (parse) │ (Valkey Stream) │ Stage │
└──────────┘ └─────┬─────┘ └─────┬────┘
│ │
┌──────────┐ │
Agent ◄──── MCP Server ◄──── reads │ Memgraph │ ◄──── writes ────────┘
Expand Down
8 changes: 4 additions & 4 deletions docs/adr/0006-pure-python-tree-sitter.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,29 +17,29 @@ actual cost breakdown:
- **Subprocess overhead** (spawn, JSON serialization, IPC) exceeded the parse time itself for typical files
- **Build complexity** required both `uv` and `cargo` toolchains in dev/CI/Docker
- **Contributor friction** — Rust was isolated to one component, but still required a full toolchain install
- **Parallelism** is already handled by the event bus (multiple Tier 2 consumer instances via Valkey Streams), not by
- **Parallelism** is already handled by the event bus (multiple AST consumer instances via Valkey Streams), not by
Rust's threading model

Meanwhile, `py-tree-sitter` uses the exact same C parsing library (tree-sitter) via Python bindings. The grammar
packages (`tree-sitter-python`, etc.) ship pre-compiled wheels — no compilation step needed.

## Decision

Drop the Rust binary (`crates/atlas-parser`) and use **py-tree-sitter** called in-process within the Tier 2 pipeline
Drop the Rust binary (`crates/atlas-parser`) and use **py-tree-sitter** called in-process within the AST pipeline
consumer. The parser module lives at `src/code_atlas/parser.py`.

### Architecture

```
Tier 2 Consumer
AST Consumer
└── parser.parse_file(path, source, project_name)
└── tree-sitter C engine (via py-tree-sitter bindings)
└── tree-sitter-python grammar (pre-compiled wheel)
```

### Parallelism Model

Multiple Tier 2 consumer instances can run concurrently — each pulls from the `atlas:ast-dirty` Valkey Stream via its
Multiple AST consumer instances can run concurrently — each pulls from the `atlas:file-changed` Valkey Stream via its
own consumer group member. This gives process-level parallelism without the GIL concern, since each consumer is an
independent process.

Expand Down
Loading