Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 60 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
<img src=".github/assets/logo.png" width="280" alt="repowise" /><br />
**Codebase intelligence for AI-assisted engineering teams.**

Four intelligence layers. Eight MCP tools. One `pip install`.
Four intelligence layers. Ten MCP tools. One `pip install`.

[![PyPI version](https://img.shields.io/pypi/v/repowise?color=F59520&labelColor=0A0A0A)](https://pypi.org/project/repowise/)
[![License: AGPL v3](https://img.shields.io/badge/license-AGPL--v3-F59520?labelColor=0A0A0A)](https://www.gnu.org/licenses/agpl-3.0)
Expand All @@ -23,12 +23,63 @@ Four intelligence layers. Eight MCP tools. One `pip install`.

When Claude Code reads a 3,000-file codebase, it reads files. It does not know who owns them, which ones change together, which ones are dead, or why they were built the way they were.

repowise fixes that. It indexes your codebase into four intelligence layers — dependency graph, git history, auto-generated documentation, and architectural decisions — and exposes them to Claude Code (and any MCP-compatible AI agent) through eight precisely designed tools.
repowise fixes that. It indexes your codebase into four intelligence layers — dependency graph, git history, auto-generated documentation, and architectural decisions — and exposes them to Claude Code (and any MCP-compatible AI agent) through ten precisely designed tools.

The result: Claude Code answers *"why does auth work this way?"* instead of *"here is what auth.ts contains."*

---

## What's new

### Faster indexing
Indexing is now fully parallel. A `ProcessPoolExecutor` distributes AST parsing across all CPU cores. Graph construction and git history indexing run concurrently via `asyncio.gather`. Per-file git history is fetched through a thread executor with a semaphore to cap concurrency — full parallelism without overwhelming the system. Large repos index noticeably faster.

### RAG-aware documentation generation
Every wiki page is generated with richer context: before calling the LLM, repowise fetches the already-generated summaries of each file's direct dependencies from the vector store and injects them into the prompt. Generation is topologically sorted so leaf files are always written first. The LLM sees what its dependencies actually do, not just their names — producing more accurate, cross-referenced documentation.

### Atomic three-store transactions
`AtomicStorageCoordinator` buffers writes across the SQL database, the in-memory dependency graph, and the vector store, then flushes them in a single coordinated operation. If any store fails, all three are rolled back — no partial writes, no silent drift. Run `repowise doctor` to inspect drift across all three stores and repair mismatches.

### Dynamic import hints
The dependency graph now captures edges that pure AST parsing misses:
- Django `INSTALLED_APPS`, `ROOT_URLCONF`, and `MIDDLEWARE` settings
- pytest fixture wiring through `conftest.py`
- Node/TypeScript path aliases from `tsconfig.json` `paths` and `package.json` `exports`

These edges appear in `get_context`, `get_risk`, and `get_dependency_path` like any other dependency.

### Single-call answers via `get_answer`
A new `get_answer(question)` MCP tool collapses the typical "search → read → reason" loop into one call. It runs retrieval over the wiki, gates on confidence (top-hit dominance ratio), and synthesizes a 2–5 sentence answer with concrete file/symbol citations. High-confidence answers can be cited directly; ambiguous ones return ranked excerpts so the agent grounds in source. Responses are cached per repository by question hash, so repeated questions cost nothing.

### Symbol lookup via `get_symbol`
A new `get_symbol(symbol_id)` MCP tool resolves a fully-qualified symbol identifier (e.g. `pkg/module.py::Class::method`) to its definition, returning the source body, signature, file location, and any cross-referenced docstring — without the agent having to grep then read.

### Test files in the documentation layer
The page generator now treats test files as first-class wiki targets. They have near-zero PageRank (nothing imports them back) but answer real questions like "what test exercises X" or "where is Y verified", which the doc layer is the right place to surface. Filtering remains available via `skip_tests` for users who prefer to exclude them.

### Temporal hotspot decay
Hotspot scoring now uses an exponentially time-decayed score with a 180-day half-life layered on top of the raw 90-day churn count. A commit from a year ago contributes roughly 25% as much as a commit from today. The score reflects recent activity, not just total volume. Surfaced in `get_overview` and `get_risk`.

### Percentile ranks via SQL window function
Incremental updates now recompute global percentile ranks for every file using a single `PERCENT_RANK()` SQL window function. Previously this required loading all rows into Python. The new approach is both faster and correct on large repos — no sampling, no approximation.

### PR blast radius
`get_risk(changed_files=[...])` now returns a full blast-radius report: transitive affected files, co-change warnings for historical co-change partners not included in the PR, recommended reviewers ranked by temporal ownership, test gap detection, and an overall 0–10 risk score. Same flat tool surface — substantially more signal per call.

### Knowledge map in `get_overview`
`get_overview` now surfaces: top owners across the codebase, "bus factor 1" knowledge silos (files where one person owns >80% of commits), and onboarding targets — high-centrality files with the weakest documentation coverage. Useful for team planning and risk review.

### Test gaps and security signals in `get_risk`
`get_risk` now includes a `test_gap` flag per file (no test file co-changes detected) and `security_signals` — static pattern detection for common risk categories: authentication bypass patterns, `eval`-family calls, raw SQL string construction, and weak cryptography. Signals appear alongside the existing hotspot and ownership data.

### LLM cost tracking
Every LLM call is logged to a new `llm_costs` table with operation type, model, token counts, and estimated cost. A new `repowise costs` CLI command lets you group spending by operation, model, or day. The indexing progress bar now shows a live `Cost: $X.XXX` counter next to the spinner.

### Configurable dead-code sensitivity
The `repowise dead-code` command and the `get_dead_code` MCP tool now expose sensitivity controls: `--min-confidence` (default 0.70), `--include-internals` (include private/underscore-prefixed symbols), and `--include-zombie-packages` (packages present in `package.json` / `pyproject.toml` but unused in the graph). Tune the output to your cleanup goals.

---

## What repowise builds

repowise runs once, builds everything, then keeps it in sync on every commit.
Expand Down Expand Up @@ -84,17 +135,19 @@ Add to your Claude Code config (`~/.claude/claude_desktop_config.json`):

---

## Eight MCP tools
## Ten MCP tools

Most tools are designed around data entities — one module, one file, one symbol — which forces AI agents into long chains of sequential calls. repowise tools are designed around **tasks**. Pass multiple targets in one call. Get complete context back.

| Tool | What it answers | When Claude Code calls it |
|---|---|---|
| `get_answer(question)` | One-call RAG: retrieves over the wiki, gates on confidence, and synthesizes a cited 2–5 sentence answer. High-confidence answers cite directly; ambiguous queries return ranked excerpts. Responses are cached per repository by question hash. | First call on any code question — collapses search → read → reason into one round-trip |
| `get_symbol(symbol_id)` | Resolves a qualified symbol id (`path::Class::method`) to its source body, signature, and docstring | When the question names a specific class, function, or method |
| `get_overview()` | Architecture summary, module map, entry points | First call on any unfamiliar codebase |
| `get_context(targets, include?)` | Docs, ownership, decisions, freshness for any targets — files, modules, or symbols | Before reading or modifying code. Pass all relevant targets in one call. |
| `get_context(targets, include?, compact?)` | Docs, ownership, decisions, freshness for any targets — files, modules, or symbols. `compact=True` is the default and bounds the response to ~10K characters; pass `compact=False` for the full structure block, importer list, and per-symbol docstrings | Before reading or modifying code. Pass all relevant targets in one call. |
| `get_risk(targets?, changed_files?)` | Hotspot scores, dependents, co-change partners, blast radius, recommended reviewers, test gaps, security signals, 0–10 risk score | Before modifying files — understand what could break |
| `get_why(query?)` | Three modes: NL search over decisions · path-based decisions for a file · no-arg health dashboard | Before architectural changes — understand existing intent |
| `search_codebase(query)` | Semantic search over the full wiki. Natural language. | When you don't know where something lives |
| `search_codebase(query)` | Semantic search over the full wiki. Natural language. | When `get_answer` returned low confidence and you need to discover candidate pages by topic |
| `get_dependency_path(from, to)` | Connection path between two files, modules, or symbols | When tracing how two things are connected |
| `get_dead_code(min_confidence?, include_internals?, include_zombie_packages?)` | Unreachable code sorted by confidence and cleanup impact | Cleanup tasks |
| `get_architecture_diagram(module?)` | Mermaid diagram for the repo or a specific module | Documentation and presentation |
Expand All @@ -106,7 +159,7 @@ Most tools are designed around data entities — one module, one file, one symbo
| Approach | Tool calls | Time to first change | What it misses |
|---|---|---|---|
| Claude Code alone (no MCP) | grep + read ~30 files | ~8 min | Ownership, prior decisions, hidden coupling |
| **repowise (8 tools)** | **5 calls** | **~2 min** | **Nothing** |
| **repowise (10 tools)** | **5 calls** | **~2 min** | **Nothing** |

The 5 calls for that task:

Expand Down Expand Up @@ -284,7 +337,7 @@ When a senior engineer leaves, the "why" usually leaves with them. Decision inte
| Git intelligence (hotspots, ownership, co-changes) | ✅ | ❌ | ❌ | ❌ | ✅ |
| Bus factor analysis | ✅ | ❌ | ❌ | ❌ | ✅ |
| Architectural decision records | ✅ | ❌ | ❌ | ❌ | ❌ |
| MCP server for AI agents | ✅ 8 tools | ❌ | ✅ 3 tools | ✅ | ✅ |
| MCP server for AI agents | ✅ 10 tools | ❌ | ✅ 3 tools | ✅ | ✅ |
| Auto-generated CLAUDE.md | ✅ | ❌ | ❌ | ❌ | ❌ |
| Doc freshness scoring | ✅ | ❌ | ❌ | ⚠️ staleness only | ❌ |
| Incremental updates on commit | ✅ <30s | ✅ | ❌ | ✅ | ✅ |
Expand Down
21 changes: 15 additions & 6 deletions docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ For per-package detail (installation, full API reference, all CLI flags, file ma
|---------|--------|----------------|
| `packages/core` | [`packages/core/README.md`](../packages/core/README.md) | Ingestion, generation, persistence, providers — all key classes with code examples |
| `packages/cli` | [`packages/cli/README.md`](../packages/cli/README.md) | All 10 CLI commands with every flag documented |
| `packages/server` | [`packages/server/README.md`](../packages/server/README.md) | All REST API endpoints, 8 MCP tools, webhook setup, scheduler jobs |
| `packages/server` | [`packages/server/README.md`](../packages/server/README.md) | All REST API endpoints, 10 MCP tools, webhook setup, scheduler jobs |
| `packages/web` | [`packages/web/README.md`](../packages/web/README.md) | Every frontend file with purpose — API client, hooks, components, pages |

---
Expand Down Expand Up @@ -78,7 +78,7 @@ For per-package detail (installation, full API reference, all CLI flags, file ma
│ Three Stores │ │ Consumers │
│ │ │ │
│ SQL (wiki pages, │ │ Web UI MCP Server GitHub Action │
│ jobs, symbols, │ │ (Next.js) (9 tools) (CI/CD) │
│ jobs, symbols, │ │ (Next.js) (10 tools) (CI/CD) │
│ versions) │ │ │
│ │ │ repowise CLI │
│ Vector (LanceDB / │ │ (init, update, watch, │
Expand Down Expand Up @@ -167,7 +167,7 @@ repowise/
│ ├── server/ # Python: FastAPI REST API + MCP server
│ │ └── src/repowise/server/
│ │ ├── routers/ # FastAPI routers (repos, pages, jobs, symbols, graph, git, dead-code, decisions, search, claude-md)
│ │ ├── mcp_server/ # MCP server package (8 tools, split into focused modules)
│ │ ├── mcp_server/ # MCP server package (10 tools, split into focused modules)
│ │ ├── webhooks/ # GitHub + GitLab handlers
│ │ ├── job_executor.py # Background pipeline executor — bridges REST endpoints to core pipeline
│ │ └── scheduler.py # APScheduler background jobs
Expand Down Expand Up @@ -219,9 +219,10 @@ Key tables:
| Table | Purpose |
|-------|---------|
| `repos` | Registered repositories, sync state, provider config |
| `wiki_pages` | All generated wiki pages with content, metadata, confidence score |
| `wiki_pages` | All generated wiki pages with content, metadata, confidence score, and a short LLM-extracted `summary` (1–3 sentences) used by `get_context` to keep responses bounded |
| `page_versions` | Full version history of every page (for diff view) |
| `symbols` | Symbol index: every function, class, method across all files |
| `answer_cache` | Memoised `get_answer` responses keyed by `(repository_id, question_hash)` plus the provider/model used. Repeated questions return at zero LLM cost; cache entries are invalidated by repository re-indexing. |
| `generation_jobs` | Job state machine with checkpoint fields for resumability |
| `webhook_events` | Every received webhook event (deduplication, audit, retry) |
| `symbol_rename_history` | Detected renames for auditing and targeted text patching |
Expand Down Expand Up @@ -424,6 +425,14 @@ cross-package edges tracked in the graph.
Each `FileInfo` is tagged with: `language`, `is_test`, `is_config`, `is_api_contract`,
`is_entry_point`, `git_hash`. These tags influence generation priority and prompt choice.

**Test files are first-class wiki targets.** The page generator includes any file
tagged `is_test=True` that has at least one extracted symbol, even if the file's
PageRank is near zero (which is typical: nothing imports test files back, so
graph-centrality metrics never select them on their own). Test files answer
questions of the form *"what test exercises X"* / *"where is Y verified"*, and
the doc layer is the right place to surface those. Users who want to exclude
tests from the wiki entirely can pass `--skip-tests` to `repowise init`.

### 5.2 AST Parsing

`ASTParser` is a single class that handles all supported languages. There are no
Expand Down Expand Up @@ -1103,7 +1112,7 @@ file, tokens used, estimated cost, estimated time remaining).
repowise includes an interactive chat interface that lets users ask questions about
their codebase and receive answers grounded in the wiki, dependency graph, git
history, and architectural decisions. The chat agent uses whichever LLM provider
the user has configured and has access to all 8 MCP tools.
the user has configured and has access to all 10 MCP tools.

See [`docs/CHAT.md`](CHAT.md) for the full technical reference covering the
backend agentic loop, SSE streaming protocol, provider abstraction extensions,
Expand All @@ -1114,7 +1123,7 @@ database schema, frontend component architecture, and artifact rendering system.
- **Provider-agnostic** — the chat agent goes through the same provider abstraction
as documentation generation. A `ChatProvider` protocol extends `BaseProvider` with
`stream_chat()` for streaming + tool use without breaking existing callers.
- **Tool reuse** — the 8 MCP tools are called directly as Python functions (no
- **Tool reuse** — the 10 MCP tools are called directly as Python functions (no
subprocess round-trip). Tool schemas are defined once in `chat_tools.py` and
fed to both the LLM and the executor.
- **SSE streaming** — `POST /api/repos/{repo_id}/chat/messages` runs the agentic
Expand Down
6 changes: 6 additions & 0 deletions docs/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [Unreleased]

### Added
- **`get_answer` MCP tool** (`tool_answer.py`) — single-call RAG over the wiki layer. Runs retrieval, gates synthesis on top-hit dominance ratio, and returns a 2–5 sentence answer with concrete file/symbol citations plus a `confidence` label. High-confidence responses can be cited directly without verification reads. Backed by an `AnswerCache` table so repeated questions on the same repository cost nothing on the second call.
- **`get_symbol` MCP tool** (`tool_symbol.py`) — resolves a fully-qualified symbol id (`path::Class::method`, also accepts `Class.method`) to its source body, signature, file location, line range, and docstring. Returns the rich source-line signature (with base classes, decorators, and full type annotations preserved) instead of the stripped DB form.
- **`Page.summary` column** — short LLM-extracted summary (1–3 sentences) attached to every wiki page during generation. Used by `get_context` to keep context payloads bounded on dense files. Added by alembic migration `0012_page_summary`.
- **`AnswerCache` table** — memoised `get_answer` responses keyed by `(repository_id, question_hash)` plus the provider/model used. Added by alembic migration `0013_answer_cache`. Cache entries are repository-scoped and invalidated by re-indexing.
- **Test files in the wiki** — `page_generator._is_significant_file()` now treats any file tagged `is_test=True` (with at least one extracted symbol) as significant, regardless of PageRank. Test files have near-zero centrality because nothing imports them back, but they answer "what test exercises X" / "where is Y verified" questions; the doc layer is the right place to surface those. Filtering remains available via `--skip-tests`.
- **Overview dashboard** (`/repos/[id]/overview`) — new landing page for each repository with:
- Health score ring (composite of doc coverage, freshness, dead code, hotspot density, silo risk)
- Attention panel highlighting items needing action (stale docs, high-risk hotspots, dead code)
Expand All @@ -27,6 +32,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- **Health score utility** (`web/src/lib/utils/health-score.ts`) — composite health score computation, attention item builder, and language aggregation for the overview dashboard

### Changed
- **`get_context` default is now `compact=True`** — drops the `structure` block, the `imported_by` list, and per-symbol docstring/end-line fields to keep the response under ~10K characters. Pass `compact=False` for the full payload (e.g. when you specifically need import-graph dependents on a large file).
- `init_cmd.py` refactored to use shared `persist_pipeline_result()` instead of inline persistence logic
- Pipeline orchestrator uses async-friendly patterns to keep the event loop responsive during ingestion
- Sidebar and mobile nav updated to include "Overview" link
Expand Down
4 changes: 2 additions & 2 deletions docs/CHAT.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

The codebase chat feature lets users have an interactive conversation with their
codebase. The agent uses whichever LLM provider the user has configured, has
access to all 8 MCP tools, and streams responses back to the browser in real time
access to all 10 MCP tools, and streams responses back to the browser in real time
showing tool calls as they happen and rendering results in an artifact panel.

---
Expand Down Expand Up @@ -158,7 +158,7 @@ class ChatProvider(Protocol):

Defined in `packages/server/src/repowise/server/chat_tools.py`.

Single source of truth for tool schemas and execution. Imports the 8 MCP tool
Single source of truth for tool schemas and execution. Imports the 10 MCP tool
functions directly from `repowise.server.mcp_server`.

```python
Expand Down
Loading
Loading