feat: cross-project search and branch-aware indexing#23
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds branch-aware project ID suffixing, linked-project discovery, and cross-project multi-collection search with client-side Reciprocal Rank Fusion (RRF) and deduplication. Exposes Changes
Sequence DiagramsequenceDiagram
participant Client as User/Tool Call
participant QueryHandler as Query Handler
participant ConfigResolver as Config Resolver
participant QdrantMulti as Multi-Collection Search
participant QdrantSingle as Qdrant (per collection)
participant Merger as RRF Merger
Client->>QueryHandler: codebase_search(query, includeLinked=true)
QueryHandler->>ConfigResolver: resolveLinkedCollections(projectPath)
ConfigResolver-->>QueryHandler: [{ name, label }, ...]
QueryHandler->>QdrantMulti: searchMultipleCollections(collections, query)
par Parallel Collection Queries
QdrantMulti->>QdrantSingle: searchChunks(collection[0], query)
QdrantSingle-->>QdrantMulti: results[0]
QdrantMulti->>QdrantSingle: searchChunks(collection[1], query)
QdrantSingle-->>QdrantMulti: results[1]
QdrantMulti->>QdrantSingle: searchChunks(collection[n], query)
QdrantSingle-->>QdrantMulti: results[n]
end
QdrantMulti->>Merger: mergeMultiCollectionResults(collectionResults, limit)
Merger->>Merger: compute RRF, deduplicate by relativePath
Merger-->>QdrantMulti: merged & ranked results
QdrantMulti-->>QueryHandler: SearchResult[] (with project labels)
QueryHandler-->>Client: Formatted results with [project] tags
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/config.ts`:
- Around line 180-186: coreProjectId(linkedPath) currently strips
explicit/shared SOCRATICODE_PROJECT_ID and collectionName(linkedId) is compared
against collections[0].name (which may include branch suffix), causing wrong
collection resolution and missed same-repo skipping; fix by resolving the linked
project's canonical project id without dropping an explicit/shared ID (use the
function that preserves SOCRATICODE_PROJECT_ID or read the repo's configured
project id instead of coreProjectId), then compute collectionName from that
preserved id, and when checking for same-repo worktrees compare base collection
names (strip branch suffix from collections[0].name or use a baseCollectionName
helper) so the skip uses branch-agnostic comparison.
In `@src/services/qdrant.ts`:
- Around line 421-425: searchMultipleCollections currently calls searchChunks
for each collection which recomputes the same dense embedding N times; refactor
so you compute the embedding once in searchMultipleCollections (call the
existing embedding routine used by searchChunks) and then replace the
per-collection call with an internal helper (e.g., searchChunksWithEmbedding or
a private function inside qdrant.ts) that accepts the precomputed embedding plus
collection-specific args (name, label, perCollectionLimit, fileFilter,
languageFilter) and only performs the Qdrant query/aggregation. Update all
callers to use the helper path and remove embedding logic from searchChunks (or
make searchChunks delegate to the helper when an embedding is passed) so the
embedding is generated once per user query instead of per collection.
- Around line 367-378: The dedupe key is currently just r.relativePath so files
with the same path across collections collapse; change the key used for the
scored Map to include the collection/project identity (e.g., combine label and
r.relativePath) so deduping is scoped per collection. Update the key creation in
the loop over collectionResults (where key is set and where scored.get/set is
called) to use the composite key and ensure the stored object still sets
project: label (leave stored r.relativePath unchanged for later use).
In `@tests/unit/config.test.ts`:
- Around line 346-380: Tests rely on the outer repo state so
detectGitBranch(process.cwd()) and projectIdFromPath(process.cwd()) can be null
in CI; make the tests self-contained by creating a temporary git repo with an
explicit branch (init, create a commit, create/checkout a named branch) and use
that repo path when calling detectGitBranch and projectIdFromPath, set
SOCRATICODE_BRANCH_AWARE as needed, and clean up the temp repo and env var after
each test; alternatively stub/mock the git lookup used by
detectGitBranch/projectIdFromPath, but prefer the temp-repo approach so tests
exercise real git behavior.
In `@tests/unit/query-tools.test.ts`:
- Around line 245-247: The test's assertion assumes POSIX separators; update it
to be platform-neutral by normalizing TEST_PATH before matching—import Node's
path module in tests and use path.normalize(TEST_PATH) (or replace '/' with
path.sep) when calling expect.stringContaining so mockResolveLinkedCollections
is asserted against the resolved projectPath produced by handleQueryTool().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f509e967-316f-4e17-b4c6-725382149baa
📒 Files selected for processing (11)
DEVELOPER.mdREADME.mdsrc/config.tssrc/index.tssrc/services/indexer.tssrc/services/qdrant.tssrc/tools/query-tools.tssrc/types.tstests/unit/config.test.tstests/unit/multi-collection-search.test.tstests/unit/query-tools.test.ts
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/unit/config.test.ts`:
- Line 352: The tests call execFileSync("git", ["commit", "--allow-empty", ...])
in multiple places and fail in CI because no local git user identity is set; add
a small helper (e.g., initGitRepoWithIdentity or setupTempGitRepo) that runs the
existing git init steps then runs execFileSync("git", ["config", "user.name",
"Test User"], { cwd: tmpDir }) and execFileSync("git", ["config", "user.email",
"test@example.com"], { cwd: tmpDir }) before any commits, replace the repeated
commit-init pattern (the locations invoking execFileSync("git", ["commit",
...])) to call that helper so all temp repos have a local user.name/user.email
configured.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/unit/config.test.ts`:
- Around line 246-264: The test suite mutates SOCRATICODE_LINKED_PROJECTS
without preserving its prior value; in the beforeEach capture the original value
(e.g., const originalLinked = process.env.SOCRATICODE_LINKED_PROJECTS) before
deleting it, and in afterEach restore it (set
process.env.SOCRATICODE_LINKED_PROJECTS = originalLinked or delete if undefined)
so the environment is returned to its original state; update the
beforeEach/afterEach in the describe("resolveLinkedCollections") block
accordingly to reference the captured variable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/unit/query-tools.test.ts (1)
287-308: Consider a more robust negative assertion.The regex
/\[.*-.*\]/checks for any bracketed content with a hyphen, which could match unrelated patterns (e.g., a hypothetical[ISO-8601]in code content). For this test, a more explicit assertion might be cleaner:♻️ Alternative assertion
// Should have the file but no project tag brackets expect(output).toContain("src/index.ts"); - expect(output).not.toMatch(/\[.*-.*\]/); // no [project-name] tag + // Project tags appear after language tag, e.g. "[typescript] [project-label]" + expect(output).not.toContain("[typescript] [");This is a minor nit—the current regex works given the mock data doesn't contain hyphenated bracketed content.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/query-tools.test.ts` around lines 287 - 308, The negative regex is too permissive; in the "does not include project tag when project field is absent" test replace the broad /\\[.*-.*\\]/ check with a stricter pattern that only matches bracketed hyphenated tags (e.g. /\\[\\w+-\\w+\\]/) or assert there is no bracketed hyphenated token adjacent to the returned path (e.g. /\\[\\w+-\\w+\\].*src\\/index\\.ts/), updating the assertion near the expect(output).not.toMatch call that follows the mockSearchChunks and handleQueryTool usage (TEST_PATH, mockSearchChunks, handleQueryTool) so the test only fails for actual project-tag-like patterns.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/config.ts`:
- Around line 34-39: sanitizeBranchName can return an empty string for branch
names made only of invalid chars, causing downstream IDs like `id__`; update
sanitizeBranchName to return a safe fallback when the sanitized result is empty
(e.g., return "unnamed" or "default" instead of ""), and also ensure any place
that does the conditional append (the code that checks `if (branch)` before
adding a suffix) checks the sanitized value (e.g., `const safeBranch =
sanitizeBranchName(branch); if (safeBranch) { ... }`) so you never append an
empty suffix.
---
Nitpick comments:
In `@tests/unit/query-tools.test.ts`:
- Around line 287-308: The negative regex is too permissive; in the "does not
include project tag when project field is absent" test replace the broad
/\\[.*-.*\\]/ check with a stricter pattern that only matches bracketed
hyphenated tags (e.g. /\\[\\w+-\\w+\\]/) or assert there is no bracketed
hyphenated token adjacent to the returned path (e.g.
/\\[\\w+-\\w+\\].*src\\/index\\.ts/), updating the assertion near the
expect(output).not.toMatch call that follows the mockSearchChunks and
handleQueryTool usage (TEST_PATH, mockSearchChunks, handleQueryTool) so the test
only fails for actual project-tag-like patterns.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a2aba72d-0c98-476f-99e5-33c23901f205
📒 Files selected for processing (4)
src/config.tssrc/services/qdrant.tstests/unit/multi-collection-search.test.tstests/unit/query-tools.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/unit/multi-collection-search.test.ts
- src/services/qdrant.ts
| export function sanitizeBranchName(branch: string): string { | ||
| return branch | ||
| .replace(/[^a-zA-Z0-9_-]/g, "_") | ||
| .replace(/_+/g, "_") | ||
| .replace(/^_|_$/g, ""); | ||
| } |
There was a problem hiding this comment.
Edge case: sanitized branch name could be empty.
If a branch name contains only invalid characters (e.g., "///" or "@@@"), sanitizeBranchName returns an empty string. Combined with the check at line 69 (if (branch)), this would append an empty suffix, producing id__.
🛡️ Proposed fix to guard against empty sanitized names
if (process.env.SOCRATICODE_BRANCH_AWARE === "true") {
const branch = detectGitBranch(path.resolve(folderPath));
if (branch) {
- id = `${id}__${sanitizeBranchName(branch)}`;
+ const sanitized = sanitizeBranchName(branch);
+ if (sanitized) {
+ id = `${id}__${sanitized}`;
+ }
}
}Also applies to: 67-72
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/config.ts` around lines 34 - 39, sanitizeBranchName can return an empty
string for branch names made only of invalid chars, causing downstream IDs like
`id__`; update sanitizeBranchName to return a safe fallback when the sanitized
result is empty (e.g., return "unnamed" or "default" instead of ""), and also
ensure any place that does the conditional append (the code that checks `if
(branch)` before adding a suffix) checks the sanitized value (e.g., `const
safeBranch = sanitizeBranchName(branch); if (safeBranch) { ... }`) so you never
append an empty suffix.
✅ Addressed in commit 0d45215
7a0f081 to
0d45215
Compare
…cation Add searchMultipleCollections() and mergeMultiCollectionResults() to qdrant.ts. Queries multiple Qdrant collections in parallel, merges results using Reciprocal Rank Fusion (k=60), and deduplicates by relativePath (first collection wins). Add optional 'project' field to SearchResult type for source attribution. This is the shared foundation for linked projects (#20) and branch-aware indexing (#19). Refs: #19, #20
…INKED_PROJECTS - loadLinkedProjects() reads .socraticode.json config and env var - resolveLinkedCollections() maps linked projects to collection names - codebase_search gains includeLinked parameter for cross-project queries - Results tagged with [project-label] when searching across projects - 10 unit tests for config loading and collection resolution Relates to #20
- detectGitBranch() detects current git branch via git rev-parse - sanitizeBranchName() converts branch names to Qdrant-safe suffixes - When SOCRATICODE_BRANCH_AWARE=true, projectIdFromPath appends __branch to create separate indexes per branch - Explicit SOCRATICODE_PROJECT_ID takes precedence (no branch suffix) - 14 unit tests for sanitization, detection, and integration Relates to #19
- README: new Features entries, full sections with config examples - README: new env vars SOCRATICODE_BRANCH_AWARE, SOCRATICODE_LINKED_PROJECTS - README: updated codebase_search tool description for includeLinked - DEVELOPER.md: updated config.ts description, branch-aware and linked projects sections
resolveLinkedCollections() was calling projectIdFromPath() for linked projects, which with SOCRATICODE_BRANCH_AWARE=true appended the linked project's git branch to its ID. Linked projects indexed without branch-aware mode don't have __branch collections, so lookups failed. Extract coreProjectId() helper (hash-only, no branch suffix) and use it for linked projects. Also fix dedup to compare collection names instead of raw IDs to handle the branch suffix mismatch. Adds unit test verifying linked collection names are branch-agnostic.
- query-tools.test.ts: 6 new tests for includeLinked parameter: - omitted/false → calls searchChunks (not searchMultipleCollections) - true → calls searchMultipleCollections via resolveLinkedCollections - passes collections and args correctly - project label appears in output - no project tag when field absent - multi-collection-search.test.ts: 1 new test for searchMultipleCollections empty-input short-circuit (internal searchChunks calls cannot be mocked from the same module — integration covered by query-tools tests)
CI checks out in detached HEAD state, causing detectGitBranch(process.cwd()) to return null. Replace process.cwd() with temporary git repos that have a known branch and initial commit, making tests deterministic everywhere.
CI runners have no git user.name/email configured, causing 'git commit --allow-empty' to fail with 'Author identity unknown'. Pass GIT_AUTHOR_NAME/EMAIL and GIT_COMMITTER_NAME/EMAIL env vars.
- Refactor git temp repos into initTempRepo() helper using git config instead of env vars (resolves Critical finding) - Restore SOCRATICODE_LINKED_PROJECTS and SOCRATICODE_PROJECT_ID to original values in resolveLinkedCollections afterEach - Fix TypeScript 'never' type errors in query-tools mocks by adding explicit SearchResult[] return types
- Scope dedupe key to label::relativePath in mergeMultiCollectionResults so same file in different projects is not collapsed - Compute embedding once in searchMultipleCollections via internal searchChunksWithVector helper (avoids N redundant API calls) - Compare base project hashes (not branch-suffixed names) when skipping duplicate linked projects in resolveLinkedCollections - Use path.resolve() for platform-neutral assertion in query-tools tests - Update multi-collection-search tests for new cross-project dedup semantics
0d45215 to
f745d59
Compare
Summary
Adds cross-project search (linked projects) and branch-aware indexing — the shared foundation for issues #19 and #20.
Users can now search across multiple related repositories in a single query, and optionally maintain separate indexes per git branch for CI/CD and PR review workflows.
Changes
searchMultipleCollections()queries linked collections in parallel and merges results via client-side Reciprocal Rank Fusion, with deduplication by file path (current project wins).socraticode.jsonconfig file andSOCRATICODE_LINKED_PROJECTSenv var to define related projects.includeLinked: trueparameter oncodebase_searchactivates cross-project search. Results are tagged with[project-name]labelsSOCRATICODE_BRANCH_AWARE=trueappends the sanitized git branch name to collection IDs, creating isolated indexes per branch (e.g.codebase_abc123__feat_my-feature). Detached HEAD falls back to branchless ID. Ignored whenSOCRATICODE_PROJECT_IDis setSOCRATICODE_BRANCH_AWAREmodeType of change
Testing
npm run test:unit) — 755 tests, 31 filesnpm run test:integration) — not applicable (no Qdrant integration changes)npx tsc --noEmit).socraticode.jsonand env varChecklist
Related issues
Fixes #19
Fixes #20
Summary by CodeRabbit
New Features
Types
Documentation
Tests