Skip to content

feat: Add thoughtbox_repo tool for Code Context (Ephemeral Clone)#152

Open
glassBead-tc wants to merge 2 commits intomainfrom
feature/ephemeral-repo-tool
Open

feat: Add thoughtbox_repo tool for Code Context (Ephemeral Clone)#152
glassBead-tc wants to merge 2 commits intomainfrom
feature/ephemeral-repo-tool

Conversation

@glassBead-tc
Copy link
Member

code intelligence for TB

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@augmentcode
Copy link

augmentcode bot commented Mar 9, 2026

🤖 Augment PR Summary

Summary: Adds a new MCP tool (thoughtbox_repo) that provides read-only code context by shallow-cloning remote git repositories into ephemeral storage.

Changes:

  • Add RepoHandler plus Zod schemas for clone, read_file, list_directory, and grep_search operations
  • Implement a git-manager to manage temp clone directories, repo eviction, and safe path resolution
  • Register the new tool in src/server-factory.ts with a descriptive help string
  • Add a stdio-based test script to exercise clone/list/read/grep/path-traversal behavior
  • Update the Docker image to include git and grep needed by the repo operations

Technical Notes: Clones use git clone --depth 1 --single-branch and are keyed by a short hash-based repoId under a temp directory.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 6 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

// Clean up if we are holding too many repos in memory
enforceCapacityLimit(5);

const safeBranch = branch || 'main'; // Assume main but git clone will naturally default if we just omit it, though we pass it explicitly here if provided.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When branch is omitted, safeBranch defaults to 'main' for repoId generation, but the git clone call omits --branch and may clone a different default (e.g. master), which can cause cache collisions and misreported branch metadata.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

fs.rmSync(repoPath, { recursive: true, force: true });
}

console.error(`[GitManager] Cloning ${url} (branch: ${safeBranch}) into ${repoPath}`);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log prints the raw clone URL; if callers include PATs in the URL, credentials can end up persisted in logs. Same applies to other log lines that interpolate ${url}.

Severity: high

Other Locations
  • src/repo/git-manager.ts:98
  • src/repo/repo-handler.ts:47

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

try {
const branchFlag = branch ? `--branch ${branch} ` : '';
// Use --single-branch and --depth 1 to make it extremely fast
const cmd = `git clone --depth 1 --single-branch ${branchFlag}"${url}" "${repoPath}"`;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

execSync is invoked with a shell command string built from branch/url; a crafted value containing quotes or command substitutions could lead to command injection.

Severity: high

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

// instead of a simple fs.readdir to see the project shape better.
try {
const output = execSync(`find . -maxdepth 2 -not -path '*/.git/*' | sort`, {
cwd: targetPath,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleListDirectory doesn’t validate that targetPath is a directory; if a file path is passed, both find (via cwd) and the fallback readdirSync can throw, yielding an opaque error.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

const flags = `-rnE ${data.ignoreCase ? '-i' : ''}`;
const excludeFlag = `--exclude-dir=.git`; // skip the massive .git folder

const cmd = `grep ${flags} ${excludeFlag} "${data.pattern}" .`;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The grep command string interpolates user-controlled pattern into a shell command; patterns containing quotes or $() can escape the quoting and execute arbitrary commands.

Severity: high

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

} catch (error) {
console.error('❌ Test failed:', error);
} finally {
process.exit(0);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process.exit(0) runs even after a failure, so automated runs would report success even when the test failed.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR introduces a new thoughtbox_repo MCP tool that enables LLM clients to shallow-clone a remote git repository into ephemeral /tmp storage and then browse it via read_file, list_directory, and grep_search operations. The implementation correctly uses execFileSync to avoid shell injection, applies path-traversal guards with resolveSafePath, scrubs PATs from most log and response output, and enforces a 5-repo capacity limit.

Two correctness/security issues remain after previous review rounds:

  • Cache key mismatch (git-manager.ts:57-58): When no branch is specified, safeBranch falls back to 'main' and becomes part of the cache key, but the actual git clone runs without --branch and fetches the remote default (which may be master, develop, etc.). A subsequent explicit clone call for branch: 'main' will receive a cache hit and return the wrong content.
  • PAT leaked in clone log (repo-handler.ts:47): The raw URL (potentially containing an embedded Personal Access Token) is written to stderr before any scrubbing, while every other log call in the codebase already uses the scrubbed form.
  • http:// scheme permitted (git-manager.ts:47): Plain HTTP is allowed alongside HTTPS, which means embedded PATs can be transmitted and logged in plaintext. Restricting to https:// alone is the safer default given that the schema description encourages PAT-in-URL usage.
  • Variable shadowing in handleGrepSearch (repo-handler.ts:143): The local const args array shadows the outer method parameter of the same name, which can confuse readers and linters.

Confidence Score: 3/5

  • Safe to merge after fixing the cache-key mismatch and PAT log leak; the remaining items are low-risk style issues.
  • The core security concerns flagged in prior rounds (shell injection, file:// SSRF, traversal) have been properly addressed. Two new issues lower confidence: the branch/cache-key mismatch is a correctness bug that silently returns wrong branch content, and the unscrubed PAT log in repo-handler.ts is a credential-hygiene regression given that git-manager.ts already scrubs the same URL. Neither is a critical vulnerability, but both should be fixed before merging to avoid surprising behaviour and accidental token exposure in production logs.
  • src/repo/git-manager.ts (cache key logic) and src/repo/repo-handler.ts (clone log scrubbing)

Important Files Changed

Filename Overview
src/repo/git-manager.ts Core cloning and repo lifecycle management; has a cache key mismatch bug when no branch is specified, and allows insecure http:// URLs that can expose PATs in plaintext.
src/repo/repo-handler.ts Dispatches MCP tool operations to git-manager and fs; leaks raw PAT-containing URLs in the clone log line and has a local variable shadowing the outer args parameter in handleGrepSearch.
src/repo/operations.ts Zod schemas for all repo operations; well-structured, though the CloneArgsSchema description explicitly encourages PAT-in-URL which amplifies the credential-leak risk elsewhere.
src/server-factory.ts Registers thoughtbox_repo tool using the new RepoHandler; integration is straightforward and correctly passes through isError.
Dockerfile Adds git and grep to the production image; correct placement in the production stage after dependency installation.
scripts/test-repo-tool.ts Manual integration test script; covers clone, list, read, grep, and path-traversal checks, though tests 4 and 5 have inconsistent indentation and the path-traversal test result doesn't fail the process exit code.

Sequence Diagram

sequenceDiagram
    participant Client as MCP Client
    participant SF as server-factory.ts
    participant RH as RepoHandler
    participant GM as GitManager
    participant FS as Filesystem (/tmp)
    participant Git as git (execFileSync)

    Client->>SF: callTool("thoughtbox_repo", { operation: "clone", args: { url, branch? } })
    SF->>RH: handle({ operation, args })
    RH->>RH: CloneArgsSchema.parse(args)
    RH->>GM: cloneRepository(url, branch?)
    GM->>GM: validate URL scheme (https/http only)
    GM->>GM: enforceCapacityLimit(5)
    GM->>GM: generateRepoId(url, safeBranch) → id
    alt Cache hit
        GM-->>RH: return cached ClonedRepo
    else Cache miss
        GM->>Git: execFileSync("git", ["clone", "--depth", "1", ...])
        Git->>FS: shallow clone → /tmp/thoughtbox-repos/{id}/
        GM->>GM: activeRepos.set(id, ClonedRepo)
        GM-->>RH: return ClonedRepo
    end
    RH-->>Client: { repoId, branch, url (scrubbed) }

    Client->>SF: callTool("thoughtbox_repo", { operation: "read_file", args: { repoId, path } })
    SF->>RH: handle(...)
    RH->>GM: resolveSafePath(repoId, path)
    GM->>GM: check ".." in path
    GM->>FS: realpathSync → absoluteTarget
    GM->>GM: assert absoluteTarget within repoBoundary
    GM-->>RH: safe absolute path
    RH->>FS: readFileSync(targetPath)
    RH-->>Client: file contents

    Client->>SF: callTool("thoughtbox_repo", { operation: "grep_search", args: { repoId, pattern, ... } })
    SF->>RH: handle(...)
    RH->>GM: resolveSafePath(repoId, path)
    RH->>Git: execFileSync("grep", ["-rnE", pattern, "."], { cwd: searchRoot })
    RH-->>Client: grep output or "No matches found."
Loading

Last reviewed commit: 4c2c4a8

const flags = `-rnE ${data.ignoreCase ? '-i' : ''}`;
const excludeFlag = `--exclude-dir=.git`; // skip the massive .git folder

const cmd = `grep ${flags} ${excludeFlag} "${data.pattern}" .`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shell injection via unsanitized data.pattern

data.pattern is interpolated directly into the shell command string inside double quotes. A caller who passes a pattern containing a double-quote character can break out of the quoting context and execute arbitrary shell commands on the server.

Example payload: pattern = 'foo" && cat /etc/passwd #' would produce:

grep -rnE -i --exclude-dir=.git "foo" && cat /etc/passwd #" .

…which executes cat /etc/passwd with full server privileges.

Use execFileSync (which never invokes a shell) and pass arguments as an array instead of building a string:

import { execFileSync } from 'node:child_process';

const grepArgs = [
    '-rnE',
    ...(data.ignoreCase ? ['-i'] : []),
    '--exclude-dir=.git',
    data.pattern,
    '.'
];

const output = execFileSync('grep', grepArgs, {
    cwd: searchRoot,
    encoding: 'utf-8',
    timeout: 10000
});

This eliminates the shell entirely so no amount of shell metacharacters in the pattern can cause injection.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/repo/repo-handler.ts
Line: 144

Comment:
**Shell injection via unsanitized `data.pattern`**

`data.pattern` is interpolated directly into the shell command string inside double quotes. A caller who passes a pattern containing a double-quote character can break out of the quoting context and execute arbitrary shell commands on the server.

Example payload: `pattern = 'foo" && cat /etc/passwd #'` would produce:

```
grep -rnE -i --exclude-dir=.git "foo" && cat /etc/passwd #" .
```

…which executes `cat /etc/passwd` with full server privileges.

Use `execFileSync` (which never invokes a shell) and pass arguments as an array instead of building a string:

```typescript
import { execFileSync } from 'node:child_process';

const grepArgs = [
    '-rnE',
    ...(data.ignoreCase ? ['-i'] : []),
    '--exclude-dir=.git',
    data.pattern,
    '.'
];

const output = execFileSync('grep', grepArgs, {
    cwd: searchRoot,
    encoding: 'utf-8',
    timeout: 10000
});
```

This eliminates the shell entirely so no amount of shell metacharacters in the pattern can cause injection.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +69 to +71
const branchFlag = branch ? `--branch ${branch} ` : '';
// Use --single-branch and --depth 1 to make it extremely fast
const cmd = `git clone --depth 1 --single-branch ${branchFlag}"${url}" "${repoPath}"`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shell injection via unquoted branch name and injectable url

Both the branch and url values are built directly into a shell command string, creating two injection vectors:

  1. branch is not quoted in branchFlag. A branch such as main --upload-pack=bash${IFS}-c${IFS}id or simply one containing a space or semicolon will inject shell tokens.
  2. url is wrapped in double quotes, but any double-quote character in the URL breaks out of that quoting context (e.g. https://example.com/" && malicious-cmd #).

Replace execSync with execFileSync and pass all arguments as an array so the shell is never involved:

import { execFileSync } from 'node:child_process';

const gitArgs = ['clone', '--depth', '1', '--single-branch'];
if (branch) {
    gitArgs.push('--branch', branch);
}
gitArgs.push(url, repoPath);

execFileSync('git', gitArgs, {
    stdio: 'pipe',
    timeout: 60000
});

This completely removes the shell from the execution path and makes both branch and url safe to pass without any escaping.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/repo/git-manager.ts
Line: 69-71

Comment:
**Shell injection via unquoted `branch` name and injectable `url`**

Both the `branch` and `url` values are built directly into a shell command string, creating two injection vectors:

1. `branch` is **not quoted** in `branchFlag`. A branch such as `main --upload-pack=bash${IFS}-c${IFS}id` or simply one containing a space or semicolon will inject shell tokens.
2. `url` is wrapped in double quotes, but any double-quote character in the URL breaks out of that quoting context (e.g. `https://example.com/" && malicious-cmd #`).

Replace `execSync` with `execFileSync` and pass all arguments as an array so the shell is never involved:

```typescript
import { execFileSync } from 'node:child_process';

const gitArgs = ['clone', '--depth', '1', '--single-branch'];
if (branch) {
    gitArgs.push('--branch', branch);
}
gitArgs.push(url, repoPath);

execFileSync('git', gitArgs, {
    stdio: 'pipe',
    timeout: 60000
});
```

This completely removes the shell from the execution path and makes both `branch` and `url` safe to pass without any escaping.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +43 to +71
export async function cloneRepository(url: string, branch?: string): Promise<ClonedRepo> {
ensureBaseDir();

// Clean up if we are holding too many repos in memory
enforceCapacityLimit(5);

const safeBranch = branch || 'main'; // Assume main but git clone will naturally default if we just omit it, though we pass it explicitly here if provided.
const id = generateRepoId(url, safeBranch);
const repoPath = path.join(REPO_BASE_DIR, id);

// If we already have this exact url+branch cloned, just return it (hot cache).
if (activeRepos.has(id) && fs.existsSync(repoPath)) {
console.error(`[GitManager] Repo ${id} already cloned. Using cache.`);
// Update timestamp to keep it alive
activeRepos.get(id)!.createdAt = new Date();
return activeRepos.get(id)!;
}

// Clear directory just in case there's stale data that wasn't tracked
if (fs.existsSync(repoPath)) {
fs.rmSync(repoPath, { recursive: true, force: true });
}

console.error(`[GitManager] Cloning ${url} (branch: ${safeBranch}) into ${repoPath}`);

try {
const branchFlag = branch ? `--branch ${branch} ` : '';
// Use --single-branch and --depth 1 to make it extremely fast
const cmd = `git clone --depth 1 --single-branch ${branchFlag}"${url}" "${repoPath}"`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No URL scheme validation allows server-side file access via file://

The url parameter is passed to git clone without any scheme check. Git supports the file:// protocol, so an attacker (or a misbehaving LLM client) can clone directly from the server's own filesystem:

clone({ url: "file:///etc" })

Git would happily copy /etc into the ephemeral directory, and read_file / grep_search could then expose any world-readable file on the host.

Validate the scheme before executing the clone:

const allowedSchemes = ['https://', 'http://', 'ssh://'];
const hasAllowedScheme = allowedSchemes.some(s => url.startsWith(s));
if (!hasAllowedScheme) {
    throw new Error(`Unsupported URL scheme. Only HTTPS/SSH git URLs are permitted.`);
}

If SSH is not needed, restricting to https:// alone is the tightest option.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/repo/git-manager.ts
Line: 43-71

Comment:
**No URL scheme validation allows server-side file access via `file://`**

The `url` parameter is passed to `git clone` without any scheme check. Git supports the `file://` protocol, so an attacker (or a misbehaving LLM client) can clone directly from the server's own filesystem:

```
clone({ url: "file:///etc" })
```

Git would happily copy `/etc` into the ephemeral directory, and `read_file` / `grep_search` could then expose any world-readable file on the host.

Validate the scheme before executing the clone:

```typescript
const allowedSchemes = ['https://', 'http://', 'ssh://'];
const hasAllowedScheme = allowedSchemes.some(s => url.startsWith(s));
if (!hasAllowedScheme) {
    throw new Error(`Unsupported URL scheme. Only HTTPS/SSH git URLs are permitted.`);
}
```

If SSH is not needed, restricting to `https://` alone is the tightest option.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +114 to +131
function enforceCapacityLimit(maxRepos: number) {
if (activeRepos.size >= maxRepos) {
// Find the oldest
let oldestId: string | null = null;
let oldestDate = new Date();

for (const [id, repo] of activeRepos.entries()) {
if (repo.createdAt < oldestDate) {
oldestDate = repo.createdAt;
oldestId = id;
}
}

if (oldestId) {
removeRepository(oldestId);
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capacity limit fails to evict when all repos share the same timestamp

oldestDate is initialised to new Date() (the current moment). The loop only enters the if body when repo.createdAt < oldestDate. If all five repos were created at exactly the same millisecond, or if the system clock has low resolution and they all share the same timestamp, no repo satisfies the condition, oldestId stays null, eviction is skipped, and the sixth repo is added unchecked — allowing the map to grow beyond the stated limit indefinitely.

Initialise oldestDate to Infinity (or the max Date value) to ensure the first repo always wins:

function enforceCapacityLimit(maxRepos: number) {
    if (activeRepos.size >= maxRepos) {
        let oldestId: string | null = null;
        let oldestTime = Infinity;

        for (const [id, repo] of activeRepos.entries()) {
            const t = repo.createdAt.getTime();
            if (t < oldestTime) {
                oldestTime = t;
                oldestId = id;
            }
        }

        if (oldestId) {
            removeRepository(oldestId);
        }
    }
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/repo/git-manager.ts
Line: 114-131

Comment:
**Capacity limit fails to evict when all repos share the same timestamp**

`oldestDate` is initialised to `new Date()` (the current moment). The loop only enters the `if` body when `repo.createdAt < oldestDate`. If all five repos were created at exactly the same millisecond, or if the system clock has low resolution and they all share the same timestamp, no repo satisfies the condition, `oldestId` stays `null`, eviction is skipped, and the sixth repo is added unchecked — allowing the map to grow beyond the stated limit indefinitely.

Initialise `oldestDate` to `Infinity` (or the max Date value) to ensure the first repo always wins:

```typescript
function enforceCapacityLimit(maxRepos: number) {
    if (activeRepos.size >= maxRepos) {
        let oldestId: string | null = null;
        let oldestTime = Infinity;

        for (const [id, repo] of activeRepos.entries()) {
            const t = repo.createdAt.getTime();
            if (t < oldestTime) {
                oldestTime = t;
                oldestId = id;
            }
        }

        if (oldestId) {
            removeRepository(oldestId);
        }
    }
}
```

How can I resolve this? If you propose a fix, please make it concise.

fs.rmSync(repoPath, { recursive: true, force: true });
}

console.error(`[GitManager] Cloning ${url} (branch: ${safeBranch}) into ${repoPath}`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PAT credential exposed in log output

The raw url is logged to stderr without any scrubbing. The schema's own description explicitly encourages callers to embed Personal Access Tokens directly in the URL (e.g. an HTTPS URL with user:token@host). This means the full credential will appear in Cloud Run logs — or any other log sink — where it is retained and searchable.

Scrub the credential before logging, just as the error path and the clone response already attempt to do:

const safeUrl = url.replace(/:[^:@]+@/g, ':***@');
console.error(`[GitManager] Cloning ${safeUrl} (branch: ${safeBranch}) into ${repoPath}`);

Also note that the scrubbing regex on line 97 (replace(/:[^:@]+@github.com/...)) only matches github.com. Tokens for GitLab, Bitbucket, or self-hosted instances are still exposed in error-path logs. The broader pattern :[^:@]+@ handles all hosts.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/repo/git-manager.ts
Line: 66

Comment:
**PAT credential exposed in log output**

The raw `url` is logged to stderr without any scrubbing. The schema's own description explicitly encourages callers to embed Personal Access Tokens directly in the URL (e.g. an HTTPS URL with `user:token@host`). This means the full credential will appear in Cloud Run logs — or any other log sink — where it is retained and searchable.

Scrub the credential before logging, just as the error path and the clone response already attempt to do:

```typescript
const safeUrl = url.replace(/:[^:@]+@/g, ':***@');
console.error(`[GitManager] Cloning ${safeUrl} (branch: ${safeBranch}) into ${repoPath}`);
```

Also note that the scrubbing regex on line 97 (`replace(/:[^:@]+@github.com/...)`) only matches `github.com`. Tokens for GitLab, Bitbucket, or self-hosted instances are still exposed in error-path logs. The broader pattern `:[^:@]+@` handles all hosts.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +57 to +58
const safeBranch = branch || 'main'; // Assume main but git clone will naturally default if we just omit it, though we pass it explicitly here if provided.
const id = generateRepoId(url, safeBranch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache key mismatch when no branch is specified

safeBranch is set to 'main' as a fallback when branch is undefined, and this value is used to generate the cache id. However, the actual git clone command only passes --branch when branch is explicitly provided (line 78: if (branch)). This creates a subtle but real correctness bug:

  1. A caller invokes cloneRepository(url, undefined) on a repo whose default branch is master. Git clones master, but the cache entry is keyed as url:main.
  2. A second caller invokes cloneRepository(url, 'main') on the same URL. The cache key resolves to the same url:main hash → cache hit returns the master branch content, silently serving the wrong branch.

The fix is to decouple the fallback label used for in-memory metadata from the cache key used for identity. Only use a hard-coded branch fallback in the cache key if no branch was explicitly requested, or better, use a distinct sentinel such as '__default__' for "use whatever the remote default is":

// Use a stable sentinel for the cache key when no branch is specified.
// This prevents conflating 'no preference' with 'explicitly wants main'.
const cacheKeyBranch = branch ?? '__default__';
const id = generateRepoId(url, cacheKeyBranch);

And keep safeBranch only for the metadata label stored in ClonedRepo.branch.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/repo/git-manager.ts
Line: 57-58

Comment:
**Cache key mismatch when no branch is specified**

`safeBranch` is set to `'main'` as a fallback when `branch` is `undefined`, and this value is used to generate the cache `id`. However, the actual `git clone` command only passes `--branch` when `branch` is explicitly provided (line 78: `if (branch)`). This creates a subtle but real correctness bug:

1. A caller invokes `cloneRepository(url, undefined)` on a repo whose default branch is `master`. Git clones `master`, but the cache entry is keyed as `url:main`.
2. A second caller invokes `cloneRepository(url, 'main')` on the same URL. The cache key resolves to the same `url:main` hash → **cache hit returns the `master` branch content**, silently serving the wrong branch.

The fix is to decouple the fallback label used for in-memory metadata from the cache key used for identity. Only use a hard-coded branch fallback in the cache key if no branch was explicitly requested, or better, use a distinct sentinel such as `'__default__'` for "use whatever the remote default is":

```typescript
// Use a stable sentinel for the cache key when no branch is specified.
// This prevents conflating 'no preference' with 'explicitly wants main'.
const cacheKeyBranch = branch ?? '__default__';
const id = generateRepoId(url, cacheKeyBranch);
```

And keep `safeBranch` only for the metadata label stored in `ClonedRepo.branch`.

How can I resolve this? If you propose a fix, please make it concise.


try {
// Use egrep (grep -E) format.
const args = ['-rnE'];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable args shadows outer function parameter

const args = ['-rnE'] declares a local variable with the same name as the method parameter args?: Record<string, unknown>. While TypeScript allows this and the inner args is never confused for the outer one in practice, the shadowing can be confusing to readers and will trigger most linters (e.g. no-shadow).

Suggested change
const args = ['-rnE'];
const grepArgs = ['-rnE'];

(And update the subsequent references to args.push(...) and execFileSync('grep', args, ...) accordingly.)

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/repo/repo-handler.ts
Line: 143

Comment:
**Variable `args` shadows outer function parameter**

`const args = ['-rnE']` declares a local variable with the same name as the method parameter `args?: Record<string, unknown>`. While TypeScript allows this and the inner `args` is never confused for the outer one in practice, the shadowing can be confusing to readers and will trigger most linters (e.g. `no-shadow`).

```suggestion
            const grepArgs = ['-rnE'];
```

(And update the subsequent references to `args.push(...)` and `execFileSync('grep', args, ...)` accordingly.)

How can I resolve this? If you propose a fix, please make it concise.


private async handleClone(args?: Record<string, unknown>) {
const data = CloneArgsSchema.parse(args || {});
console.error(`[RepoHandler] Initiating clone for ${data.url}`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PAT credential leaked before scrubbing

data.url is logged to stderr in its raw form here, before any scrubbing. Since the tool schema explicitly encourages embedding Personal Access Tokens inline in the URL, those tokens will appear in Cloud Run logs (or any log sink) every time a clone operation is initiated.

The git-manager.ts already demonstrates the correct pattern — scrub the URL before logging:

Suggested change
console.error(`[RepoHandler] Initiating clone for ${data.url}`);
console.error(`[RepoHandler] Initiating clone for ${data.url.replace(/:\/\/[^@]+@/, '://***@')}`);
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/repo/repo-handler.ts
Line: 47

Comment:
**PAT credential leaked before scrubbing**

`data.url` is logged to stderr in its raw form here, before any scrubbing. Since the tool schema explicitly encourages embedding Personal Access Tokens inline in the URL, those tokens will appear in Cloud Run logs (or any log sink) every time a `clone` operation is initiated.

The `git-manager.ts` already demonstrates the correct pattern — scrub the URL before logging:

```suggestion
        console.error(`[RepoHandler] Initiating clone for ${data.url.replace(/:\/\/[^@]+@/, '://***@')}`);
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +46 to +49
// Prevent SSRF: Only allow HTTP/HTTPS, reject file://
if (!url.startsWith('http://') && !url.startsWith('https://')) {
throw new Error(`Invalid URL scheme. Only http:// and https:// are supported.`);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http:// scheme allows PATs to be sent in plaintext

The URL scheme check permits http:// alongside https://. The tool's own schema description actively encourages embedding PATs inline in the URL. Allowing http:// means those credentials can be transmitted in plaintext over the network, making them trivially interceptable.

Unless there is a specific requirement for plain-HTTP corporate/internal git servers, restricting to https:// only is the safer default:

Suggested change
// Prevent SSRF: Only allow HTTP/HTTPS, reject file://
if (!url.startsWith('http://') && !url.startsWith('https://')) {
throw new Error(`Invalid URL scheme. Only http:// and https:// are supported.`);
}
if (!url.startsWith('https://')) {
throw new Error(`Invalid URL scheme. Only https:// is supported.`);
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/repo/git-manager.ts
Line: 46-49

Comment:
**`http://` scheme allows PATs to be sent in plaintext**

The URL scheme check permits `http://` alongside `https://`. The tool's own schema description actively encourages embedding PATs inline in the URL. Allowing `http://` means those credentials can be transmitted in plaintext over the network, making them trivially interceptable.

Unless there is a specific requirement for plain-HTTP corporate/internal git servers, restricting to `https://` only is the safer default:

```suggestion
    if (!url.startsWith('https://')) {
        throw new Error(`Invalid URL scheme. Only https:// is supported.`);
    }
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant