Skip to content

feat: refactor architecture to identify snippets#14

Merged
mattiacerutti merged 10 commits intomainfrom
feat/db-refactor
Nov 18, 2025
Merged

feat: refactor architecture to identify snippets#14
mattiacerutti merged 10 commits intomainfrom
feat/db-refactor

Conversation

@mattiacerutti
Copy link
Copy Markdown
Owner

@mattiacerutti mattiacerutti commented Nov 18, 2025

Summary by CodeRabbit

  • New Features

    • Added automated code snippet extraction, formatting and quality filters.
    • Added GitHub-backed seeding to populate files, file versions, and snippets.
  • Refactor

    • Restructured database to track Files → FileVersions → Snippets for per-commit history.
    • Simplified snippet retrieval pipeline and updated filtering/selection logic.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 18, 2025

Walkthrough

Adds a GitHub API client with retries and rate-limit handling, a Tree-sitter–based snippet extraction/processing library, a new hierarchical Prisma data model (File → FileVersion → Snippet), revamped seed procedures that fetch and persist files/snippets, and replaces legacy CLI/scripts with integrated seed modules and a Prisma random-extension integration.

Changes

Cohort / File(s) Summary
GitHub API client
lib/github.ts
New module exporting callGitHubApi(path, options) and GitHubRequestOptions with token handling, URL builder, retry/shouldRetry hooks, rate-limit detection/handling, and wait/backoff utilities.
Snippet library
lib/snippets/config.ts, lib/snippets/index.ts, lib/snippets/parsing/parser.ts, lib/snippets/processing/*
New modular snippet pipeline: config constants, getSnippetsFromFile, Tree-sitter parsing (extractSnippets, node utilities), formatting (formatCode), filtering, indentation helpers, and hashSnippet (SHA‑256).
Prisma schema & migrations
prisma/schema.prisma, prisma/migrations/...
Replaces flat Snippet model with hierarchical models: File, FileVersion, Snippet; adds migration creating Language/File/FileVersion/Snippet tables, indexes, and FK constraints; removes older initial migrations.
Prisma client ext.
src/shared/db/prisma.ts, package.json
Adds prisma-extension-random; factory createPrisma() extends PrismaClient with random extension; exports prisma as ExtendedPrismaClient and new type alias.
Seed orchestration
prisma/seed.ts, prisma/seed/files.ts, prisma/seed/snippets.ts, prisma/seed/languages.ts
Centralized seeding: seedFiles queries GitHub (uses callGitHubApi) to create File+FileVersion records; seedSnippets fetches raw files, extracts snippets via library, inserts snippets, and marks parsedAt; languages seed logs updated.
Removed legacy scripts/utilities
scripts/seed-*.ts, scripts/utils/snippet-helpers.ts, scripts/utils/cli.ts
Deleted CLI seed scripts and helper utilities (flag parsing, old snippet fetch/seed helpers).
App snippet service & repos
src/features/snippets/...
Reworked snippet service/repository: removed URL-based fetching and server-side processing; added findRandomSnippets using Prisma random extension; new filter module; removed processSnippets and SnippetSourceFile type.
Build config
tsconfig.json
Adds path alias @lib/*./lib/*.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Seed as Seed Pipeline
    participant GH as GitHub API (callGitHubApi)
    participant Parser as Snippet Library (Tree‑sitter)
    participant DB as Prisma (Extended)
    participant Service as Snippet Service

    Seed->>GH: Search code per language (paginated)
    GH-->>Seed: List of files (repo, path, sha, commit)
    Seed->>DB: Create File + FileVersion records

    Seed->>GH: Fetch raw content for unparsed FileVersions
    GH-->>Seed: File content
    Seed->>Parser: getSnippetsFromFile(content, languageId)
    Parser-->>Seed: snippet texts[]

    Seed->>DB: Insert Snippet rows (hash, content, rawContent) and set parsedAt
    DB-->>Seed: persisted

    Service->>DB: findRandomSnippets(languageId, qty)
    DB-->>Service: random snippets (id, content)
    Service->>Service: filterSnippets(content) → compute disabledRanges
    Service-->>Client: final shuffled snippets
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • Pay special attention to:
    • lib/github.ts: retry/shouldRetry behavior, rate-limit parsing (X-RateLimit-Reset, Retry-After) and timing edge cases.
    • prisma/seed/files.ts and seed/snippets.ts: GitHub query correctness, pagination, concurrency (p-limit), error handling for 404/400 and unique constraint handling.
    • prisma/schema + migrations: FK cascade semantics, unique/index constraints (repository+path, fileId+sha, snippet sha).
    • src/shared/db/prisma.ts: correct extension via $extends and type compatibility of ExtendedPrismaClient and global singleton usage.
    • Replaced/removed APIs: ensure callers updated to new function names/types (findRandomSnippets, removed processSnippets/getFilesFromUrls/etc.).

Possibly related PRs

Poem

🐰 I hopped through repos bright and far,

built retries and a rate-limit star,
parsed each node with Tree‑sitter's art,
gave files versions and snippets a start,
now seeds and hashes play their part. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main architectural refactoring: shifting from a simple snippet model to a hierarchical file-versioning system with GitHub integration for snippet identification.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/db-refactor

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (13)
src/shared/db/prisma.ts (1)

4-6: Global Prisma singleton pattern is sound; optional type/ordering tweak

The globalThis cast plus caching in non‑production environments follows the standard Prisma singleton pattern and should avoid connection churn during hot reloads. If you want slightly clearer typings, you could optionally reorder to define ExtendedPrismaClient before the globalForPrisma cast, so the type alias isn’t forward‑referenced.

If you reuse this pattern elsewhere, ensure all imports reference ExtendedPrismaClient (or the prisma export) rather than PrismaClient directly so types remain consistent with the extended client.

Also applies to: 16-22

lib/snippets/processing/utils.ts (1)

27-32: Remove redundant trim() call.

Line 29 removes all whitespace using replace(/\s+/g, ""), making the subsequent .trim() call redundant since there's no whitespace left to trim.

Apply this diff:

-  const newContent = content.replace(/\s+/g, "").trim();
+  const newContent = content.replace(/\s+/g, "");
lib/snippets/processing/filter.ts (1)

25-28: Clarify intent of filterTabsInBetween and its interaction with filterSnippets

Right now filterTabsInBetween returns true only if the snippet contains either a literal \t sequence or an actual tab character, and filterSnippets requires this to be true. That means any snippet with no tabs at all (including multi‑line, non‑indented code) will always be rejected.

If the goal is:

  • “Reject snippets that contain tabs/\t literals” → invert the predicate in filterTabsInBetween.
  • “Require at least one indented line using tabs” → the current behavior is closer, but consider:
    • Renaming the helper (e.g. hasTabsIndentation) or adding a short comment explaining the requirement.
    • Simplifying the regex (e.g., snippet.trim().includes("\t") if you only care about actual tab characters).

Please confirm which behavior you want so this filter doesn’t silently drop more snippets than intended.

Also applies to: 34-35

lib/snippets/processing/process.ts (1)

11-14: hasProhibitedCharacters is very strict (no backslashes, no CR/other whitespace)

The regex /((?:(?![ \t\n])\s)|\\)/ rejects:

  • Any whitespace other than space, tab, and \n (e.g. \r, \f, NBSP).
  • Any backslash \ anywhere in the snippet.

This will drop, for example, CRLF‑encoded files, Windows paths, string/regex literals with escapes, etc. If that’s intentional, it’s fine; otherwise you may want to narrow the rule (e.g. only reject specific problematic characters) instead of all backslashes and non‑standard whitespace.

Consider documenting the intent, or relaxing the regex if you expect such snippets.

prisma/seed/snippets.ts (2)

27-29: Consider not throwing when there are no unparsed files

When unparsedFiles.length === 0 you currently throw:

throw new Error("All files are parsed. No snippets to seed.");

For an idempotent seed pipeline it’s often more convenient to just log and return early, so re‑running seeds doesn’t fail:

-  if (unparsedFiles.length === 0) {
-    throw new Error("All files are parsed. No snippets to seed.");
-  }
+  if (unparsedFiles.length === 0) {
+    console.log("All files are parsed. No snippets to seed.");
+    return;
+  }

31-86: DB operations run with potentially high concurrency

p-limit caps concurrent fetches to 100, but:

  • getSnippetsFromFile runs for all unparsed files in parallel.
  • For each file, snippet inserts use Promise.all(snippets.map(...)) without any concurrency limit.
  • All fileVersion updates also run under the outer Promise.all.

On large datasets this can produce a high number of concurrent DB calls. If you expect many file versions/snippets, consider:

  • Moving the limit to wrap the whole per‑file task instead of just fetch, or
  • Using another p-limit for snippet inserts per file, or
  • Chunking unparsedFiles into batches.

That will keep seeding friendlier to the DB without changing behavior.

lib/snippets/config.ts (1)

1-27: Config values look good; consider typing VALID_NODES more strongly

The thresholds and node type list are clear and centralized. To get better type‑safety where these node types are used, you could mark the array as const and optionally expose a union type:

-export const VALID_NODES = [
+export const VALID_NODES = [
   "function_declaration",
   "function_definition",
   "class_declaration",
   "class_definition",
   "function_item",
   "method_declaration",
   "method_definition",
   "method_item",
   "class_item",
   "interface_declaration",
   "interface_definition",
   "interface_item",
   "module",
   "switch_statement",
   "while_statement",
   "for_statement",
   "if_statement",
-];
+] as const;
+
+export type ValidNodeType = (typeof VALID_NODES)[number];

This is optional but helps prevent typos in node type strings elsewhere.

src/features/snippets/services/get-random-snippets.server.ts (2)

11-26: Handle duplicates and optionally cap to the minimum number of snippets

The loop accumulates snippets until snippets.length >= MIN_SNIPPETS_NUMBER or attempts are exhausted, but:

  • The same snippet can be returned multiple times by findRandomSnippets across attempts; there’s no deduplication by snippet ID/content.
  • Once you exceed MIN_SNIPPETS_NUMBER, you keep all accumulated snippets; callers might expect “exactly N” rather than “at least N”.

If you care about uniqueness and an exact count, consider something like:

const bySha = new Map<string, ISnippet>();

// when adding:
bySha.set(snippetSha, snippet);

// after the loop:
const uniqueSnippets = Array.from(bySha.values()).slice(0, MIN_SNIPPETS_NUMBER);

This would avoid duplicates and cap the result while preserving your attempt logic.


30-35: Shuffling via sort(() => Math.random() - 0.5) is simple but biased

Using array.sort(() => Math.random() - 0.5) works for small datasets, but it’s known to produce a biased distribution and relies on engine‑specific sort behavior.

If you want a more robust shuffle, a small Fisher–Yates implementation is safer:

function shuffle<T>(arr: T[]): T[] {
  const result = [...arr];
  for (let i = result.length - 1; i > 0; i--) {
    const j = Math.floor(Math.random() * (i + 1));
    [result[i], result[j]] = [result[j], result[i]];
  }
  return result;
}

// ...
return shuffle(snippets);

Not critical, but more predictable.

lib/snippets/parsing/parser.ts (1)

6-18: Avoid repeated array concatenations in findValidNodes

The recursive nodes = nodes.concat(findValidNodes(child)) pattern builds many intermediate arrays and can degrade to O(n²) on large trees. An iterative DFS or passing an accumulator array would be more efficient while keeping behavior identical.

You could refactor along these lines:

export function findValidNodes(root: IParser.SyntaxNode): IParser.SyntaxNode[] {
  const result: IParser.SyntaxNode[] = [];
  const stack: IParser.SyntaxNode[] = [root];

  while (stack.length > 0) {
    const node = stack.pop()!;
    if (isValidNode(node)) {
      result.push(node);
    }
    // children are never null in tree-sitter, so no null check needed
    for (const child of node.children) {
      stack.push(child);
    }
  }

  return result;
}

Please confirm this matches your expectations with a few representative trees from tree-sitter.

prisma/seed/files.ts (2)

39-47: Align log messages with what’s actually being seeded

This function seeds File/FileVersion records, but the logs say “snippets”:

console.log("All languages already have snippets.");
console.log(`Seeding snippets for missing languages: ${missingLanguageIds.join(", ")}`);

For clarity when running seeds, consider updating these to talk about “files” or “tracked files” instead.


85-96: Add an explicit type for files in getFilesForLanguage

Right now files is inferred as any[]:

const files = [];

Given the function’s return type is Promise<FileMetadata[]>, it’s better to enforce that locally:

async function getFilesForLanguage(
  language: ILanguage,
  {pagesPerLanguage = 10}: {pagesPerLanguage?: number} = {},
): Promise<FileMetadata[]> {
  const files: FileMetadata[] = [];
  // ...
}

This will catch accidental pushes of incorrectly shaped data at compile time.

lib/github.ts (1)

14-70: GitHub client structure looks solid; consider enriching the final error

The retry + rate‑limit handling in callGitHubApi is well structured and should be easy to reuse.

One small improvement: the final error after exhausting attempts currently doesn’t expose the last status code or body, which can make debugging harder:

throw new Error(`GitHub API request failed after ${attempts} attempts for ${url}`);

You could capture the last response status/text (and maybe a truncated body) and include it:

let lastStatus: number | undefined;
let lastStatusText: string | undefined;

for (let attempt = 1; attempt <= attempts; attempt += 1) {
  const response = await fetch(url, { ...fetchOptions, headers });
  lastStatus = response.status;
  lastStatusText = response.statusText;
  // existing logic...
}

throw new Error(
  `GitHub API request failed after ${attempts} attempts for ${url} (last status: ${lastStatus} ${lastStatusText})`,
);

This keeps behavior the same while giving more context when things go wrong.

Please re-check TypeScript types for fetch/Response in your runtime (Node vs. other) after this change.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ffa042 and 67c87ab.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (31)
  • lib/github.ts (1 hunks)
  • lib/snippets/config.ts (1 hunks)
  • lib/snippets/index.ts (1 hunks)
  • lib/snippets/parsing/parser.ts (1 hunks)
  • lib/snippets/processing/filter.ts (1 hunks)
  • lib/snippets/processing/indentation.ts (1 hunks)
  • lib/snippets/processing/process.ts (1 hunks)
  • lib/snippets/processing/utils.ts (2 hunks)
  • package.json (1 hunks)
  • prisma/migrations/20250402143241_init/migration.sql (0 hunks)
  • prisma/migrations/20250402153753_add_languageid_index/migration.sql (0 hunks)
  • prisma/migrations/20251117225653_init/migration.sql (1 hunks)
  • prisma/schema.prisma (1 hunks)
  • prisma/seed.ts (1 hunks)
  • prisma/seed/files.ts (1 hunks)
  • prisma/seed/languages.ts (1 hunks)
  • prisma/seed/snippets.ts (1 hunks)
  • scripts/seed-missing-snippets.ts (0 hunks)
  • scripts/seed-snippets.ts (0 hunks)
  • scripts/utils/cli.ts (0 hunks)
  • scripts/utils/snippet-helpers.ts (0 hunks)
  • src/features/snippets/config/snippets.server.ts (1 hunks)
  • src/features/snippets/infrastructure/adapters/snippet-fetch.server.ts (0 hunks)
  • src/features/snippets/infrastructure/repositories/snippet.repository.server.ts (1 hunks)
  • src/features/snippets/logic/filter.ts (1 hunks)
  • src/features/snippets/logic/parsing/snippet-parser.server.ts (1 hunks)
  • src/features/snippets/logic/processing/snippet-process.server.ts (0 hunks)
  • src/features/snippets/services/get-random-snippets.server.ts (1 hunks)
  • src/features/snippets/types/snippet-source.ts (0 hunks)
  • src/shared/db/prisma.ts (1 hunks)
  • tsconfig.json (1 hunks)
💤 Files with no reviewable changes (9)
  • src/features/snippets/logic/processing/snippet-process.server.ts
  • src/features/snippets/types/snippet-source.ts
  • scripts/utils/cli.ts
  • scripts/seed-snippets.ts
  • scripts/seed-missing-snippets.ts
  • prisma/migrations/20250402153753_add_languageid_index/migration.sql
  • prisma/migrations/20250402143241_init/migration.sql
  • scripts/utils/snippet-helpers.ts
  • src/features/snippets/infrastructure/adapters/snippet-fetch.server.ts
🧰 Additional context used
🧬 Code graph analysis (12)
lib/snippets/processing/indentation.ts (3)
src/features/snippets/logic/processing/text.server.ts (2)
  • countInitialWhitespaces (1-11)
  • removeInitialCharacters (13-23)
src/features/snippets/logic/processing/snippet-process.server.ts (3)
  • cleanText (16-21)
  • formatCode (28-66)
  • line (47-57)
src/features/snippets/logic/processing/indentation.server.ts (2)
  • applyIndentationToEmptyLines (73-90)
  • adjustIndentationOffset (47-52)
lib/snippets/processing/process.ts (2)
lib/snippets/processing/indentation.ts (3)
  • detectIndentationStyle (4-45)
  • adjustIndentationOffset (47-52)
  • applyIndentationToEmptyLines (73-90)
lib/snippets/processing/utils.ts (1)
  • countInitialWhitespaces (3-13)
src/features/snippets/logic/parsing/snippet-parser.server.ts (1)
src/features/snippets/logic/processing/snippet-process.server.ts (2)
  • processSnippets (68-85)
  • snippet (75-82)
lib/snippets/parsing/parser.ts (4)
lib/snippets/processing/filter.ts (1)
  • isValidNode (30-32)
lib/snippets/processing/indentation.ts (1)
  • getInitialIndentation (54-71)
lib/snippets/parsing/parser-factory.ts (1)
  • getTSParser (19-27)
src/features/snippets/logic/processing/snippet-process.server.ts (1)
  • processSnippets (68-85)
src/features/snippets/logic/filter.ts (3)
src/features/snippets/config/snippets.server.ts (5)
  • MAX_LINES (1-1)
  • MIN_LINES (2-2)
  • MAX_LINE_LENGTH (4-4)
  • MIN_SNIPPET_LENGTH (7-7)
  • MAX_SNIPPET_LENGTH (6-6)
src/features/snippets/logic/processing/snippet-process.server.ts (5)
  • filterSnippets (7-14)
  • snippet (10-10)
  • snippet (11-11)
  • snippet (13-13)
  • snippet (12-12)
src/features/snippets/logic/processing/snippet-filters.server.ts (5)
  • filterLineLength (9-12)
  • filterLinesNumber (4-7)
  • filterSnippetLength (14-16)
  • filterTabsInBetween (25-28)
  • filterSnippetSpecialCharacters (18-23)
lib/snippets/index.ts (4)
lib/snippets/parsing/parser.ts (1)
  • extractSnippets (26-42)
lib/snippets/processing/utils.ts (1)
  • hashSnippet (27-32)
lib/snippets/processing/process.ts (1)
  • formatCode (16-54)
lib/snippets/processing/filter.ts (1)
  • filterSnippets (34-36)
prisma/seed/snippets.ts (2)
src/shared/db/prisma.ts (1)
  • prisma (18-18)
lib/snippets/index.ts (1)
  • getSnippetsFromFile (12-26)
lib/snippets/processing/filter.ts (1)
lib/snippets/config.ts (6)
  • MAX_LINES (1-1)
  • MIN_LINES (2-2)
  • MAX_LINE_LENGTH (4-4)
  • MIN_SNIPPET_LENGTH (7-7)
  • MAX_SNIPPET_LENGTH (6-6)
  • VALID_NODES (9-27)
prisma/seed/files.ts (3)
prisma/seed/languages.ts (1)
  • SUPPORTED_LANGUAGES (4-59)
src/shared/types/language.ts (1)
  • ILanguage (1-6)
lib/github.ts (1)
  • callGitHubApi (14-70)
prisma/seed.ts (3)
prisma/seed/languages.ts (1)
  • seedLanguages (61-81)
prisma/seed/files.ts (1)
  • seedFiles (26-53)
prisma/seed/snippets.ts (1)
  • seedSnippets (8-89)
src/features/snippets/infrastructure/repositories/snippet.repository.server.ts (1)
src/shared/db/prisma.ts (1)
  • prisma (18-18)
src/features/snippets/services/get-random-snippets.server.ts (5)
src/shared/types/snippet.server.ts (1)
  • ISnippet (1-4)
src/features/snippets/config/snippets.server.ts (3)
  • MIN_SNIPPETS_NUMBER (29-29)
  • MAX_GET_SNIPPETS_ATTEMPTS (31-31)
  • SNIPPETS_RETRIEVED_PER_QUERY (30-30)
src/features/snippets/infrastructure/repositories/snippet.repository.server.ts (1)
  • findRandomSnippets (4-20)
src/features/snippets/logic/filter.ts (1)
  • filterSnippets (29-31)
src/features/snippets/logic/parsing/snippet-parser.server.ts (1)
  • extractAutoCompleteDisabledRanges (4-51)
🪛 Biome (2.1.2)
src/features/snippets/logic/filter.ts

[error] 18-18: Unexpected control character in a regular expression.

Control characters are unusual and potentially incorrect inputs, so they are disallowed.

(lint/suspicious/noControlCharactersInRegex)

lib/snippets/processing/filter.ts

[error] 19-19: Unexpected control character in a regular expression.

Control characters are unusual and potentially incorrect inputs, so they are disallowed.

(lint/suspicious/noControlCharactersInRegex)

🔇 Additional comments (18)
package.json (1)

32-32: Runtime dependency addition matches prisma client extension usage

Adding prisma-extension-random as a regular dependency aligns with its runtime use in src/shared/db/prisma.ts and keeps the dependency graph consistent with the new Prisma client extension pattern.

Please double‑check that prisma-extension-random@^0.2.2 is compatible with @prisma/client@^6.5.0 and your target Node.js 22.11.0 runtime.

src/shared/db/prisma.ts (1)

2-2: Prisma factory and extension wiring look correct

Using a createPrisma factory that instantiates PrismaClient once and immediately extends it with prismaRandom() is a clean way to centralize extension setup, and the returned extended client type is inferred correctly via ReturnType<typeof createPrisma>.

Since this relies on prisma-extension-random’s Prisma 6 extension API, please confirm against the library docs that base.$extends(prismaRandom()) is the recommended usage for your version.

Also applies to: 8-14

tsconfig.json (1)

28-31: LGTM!

The new path alias @lib/* correctly maps to ./lib/* and aligns with the new library modules introduced in this PR.

lib/snippets/processing/indentation.ts (1)

1-1: LGTM!

The updated import path consolidates utility functions into a single module, which improves code organization.

prisma/seed/languages.ts (1)

80-80: LGTM!

The success log provides useful feedback for the seeding process and aligns with the logging pattern in the new seeding flow.

prisma/seed.ts (2)

4-5: LGTM!

The new imports support the extended seeding pipeline introduced in this PR.


13-14: LGTM!

The sequential seeding flow is correct: languages must be seeded before files, and files before snippets. The orchestration aligns with the new hierarchical data model.

lib/snippets/index.ts (1)

15-21: Verify that hash collisions from whitespace-normalized content are acceptable.

Line 17 computes the SHA hash using hashSnippet(snippet), which removes all whitespace before hashing (as seen in lib/snippets/processing/utils.ts). This means that two different raw snippets differing only in whitespace/formatting will produce identical SHA values, potentially leading to duplicate detection issues or unintended collisions.

If the intent is to identify semantically identical code regardless of formatting, this is correct. Otherwise, consider hashing the formatted content instead of the raw content.

src/features/snippets/config/snippets.server.ts (1)

29-31: All references to renamed constants have been successfully updated.

Verification confirms:

  • Old constant names (MIN_SNIPPETS_PER_LANGUAGE, RANDOM_FILES_FETCHED, MAX_SNIPPETS_FETCH_ATTEMPTS) have no remaining references in the codebase
  • New constants (MIN_SNIPPETS_NUMBER, SNIPPETS_RETRIEVED_PER_QUERY, MAX_GET_SNIPPETS_ATTEMPTS) are properly exported and imported in src/features/snippets/services/get-random-snippets.server.ts where they are actively used
  • The breaking API change has been fully migrated
src/features/snippets/logic/parsing/snippet-parser.server.ts (1)

1-1: The review comment is based on inaccurate assumptions about the refactoring.

The functions (findValidNodes, convertSnippetToText, extractSnippets, getNodeText) were not removed—they still exist and are exported from both the old location (src/features/snippets/logic/parsing/snippet-parser.server.ts) and the new location (lib/snippets/parsing/parser.ts). Additionally, not all consumers have been updated: src/features/snippets/services/get-random-snippets.server.ts:4 still imports extractAutoCompleteDisabledRanges from the old path @/features/snippets/logic/parsing/snippet-parser.server.

Likely an incorrect or invalid review comment.

lib/snippets/processing/filter.ts (1)

4-16: Core snippet size/shape filters and isValidNode look consistent

Line/length constraints and isValidNode are straightforward and align well with the centralized config values in lib/snippets/config.ts. The combined use in filterSnippets gives a clear, composable gate for snippet validity.

Also applies to: 30-32

lib/snippets/processing/process.ts (1)

16-54: Indentation normalization pipeline is coherent

formatCode cleanly composes:

  • Early rejection on prohibited characters and mixed indentation.
  • A no‑op path for unindented code (type === "none").
  • Space→tab normalization using detectIndentationStyle and countInitialWhitespaces.
  • Offset removal and empty‑line indentation via adjustIndentationOffset / applyIndentationToEmptyLines.

Assuming the helpers behave as in indentation.ts, this should produce consistently tab‑indented snippets for downstream processing.

prisma/seed/snippets.ts (1)

62-77: Unique‑constraint handling is reasonable

Catching PrismaClientKnownRequestError with code "P2002" and ignoring that case is a good way to deal with duplicate snippet content across files during parallel seeding, while still surfacing any other unexpected write errors.

src/features/snippets/services/get-random-snippets.server.ts (1)

12-23: End‑to‑end filtering/parsing pipeline is clean

The combination of:

  • findRandomSnippets as a DB‑level random selector,
  • filterSnippets as a content gate, and
  • extractAutoCompleteDisabledRanges as a tree‑sitter‑based post‑processor

produces a clear, layered pipeline for random snippet retrieval. The resulting ISnippet shape (content + disabledRanges) is well‑factored.

src/features/snippets/infrastructure/repositories/snippet.repository.server.ts (1)

4-19: Repository query is straightforward and matches new data model

findRandomSnippets cleanly delegates randomness to prisma.snippet.findManyRandom, selecting only id and content and scoping by fileVersion.file.languageId. That aligns well with the new File/FileVersion‑based schema and keeps the repository’s surface small and focused.

lib/snippets/parsing/parser.ts (1)

20-24: Verify indentation behavior in convertSnippetToText

convertSnippetToText prefixes getInitialIndentation(node.startIndex, sourceCode) to node.text. If node.text already includes its original leading indentation (which it typically does with tree-sitter), this will effectively double the outer indentation for nested nodes.

If the goal is to preserve original indentation only, consider either:

  • Dropping the extra prefix and returning just getNodeText(node), or
  • Normalizing the node text first (e.g., stripping its original leading indentation) before prefixing the computed context indentation.

This is likely fine, but it’s worth validating with a few nested constructs (functions, methods, blocks) to ensure snippets don’t end up over-indented.

Also applies to: 44-45

prisma/migrations/20251117225653_init/migration.sql (1)

1-66: Schema for Language/File/FileVersion/Snippet looks consistent and well-indexed

The migration cleanly models the new hierarchy:

  • Language keyed by id with a unique name.
  • File unique on (repository, path) with an index on languageId.
  • FileVersion unique per (fileId, sha) and linked back to File.
  • Snippet referencing FileVersion with a unique sha.

The foreign keys and indexes align with the seeding logic in prisma/seed/files.ts and should support efficient lookups. No changes needed from my side.

prisma/schema.prisma (1)

10-16: Language.id design is intentional and correct — no breaking change exists.

The concern about missing @default generator is invalid. Evidence shows Language records are created exclusively through the seed function (prisma/seed/languages.ts), which provides explicit ids for all supported languages ("js", "ts", "c", "cpp", "cs", "java", "py", "lua", "r"). The application never attempts to create Language records dynamically without providing an id. The schema design is deliberate: languages are managed through seeding only, not through application-level record creation. No @default is needed or expected.

Comment on lines +18 to 23
function filterSnippetSpecialCharacters(snippet: string): boolean {
const regex = /^[\x00-\x7F]+$/;
const test = regex.test(snippet);

return test;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

ASCII-only regex triggers Biome warning and could be modernized

filterSnippetSpecialCharacters uses /^[\x00-\x7F]+$/, which Biome flags (noControlCharactersInRegex) and enforces strict ASCII-only content. If you want “ASCII-only” semantics, you can avoid explicit control-character ranges and satisfy the linter with Unicode properties:

-function filterSnippetSpecialCharacters(snippet: string): boolean {
-  const regex = /^[\x00-\x7F]+$/;
-  const test = regex.test(snippet);
-
-  return test;
-}
+function filterSnippetSpecialCharacters(snippet: string): boolean {
+  // Allow only ASCII characters.
+  const regex = /^[\p{ASCII}]+$/u;
+  return regex.test(snippet);
+}

Also double‑check that rejecting any non‑ASCII characters (identifiers, comments, emojis, etc.) is intentional, as this will exclude many valid snippets.










🏁 Script executed:

#!/bin/bash
# Verify file exists and inspect the function
cat lib/snippets/processing/filter.ts

Length of output: 1255


🏁 Script executed:

#!/bin/bash
# Search for Biome configuration files
find . -name ".biomeignore" -o -name "biome.json" -o -name "biome.jsonc" 2>/dev/null | head -20

Length of output: 12


🏁 Script executed:

#!/bin/bash
# Find usages of filterSnippetSpecialCharacters to understand purpose
rg -n "filterSnippetSpecialCharacters" --type ts --type js

Length of output: 637


🏁 Script executed:

#!/bin/bash
# Check TypeScript configuration for target/lib settings
cat tsconfig.json 2>/dev/null || cat tsconfig.*.json 2>/dev/null | head -50

Length of output: 746


🏁 Script executed:

#!/bin/bash
# Look for any linting output or CI config that might show Biome warnings
find . -name ".github" -type d | xargs -I {} find {} -name "*.yml" -o -name "*.yaml" 2>/dev/null | xargs cat 2>/dev/null | grep -i biome -A 5 -B 5 | head -50

Length of output: 0


🌐 Web query:

Does TypeScript/JavaScript support \p{ASCII} Unicode property in regex?

💡 Result:

Yes — JavaScript (and therefore TypeScript at runtime) supports the Unicode binary property ASCII via Unicode property escapes. Use the u (Unicode) flag, e.g. /\p{ASCII}/u. [1][2]

Sources:
[1] MDN — Unicode character class escape (\p{...}, \P{...}).
[2] MDN (Chinese) — notes listing binary properties including ASCII.


🌐 Web query:

What does Biome noControlCharactersInRegex rule flag?

💡 Result:

The Biome rule noControlCharactersInRegex flags control (non-printable) ASCII characters in regular expressions — i.e. characters U+0000–U+001F — when they appear as:

  • hex escapes (\x00–\x1F)
  • Unicode escapes (\u0000–\u001F)
  • Unicode code point escapes (\u{0}–\u{1F})
  • unescaped raw characters U+0000–U+001F

It allows common control escape sequences like \t and \n. [1][2]



---

🏁 Script executed:

```shell
#!/bin/bash
# Inspect Biome configuration
cat biome.json

Length of output: 1774


🏁 Script executed:

#!/bin/bash
# Check if Next.js project uses custom Babel config or build tools
ls -la | grep -i babel
cat .babelrc* 2>/dev/null || cat babel.config.* 2>/dev/null

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for context around snippet filtering — comments or docs explaining ASCII requirement
rg -n "ASCII|control|character|\\\\x00" --type ts --type js lib/snippets src/features/snippets -B 2 -A 2

Length of output: 3868


🏁 Script executed:

#!/bin/bash
# Check if project has a next.config or similar that might affect regex compilation
cat next.config.* 2>/dev/null | head -100

Length of output: 594


The suggested Unicode property syntax may not work with your ES2017 TypeScript target without additional transpilation config

The Biome warning (noControlCharactersInRegex) is real and correctly triggered by /^[\x00-\x7F]+$/. However, the suggested fix /^[\p{ASCII}]+$/u uses Unicode properties, which were added in ES2018—your tsconfig targets ES2017.

To satisfy Biome without risking transpilation issues, simplify the regex to avoid explicit control-character ranges:

function filterSnippetSpecialCharacters(snippet: string): boolean {
-  const regex = /^[\x00-\x7F]+$/;
-  const test = regex.test(snippet);
-
-  return test;
+  // Allow only printable ASCII characters (0x20–0x7E).
+  const regex = /^[ -~]*$/;
+  return regex.test(snippet);
}

This avoids control characters entirely and satisfies Biome without transpilation concerns. Alternatively, confirm your build system handles Unicode property transpilation, then the /^\p{ASCII}+$/u approach works.

Also verify that rejecting all non-ASCII content (identifiers, comments, emojis, etc.) is intentional—this filter may be too restrictive for valid snippets.

🧰 Tools
🪛 Biome (2.1.2)

[error] 19-19: Unexpected control character in a regular expression.

Control characters are unusual and potentially incorrect inputs, so they are disallowed.

(lint/suspicious/noControlCharactersInRegex)

🤖 Prompt for AI Agents
In lib/snippets/processing/filter.ts around lines 18-23, the current regex
/^[\x00-\x7F]+$/ triggers Biome's noControlCharactersInRegex and the suggested
Unicode property /\p{ASCII}/u requires ES2018+, so either replace the regex with
a printable-ASCII-only pattern (e.g. use the visible ASCII range to exclude
control characters, such as /^[ -~]+$/) to satisfy Biome without changing
transpilation, or if you truly need full ASCII control characters support,
update the project target to ES2018+ (or enable Unicode property transpilation)
and switch to /^\p{ASCII}+$/u; also confirm that rejecting all non-ASCII input
is intentional before applying the change.

Comment on lines +17 to +22
function filterSnippetSpecialCharacters(snippet: string): boolean {
const regex = /^[\x00-\x7F]+$/;
const test = regex.test(snippet);

return test;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Fix filterSnippetSpecialCharacters to satisfy lint and clarify intent

Biome flags ^[\x00-\x7F]+$ for control characters; it’s also unclear whether you intend to allow all ASCII (including control chars) or just printable characters.

If the goal is “no non-ASCII characters”, a clearer and lint-friendly implementation is:

function filterSnippetSpecialCharacters(snippet: string): boolean {
  // allow only ASCII (code points <= 0x7f)
  for (let i = 0; i < snippet.length; i += 1) {
    if (snippet.charCodeAt(i) > 0x7f) {
      return false;
    }
  }
  return true;
}

This avoids control characters in the regex entirely and makes the intent explicit.

Please re-run Biome after changing this to ensure the lint error is resolved.


🏁 Script executed:

#!/bin/bash
# Check Biome availability and configuration
which biome && biome --version

# Display the file content to understand context
echo "=== Current file content ==="
cat src/features/snippets/logic/filter.ts

# Check if biome.json or similar config exists
echo "=== Biome config ==="
ls -la | grep -E "biome|eslint" || echo "No Biome config found at root"

Length of output: 1364


🏁 Script executed:

#!/bin/bash
# Run Biome lint on the specific file to see actual errors
if command -v biome &> /dev/null; then
  echo "=== Running Biome lint on filter.ts ==="
  biome lint src/features/snippets/logic/filter.ts || true
fi

Length of output: 724


🏁 Script executed:

#!/bin/bash
# Search for any tests or documentation about these filter functions
echo "=== Looking for tests ==="
find . -type f \( -name "*.test.ts" -o -name "*.spec.ts" -o -name "*.test.js" -o -name "*.spec.js" \) | xargs grep -l "filterSnippet\|filterTabs" 2>/dev/null || echo "No tests found"

echo "=== Looking for usage of filterSnippetSpecialCharacters ==="
rg -n "filterSnippetSpecialCharacters" --type=ts --type=js

Length of output: 738


Fix filterSnippetSpecialCharacters to resolve Biome lint error

Biome flags the regex /^[\x00-\x7F]+$/ with noControlCharactersInRegex. To fix this while maintaining the same logic (allow only ASCII, reject non-ASCII), use a loop-based implementation:

function filterSnippetSpecialCharacters(snippet: string): boolean {
  for (let i = 0; i < snippet.length; i += 1) {
    if (snippet.charCodeAt(i) > 0x7f) {
      return false;
    }
  }
  return true;
}

This avoids control characters in regex and makes the intent explicit.

🧰 Tools
🪛 Biome (2.1.2)

[error] 18-18: Unexpected control character in a regular expression.

Control characters are unusual and potentially incorrect inputs, so they are disallowed.

(lint/suspicious/noControlCharactersInRegex)

🤖 Prompt for AI Agents
In src/features/snippets/logic/filter.ts around lines 17 to 22, the current
regex /^[\x00-\x7F]+$/ triggers Biome's noControlCharactersInRegex rule; replace
the regex approach with a simple loop over the snippet characters that checks
charCodeAt(i) and returns false if any codepoint is > 0x7F, otherwise return
true—this preserves the original "only ASCII" logic while avoiding control
characters in a regex.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
lib/snippets/processing/filter.ts (1)

18-23: Biome control character warning remains unresolved.

The regex /^[\x00-\x7F]+$/ still triggers Biome's noControlCharactersInRegex rule. As noted in the previous review, your ES2017 target prevents using Unicode properties without additional transpilation.

To satisfy Biome without transpilation concerns, replace with printable ASCII only:

 function filterSnippetSpecialCharacters(snippet: string): boolean {
-  const regex = /^[\x00-\x7F]+$/;
-  const test = regex.test(snippet);
-
-  return test;
+  // Allow only printable ASCII characters (0x20–0x7E).
+  const regex = /^[ -~]+$/;
+  return regex.test(snippet);
 }

Also verify that rejecting all non-ASCII content (identifiers, comments, emojis) is intentional.

prisma/schema.prisma (1)

43-50: Global sha uniqueness must be replaced with composite constraint.

The sha field at line 45 is marked globally @unique, but each Snippet is scoped to a specific FileVersion (non-nullable foreign key). This prevents different FileVersion records from legitimately storing snippets with identical content hashes—a design requirement confirmed by the seeding code in prisma/seed/snippets.ts (lines 74-75), which explicitly catches and ignores unique constraint violations.

Apply this diff to enforce uniqueness per file version:

 model Snippet {
   id            String      @id @default(uuid())
-  sha           String      @unique
+  sha           String
   content       String
   rawContent    String
   fileVersionId String
   fileVersion   FileVersion @relation(fields: [fileVersionId], references: [id], onDelete: Cascade)
+
+  @@unique([fileVersionId, sha])
 }

After updating the schema, run npx prisma migrate dev and remove the try-catch workaround in prisma/seed/snippets.ts that silently ignores P2002 errors.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 67c87ab and 62e4b78.

📒 Files selected for processing (3)
  • lib/snippets/processing/filter.ts (1 hunks)
  • prisma/schema.prisma (1 hunks)
  • prisma/seed/snippets.ts (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
prisma/seed/snippets.ts (1)
lib/snippets/index.ts (1)
  • getSnippetsFromFile (12-26)
lib/snippets/processing/filter.ts (1)
lib/snippets/config.ts (6)
  • MAX_LINES (1-1)
  • MIN_LINES (2-2)
  • MAX_LINE_LENGTH (4-4)
  • MIN_SNIPPET_LENGTH (7-7)
  • MAX_SNIPPET_LENGTH (6-6)
  • VALID_NODES (9-27)
🪛 Biome (2.1.2)
lib/snippets/processing/filter.ts

[error] 19-19: Unexpected control character in a regular expression.

Control characters are unusual and potentially incorrect inputs, so they are disallowed.

(lint/suspicious/noControlCharactersInRegex)

🔇 Additional comments (5)
prisma/seed/snippets.ts (1)

6-6: Verify concurrency limit of 100 is appropriate for GitHub raw content API.

The p-limit(100) setting allows 100 concurrent fetches from raw.githubusercontent.com. While raw content endpoints are generally more permissive than the REST API, this limit might still be aggressive and could trigger rate limiting or timeouts.

Consider reducing to a more conservative value (e.g., 10-20) unless you've confirmed this limit works reliably in production.

prisma/schema.prisma (3)

23-23: onDelete cascade behavior has been correctly added.

The onDelete: Cascade specification ensures that deleting a Language will cascade to all associated File records, maintaining data integrity.


37-37: onDelete cascade behavior has been correctly added.

The onDelete: Cascade specification ensures that deleting a File will cascade to all associated FileVersion records, maintaining referential integrity.


49-49: onDelete cascade behavior has been correctly added.

The onDelete: Cascade specification ensures that deleting a FileVersion will cascade to all associated Snippet records.

lib/snippets/processing/filter.ts (1)

25-28: Critical: Tab regex matches literal "\t" string, not tab characters.

The regex /^.+\\t.+$/ uses a double-escaped backslash (\\t), which matches the two-character literal string \t rather than an actual tab character. This means the filter fails to detect real tabs in snippets.

Apply this diff to fix the regex:

 function filterTabsInBetween(snippet: string): boolean {
-  const regex = /^.+\\t.+$/;
+  const regex = /^.+\t.+$/;
   return !regex.test(snippet);
 }

Likely an incorrect or invalid review comment.

Comment on lines +46 to +51
if (res.status === 404 || res.status === 400) {
await prisma.fileVersion.delete({
where: {id: file.id},
});
console.warn(`Deleted file version ${file.id} due to 404 (file missing).`);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

FileVersion deletion on 404/400 may orphan parent File records.

When a file is not found (404/400), the code deletes the FileVersion but leaves the parent File intact. If all versions of a file are deleted, the File becomes an orphan with no fileVersions.

Consider checking after deletion whether the parent File has any remaining fileVersions, and if not, delete the File as well:

         if (res.status === 404 || res.status === 400) {
           await prisma.fileVersion.delete({
             where: {id: file.id},
           });
+          // Clean up orphaned parent File if this was the last version
+          const remainingVersions = await prisma.fileVersion.count({
+            where: {fileId: file.file.id},
+          });
+          if (remainingVersions === 0) {
+            await prisma.file.delete({where: {id: file.file.id}});
+          }
           console.warn(`Deleted file version ${file.id} due to 404 (file missing).`);
         }

However, you'll need to adjust the query to fetch file.file.id in the initial select (line 13).

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In prisma/seed/snippets.ts around lines 46 to 51, deleting the FileVersion on a
404/400 can leave an orphaned parent File; update the initial query (around line
13) to also select the parent file id (file.file.id), then after await
prisma.fileVersion.delete(...) check whether any fileVersions remain for that
parent (using prisma.fileVersion.count or prisma.file.findUnique with
select/count); if the count is zero, delete the parent File via
prisma.file.delete({ where: { id: parentId } }), and log the action. Ensure this
logic runs atomically enough for your seed context (or retry on constraint
errors).

Comment on lines +64 to +77
try {
await prisma.snippet.create({
data: {
sha: snippet.sha,
content: snippet.content,
rawContent: snippet.rawContent,
fileVersionId: file.id,
},
});
} catch (error) {
// Ignore unique constraint violation (since multiple files can produce the same snippet). Upsert is not an option since we are seeding in parallel.
if (error instanceof Prisma.PrismaClientKnownRequestError && error.code === "P2002") return;
throw error;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Remove unique constraint workaround once schema is corrected.

The try-catch block silently ignores P2002 errors with the comment "since multiple files can produce the same snippet." This is a workaround for the incorrect global @unique constraint on Snippet.sha in the schema.

Once the schema is updated to use @@unique([fileVersionId, sha]) (as noted in the schema review), remove this try-catch block to let real errors surface:

       snippets.map(async (snippet) => {
-        try {
-          await prisma.snippet.create({
-            data: {
-              sha: snippet.sha,
-              content: snippet.content,
-              rawContent: snippet.rawContent,
-              fileVersionId: file.id,
-            },
-          });
-        } catch (error) {
-          // Ignore unique constraint violation (since multiple files can produce the same snippet). Upsert is not an option since we are seeding in parallel.
-          if (error instanceof Prisma.PrismaClientKnownRequestError && error.code === "P2002") return;
-          throw error;
-        }
+        await prisma.snippet.create({
+          data: {
+            sha: snippet.sha,
+            content: snippet.content,
+            rawContent: snippet.rawContent,
+            fileVersionId: file.id,
+          },
+        });
       })
🤖 Prompt for AI Agents
In prisma/seed/snippets.ts around lines 64 to 77 there is a try-catch that
swallows Prisma P2002 unique-constraint errors for Snippet.sha as a workaround
for an incorrect global unique; once the schema is corrected to use
@@unique([fileVersionId, sha]) remove this try-catch entirely so
duplicate-snippet conflicts are handled by the schema and real errors will
surface during seeding; after removing the block, run the migration and re-run
the seed to verify no P2002 suppression remains and legitimate errors are
reported.

@mattiacerutti mattiacerutti merged commit e936fdd into main Nov 18, 2025
4 checks passed
@mattiacerutti mattiacerutti deleted the feat/db-refactor branch November 18, 2025 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant