feat: refactor architecture to identify snippets#14
Conversation
WalkthroughAdds a GitHub API client with retries and rate-limit handling, a Tree-sitter–based snippet extraction/processing library, a new hierarchical Prisma data model (File → FileVersion → Snippet), revamped seed procedures that fetch and persist files/snippets, and replaces legacy CLI/scripts with integrated seed modules and a Prisma random-extension integration. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Seed as Seed Pipeline
participant GH as GitHub API (callGitHubApi)
participant Parser as Snippet Library (Tree‑sitter)
participant DB as Prisma (Extended)
participant Service as Snippet Service
Seed->>GH: Search code per language (paginated)
GH-->>Seed: List of files (repo, path, sha, commit)
Seed->>DB: Create File + FileVersion records
Seed->>GH: Fetch raw content for unparsed FileVersions
GH-->>Seed: File content
Seed->>Parser: getSnippetsFromFile(content, languageId)
Parser-->>Seed: snippet texts[]
Seed->>DB: Insert Snippet rows (hash, content, rawContent) and set parsedAt
DB-->>Seed: persisted
Service->>DB: findRandomSnippets(languageId, qty)
DB-->>Service: random snippets (id, content)
Service->>Service: filterSnippets(content) → compute disabledRanges
Service-->>Client: final shuffled snippets
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (13)
src/shared/db/prisma.ts (1)
4-6: Global Prisma singleton pattern is sound; optional type/ordering tweakThe
globalThiscast plus caching in non‑production environments follows the standard Prisma singleton pattern and should avoid connection churn during hot reloads. If you want slightly clearer typings, you could optionally reorder to defineExtendedPrismaClientbefore theglobalForPrismacast, so the type alias isn’t forward‑referenced.If you reuse this pattern elsewhere, ensure all imports reference
ExtendedPrismaClient(or theprismaexport) rather thanPrismaClientdirectly so types remain consistent with the extended client.Also applies to: 16-22
lib/snippets/processing/utils.ts (1)
27-32: Remove redundanttrim()call.Line 29 removes all whitespace using
replace(/\s+/g, ""), making the subsequent.trim()call redundant since there's no whitespace left to trim.Apply this diff:
- const newContent = content.replace(/\s+/g, "").trim(); + const newContent = content.replace(/\s+/g, "");lib/snippets/processing/filter.ts (1)
25-28: Clarify intent offilterTabsInBetweenand its interaction withfilterSnippetsRight now
filterTabsInBetweenreturnstrueonly if the snippet contains either a literal\tsequence or an actual tab character, andfilterSnippetsrequires this to betrue. That means any snippet with no tabs at all (including multi‑line, non‑indented code) will always be rejected.If the goal is:
- “Reject snippets that contain tabs/
\tliterals” → invert the predicate infilterTabsInBetween.- “Require at least one indented line using tabs” → the current behavior is closer, but consider:
- Renaming the helper (e.g.
hasTabsIndentation) or adding a short comment explaining the requirement.- Simplifying the regex (e.g.,
snippet.trim().includes("\t")if you only care about actual tab characters).Please confirm which behavior you want so this filter doesn’t silently drop more snippets than intended.
Also applies to: 34-35
lib/snippets/processing/process.ts (1)
11-14:hasProhibitedCharactersis very strict (no backslashes, no CR/other whitespace)The regex
/((?:(?![ \t\n])\s)|\\)/rejects:
- Any whitespace other than space, tab, and
\n(e.g.\r,\f, NBSP).- Any backslash
\anywhere in the snippet.This will drop, for example, CRLF‑encoded files, Windows paths, string/regex literals with escapes, etc. If that’s intentional, it’s fine; otherwise you may want to narrow the rule (e.g. only reject specific problematic characters) instead of all backslashes and non‑standard whitespace.
Consider documenting the intent, or relaxing the regex if you expect such snippets.
prisma/seed/snippets.ts (2)
27-29: Consider not throwing when there are no unparsed filesWhen
unparsedFiles.length === 0you currently throw:throw new Error("All files are parsed. No snippets to seed.");For an idempotent seed pipeline it’s often more convenient to just log and return early, so re‑running seeds doesn’t fail:
- if (unparsedFiles.length === 0) { - throw new Error("All files are parsed. No snippets to seed."); - } + if (unparsedFiles.length === 0) { + console.log("All files are parsed. No snippets to seed."); + return; + }
31-86: DB operations run with potentially high concurrency
p-limitcaps concurrent fetches to 100, but:
getSnippetsFromFileruns for all unparsed files in parallel.- For each file, snippet inserts use
Promise.all(snippets.map(...))without any concurrency limit.- All fileVersion updates also run under the outer
Promise.all.On large datasets this can produce a high number of concurrent DB calls. If you expect many file versions/snippets, consider:
- Moving the
limitto wrap the whole per‑file task instead of justfetch, or- Using another
p-limitfor snippet inserts per file, or- Chunking
unparsedFilesinto batches.That will keep seeding friendlier to the DB without changing behavior.
lib/snippets/config.ts (1)
1-27: Config values look good; consider typingVALID_NODESmore stronglyThe thresholds and node type list are clear and centralized. To get better type‑safety where these node types are used, you could mark the array as
constand optionally expose a union type:-export const VALID_NODES = [ +export const VALID_NODES = [ "function_declaration", "function_definition", "class_declaration", "class_definition", "function_item", "method_declaration", "method_definition", "method_item", "class_item", "interface_declaration", "interface_definition", "interface_item", "module", "switch_statement", "while_statement", "for_statement", "if_statement", -]; +] as const; + +export type ValidNodeType = (typeof VALID_NODES)[number];This is optional but helps prevent typos in node type strings elsewhere.
src/features/snippets/services/get-random-snippets.server.ts (2)
11-26: Handle duplicates and optionally cap to the minimum number of snippetsThe loop accumulates snippets until
snippets.length >= MIN_SNIPPETS_NUMBERor attempts are exhausted, but:
- The same snippet can be returned multiple times by
findRandomSnippetsacross attempts; there’s no deduplication by snippet ID/content.- Once you exceed
MIN_SNIPPETS_NUMBER, you keep all accumulated snippets; callers might expect “exactly N” rather than “at least N”.If you care about uniqueness and an exact count, consider something like:
const bySha = new Map<string, ISnippet>(); // when adding: bySha.set(snippetSha, snippet); // after the loop: const uniqueSnippets = Array.from(bySha.values()).slice(0, MIN_SNIPPETS_NUMBER);This would avoid duplicates and cap the result while preserving your attempt logic.
30-35: Shuffling viasort(() => Math.random() - 0.5)is simple but biasedUsing
array.sort(() => Math.random() - 0.5)works for small datasets, but it’s known to produce a biased distribution and relies on engine‑specific sort behavior.If you want a more robust shuffle, a small Fisher–Yates implementation is safer:
function shuffle<T>(arr: T[]): T[] { const result = [...arr]; for (let i = result.length - 1; i > 0; i--) { const j = Math.floor(Math.random() * (i + 1)); [result[i], result[j]] = [result[j], result[i]]; } return result; } // ... return shuffle(snippets);Not critical, but more predictable.
lib/snippets/parsing/parser.ts (1)
6-18: Avoid repeated array concatenations infindValidNodesThe recursive
nodes = nodes.concat(findValidNodes(child))pattern builds many intermediate arrays and can degrade to O(n²) on large trees. An iterative DFS or passing an accumulator array would be more efficient while keeping behavior identical.You could refactor along these lines:
export function findValidNodes(root: IParser.SyntaxNode): IParser.SyntaxNode[] { const result: IParser.SyntaxNode[] = []; const stack: IParser.SyntaxNode[] = [root]; while (stack.length > 0) { const node = stack.pop()!; if (isValidNode(node)) { result.push(node); } // children are never null in tree-sitter, so no null check needed for (const child of node.children) { stack.push(child); } } return result; }Please confirm this matches your expectations with a few representative trees from
tree-sitter.prisma/seed/files.ts (2)
39-47: Align log messages with what’s actually being seededThis function seeds
File/FileVersionrecords, but the logs say “snippets”:console.log("All languages already have snippets."); console.log(`Seeding snippets for missing languages: ${missingLanguageIds.join(", ")}`);For clarity when running seeds, consider updating these to talk about “files” or “tracked files” instead.
85-96: Add an explicit type forfilesingetFilesForLanguageRight now
filesis inferred asany[]:const files = [];Given the function’s return type is
Promise<FileMetadata[]>, it’s better to enforce that locally:async function getFilesForLanguage( language: ILanguage, {pagesPerLanguage = 10}: {pagesPerLanguage?: number} = {}, ): Promise<FileMetadata[]> { const files: FileMetadata[] = []; // ... }This will catch accidental pushes of incorrectly shaped data at compile time.
lib/github.ts (1)
14-70: GitHub client structure looks solid; consider enriching the final errorThe retry + rate‑limit handling in
callGitHubApiis well structured and should be easy to reuse.One small improvement: the final error after exhausting attempts currently doesn’t expose the last status code or body, which can make debugging harder:
throw new Error(`GitHub API request failed after ${attempts} attempts for ${url}`);You could capture the last response status/text (and maybe a truncated body) and include it:
let lastStatus: number | undefined; let lastStatusText: string | undefined; for (let attempt = 1; attempt <= attempts; attempt += 1) { const response = await fetch(url, { ...fetchOptions, headers }); lastStatus = response.status; lastStatusText = response.statusText; // existing logic... } throw new Error( `GitHub API request failed after ${attempts} attempts for ${url} (last status: ${lastStatus} ${lastStatusText})`, );This keeps behavior the same while giving more context when things go wrong.
Please re-check TypeScript types for
fetch/Responsein your runtime (Node vs. other) after this change.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (31)
lib/github.ts(1 hunks)lib/snippets/config.ts(1 hunks)lib/snippets/index.ts(1 hunks)lib/snippets/parsing/parser.ts(1 hunks)lib/snippets/processing/filter.ts(1 hunks)lib/snippets/processing/indentation.ts(1 hunks)lib/snippets/processing/process.ts(1 hunks)lib/snippets/processing/utils.ts(2 hunks)package.json(1 hunks)prisma/migrations/20250402143241_init/migration.sql(0 hunks)prisma/migrations/20250402153753_add_languageid_index/migration.sql(0 hunks)prisma/migrations/20251117225653_init/migration.sql(1 hunks)prisma/schema.prisma(1 hunks)prisma/seed.ts(1 hunks)prisma/seed/files.ts(1 hunks)prisma/seed/languages.ts(1 hunks)prisma/seed/snippets.ts(1 hunks)scripts/seed-missing-snippets.ts(0 hunks)scripts/seed-snippets.ts(0 hunks)scripts/utils/cli.ts(0 hunks)scripts/utils/snippet-helpers.ts(0 hunks)src/features/snippets/config/snippets.server.ts(1 hunks)src/features/snippets/infrastructure/adapters/snippet-fetch.server.ts(0 hunks)src/features/snippets/infrastructure/repositories/snippet.repository.server.ts(1 hunks)src/features/snippets/logic/filter.ts(1 hunks)src/features/snippets/logic/parsing/snippet-parser.server.ts(1 hunks)src/features/snippets/logic/processing/snippet-process.server.ts(0 hunks)src/features/snippets/services/get-random-snippets.server.ts(1 hunks)src/features/snippets/types/snippet-source.ts(0 hunks)src/shared/db/prisma.ts(1 hunks)tsconfig.json(1 hunks)
💤 Files with no reviewable changes (9)
- src/features/snippets/logic/processing/snippet-process.server.ts
- src/features/snippets/types/snippet-source.ts
- scripts/utils/cli.ts
- scripts/seed-snippets.ts
- scripts/seed-missing-snippets.ts
- prisma/migrations/20250402153753_add_languageid_index/migration.sql
- prisma/migrations/20250402143241_init/migration.sql
- scripts/utils/snippet-helpers.ts
- src/features/snippets/infrastructure/adapters/snippet-fetch.server.ts
🧰 Additional context used
🧬 Code graph analysis (12)
lib/snippets/processing/indentation.ts (3)
src/features/snippets/logic/processing/text.server.ts (2)
countInitialWhitespaces(1-11)removeInitialCharacters(13-23)src/features/snippets/logic/processing/snippet-process.server.ts (3)
cleanText(16-21)formatCode(28-66)line(47-57)src/features/snippets/logic/processing/indentation.server.ts (2)
applyIndentationToEmptyLines(73-90)adjustIndentationOffset(47-52)
lib/snippets/processing/process.ts (2)
lib/snippets/processing/indentation.ts (3)
detectIndentationStyle(4-45)adjustIndentationOffset(47-52)applyIndentationToEmptyLines(73-90)lib/snippets/processing/utils.ts (1)
countInitialWhitespaces(3-13)
src/features/snippets/logic/parsing/snippet-parser.server.ts (1)
src/features/snippets/logic/processing/snippet-process.server.ts (2)
processSnippets(68-85)snippet(75-82)
lib/snippets/parsing/parser.ts (4)
lib/snippets/processing/filter.ts (1)
isValidNode(30-32)lib/snippets/processing/indentation.ts (1)
getInitialIndentation(54-71)lib/snippets/parsing/parser-factory.ts (1)
getTSParser(19-27)src/features/snippets/logic/processing/snippet-process.server.ts (1)
processSnippets(68-85)
src/features/snippets/logic/filter.ts (3)
src/features/snippets/config/snippets.server.ts (5)
MAX_LINES(1-1)MIN_LINES(2-2)MAX_LINE_LENGTH(4-4)MIN_SNIPPET_LENGTH(7-7)MAX_SNIPPET_LENGTH(6-6)src/features/snippets/logic/processing/snippet-process.server.ts (5)
filterSnippets(7-14)snippet(10-10)snippet(11-11)snippet(13-13)snippet(12-12)src/features/snippets/logic/processing/snippet-filters.server.ts (5)
filterLineLength(9-12)filterLinesNumber(4-7)filterSnippetLength(14-16)filterTabsInBetween(25-28)filterSnippetSpecialCharacters(18-23)
lib/snippets/index.ts (4)
lib/snippets/parsing/parser.ts (1)
extractSnippets(26-42)lib/snippets/processing/utils.ts (1)
hashSnippet(27-32)lib/snippets/processing/process.ts (1)
formatCode(16-54)lib/snippets/processing/filter.ts (1)
filterSnippets(34-36)
prisma/seed/snippets.ts (2)
src/shared/db/prisma.ts (1)
prisma(18-18)lib/snippets/index.ts (1)
getSnippetsFromFile(12-26)
lib/snippets/processing/filter.ts (1)
lib/snippets/config.ts (6)
MAX_LINES(1-1)MIN_LINES(2-2)MAX_LINE_LENGTH(4-4)MIN_SNIPPET_LENGTH(7-7)MAX_SNIPPET_LENGTH(6-6)VALID_NODES(9-27)
prisma/seed/files.ts (3)
prisma/seed/languages.ts (1)
SUPPORTED_LANGUAGES(4-59)src/shared/types/language.ts (1)
ILanguage(1-6)lib/github.ts (1)
callGitHubApi(14-70)
prisma/seed.ts (3)
prisma/seed/languages.ts (1)
seedLanguages(61-81)prisma/seed/files.ts (1)
seedFiles(26-53)prisma/seed/snippets.ts (1)
seedSnippets(8-89)
src/features/snippets/infrastructure/repositories/snippet.repository.server.ts (1)
src/shared/db/prisma.ts (1)
prisma(18-18)
src/features/snippets/services/get-random-snippets.server.ts (5)
src/shared/types/snippet.server.ts (1)
ISnippet(1-4)src/features/snippets/config/snippets.server.ts (3)
MIN_SNIPPETS_NUMBER(29-29)MAX_GET_SNIPPETS_ATTEMPTS(31-31)SNIPPETS_RETRIEVED_PER_QUERY(30-30)src/features/snippets/infrastructure/repositories/snippet.repository.server.ts (1)
findRandomSnippets(4-20)src/features/snippets/logic/filter.ts (1)
filterSnippets(29-31)src/features/snippets/logic/parsing/snippet-parser.server.ts (1)
extractAutoCompleteDisabledRanges(4-51)
🪛 Biome (2.1.2)
src/features/snippets/logic/filter.ts
[error] 18-18: Unexpected control character in a regular expression.
Control characters are unusual and potentially incorrect inputs, so they are disallowed.
(lint/suspicious/noControlCharactersInRegex)
lib/snippets/processing/filter.ts
[error] 19-19: Unexpected control character in a regular expression.
Control characters are unusual and potentially incorrect inputs, so they are disallowed.
(lint/suspicious/noControlCharactersInRegex)
🔇 Additional comments (18)
package.json (1)
32-32: Runtime dependency addition matches prisma client extension usageAdding
prisma-extension-randomas a regular dependency aligns with its runtime use insrc/shared/db/prisma.tsand keeps the dependency graph consistent with the new Prisma client extension pattern.Please double‑check that
prisma-extension-random@^0.2.2is compatible with@prisma/client@^6.5.0and your target Node.js 22.11.0 runtime.src/shared/db/prisma.ts (1)
2-2: Prisma factory and extension wiring look correctUsing a
createPrismafactory that instantiatesPrismaClientonce and immediately extends it withprismaRandom()is a clean way to centralize extension setup, and the returned extended client type is inferred correctly viaReturnType<typeof createPrisma>.Since this relies on
prisma-extension-random’s Prisma 6 extension API, please confirm against the library docs thatbase.$extends(prismaRandom())is the recommended usage for your version.Also applies to: 8-14
tsconfig.json (1)
28-31: LGTM!The new path alias
@lib/*correctly maps to./lib/*and aligns with the new library modules introduced in this PR.lib/snippets/processing/indentation.ts (1)
1-1: LGTM!The updated import path consolidates utility functions into a single module, which improves code organization.
prisma/seed/languages.ts (1)
80-80: LGTM!The success log provides useful feedback for the seeding process and aligns with the logging pattern in the new seeding flow.
prisma/seed.ts (2)
4-5: LGTM!The new imports support the extended seeding pipeline introduced in this PR.
13-14: LGTM!The sequential seeding flow is correct: languages must be seeded before files, and files before snippets. The orchestration aligns with the new hierarchical data model.
lib/snippets/index.ts (1)
15-21: Verify that hash collisions from whitespace-normalized content are acceptable.Line 17 computes the SHA hash using
hashSnippet(snippet), which removes all whitespace before hashing (as seen inlib/snippets/processing/utils.ts). This means that two different raw snippets differing only in whitespace/formatting will produce identical SHA values, potentially leading to duplicate detection issues or unintended collisions.If the intent is to identify semantically identical code regardless of formatting, this is correct. Otherwise, consider hashing the formatted content instead of the raw content.
src/features/snippets/config/snippets.server.ts (1)
29-31: All references to renamed constants have been successfully updated.Verification confirms:
- Old constant names (MIN_SNIPPETS_PER_LANGUAGE, RANDOM_FILES_FETCHED, MAX_SNIPPETS_FETCH_ATTEMPTS) have no remaining references in the codebase
- New constants (MIN_SNIPPETS_NUMBER, SNIPPETS_RETRIEVED_PER_QUERY, MAX_GET_SNIPPETS_ATTEMPTS) are properly exported and imported in src/features/snippets/services/get-random-snippets.server.ts where they are actively used
- The breaking API change has been fully migrated
src/features/snippets/logic/parsing/snippet-parser.server.ts (1)
1-1: The review comment is based on inaccurate assumptions about the refactoring.The functions (
findValidNodes,convertSnippetToText,extractSnippets,getNodeText) were not removed—they still exist and are exported from both the old location (src/features/snippets/logic/parsing/snippet-parser.server.ts) and the new location (lib/snippets/parsing/parser.ts). Additionally, not all consumers have been updated:src/features/snippets/services/get-random-snippets.server.ts:4still importsextractAutoCompleteDisabledRangesfrom the old path@/features/snippets/logic/parsing/snippet-parser.server.Likely an incorrect or invalid review comment.
lib/snippets/processing/filter.ts (1)
4-16: Core snippet size/shape filters andisValidNodelook consistentLine/length constraints and
isValidNodeare straightforward and align well with the centralized config values inlib/snippets/config.ts. The combined use infilterSnippetsgives a clear, composable gate for snippet validity.Also applies to: 30-32
lib/snippets/processing/process.ts (1)
16-54: Indentation normalization pipeline is coherent
formatCodecleanly composes:
- Early rejection on prohibited characters and mixed indentation.
- A no‑op path for unindented code (
type === "none").- Space→tab normalization using
detectIndentationStyleandcountInitialWhitespaces.- Offset removal and empty‑line indentation via
adjustIndentationOffset/applyIndentationToEmptyLines.Assuming the helpers behave as in
indentation.ts, this should produce consistently tab‑indented snippets for downstream processing.prisma/seed/snippets.ts (1)
62-77: Unique‑constraint handling is reasonableCatching
PrismaClientKnownRequestErrorwith code"P2002"and ignoring that case is a good way to deal with duplicate snippet content across files during parallel seeding, while still surfacing any other unexpected write errors.src/features/snippets/services/get-random-snippets.server.ts (1)
12-23: End‑to‑end filtering/parsing pipeline is cleanThe combination of:
findRandomSnippetsas a DB‑level random selector,filterSnippetsas a content gate, andextractAutoCompleteDisabledRangesas a tree‑sitter‑based post‑processorproduces a clear, layered pipeline for random snippet retrieval. The resulting
ISnippetshape (content + disabledRanges) is well‑factored.src/features/snippets/infrastructure/repositories/snippet.repository.server.ts (1)
4-19: Repository query is straightforward and matches new data model
findRandomSnippetscleanly delegates randomness toprisma.snippet.findManyRandom, selecting onlyidandcontentand scoping byfileVersion.file.languageId. That aligns well with the new File/FileVersion‑based schema and keeps the repository’s surface small and focused.lib/snippets/parsing/parser.ts (1)
20-24: Verify indentation behavior inconvertSnippetToText
convertSnippetToTextprefixesgetInitialIndentation(node.startIndex, sourceCode)tonode.text. Ifnode.textalready includes its original leading indentation (which it typically does with tree-sitter), this will effectively double the outer indentation for nested nodes.If the goal is to preserve original indentation only, consider either:
- Dropping the extra prefix and returning just
getNodeText(node), or- Normalizing the node text first (e.g., stripping its original leading indentation) before prefixing the computed context indentation.
This is likely fine, but it’s worth validating with a few nested constructs (functions, methods, blocks) to ensure snippets don’t end up over-indented.
Also applies to: 44-45
prisma/migrations/20251117225653_init/migration.sql (1)
1-66: Schema for Language/File/FileVersion/Snippet looks consistent and well-indexedThe migration cleanly models the new hierarchy:
Languagekeyed byidwith a uniquename.Fileunique on(repository, path)with an index onlanguageId.FileVersionunique per(fileId, sha)and linked back toFile.SnippetreferencingFileVersionwith a uniquesha.The foreign keys and indexes align with the seeding logic in
prisma/seed/files.tsand should support efficient lookups. No changes needed from my side.prisma/schema.prisma (1)
10-16: Language.id design is intentional and correct — no breaking change exists.The concern about missing
@defaultgenerator is invalid. Evidence shows Language records are created exclusively through the seed function (prisma/seed/languages.ts), which provides explicit ids for all supported languages ("js", "ts", "c", "cpp", "cs", "java", "py", "lua", "r"). The application never attempts to create Language records dynamically without providing an id. The schema design is deliberate: languages are managed through seeding only, not through application-level record creation. No @default is needed or expected.
| function filterSnippetSpecialCharacters(snippet: string): boolean { | ||
| const regex = /^[\x00-\x7F]+$/; | ||
| const test = regex.test(snippet); | ||
|
|
||
| return test; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
ASCII-only regex triggers Biome warning and could be modernized
filterSnippetSpecialCharacters uses /^[\x00-\x7F]+$/, which Biome flags (noControlCharactersInRegex) and enforces strict ASCII-only content. If you want “ASCII-only” semantics, you can avoid explicit control-character ranges and satisfy the linter with Unicode properties:
-function filterSnippetSpecialCharacters(snippet: string): boolean {
- const regex = /^[\x00-\x7F]+$/;
- const test = regex.test(snippet);
-
- return test;
-}
+function filterSnippetSpecialCharacters(snippet: string): boolean {
+ // Allow only ASCII characters.
+ const regex = /^[\p{ASCII}]+$/u;
+ return regex.test(snippet);
+}Also double‑check that rejecting any non‑ASCII characters (identifiers, comments, emojis, etc.) is intentional, as this will exclude many valid snippets.
🏁 Script executed:
#!/bin/bash
# Verify file exists and inspect the function
cat lib/snippets/processing/filter.tsLength of output: 1255
🏁 Script executed:
#!/bin/bash
# Search for Biome configuration files
find . -name ".biomeignore" -o -name "biome.json" -o -name "biome.jsonc" 2>/dev/null | head -20Length of output: 12
🏁 Script executed:
#!/bin/bash
# Find usages of filterSnippetSpecialCharacters to understand purpose
rg -n "filterSnippetSpecialCharacters" --type ts --type jsLength of output: 637
🏁 Script executed:
#!/bin/bash
# Check TypeScript configuration for target/lib settings
cat tsconfig.json 2>/dev/null || cat tsconfig.*.json 2>/dev/null | head -50Length of output: 746
🏁 Script executed:
#!/bin/bash
# Look for any linting output or CI config that might show Biome warnings
find . -name ".github" -type d | xargs -I {} find {} -name "*.yml" -o -name "*.yaml" 2>/dev/null | xargs cat 2>/dev/null | grep -i biome -A 5 -B 5 | head -50Length of output: 0
🌐 Web query:
Does TypeScript/JavaScript support \p{ASCII} Unicode property in regex?
💡 Result:
Yes — JavaScript (and therefore TypeScript at runtime) supports the Unicode binary property ASCII via Unicode property escapes. Use the u (Unicode) flag, e.g. /\p{ASCII}/u. [1][2]
Sources:
[1] MDN — Unicode character class escape (\p{...}, \P{...}).
[2] MDN (Chinese) — notes listing binary properties including ASCII.
🌐 Web query:
What does Biome noControlCharactersInRegex rule flag?
💡 Result:
The Biome rule noControlCharactersInRegex flags control (non-printable) ASCII characters in regular expressions — i.e. characters U+0000–U+001F — when they appear as:
- hex escapes (\x00–\x1F)
- Unicode escapes (\u0000–\u001F)
- Unicode code point escapes (\u{0}–\u{1F})
- unescaped raw characters U+0000–U+001F
It allows common control escape sequences like \t and \n. [1][2]
---
🏁 Script executed:
```shell
#!/bin/bash
# Inspect Biome configuration
cat biome.json
Length of output: 1774
🏁 Script executed:
#!/bin/bash
# Check if Next.js project uses custom Babel config or build tools
ls -la | grep -i babel
cat .babelrc* 2>/dev/null || cat babel.config.* 2>/dev/nullLength of output: 0
🏁 Script executed:
#!/bin/bash
# Search for context around snippet filtering — comments or docs explaining ASCII requirement
rg -n "ASCII|control|character|\\\\x00" --type ts --type js lib/snippets src/features/snippets -B 2 -A 2Length of output: 3868
🏁 Script executed:
#!/bin/bash
# Check if project has a next.config or similar that might affect regex compilation
cat next.config.* 2>/dev/null | head -100Length of output: 594
The suggested Unicode property syntax may not work with your ES2017 TypeScript target without additional transpilation config
The Biome warning (noControlCharactersInRegex) is real and correctly triggered by /^[\x00-\x7F]+$/. However, the suggested fix /^[\p{ASCII}]+$/u uses Unicode properties, which were added in ES2018—your tsconfig targets ES2017.
To satisfy Biome without risking transpilation issues, simplify the regex to avoid explicit control-character ranges:
function filterSnippetSpecialCharacters(snippet: string): boolean {
- const regex = /^[\x00-\x7F]+$/;
- const test = regex.test(snippet);
-
- return test;
+ // Allow only printable ASCII characters (0x20–0x7E).
+ const regex = /^[ -~]*$/;
+ return regex.test(snippet);
}This avoids control characters entirely and satisfies Biome without transpilation concerns. Alternatively, confirm your build system handles Unicode property transpilation, then the /^\p{ASCII}+$/u approach works.
Also verify that rejecting all non-ASCII content (identifiers, comments, emojis, etc.) is intentional—this filter may be too restrictive for valid snippets.
🧰 Tools
🪛 Biome (2.1.2)
[error] 19-19: Unexpected control character in a regular expression.
Control characters are unusual and potentially incorrect inputs, so they are disallowed.
(lint/suspicious/noControlCharactersInRegex)
🤖 Prompt for AI Agents
In lib/snippets/processing/filter.ts around lines 18-23, the current regex
/^[\x00-\x7F]+$/ triggers Biome's noControlCharactersInRegex and the suggested
Unicode property /\p{ASCII}/u requires ES2018+, so either replace the regex with
a printable-ASCII-only pattern (e.g. use the visible ASCII range to exclude
control characters, such as /^[ -~]+$/) to satisfy Biome without changing
transpilation, or if you truly need full ASCII control characters support,
update the project target to ES2018+ (or enable Unicode property transpilation)
and switch to /^\p{ASCII}+$/u; also confirm that rejecting all non-ASCII input
is intentional before applying the change.
| function filterSnippetSpecialCharacters(snippet: string): boolean { | ||
| const regex = /^[\x00-\x7F]+$/; | ||
| const test = regex.test(snippet); | ||
|
|
||
| return test; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
Fix filterSnippetSpecialCharacters to satisfy lint and clarify intent
Biome flags ^[\x00-\x7F]+$ for control characters; it’s also unclear whether you intend to allow all ASCII (including control chars) or just printable characters.
If the goal is “no non-ASCII characters”, a clearer and lint-friendly implementation is:
function filterSnippetSpecialCharacters(snippet: string): boolean {
// allow only ASCII (code points <= 0x7f)
for (let i = 0; i < snippet.length; i += 1) {
if (snippet.charCodeAt(i) > 0x7f) {
return false;
}
}
return true;
}This avoids control characters in the regex entirely and makes the intent explicit.
Please re-run Biome after changing this to ensure the lint error is resolved.
🏁 Script executed:
#!/bin/bash
# Check Biome availability and configuration
which biome && biome --version
# Display the file content to understand context
echo "=== Current file content ==="
cat src/features/snippets/logic/filter.ts
# Check if biome.json or similar config exists
echo "=== Biome config ==="
ls -la | grep -E "biome|eslint" || echo "No Biome config found at root"Length of output: 1364
🏁 Script executed:
#!/bin/bash
# Run Biome lint on the specific file to see actual errors
if command -v biome &> /dev/null; then
echo "=== Running Biome lint on filter.ts ==="
biome lint src/features/snippets/logic/filter.ts || true
fiLength of output: 724
🏁 Script executed:
#!/bin/bash
# Search for any tests or documentation about these filter functions
echo "=== Looking for tests ==="
find . -type f \( -name "*.test.ts" -o -name "*.spec.ts" -o -name "*.test.js" -o -name "*.spec.js" \) | xargs grep -l "filterSnippet\|filterTabs" 2>/dev/null || echo "No tests found"
echo "=== Looking for usage of filterSnippetSpecialCharacters ==="
rg -n "filterSnippetSpecialCharacters" --type=ts --type=jsLength of output: 738
Fix filterSnippetSpecialCharacters to resolve Biome lint error
Biome flags the regex /^[\x00-\x7F]+$/ with noControlCharactersInRegex. To fix this while maintaining the same logic (allow only ASCII, reject non-ASCII), use a loop-based implementation:
function filterSnippetSpecialCharacters(snippet: string): boolean {
for (let i = 0; i < snippet.length; i += 1) {
if (snippet.charCodeAt(i) > 0x7f) {
return false;
}
}
return true;
}This avoids control characters in regex and makes the intent explicit.
🧰 Tools
🪛 Biome (2.1.2)
[error] 18-18: Unexpected control character in a regular expression.
Control characters are unusual and potentially incorrect inputs, so they are disallowed.
(lint/suspicious/noControlCharactersInRegex)
🤖 Prompt for AI Agents
In src/features/snippets/logic/filter.ts around lines 17 to 22, the current
regex /^[\x00-\x7F]+$/ triggers Biome's noControlCharactersInRegex rule; replace
the regex approach with a simple loop over the snippet characters that checks
charCodeAt(i) and returns false if any codepoint is > 0x7F, otherwise return
true—this preserves the original "only ASCII" logic while avoiding control
characters in a regex.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
lib/snippets/processing/filter.ts (1)
18-23: Biome control character warning remains unresolved.The regex
/^[\x00-\x7F]+$/still triggers Biome'snoControlCharactersInRegexrule. As noted in the previous review, your ES2017 target prevents using Unicode properties without additional transpilation.To satisfy Biome without transpilation concerns, replace with printable ASCII only:
function filterSnippetSpecialCharacters(snippet: string): boolean { - const regex = /^[\x00-\x7F]+$/; - const test = regex.test(snippet); - - return test; + // Allow only printable ASCII characters (0x20–0x7E). + const regex = /^[ -~]+$/; + return regex.test(snippet); }Also verify that rejecting all non-ASCII content (identifiers, comments, emojis) is intentional.
prisma/schema.prisma (1)
43-50: Globalshauniqueness must be replaced with composite constraint.The
shafield at line 45 is marked globally@unique, but eachSnippetis scoped to a specificFileVersion(non-nullable foreign key). This prevents differentFileVersionrecords from legitimately storing snippets with identical content hashes—a design requirement confirmed by the seeding code inprisma/seed/snippets.ts(lines 74-75), which explicitly catches and ignores unique constraint violations.Apply this diff to enforce uniqueness per file version:
model Snippet { id String @id @default(uuid()) - sha String @unique + sha String content String rawContent String fileVersionId String fileVersion FileVersion @relation(fields: [fileVersionId], references: [id], onDelete: Cascade) + + @@unique([fileVersionId, sha]) }After updating the schema, run
npx prisma migrate devand remove the try-catch workaround inprisma/seed/snippets.tsthat silently ignoresP2002errors.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
lib/snippets/processing/filter.ts(1 hunks)prisma/schema.prisma(1 hunks)prisma/seed/snippets.ts(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
prisma/seed/snippets.ts (1)
lib/snippets/index.ts (1)
getSnippetsFromFile(12-26)
lib/snippets/processing/filter.ts (1)
lib/snippets/config.ts (6)
MAX_LINES(1-1)MIN_LINES(2-2)MAX_LINE_LENGTH(4-4)MIN_SNIPPET_LENGTH(7-7)MAX_SNIPPET_LENGTH(6-6)VALID_NODES(9-27)
🪛 Biome (2.1.2)
lib/snippets/processing/filter.ts
[error] 19-19: Unexpected control character in a regular expression.
Control characters are unusual and potentially incorrect inputs, so they are disallowed.
(lint/suspicious/noControlCharactersInRegex)
🔇 Additional comments (5)
prisma/seed/snippets.ts (1)
6-6: Verify concurrency limit of 100 is appropriate for GitHub raw content API.The
p-limit(100)setting allows 100 concurrent fetches fromraw.githubusercontent.com. While raw content endpoints are generally more permissive than the REST API, this limit might still be aggressive and could trigger rate limiting or timeouts.Consider reducing to a more conservative value (e.g., 10-20) unless you've confirmed this limit works reliably in production.
prisma/schema.prisma (3)
23-23: onDelete cascade behavior has been correctly added.The
onDelete: Cascadespecification ensures that deleting aLanguagewill cascade to all associatedFilerecords, maintaining data integrity.
37-37: onDelete cascade behavior has been correctly added.The
onDelete: Cascadespecification ensures that deleting aFilewill cascade to all associatedFileVersionrecords, maintaining referential integrity.
49-49: onDelete cascade behavior has been correctly added.The
onDelete: Cascadespecification ensures that deleting aFileVersionwill cascade to all associatedSnippetrecords.lib/snippets/processing/filter.ts (1)
25-28: Critical: Tab regex matches literal "\t" string, not tab characters.The regex
/^.+\\t.+$/uses a double-escaped backslash (\\t), which matches the two-character literal string\trather than an actual tab character. This means the filter fails to detect real tabs in snippets.Apply this diff to fix the regex:
function filterTabsInBetween(snippet: string): boolean { - const regex = /^.+\\t.+$/; + const regex = /^.+\t.+$/; return !regex.test(snippet); }Likely an incorrect or invalid review comment.
| if (res.status === 404 || res.status === 400) { | ||
| await prisma.fileVersion.delete({ | ||
| where: {id: file.id}, | ||
| }); | ||
| console.warn(`Deleted file version ${file.id} due to 404 (file missing).`); | ||
| } |
There was a problem hiding this comment.
FileVersion deletion on 404/400 may orphan parent File records.
When a file is not found (404/400), the code deletes the FileVersion but leaves the parent File intact. If all versions of a file are deleted, the File becomes an orphan with no fileVersions.
Consider checking after deletion whether the parent File has any remaining fileVersions, and if not, delete the File as well:
if (res.status === 404 || res.status === 400) {
await prisma.fileVersion.delete({
where: {id: file.id},
});
+ // Clean up orphaned parent File if this was the last version
+ const remainingVersions = await prisma.fileVersion.count({
+ where: {fileId: file.file.id},
+ });
+ if (remainingVersions === 0) {
+ await prisma.file.delete({where: {id: file.file.id}});
+ }
console.warn(`Deleted file version ${file.id} due to 404 (file missing).`);
}However, you'll need to adjust the query to fetch file.file.id in the initial select (line 13).
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In prisma/seed/snippets.ts around lines 46 to 51, deleting the FileVersion on a
404/400 can leave an orphaned parent File; update the initial query (around line
13) to also select the parent file id (file.file.id), then after await
prisma.fileVersion.delete(...) check whether any fileVersions remain for that
parent (using prisma.fileVersion.count or prisma.file.findUnique with
select/count); if the count is zero, delete the parent File via
prisma.file.delete({ where: { id: parentId } }), and log the action. Ensure this
logic runs atomically enough for your seed context (or retry on constraint
errors).
| try { | ||
| await prisma.snippet.create({ | ||
| data: { | ||
| sha: snippet.sha, | ||
| content: snippet.content, | ||
| rawContent: snippet.rawContent, | ||
| fileVersionId: file.id, | ||
| }, | ||
| }); | ||
| } catch (error) { | ||
| // Ignore unique constraint violation (since multiple files can produce the same snippet). Upsert is not an option since we are seeding in parallel. | ||
| if (error instanceof Prisma.PrismaClientKnownRequestError && error.code === "P2002") return; | ||
| throw error; | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Remove unique constraint workaround once schema is corrected.
The try-catch block silently ignores P2002 errors with the comment "since multiple files can produce the same snippet." This is a workaround for the incorrect global @unique constraint on Snippet.sha in the schema.
Once the schema is updated to use @@unique([fileVersionId, sha]) (as noted in the schema review), remove this try-catch block to let real errors surface:
snippets.map(async (snippet) => {
- try {
- await prisma.snippet.create({
- data: {
- sha: snippet.sha,
- content: snippet.content,
- rawContent: snippet.rawContent,
- fileVersionId: file.id,
- },
- });
- } catch (error) {
- // Ignore unique constraint violation (since multiple files can produce the same snippet). Upsert is not an option since we are seeding in parallel.
- if (error instanceof Prisma.PrismaClientKnownRequestError && error.code === "P2002") return;
- throw error;
- }
+ await prisma.snippet.create({
+ data: {
+ sha: snippet.sha,
+ content: snippet.content,
+ rawContent: snippet.rawContent,
+ fileVersionId: file.id,
+ },
+ });
})🤖 Prompt for AI Agents
In prisma/seed/snippets.ts around lines 64 to 77 there is a try-catch that
swallows Prisma P2002 unique-constraint errors for Snippet.sha as a workaround
for an incorrect global unique; once the schema is corrected to use
@@unique([fileVersionId, sha]) remove this try-catch entirely so
duplicate-snippet conflicts are handled by the schema and real errors will
surface during seeding; after removing the block, run the migration and re-run
the seed to verify no P2002 suppression remains and legitimate errors are
reported.
Summary by CodeRabbit
New Features
Refactor