Skip to content

refactor: per-language registry — eliminate cross-PR conflict surface for language additions#20

Open
mschreib28 wants to merge 2 commits into
mainfrom
upstream/refactor/language-registry
Open

refactor: per-language registry — eliminate cross-PR conflict surface for language additions#20
mschreib28 wants to merge 2 commits into
mainfrom
upstream/refactor/language-registry

Conversation

@mschreib28
Copy link
Copy Markdown
Owner

Reviewers: for a 5-minute review path that surfaces only what to verify, see the review checklist comment below. The original description is preserved here for full technical detail.\n\n---\n\n## Summary\n\nAdding a new language used to require coordinated edits to 6 shared lists across 4 files. Two PRs adding different languages (HCL, R, SQL, Scala, Vue, ReScript, mql5, …) typically conflicted on every one of them.\n\nAfter this refactor, adding a new language is:\n\n1. Drop a file at src/extraction/languages/<name>.ts exporting an <NAME>_DEF: LanguageDef constant.\n2. Add one import line and one array entry to src/extraction/languages/registry.ts (alphabetical position — adjacent additions still possible but rare).\n\nEverything else (grammars.ts, types.ts, tree-sitter.ts dispatch, DEFAULT_CONFIG.include) is derived from the registry.\n\n## What's a LanguageDef?\n\nts\ninterface LanguageDef {\n name: string; // canonical id, used in Node/Edge.language\n displayName: string; // "Pascal / Delphi"\n extensions: readonly string[]; // ['.pas', '.dpr', '.dfm', ...]\n includeGlobs: readonly string[]; // ['**/*.pas', ...]\n grammar?: { // tree-sitter-backed languages\n wasmFile: string;\n vendored?: boolean;\n extractor: LanguageExtractor;\n };\n customExtractor?: CustomExtractorFn; // for Liquid, Svelte, etc.\n extensionOverrides?: Record<string, { customExtractor }>; // Pascal .dfm/.fmx\n}\n\n\nEach existing language file now exports both its xxxExtractor (unchanged) AND a new XXX_DEF. New files were added for tsx, jsx, svelte, liquid — bringing all 19 currently-supported languages into the registry uniformly.\n\n## Refactored consumers\n\n| File | Before | After |\n|---|---|---|\n| src/extraction/grammars.ts | Hardcoded WASM_GRAMMAR_FILES, EXTENSION_MAP, getLanguageDisplayName map | All derived from registry. EXTENSION_MAP is now a Proxy that lazy-builds on first access (avoids TDZ in cyclic load paths). WASM_GRAMMAR_FILES removed (was internal-only). |\n| src/extraction/tree-sitter.ts | if (lang === 'svelte') ... if 'liquid' ... if 'pascal' && ... chain in extractFromSource | One lookup: def.extensionOverrides[ext]?.customExtractor || def.customExtractor || tree-sitter. Drops direct imports of LiquidExtractor, SvelteExtractor, DfmExtractor. |\n| src/types.ts | DEFAULT_CONFIG literal with hardcoded include globs per language | Moved to src/default-config.ts; types.ts re-exports for backward compat. include array built lazily from LanguageDef.includeGlobs. |\n| src/extraction/languages/index.ts | Hardcoded EXTRACTORS: Partial<Record<Language, LanguageExtractor>> | Now derived from registry. |\n\n## What still requires a one-line edit\n\nThe Language string union in types.ts still hard-codes the known languages (typescript | javascript | … | unknown). New languages added to the registry work at runtime as strings, but adding the literal here is required IF the resolver wants exhaustive narrowing on the new language — src/resolution/index.ts and src/resolution/import-resolver.ts have a few language === 'X' branches. Most new languages don't need any.\n\nTrade-off: strict narrowing preserved for the existing handful of language-specific code paths; everything else is registry-driven.\n\n## Tests\n\n380/380 existing tests pass + 16 new structural-invariant tests in __tests__/language-registry.test.ts (added in the fix commit e43a618 after the reviewer pass — see below).\n\nThe 16 invariants assert that every derived consumer (EXTRACTORS, EXTENSION_MAP, detectLanguage, isLanguageSupported, getSupportedLanguages, getLanguageDisplayName) exactly mirrors the registry contents — drift between the registry and any derived view is impossible. Existing extraction.test.ts and pr19-improvements.test.ts continue to exercise detectLanguage, isLanguageSupported, getSupportedLanguages, loadAllGrammars, and the per-language extractors — all green.\n\n## Impact on open PRs\n\nEvery open language PR (colbymchenry#92 HCL, colbymchenry#94 R, colbymchenry#95 SQL, colbymchenry#91 Scala, colbymchenry#66 Vue, colbymchenry#58 ReScript, colbymchenry#57 mql5) currently conflicts with main + with each other on the 6 shared lists. After this lands, they each shrink to:\n\n- 1 new file: src/extraction/languages/<name>.ts\n- 2 lines in registry.ts (the import + the array entry)\n- 1 line in Language union if exhaustive narrowing is needed (rare)\n\nThat's it. Two language PRs no longer conflict on each other unless their alphabetical positions overlap in registry.ts (and even then, it's a trivial "add both lines" resolution).\n\n## Follow-ups (out of scope, tracked in commit message)\n\n- Auto-discovery in registry.ts (fs.readdirSync) — works in built dist/ but vite-node doesn't support extensionless require() of TS source. A small build-time generator could replace the static import list entirely.\n- Splitting __tests__/extraction.test.ts into per-language files — eliminates the test-end-of-file conflict surface that every language PR currently also hits.\n- Similar registry refactors for: MCP tool definitions, migration files, index/sync hooks. Each follows the same pattern (self-registering files, derived registries) and would unblock another class of PR conflicts.\n\n## Test plan\n\n- [x] tsc --noEmit clean\n- [x] npm test — 380/380 passing\n- [x] End-to-end smoke: npm run build && node -e "require('./dist').getSupportedLanguages()" returns all 19 languages\n- [x] DEFAULT_CONFIG.include built lazily — verified via require('./dist/types').DEFAULT_CONFIG.include.length === 29\n- [x] Pascal .dfm files still route to DfmExtractor via extensionOverrides\n- [x] Svelte and Liquid still route to their custom extractors\n\n🤖 Generated with Claude Code\n\n\n


Copied from colbymchenry/codegraph#116

andreinknv and others added 2 commits April 27, 2026 16:27
Adding a new language used to require coordinated edits to 6
shared lists across 4 files (Language union in types.ts;
DEFAULT_CONFIG.include; WASM_GRAMMAR_FILES, EXTENSION_MAP, and
getLanguageDisplayName in grammars.ts; EXTRACTORS map in
languages/index.ts). Two PRs adding different languages typically
conflicted on every one of those.

After this refactor, adding a new language is:

  1. Drop a file at src/extraction/languages/<name>.ts exporting an
     <NAME>_DEF: LanguageDef constant.
  2. Add ONE import line and ONE array entry to
     src/extraction/languages/registry.ts (alphabetical position —
     adjacent additions are still possible but rare).

That's it. grammars.ts, types.ts, tree-sitter.ts dispatch, and the
default include globs are all derived from the registry.

## What's in a LanguageDef

```ts
interface LanguageDef {
  name: string;                  // canonical id
  displayName: string;           // "Pascal / Delphi"
  extensions: readonly string[]; // ['.pas', '.dpr', ...]
  includeGlobs: readonly string[];
  grammar?: { wasmFile, vendored?, extractor };  // tree-sitter
  customExtractor?: (fp, src) => ExtractionResult;  // Liquid, Svelte
  extensionOverrides?: { '.dfm': { customExtractor } };  // Pascal forms
}
```

Each existing language file now exports both its `xxxExtractor`
(unchanged) AND a new `XXX_DEF`. New files were added for tsx, jsx,
svelte, liquid (the latter two wrap their existing custom extractor
classes via the customExtractor field).

## Refactored consumers

- src/extraction/grammars.ts: WASM_GRAMMAR_FILES removed (was
  internal-only); EXTENSION_MAP now a Proxy that lazy-builds from
  the registry on first access (avoids TDZ in cyclic load paths).
  loadGrammarsForLanguages, isLanguageSupported, isGrammarLoaded,
  getSupportedLanguages, getLanguageDisplayName, detectLanguage —
  all read from registry.
- src/extraction/tree-sitter.ts: extractFromSource's if-chain
  (svelte / liquid / pascal+.dfm/.fmx) replaced with one lookup:
  def.extensionOverrides[ext]?.customExtractor || def.customExtractor.
  Drops direct imports of LiquidExtractor, SvelteExtractor,
  DfmExtractor.
- src/types.ts: DEFAULT_CONFIG moved to src/default-config.ts (cycle
  break). types.ts re-exports for backward compat. The `include`
  array is now built lazily from each LanguageDef's includeGlobs.

## What still requires a one-line edit

The Language string union in types.ts still hard-codes the known
languages (typescript | javascript | … | unknown). New languages
added to the registry work at runtime as strings, but adding the
literal here is required IF the resolver wants to do exhaustive
narrowing on the new language (resolution/index.ts and
resolution/import-resolver.ts have a few `language === 'X'`
branches). Most new languages don't need such branches.

This trade-off keeps strict narrowing for the existing handful of
language-specific code paths while making everything else
registry-driven.

## Tests

380/380 pass. No new tests; behavior is identical. Existing
extraction.test.ts and pr19-improvements.test.ts heavily exercise
detectLanguage, isLanguageSupported, getSupportedLanguages, and
loadAllGrammars — all green.

## Follow-ups (out of scope)

- Auto-discovery in registry.ts via fs.readdirSync — works in
  built dist/ but vite-node doesn't support extensionless require()
  of TS source. A small build-time generator could remove the
  static import list entirely.
- Splitting __tests__/extraction.test.ts into per-language test
  files — eliminates the test-end-of-file conflict surface that
  every language PR currently hits.
- Similar registry refactors for:
  - MCP tool definitions (each tool self-registers; no shared
    tools[] array or case-switch in execute())
  - Migration files (each migration in src/db/migrations/NNN-*.ts;
    auto-discovered by version)
  - Index/sync hooks (centrality, churn, issue-history,
    config-refs, sql-refs, cochange all currently mutate
    CodeGraph.indexAll/sync; an IndexHook interface would make
    each pass self-contained)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tractor

Reviewer caught a real bug: the original commit kept the
EXTRACTORS map in src/extraction/languages/index.ts as a separate
hand-curated registry that TreeSitterExtractor read from. Adding
a new grammar-backed language would have required editing
EXTRACTORS too, undermining the refactor's stated single-source-of-
truth claim. A future contributor missing the EXTRACTORS update
would silently produce empty extraction results.

Fix:
- TreeSitterExtractor now reads its extractor straight off the
  language def: getLanguageDefByName(this.language)?.grammar?.extractor
- EXTRACTORS in languages/index.ts becomes a Proxy that derives
  lazily from the registry (kept for backward compat — readers
  unchanged).
- Add 16 structural-invariant tests in __tests__/language-registry.test.ts
  that fail loudly if any derived consumer drifts from the registry:
  EXTRACTORS / EXTENSION_MAP / detectLanguage / isLanguageSupported /
  getSupportedLanguages / getLanguageDisplayName all asserted to
  exactly mirror the registry contents.

Adding a new grammar-backed language is now genuinely "one new file
+ two lines in registry.ts" — no other files to touch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants