refactor: per-language registry — eliminate cross-PR conflict surface for language additions#116
refactor: per-language registry — eliminate cross-PR conflict surface for language additions#116andreinknv wants to merge 2 commits into
Conversation
Adding a new language used to require coordinated edits to 6
shared lists across 4 files (Language union in types.ts;
DEFAULT_CONFIG.include; WASM_GRAMMAR_FILES, EXTENSION_MAP, and
getLanguageDisplayName in grammars.ts; EXTRACTORS map in
languages/index.ts). Two PRs adding different languages typically
conflicted on every one of those.
After this refactor, adding a new language is:
1. Drop a file at src/extraction/languages/<name>.ts exporting an
<NAME>_DEF: LanguageDef constant.
2. Add ONE import line and ONE array entry to
src/extraction/languages/registry.ts (alphabetical position —
adjacent additions are still possible but rare).
That's it. grammars.ts, types.ts, tree-sitter.ts dispatch, and the
default include globs are all derived from the registry.
## What's in a LanguageDef
```ts
interface LanguageDef {
name: string; // canonical id
displayName: string; // "Pascal / Delphi"
extensions: readonly string[]; // ['.pas', '.dpr', ...]
includeGlobs: readonly string[];
grammar?: { wasmFile, vendored?, extractor }; // tree-sitter
customExtractor?: (fp, src) => ExtractionResult; // Liquid, Svelte
extensionOverrides?: { '.dfm': { customExtractor } }; // Pascal forms
}
```
Each existing language file now exports both its `xxxExtractor`
(unchanged) AND a new `XXX_DEF`. New files were added for tsx, jsx,
svelte, liquid (the latter two wrap their existing custom extractor
classes via the customExtractor field).
## Refactored consumers
- src/extraction/grammars.ts: WASM_GRAMMAR_FILES removed (was
internal-only); EXTENSION_MAP now a Proxy that lazy-builds from
the registry on first access (avoids TDZ in cyclic load paths).
loadGrammarsForLanguages, isLanguageSupported, isGrammarLoaded,
getSupportedLanguages, getLanguageDisplayName, detectLanguage —
all read from registry.
- src/extraction/tree-sitter.ts: extractFromSource's if-chain
(svelte / liquid / pascal+.dfm/.fmx) replaced with one lookup:
def.extensionOverrides[ext]?.customExtractor || def.customExtractor.
Drops direct imports of LiquidExtractor, SvelteExtractor,
DfmExtractor.
- src/types.ts: DEFAULT_CONFIG moved to src/default-config.ts (cycle
break). types.ts re-exports for backward compat. The `include`
array is now built lazily from each LanguageDef's includeGlobs.
## What still requires a one-line edit
The Language string union in types.ts still hard-codes the known
languages (typescript | javascript | … | unknown). New languages
added to the registry work at runtime as strings, but adding the
literal here is required IF the resolver wants to do exhaustive
narrowing on the new language (resolution/index.ts and
resolution/import-resolver.ts have a few `language === 'X'`
branches). Most new languages don't need such branches.
This trade-off keeps strict narrowing for the existing handful of
language-specific code paths while making everything else
registry-driven.
## Tests
380/380 pass. No new tests; behavior is identical. Existing
extraction.test.ts and pr19-improvements.test.ts heavily exercise
detectLanguage, isLanguageSupported, getSupportedLanguages, and
loadAllGrammars — all green.
## Follow-ups (out of scope)
- Auto-discovery in registry.ts via fs.readdirSync — works in
built dist/ but vite-node doesn't support extensionless require()
of TS source. A small build-time generator could remove the
static import list entirely.
- Splitting __tests__/extraction.test.ts into per-language test
files — eliminates the test-end-of-file conflict surface that
every language PR currently hits.
- Similar registry refactors for:
- MCP tool definitions (each tool self-registers; no shared
tools[] array or case-switch in execute())
- Migration files (each migration in src/db/migrations/NNN-*.ts;
auto-discovered by version)
- Index/sync hooks (centrality, churn, issue-history,
config-refs, sql-refs, cochange all currently mutate
CodeGraph.indexAll/sync; an IndexHook interface would make
each pass self-contained)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tractor Reviewer caught a real bug: the original commit kept the EXTRACTORS map in src/extraction/languages/index.ts as a separate hand-curated registry that TreeSitterExtractor read from. Adding a new grammar-backed language would have required editing EXTRACTORS too, undermining the refactor's stated single-source-of- truth claim. A future contributor missing the EXTRACTORS update would silently produce empty extraction results. Fix: - TreeSitterExtractor now reads its extractor straight off the language def: getLanguageDefByName(this.language)?.grammar?.extractor - EXTRACTORS in languages/index.ts becomes a Proxy that derives lazily from the registry (kept for backward compat — readers unchanged). - Add 16 structural-invariant tests in __tests__/language-registry.test.ts that fail loudly if any derived consumer drifts from the registry: EXTRACTORS / EXTENSION_MAP / detectLanguage / isLanguageSupported / getSupportedLanguages / getLanguageDisplayName all asserted to exactly mirror the registry contents. Adding a new grammar-backed language is now genuinely "one new file + two lines in registry.ts" — no other files to touch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Reviewer caught a real bug — the original commit left a parallel Pushed fix in e43a618:
396/396 tests pass. |
|
Part of a coordinated 4-PR refactor that unblocks the open PR backlog. See #120 for the full merge-order guide explaining how this PR fits into the picture and how each open language/feature PR rebases onto the new pattern. |
Originally based on monolithic grammars.ts/tree-sitter.ts; rebased to the per-language registry pattern (PR colbymchenry#116): - src/extraction/languages/hcl.ts (LanguageDef with grammar+custom) - 'hcl' added to Language union - Vendored tree-sitter-hcl.wasm - HclExtractor (custom block extractor) - 220 extraction tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sections 3, 5a, 5b previously taught the monolithic-file pattern that PR colbymchenry#116 obsoletes. After colbymchenry#116, adding a language is one new file in src/extraction/languages/ + 2 lines in registry.ts (and an optional 1-line addition to the Language union for TypeScript narrowing). Updated: - Section 3: full rewrite. Was 3-file mutation (types.ts, grammars.ts, CLAUDE.md). Now: 1 LanguageDef file + registry import + Language union entry. Includes a "why per-file" sidebar pointing at the cross-PR conflict bottleneck the registry resolves. - Section 5a: drops the EXTRACTORS-map registration step. The extractor is referenced from the LanguageDef directly. - Section 5b: drops the tree-sitter.ts dispatch wiring. customExtractor on the LanguageDef takes the dispatch — no per-language if-branches. Section 1 (sourcing wasm), Section 2 (probing AST), Sections 6/7/8 (NodeKind mapping, tests, PR checklist), and the existing-extractors reference table are unchanged — those parts of the workflow didn't change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Maintainer review checklist (5-min path)This refactor is behavior-preserving. Below is the shortest path to verify that without re-reading the 26-file patch. 1. What changed structurally
2. What stays exactly the same — verify in seconds
Existing imports keep working unchanged. The 3. Invariants now enforced by tests (
|
Summary
Adding a new language used to require coordinated edits to 6 shared lists across 4 files. Two PRs adding different languages (HCL, R, SQL, Scala, Vue, ReScript, mql5, …) typically conflicted on every one of them.
After this refactor, adding a new language is:
src/extraction/languages/<name>.tsexporting an<NAME>_DEF: LanguageDefconstant.src/extraction/languages/registry.ts(alphabetical position — adjacent additions still possible but rare).Everything else (
grammars.ts,types.ts,tree-sitter.tsdispatch,DEFAULT_CONFIG.include) is derived from the registry.What's a
LanguageDef?Each existing language file now exports both its
xxxExtractor(unchanged) AND a newXXX_DEF. New files were added fortsx,jsx,svelte,liquid— bringing all 19 currently-supported languages into the registry uniformly.Refactored consumers
src/extraction/grammars.tsWASM_GRAMMAR_FILES,EXTENSION_MAP,getLanguageDisplayNamemapEXTENSION_MAPis now a Proxy that lazy-builds on first access (avoids TDZ in cyclic load paths).WASM_GRAMMAR_FILESremoved (was internal-only).src/extraction/tree-sitter.tsif (lang === 'svelte') ... if 'liquid' ... if 'pascal' && ...chain inextractFromSourcesrc/types.tsDEFAULT_CONFIGliteral with hardcoded include globs per languagesrc/default-config.ts; types.ts re-exports for backward compat.includearray built lazily fromLanguageDef.includeGlobs.src/extraction/languages/index.tsEXTRACTORS: Partial<Record<Language, LanguageExtractor>>What still requires a one-line edit
The
Languagestring union intypes.tsstill hard-codes the known languages (typescript | javascript | … | unknown). New languages added to the registry work at runtime as strings, but adding the literal here is required IF the resolver wants exhaustive narrowing on the new language —src/resolution/index.tsandsrc/resolution/import-resolver.tshave a fewlanguage === 'X'branches. Most new languages don't need any.Trade-off: strict narrowing preserved for the existing handful of language-specific code paths; everything else is registry-driven.
Tests
380/380 existing tests pass + 16 new structural-invariant tests in
__tests__/language-registry.test.ts(added in the fix commite43a618after the reviewer pass — see below).The 16 invariants assert that every derived consumer (
EXTRACTORS,EXTENSION_MAP,detectLanguage,isLanguageSupported,getSupportedLanguages,getLanguageDisplayName) exactly mirrors the registry contents — drift between the registry and any derived view is impossible. Existingextraction.test.tsandpr19-improvements.test.tscontinue to exercisedetectLanguage,isLanguageSupported,getSupportedLanguages,loadAllGrammars, and the per-language extractors — all green.Impact on open PRs
Every open language PR (#92 HCL, #94 R, #95 SQL, #91 Scala, #66 Vue, #58 ReScript, #57 mql5) currently conflicts with main + with each other on the 6 shared lists. After this lands, they each shrink to:
src/extraction/languages/<name>.tsregistry.ts(the import + the array entry)Languageunion if exhaustive narrowing is needed (rare)That's it. Two language PRs no longer conflict on each other unless their alphabetical positions overlap in registry.ts (and even then, it's a trivial "add both lines" resolution).
Follow-ups (out of scope, tracked in commit message)
registry.ts(fs.readdirSync) — works in builtdist/but vite-node doesn't support extensionlessrequire()of TS source. A small build-time generator could replace the static import list entirely.__tests__/extraction.test.tsinto per-language files — eliminates the test-end-of-file conflict surface that every language PR currently also hits.Test plan
tsc --noEmitcleannpm test— 380/380 passingnpm run build && node -e "require('./dist').getSupportedLanguages()"returns all 19 languagesrequire('./dist/types').DEFAULT_CONFIG.include.length === 29.dfmfiles still route to DfmExtractor viaextensionOverrides🤖 Generated with Claude Code