Skip to content

docs: cookbook for adding a new language extractor#39

Open
mschreib28 wants to merge 2 commits into
mainfrom
upstream/docs/adding-a-language
Open

docs: cookbook for adding a new language extractor#39
mschreib28 wants to merge 2 commits into
mainfrom
upstream/docs/adding-a-language

Conversation

@mschreib28
Copy link
Copy Markdown
Owner

🔄 Rebased onto refactor colbymchenry#116 (per-language registry). The cookbook now teaches the post-refactor flow: one new file in src/extraction/languages/ instead of the previous 3-file mutation across types.ts / grammars.ts / CLAUDE.md. Sections 1, 2, 4, 6, 7, 8 and the reference table are unchanged.\n\n---\n\n## Summary\n\nAdds docs/ADDING-A-LANGUAGE.md — a cookbook for contributors adding a new language extractor. Closes colbymchenry#55.\n\nThe doc was prompted by @cfournel's question on colbymchenry#55 (they published tree-sitter-mql5 and asked how to plug it in), but applies equally to anyone adding HCL/Terraform, R, SQL, dbt, Scala, Vue, or any of the other language requests in the issue tracker. Now there's a single self-serve walkthrough.\n\n## What it covers\n\n1. Sourcing the wasm grammar — three real-world paths: already in tree-sitter-wasms, pre-built in a GitHub release / npm tarball, or built from source via tree-sitter-cli's bundled wasi-sdk (no Docker / local emcc needed).\n2. Probing the AST with a 15-line scratch script before writing any extractor code.\n3. Registering the language — one new file (src/extraction/languages/<name>.ts exporting a LanguageDef) plus a 2-line registry update; reflects the per-file registry pattern from colbymchenry#116.\n4. Type-check before extraction logic — catches wiring bugs early.\n5. Two extractor patterns plugged into the same LanguageDef:\n - LanguageExtractor config (procedural / OO — Python, Ruby, R)\n - Self-contained extractor class (declarative / template / non-OO — HCL, SQL, Liquid)\n6. NodeKind / EdgeKind mapping table so contributors don't introduce new kinds unnecessarily (those are cross-cutting).\n7. Test patterns with extractFromSource + the end-to-end CLI smoke recipe.\n8. PR description checklist including grammar-source provenance, sha256 for vendored wasms, and being honest about constructs the grammar can't parse.\n\nEach section points at concrete extractors in the repo as worked examples — R for the OO path, HCL / SQL / Liquid for the custom-class path, Pascal + DFM for the cross-format case.\n\n## Files changed\n\n| File | Change |\n|---|---|\n| docs/ADDING-A-LANGUAGE.md | New cookbook (~460 lines) |\n| README.md | Pointer to the cookbook from the "Supported Languages" section |\n| CLAUDE.md | Pointer from the "Supported Languages" architecture line so the LLM-readable doc stays in sync |\n\n## Test plan\n\n- [x] Reviewed the rendered Markdown locally\n- [x] Verified every code snippet against the actual code I just wrote across HCL (colbymchenry#92), R (colbymchenry#94), and SQL (colbymchenry#95) — these PRs followed exactly the steps the cookbook lays out\n- [x] Verified the tree-sitter-cli build --wasm workflow on the SQL grammar (the wasi-sdk path the doc recommends)\n\n🤖 Generated with Claude Code\n\n\n


Copied from colbymchenry/codegraph#97

andreinknv and others added 2 commits April 26, 2026 01:04
Adds docs/ADDING-A-LANGUAGE.md walking through every step a contributor
needs to add a new language extractor:

  1. Source a tree-sitter wasm grammar — covers the three real-world
     paths (already in tree-sitter-wasms, pre-built release artifact,
     build from source via tree-sitter-cli's bundled wasi-sdk).
  2. Probe the AST with a small scratch script before writing code.
  3. Register in src/types.ts + src/extraction/grammars.ts.
  4. Type-check before adding extraction logic.
  5. Pick a pattern: LanguageExtractor config (procedural / OO) or a
     self-contained extractor class (declarative / template / non-OO).
  6. Map onto existing NodeKind / EdgeKind values.
  7. Tests + end-to-end CLI smoke.
  8. PR description checklist.

Each section points at the existing extractors as worked examples
(R for the OO path, HCL/SQL/Liquid for the custom path, Pascal+DFM
for the cross-format case). README.md and CLAUDE.md gain a one-line
pointer to the cookbook.

Closes colbymchenry#55.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sections 3, 5a, 5b previously taught the monolithic-file pattern that
PR colbymchenry#116 obsoletes. After colbymchenry#116, adding a language is one new file in
src/extraction/languages/ + 2 lines in registry.ts (and an optional
1-line addition to the Language union for TypeScript narrowing).

Updated:
- Section 3: full rewrite. Was 3-file mutation (types.ts, grammars.ts,
  CLAUDE.md). Now: 1 LanguageDef file + registry import + Language
  union entry. Includes a "why per-file" sidebar pointing at the
  cross-PR conflict bottleneck the registry resolves.
- Section 5a: drops the EXTRACTORS-map registration step. The
  extractor is referenced from the LanguageDef directly.
- Section 5b: drops the tree-sitter.ts dispatch wiring. customExtractor
  on the LanguageDef takes the dispatch — no per-language if-branches.

Section 1 (sourcing wasm), Section 2 (probing AST), Sections 6/7/8
(NodeKind mapping, tests, PR checklist), and the existing-extractors
reference table are unchanged — those parts of the workflow didn't
change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cookbook to add another language support

2 participants