diff --git a/specs/GH9955/product.md b/specs/GH9955/product.md new file mode 100644 index 000000000..174d73533 --- /dev/null +++ b/specs/GH9955/product.md @@ -0,0 +1,450 @@ +# Product spec: Generic syntax-highlight definition mechanism (GH-9955) + +## Problem + +Adding a new language to Warp's syntax-highlighting today requires +changes in 5+ places, all in `crates/languages/src/lib.rs`: + +1. The `SUPPORTED_LANGUAGES: [&str; 32]` array. +2. The `language_by_filename` extension-to-language match (one + match arm). +3. The `to_arborium_name` aliasing match (only if the name differs). +4. The `get_arborium_highlight_query` match (one arm with a hard + reference to `arborium::lang_X::HIGHLIGHTS_QUERY`). +5. A `crates/languages/grammars//` folder with `config.yaml`, + `identifiers.scm`, and `indents.scm`. + +Steps 1–4 require modifying compiled Rust code, which means a new +language requires a Warp release. Step 4 also requires the language +to be supported by the upstream `arborium` tree-sitter aggregator +crate (which is itself an internal dependency of Warp), which means +adding a language Warp does not yet support requires either: +- Waiting for `arborium` to add it, or +- Vendoring a tree-sitter grammar into the Warp source tree. + +The closed registry blocks the most-requested kind of community +contribution: "I use $LANG and would happily contribute the +highlighting definition." Today that contribution requires touching +internal crate dependencies and shipping a release. + +The reporter explicitly cited this as the bottleneck for +distributing syntax-highlight work to individual contributors and +referenced Sublime Text, TextMate, Midnight Commander, and modern +tree-sitter-based editors as prior art for pluggable grammar +mechanisms. + +## Goal + +> **Correction (review #10129):** earlier drafts conflated the two +> paths under one "no compiled Rust changes / no release" goal. A +> source-tree bundled grammar still ships in Warp and the tech spec +> requires Cargo / parser-map changes for it. The goal is split below. + +The contributor experience has two distinct paths: + +### G1 — User-local grammars: no Warp release required (V1.5 / V2) + +> **Correction (re-review #10129):** the previous draft promised the +> G1 capability in V1, but G1 depends on `tree_sitter::WasmStore`, +> which requires a tree-sitter version with WASM support. Warp's +> bundled grammars currently come through the internal `arborium` +> crate (version 2, used at `Cargo.toml`) — confirming whether +> arborium re-exports a WASM-capable tree-sitter, or whether +> `crates/languages` would need a parallel direct `tree-sitter` +> dependency, requires maintainer input. G1 is therefore deferred +> behind that resolution. See also the open question at the bottom +> of tech.md. + +The eventual G1 contract: a contributor with admin access to their +own machine adds a new language **without modifying compiled Rust +code and without releasing Warp** by dropping a directory of files +into a user-local config directory +(`$XDG_CONFIG_HOME/warp/grammars//` or +`~/.warp/grammars//`). The grammar loads at next Warp startup, +parsed via WASM for full sandboxing. + +V1 of THIS spec does NOT ship G1. It ships only G2 below — the +bundled-grammar discovery layer. User-local WASM is wired through +the loader as `LoadResult::Failed { reason: +UserLocalWasmNotYetSupported }` so the API shape stabilizes +without enabling the path. Once the WASM-tree-sitter version +question resolves, a follow-up PR flips the gate and the +contributor experience matches G1 above. + +### G2 — Bundled (source-tree) grammars: no hand-written match arms, +### but does ship with Warp + +> **Correction (re-review #10129):** the previous wording said "no +> edits to lib.rs were required," which understated the actual +> Rust/Cargo changes. The honest list is below. + +A contributor sending a PR to Warp adds a new language by: + +1. Creating `crates/languages/grammars//` with + `language.toml`, `highlights.scm`, and the optional + `*.scm` query files. +2. **Adding the parser** in one of two ways: + - **Cargo-dep parser:** add a new line to + `crates/languages/Cargo.toml` (`tree-sitter- = + "..."`), and a single entry mapping `""` → + `tree_sitter_::LANGUAGE` in + `crates/languages/src/bundled_parsers.rs`. This is a Rust + edit, but it is a **mechanical one-line addition in two + places**, not the five-place hand-coded match-statement + spread the issue was asking us to remove. + - **Bundled WASM parser:** drop `grammar.wasm` in the same + directory. No Rust edits at all (once G1 is enabled — until + then, bundled WASM is treated as `Failed`). +3. Sending the PR; the new language ships with the next Warp + release after merge. + +**No edits required** in any case to: +`language_by_filename`, `language_by_name`, +`to_arborium_name`, or `get_arborium_highlight_query`. + +The bundled path requires `cargo build` and a Warp release. What +it satisfies is the original issue's "distribute work on +syntax-highlight feature requests to individual contributors" +outcome by removing the five-place hand-edit, the cross-crate +`arborium`-upstream gate, and the implicit "you must understand +the lookup match-statements" learning curve. + +The bundled path still requires `cargo build` and a Warp release — +it does NOT satisfy G1's "no release" property. What it satisfies is +the original issue's "distribute work on syntax-highlight feature +requests to individual contributors" outcome by removing the +five-place hand-edit and the `arborium`-upstream gate. + +### Substrate + +Both paths preserve Warp's tree-sitter substrate (the right +substrate; not switching to TextMate-style regex grammars), and the +existing 32 first-class languages keep working with no behavior +change. + +## Non-goals (V1) + +- **Switching away from tree-sitter.** Tree-sitter is the right + substrate for accurate parsing. TextMate / Sublime grammars are + regex-based and inferior for accuracy. They are referenced in the + issue as community-distribution exemplars, not as recommended + technology. +- **Runtime-compiled tree-sitter grammars (loadable .so/.dylib).** + Loading native `.so` files is a security and portability hazard. + V1 supports only **WASM-compiled** tree-sitter grammars and + **vendored Rust grammars**; native dynamic loading is explicitly + rejected. +- **Per-user theme / color-scheme definition mechanism.** This spec + is about adding new languages to highlighting, not about styling + the captures. +- **LSP integration mechanism.** The `Language` struct comment hints + at LSP being the next addition; that is a separate spec. +- **Sub-language injection** (e.g. SQL inside a Python string, + CSS inside a Vue template). Currently handled via the existing + `vue` / `tsx` special cases. Out of V1. +- **Hot-reload of grammars.** Grammars load once at startup; user + edits require a Warp restart. Hot-reload is a follow-up. +- **First-class user-contributed grammars in the cloud / via a + package manager.** V1 is local files only. A package-manager + layer can be built on top later. + +## Behavior contract (V1) + +### B1 — Drop-in directory definition + +A new language is defined by a directory containing: +- `language.toml` — display name, file extensions, filename matches, + alias names, comment prefix, brackets, indent unit. +- `highlights.scm` — tree-sitter highlight query. +- `indents.scm` — (optional) tree-sitter indent query. +- `identifiers.scm` — (optional) tree-sitter symbol query. +- `grammar.wasm` OR a `cargo` reference to a vendored Rust + grammar — the parser itself. + +The `language.toml` schema is the single contract a contributor +must learn. All other files are tree-sitter standard files. + +### B2 — Two load paths: bundled and user-local + +**Bundled:** A `crates/languages/grammars//` directory is +discovered at compile time via the existing `RustEmbed` mechanism. +A new language directory is the only required Rust change; no +hand-written match arms. + +**User-local:** A `~/.warp/grammars//` (or +`$XDG_CONFIG_HOME/warp/grammars//`) directory is discovered at +startup. User-local grammars are loaded after bundled grammars and +do not override them by default (preventing a malicious +user-grammar from masquerading as Rust). A user-local grammar +whose `language.toml` declares a name that collides with a bundled +language is logged at `warn` level and ignored. + +### B3 — Schema-driven file association + +`language.toml` declares its own filename / extension / shebang +patterns: + +```toml +[language] +display_name = "Nim" +internal_name = "nim" + +[file_associations] +extensions = ["nim", "nims"] +filenames = ["nim.cfg"] +shebangs = ["nim"] # for `#!/usr/bin/env nim` scripts +aliases = ["nim-lang"] # markdown ```nim-lang code blocks +``` + +The hand-coded `language_by_filename` and `normalize_language_name` +match statements are replaced with a registry-driven lookup. + +> **Correction (review #10129):** earlier drafts described the +> 32-language migration as "a single mechanical PR" while B4 and +> tech.md require independently revertable per-language migrations. +> The single canonical strategy is below. + +The existing 32 languages migrate **one language per PR**, each +independently revertable, with the hardcoded path remaining as a +fallthrough for unmigrated languages. The migration template is in +tech.md §"Migration strategy for the 32 existing languages." V1 +of the discovery PR migrates **zero** languages — it only adds the +discovery layer beside the hardcoded match statements. There is no +"single mechanical PR" follow-up; each language's migration is its +own PR. + +### B4 — Backwards compatibility for the existing 32 languages + +The 32 existing languages continue to work bit-for-bit identically. +Their grammars stay in `arborium` (V1 does not vendor or rewrite +them). The discovery mechanism is added beside the existing +hardcoded match statements, not in place of them. A bundled +language defined via the new mechanism takes precedence over a +hardcoded one only after manual migration of that language +(staged migration; not a flag day). + +### B5 — Security: WASM only for runtime-loaded grammars + +User-local grammars must ship as `grammar.wasm`. Native dynamic +libraries (`.so`, `.dylib`, `.dll`) are explicitly rejected and +never loaded. The WASM is loaded via tree-sitter's existing WASM +runtime. + +Bundled grammars (the Warp source tree) can be either WASM or a +Rust crate reference. The Rust crate reference is for the existing +`arborium` languages and for any future first-class language that +warrants a Cargo dependency. + +> **Correction (review #10129, security):** earlier drafts treated +> "WASM" as a sufficient sandboxing claim. WASM by itself does not +> bound CPU, memory, or input size. The contract is below. + +**WASM safety contract for user-local grammars:** + +- **No host capabilities.** The tree-sitter WASM runtime exposes no + filesystem, network, or process capabilities to grammar code by + design. The loader rejects any WASM module that attempts to import + symbols outside tree-sitter's required exports. +- **CPU bound — parse timeout.** Each parse invocation is gated by + `parser.set_timeout_micros(WARP_GRAMMAR_PARSE_TIMEOUT_US)` + (default 100ms). Grammars whose parse exceeds the timeout return + partial results; the editor falls back to no-syntax-tree rendering + for that buffer until the next edit. +- **CPU bound — query execution timeout.** The parse timeout above + bounds *parsing*, not query matching against the produced tree. + `Query::matches` / `Query::captures` runs in a separate code path + with its own potential pathologies (regex predicates, deeply + nested captures). User-supplied `.scm` queries get a wall-clock + budget of `WARP_GRAMMAR_QUERY_TIMEOUT_MS` (default 50ms per + buffer per query type) enforced via a `tree_sitter::QueryCursor` + wrapper that polls an `AtomicBool` from a watchdog thread; on + timeout, the cursor is cancelled, partial results are discarded, + and the buffer falls through to plain rendering with a one-time + warn log per (grammar, query-kind) pair. The same bound applies + to indent and identifiers queries. +- **Memory bound — query output size.** Per-buffer query results + are capped at `WARP_GRAMMAR_MAX_QUERY_CAPTURES` (default 100k + captures). A query that emits more captures is truncated and + emits a one-time warn log. This bounds memory for pathological + highlight queries that match every token. +- **Memory bound — input-size cap.** Grammars are not invoked on + buffers larger than `WARP_GRAMMAR_MAX_INPUT_BYTES` (default 8MiB, + matching the existing editor large-file threshold). Larger + buffers fall through to plain rendering. +- **Memory bound — runtime cap.** The WASM runtime is configured + with a hard memory cap of `WARP_GRAMMAR_MAX_RUNTIME_BYTES` (default + 64MiB) per parser instance. Exceeding it triggers a parser reset + and a one-time warn log per grammar. +- **Startup-load timeout.** WASM module instantiation is wrapped in + a 5-second hard timeout. A grammar that fails to instantiate in + time is treated as a load failure (B6). +- **No worker isolation in V1.** All parsers share the editor + thread. A grammar that hangs (despite the timeout) can starve the + syntax-tree refresh on other buffers; this is documented as a known + limitation. Worker-thread isolation is a follow-up. + +The above limits are tunable per-platform via env vars in case +specific Linux/Windows configurations need different defaults; the +defaults are conservative. + +### B6 — Validation and clear failure modes + +A grammar directory that fails to load (malformed `language.toml`, +WASM that fails to instantiate, `highlights.scm` that fails to +parse against the grammar) does NOT break Warp startup. Instead: +- A `log::error!` fires with the **basename of the directory** and + the failure reason. The full directory path is NOT logged (see + privacy note below). +- A persistent in-app notification surfaces the failure (one + notification per failed grammar, dismissible). The in-app UI + shows the full path because that's local to the user's machine + and useful for debugging. +- The language is omitted from the registry but other languages + load normally. + +> **Correction (review #10129, security):** earlier drafts logged +> the full grammar directory path, which can leak usernames or +> private project paths in shared logs. tech.md's telemetry section +> separately identified paths as PII. The two were inconsistent. +> Resolved: logs use basenames only; full paths appear only in the +> local Settings UI. + +A grammar with valid `language.toml` and parser but a **missing** +`highlights.scm` loads as a syntax-tree-aware language with no +coloring (you still get bracket pairing, indent, etc.). This makes +a "minimum viable" grammar contribution low-effort. + +> **Correction (review #10129):** earlier drafts said "missing +> highlights.scm loads without coloring" while the tech loader +> rejected highlight-query failures as `LoadFailure`. tech.md is +> updated to distinguish missing-file (load without coloring, +> emit info-level log) from invalid-query (load with no language +> at all, emit error-level + notification). + +### B7 — Discoverability of installed grammars + +> **Correction (re-review #10129):** the previous draft offered +> "CLI command OR settings page" as alternatives, but A7 +> requires the settings page. Resolved: the **Settings → Editor +> → Languages page is the required deliverable.** The CLI +> command is a non-V1 follow-up. + +`Settings → Editor → Languages` shows: +- Each loaded language, its source (bundled / user-local), its + parser revision, and the file extensions it claims. +- Each failed-to-load grammar with its failure reason. + +This is the diagnostic surface a contributor uses to confirm their +new grammar loaded. + +### B8 — Existing settings keys preserve forward compatibility + +The existing `editor.indent_unit` per-language settings, the +`renderer.theme` highlight color mappings, and any other downstream +consumer of `Language` continues to work. The `Language` struct +stays the same shape; only its construction path changes. + +## Acceptance criteria + +A1. A contributor adds `crates/languages/grammars/nim/` containing + `language.toml`, `highlights.scm`, `indents.scm`, and a Cargo + dependency on a Nim tree-sitter grammar. After `cargo build`, + `.nim` files render with syntax highlighting in Warp. + No edits to `lib.rs` were required. + +> **Correction (re-review #10129):** the previous A2/A4/A5/A6 +> required user-local WASM grammars to load and render in V1, but +> the goal section explicitly defers G1 (user-local) until the +> tree-sitter version question resolves. The criteria below are +> rewritten so V1 ships only G2 (bundled). The user-local +> acceptance criteria are kept as **A2.future / A4.future / etc.** +> for the follow-up PR that flips the gate. + +A2. A user-local grammar directory at + `~/.warp/grammars/zig/` is **detected** by `discover_grammars()` + and surfaces in `Settings → Editor → Languages` as + `LoadResult::Failed { reason: UserLocalWasmNotYetSupported }` + with a friendly message ("User-local grammars are coming + soon"). The directory is NOT loaded as a parser in V1. + +A3. The 32 existing languages render bit-for-bit identically to + today. The existing test suite (`crates/languages/src/lib_tests.rs`, + `crates/syntax_tree/src/queries/indent_query_tests.rs`) passes + unchanged. + +A4. (V1 — bundled-only path) A second bundled grammar with the + same `internal_name` as a hardcoded language is dropped from + the merged list and a basename-only `log::warn!` fires. + Hardcoded > Bundled precedence is preserved. + +A5. A bundled grammar with malformed `language.toml` does not + break Warp startup; the failure surfaces via the in-app + notification and `Settings → Editor → Languages` view. + +A6. An attempt to declare `parser.native_lib = "grammar.so"` (in + bundled OR user-local) is rejected at schema-validate time + with the B5 error message. No `dlopen` is attempted. + +A7. The `Settings → Editor → Languages` page lists all loaded + grammars and any failures (including + `UserLocalWasmNotYetSupported` entries). + +**Future acceptance criteria (deferred to G1 follow-up PR):** + +- A2.future. A user drops `~/.warp/grammars/zig/` containing + `language.toml`, `highlights.scm`, and `grammar.wasm`. After + restarting Warp, `.zig` files render with syntax highlighting. +- A4.future. A user-local grammar that collides with a hardcoded + language is dropped; hardcoded wins. +- A5.future. A user-local grammar with malformed `language.toml` + surfaces the failure but doesn't break startup. + +## Risks and decisions for tech.md + +1. **WASM tree-sitter runtime cost.** WASM grammars are slower than + native (compiled Rust) grammars. The TECH spec must define: + - The benchmark we run before / after to establish the + regression budget. + - Whether bundled grammars stay native by default and only + user-local grammars use WASM (recommended). + +2. **`language.toml` schema versioning.** Future Warp releases will + want to add fields (e.g. LSP server binary path). The schema + needs a `schema_version` field at the root and a migration story + for older grammars. + +3. **The migration of the 32 existing languages.** This spec + explicitly does NOT migrate them in V1 (B4). The TECH spec + should sketch the per-language migration PR template so a + follow-up can be done incrementally without coordinated + flag-day risk. + +4. **Sub-language injection** (Vue, TSX, Markdown code blocks). + The current Vue/TSX special casing is hand-written. New + user-local grammars cannot define injections in V1. This is + acknowledged in non-goals. + +5. **Theme integration.** Highlight queries reference capture names + (`@keyword`, `@string`, `@function.method`, etc.) that the + theme then colors. A user-local grammar that uses a non-standard + capture name gets no color. The TECH spec must define: + - The list of capture names the theme guarantees support for + (the "standard set"), AND + - The fallback color for unknown capture names (recommended: + foreground, no styling). + +6. **Per-user grammar cache and parser-revision pinning.** Tree- + sitter ABI changes have caused breakage in other editors when + user-local grammars are compiled against a different ABI than + the host editor uses. The TECH spec must define how the loader + detects ABI mismatch and reports it. + +## Reporter-supplied context (preserved) + +The reporter explicitly cited Midnight Commander's syntax definition +folder as inspiration, and modern reference points: Sublime Text, +TextMate, and the Rust syntax-highlighting library ecosystem. +The reporter's stated motivation is to "distribute work on all the +syntax highlight feature requests to individual contributors" — i.e., +the unblocking outcome is contributor velocity, not parser +expressiveness. diff --git a/specs/GH9955/tech.md b/specs/GH9955/tech.md new file mode 100644 index 000000000..9e348163d --- /dev/null +++ b/specs/GH9955/tech.md @@ -0,0 +1,489 @@ +# Technical spec: Generic syntax-highlight definition mechanism (GH-9955) + +This spec is the implementation companion to `product.md`. It picks +the discovery mechanism, the schema, the loader architecture, and +the migration strategy for the 32 existing languages. + +## Current state (recap from product.md investigation) + +- `crates/languages/src/lib.rs` defines `SUPPORTED_LANGUAGES: [&str; + 32]` (line 23), `language_by_name`, `language_by_filename`, + `to_arborium_name`, `get_arborium_highlight_query`, `load_language`. +- `crates/languages/grammars//` provides per-language + `config.yaml`, `identifiers.scm`, `indents.scm`. Embedded via + `RustEmbed`. +- `arborium` (internal crate) provides parsers and bundled + highlight queries via `arborium::lang_::HIGHLIGHTS_QUERY` + consts. +- Consumers of `Language`: `crates/syntax_tree/src/queries/` + (highlight query, indent query), `app/src/code/editor`, the AI + context indexers, the workflow view's `syntax_highlightable.rs`. + +## Architecture overview + +Add a `LanguageRegistry` discovery layer that loads from THREE +sources, in priority order: + +1. **Compile-time hardcoded** (existing path) — the current + `to_arborium_name` / `get_arborium_highlight_query` matches. + This is the "first-class" path; it stays for the existing 32 + languages until per-language migration PRs convert them. +2. **Bundled directory** — `crates/languages/grammars//` with + a `language.toml` driving discovery. New languages can be added + here without touching `lib.rs`. +3. **User-local directory** — `~/.warp/grammars//` (or + `$XDG_CONFIG_HOME/warp/grammars//`). WASM-only grammars, + loaded at startup. + +A bundled grammar takes precedence over a user-local one with the +same `internal_name` (B2 invariant). A compile-time hardcoded +language takes precedence over a bundled grammar with the same +name; this gives us the staged migration path of B4. + +## `language.toml` schema + +```toml +schema_version = 1 + +[language] +display_name = "Nim" +internal_name = "nim" +comment_prefix = "#" +indent_unit = { spaces = 2 } # or { tabs = 1 } + +[file_associations] +extensions = ["nim", "nims"] +filenames = ["nim.cfg"] +shebangs = ["nim"] +aliases = ["nim-lang"] + +[brackets] +pairs = [ + { start = "(", end = ")" }, + { start = "{", end = "}" }, + { start = "[", end = "]" }, +] + +[parser] +# Exactly one of `rust_crate` (bundled only) or `wasm` (bundled or +# user-local) must be set. The schema validate() rejects setting +# both or neither. The two examples below show the canonical +# bundled and user-local shapes. + +# Optional: pin the tree-sitter ABI version this grammar was +# compiled against; loader rejects mismatches with a clear error. +ts_abi = 14 +``` + +> **Correction (review #10129):** earlier drafts showed both +> `rust_crate` and `wasm` set in the same example block while the +> comments said they were mutually exclusive. The two canonical +> shapes are split below. + +**Bundled-grammar shape (Rust crate parser):** +```toml +[parser] +rust_crate = "tree-sitter-nim" +ts_abi = 14 +``` + +**Bundled or user-local shape (WASM parser):** +```toml +[parser] +wasm = "grammar.wasm" # path relative to the grammar dir +ts_abi = 14 +``` + +The schema lives in a new module `crates/languages/src/schema.rs` +with `serde::Deserialize` derives and a `validate()` method that: +- Rejects setting both `rust_crate` and `wasm`. +- Rejects setting neither. +- Rejects `rust_crate` in user-local grammars (only WASM is allowed + there per B5). +- Rejects unknown bracket characters. +- Rejects unknown top-level keys to surface typos to contributors. + +## Loader architecture + +### `crates/languages/src/loader.rs` (new) + +> **Correction (review #10129):** earlier drafts had `LoadedLanguage` +> with a mandatory `language: Arc` plus an optional +> `failure`. That can't represent a grammar that fails before a +> `Language` is constructed (e.g. malformed `language.toml`). The +> shape below is a tagged sum so failed grammars are first-class. + +```rust +pub enum LanguageSource { + Hardcoded, + Bundled { dir: PathBuf }, + UserLocal { dir: PathBuf }, +} + +/// One result of attempting to load a grammar from a directory or +/// from the hardcoded path. Either we got a `Language`, or we got +/// a `FailedGrammar` describing what went wrong. +pub enum LoadResult { + Loaded(LoadedLanguage), + Failed(FailedGrammar), +} + +pub struct LoadedLanguage { + pub language: Arc, + pub source: LanguageSource, + /// Non-fatal warnings (e.g. missing optional `highlights.scm`). + /// The grammar is in the registry; these are surfaced in the + /// Settings UI but do not prevent the language from loading. + pub warnings: Vec, +} + +pub struct FailedGrammar { + pub source: LanguageSource, + /// Best-effort name extracted from `language.toml` if it parsed + /// far enough; `None` if even the TOML parse failed. + pub internal_name: Option, + pub reason: LoadFailureReason, + pub schema_version: Option, +} + +pub enum LoadFailureReason { + SchemaParse(String), + SchemaVersionMismatch { found: u32 }, + NativeLibAttempted, + ParserCrateNotFound { crate_name: String }, + WasmInstantiate(String), + WasmAbiMismatch { host: u32, grammar: u32 }, + HighlightQueryInvalid(String), // syntactically wrong vs grammar + IndentQueryInvalid(String), + SymbolsQueryInvalid(String), +} + +pub enum LoadWarning { + HighlightsScmMissing, // optional file absent — no coloring, + // grammar still loads + IndentsScmMissing, + IdentifiersScmMissing, +} + +pub fn discover_grammars() -> Vec { ... } +``` + +`discover_grammars()` is called once at startup. It walks the three +sources in priority order, deduplicates by `internal_name` across +loaded results (failed grammars are kept regardless of dedup so +their failure surfaces in Settings), and returns one `LoadResult` +per attempted directory. + +### Loading a single grammar + +> **Correction (review #10129):** earlier drafts treated highlight- +> query load failures as `LoadFailure` and returned, contradicting +> product B6 which allowed missing `highlights.scm` to load without +> coloring. The two cases are now distinct. + +For each grammar directory: +1. Parse `language.toml` (`schema.rs::parse`). On failure: return + `LoadResult::Failed { reason: SchemaParse }`. +2. Validate schema constraints (`schema.rs::validate`). On failure: + return `LoadResult::Failed`. +3. Resolve the parser: + - `rust_crate`: look up via the compile-time `bundled_parsers.rs` + map. On miss: return `LoadResult::Failed { reason: + ParserCrateNotFound }`. + - `wasm`: `tree_sitter::WasmStore::load_language(&wasm_bytes)`. + On instantiate failure: return `LoadResult::Failed { reason: + WasmInstantiate }`. On ABI mismatch with the host's + `tree_sitter::TREE_SITTER_LANGUAGE_VERSION`: return + `LoadResult::Failed { reason: WasmAbiMismatch }`. +4. **`highlights.scm` (optional file):** + + > **Correction (re-review #10129):** the previous draft said + > missing-`highlights.scm` "loads without coloring," but + > [`crates/languages/src/lib.rs`](crates/languages/src/lib.rs) + > defines `pub struct Language { ..., pub highlight_query: Query, + > ... }` — `highlight_query` is **not** `Option`, so a + > `Language` cannot be constructed without one. The corrected + > design uses an empty query as the missing-file substitute + > rather than changing the `Language` API. + + - **File missing:** synthesize an empty highlight query via + `Query::new(grammar, "")`. Tree-sitter accepts empty source + (zero patterns). The language loads with the same `Language` + struct shape; matches at runtime return zero captures so no + coloring is applied. Record `LoadWarning::HighlightsScmMissing` + so the diagnostic surface in Settings still flags the missing + file. **The `Language` API stays unchanged**; preserving B8. + - **File present but `Query::new` fails:** the contributor + intended to provide a query and got it wrong; return + `LoadResult::Failed { reason: HighlightQueryInvalid }`. This + is treated as a hard failure because shipping a grammar with + a broken query is worse than no query at all. + + The same empty-query synthesis applies to the optional indent + and identifiers queries: their `Language` fields ARE + `Option` already, so missing-file = `None`, and + invalid-file = `LoadResult::Failed` per (5) below. +5. **`indents.scm` and `identifiers.scm` (optional files):** same + missing-vs-invalid split. Missing → `LoadWarning`. Invalid → + `LoadResult::Failed`. +6. Construct the `Language` struct with all available fields. +7. Return `LoadResult::Loaded(LoadedLanguage { ..., warnings })`. + +### Native dynamic-library rejection + +The loader explicitly checks the `parser` table for any field other +than `rust_crate` or `wasm`. If a `native_lib = "grammar.so"` field +is present (or any unknown field starting with `dl` / `native` / +`so` / `dylib` / `dll`), the loader rejects the grammar with the +B5 error message and never attempts to open the file. No +`libloading::Library::new` call exists anywhere on the loader path. + +### File-association registration + +After all grammars load, the loader populates two maps: + +```rust +struct AssociationIndex { + by_extension: HashMap>, + by_filename: HashMap>, + by_shebang: HashMap>, + by_alias: HashMap>, + by_internal_name: HashMap>, +} +``` + +The hardcoded path (the current `language_by_filename` match) is +queried first; if it returns `None`, fall through to the +`AssociationIndex`. This keeps existing behavior identical for the +32 languages until they migrate. + +## Public API changes + +`crates/languages/src/lib.rs`: + +- `language_by_name(name: &str)` — unchanged signature; internally + consults hardcoded match first, then `AssociationIndex.by_internal_name` + / `by_alias`. +- `language_by_filename(path: &Path)` — unchanged signature; + consults hardcoded path first, then `AssociationIndex` extension + / filename / shebang lookups. +- New: `loaded_languages() -> &[LoadResult]` — for the new + Settings → Editor → Languages page. Returns one entry per + attempted directory (loaded, with-warnings, or failed). + +No change for any current consumer; they continue to call the same +two functions. + +## Settings page integration + +`Settings → Editor → Languages` (new sub-page): + +- Lists each loaded language: display name, internal name, source + (Hardcoded / Bundled / User-local), file extensions claimed, + parser revision (from `language.toml` `ts_abi`). +- Lists each failed grammar: directory, reason. +- A "Reveal in Finder/Files" button next to user-local grammars. +- A "Refresh after restart" pill at the top reminding users that + changes require restart (V1 has no hot-reload). + +## Migration strategy for the 32 existing languages + +> **Correction (review #10129):** earlier drafts called this "a +> single mechanical PR." It is actually **one PR per language**, +> each independently revertable. The product spec is now consistent +> with this. Each PR follows this template: + +1. Create `crates/languages/grammars//language.toml` with the + file associations and parser reference matching the current + hardcoded behavior. +2. Move the `arborium::lang_X::HIGHLIGHTS_QUERY` const into a + `highlights.scm` file in the same directory. +3. Remove the language's arms from `to_arborium_name`, + `get_arborium_highlight_query`, `language_by_filename`, and the + `SUPPORTED_LANGUAGES` array. +4. Verify `crates/languages/src/lib_tests.rs` still passes. +5. Verify any language-specific indent / highlight tests in + `crates/syntax_tree/` still pass. + +The `bundled_parsers.rs` map gets one new entry per migration. The +priority-order rule (hardcoded > bundled) means a partial migration +is safe: an unmigrated language uses the hardcoded path; a migrated +one uses the bundled path. There is no flag day. + +V1 of THIS PR migrates **zero** existing languages — only adds the +discovery mechanism beside them. Each follow-up PR migrates one +language and is independently revertable. + +## Telemetry and logging privacy + +> **Correction (review #10129):** product B6 said logs include the +> grammar directory path; this section said paths are PII for +> telemetry. The two were inconsistent. Resolved: paths are PII +> across both surfaces. + +> **Correction (re-review #10129, security):** the previous draft +> sent `internal_name` for all grammar sources. Hardcoded and +> bundled names are well-known (they ship in the Warp binary), but +> user-local `internal_name` is **user-controlled** — a customer's +> private project might define a grammar named `acme-internal-dsl` +> and disclose that name to analytics on every startup. Resolved +> below by stripping `internal_name` for `UserLocal`-sourced +> events. + +**Telemetry events** (sent to Warp's analytics): +- `grammar_loaded` (one-time at startup): + - For **Hardcoded** and **Bundled** sources: payload includes + `internal_name`, `source` tag, `parser_kind` (rust_crate / + wasm), `ts_abi`. Names are public (they ship in Warp). + - For **UserLocal** sources: payload is `{ source: + "user_local", parser_kind, ts_abi }` — **no `internal_name`, + no path, no `name_hash`.** The team gets aggregate counts of + user-local-grammar adoption without learning *which* grammars + individual users installed. +- `grammar_load_failed` (one-time): + - For **Hardcoded** and **Bundled**: `internal_name` (if the + TOML parsed far enough to extract it), `reason_kind` (one of + `schema_parse`, `schema_version`, `native_lib`, + `parser_crate_not_found`, `wasm_instantiate`, `wasm_abi`, + `highlight_query`, `indent_query`, `symbols_query`), + `source_kind`. No paths. + - For **UserLocal**: `{ source_kind: "user_local", + reason_kind }` — no `internal_name`, no path. The + reason_kind alone is enough to identify systemic user-local + failure modes (e.g. `wasm_abi` mismatches after a Warp + upgrade) without disclosing user-controlled strings. +- Both events respect Warp's existing global telemetry opt-out. + +**Local logs** (`log::error!`, `log::warn!`, `log::info!`): +- Use the **basename** of the grammar directory only (e.g., + `nim`, `zig`). The full path is never logged. +- The exception is the in-app Settings UI, which DOES show the + full path because it is local to the user and useful for + debugging. The Settings UI is not log output. + +**No payload contains:** raw `language.toml` contents, the +contents of `.scm` files, the WASM binary, or absolute paths. + +## Test plan + +### Unit tests (`crates/languages/src/schema_test.rs` — new) + +- T1: Parse a minimal valid `language.toml` (display_name + + internal_name + extensions + parser). +- T2: Parse a fully-populated `language.toml` and verify all fields + round-trip. +- T3: Parser table with both `rust_crate` and `wasm` fields fails + validation. +- T4: Parser table with `native_lib = "..."` is rejected with the + B5 error message. +- T5: `schema_version` mismatch (e.g., 999) is rejected with a + clear "unsupported schema version" error. + +### Unit tests (`crates/languages/src/loader_test.rs` — new) + +- T6: A bundled grammar with a stub WASM that fails to instantiate + surfaces as `LoadResult::Failed { reason: WasmInstantiate, .. }` + and does NOT panic. +- T7: A user-local grammar whose `internal_name` collides with a + hardcoded language is dropped from the merged list and a warn + fires (basename only, no full path in the log message). +- T8: A user-local grammar whose `internal_name` collides with a + bundled grammar (after a hypothetical migration) is dropped; + bundled wins. +- T9: ABI mismatch (host ABI 14, grammar declares ABI 13) surfaces + as `LoadResult::Failed { reason: WasmAbiMismatch { host: 14, + grammar: 13 } }`. +- T10 (new): Missing `highlights.scm` produces + `LoadResult::Loaded { warnings: [HighlightsScmMissing], .. }`. + The grammar IS in the registry; coloring is absent. +- T11 (new): Present-but-invalid `highlights.scm` (parses against + a different grammar) produces `LoadResult::Failed { reason: + HighlightQueryInvalid }`. The grammar is NOT in the registry. +- T12 (new): Same missing-vs-invalid distinction for `indents.scm` + and `identifiers.scm`. +- T13 (new): A WASM grammar whose import list includes a non- + tree-sitter symbol (e.g. fs read) is rejected at load. +- T14 (new): A grammar that triggers `parser.set_timeout_micros` + (parse exceeds 100ms on a fixture buffer) returns partial parse + results and emits one warn-level log. + +### Integration test (`crates/languages/src/integration_test.rs` — new) + +- IT1: Create a temp dir with a fixture grammar (a real + tree-sitter-toml grammar shrunk to a minimal subset), point + `WARP_USER_GRAMMAR_DIR` env var at it, call + `discover_grammars()`. Assert the language returns as + `LoadResult::Loaded(...)` and + `language_by_filename(Path::new("test.example"))` returns it. +- IT2: Same as IT1 but with malformed TOML; assert the rest of the + registry loads normally and the failure is reported as + `LoadResult::Failed { reason: SchemaParse(_) }`. +- IT3: Call `loaded_languages()` after discovery and assert the 32 + hardcoded languages are present alongside the test fixture. +- IT4 (new): Drop a fixture WASM grammar with the same + `internal_name` as a hardcoded language; verify the user-local + one is dropped, a warn fires with basename only, and the + hardcoded language continues to handle that name. +- IT5 (new): Feed a 16MiB buffer through a user-local grammar; + verify the input-size cap kicks in and the buffer falls back to + plain rendering with a one-time info log. + +### Regression (existing test files unchanged) + +- Existing `crates/languages/src/lib_tests.rs` (which iterates + `SUPPORTED_LANGUAGES` and verifies each loads) must pass with no + modifications. The 32 hardcoded languages remain in + `SUPPORTED_LANGUAGES` until their migration PRs. +- Existing `crates/syntax_tree/src/queries/*_tests.rs` calling + `language_by_filename` must pass with no modifications. + +## Files touched + +V1 (this PR — discovery mechanism only, zero migrations): + +- `crates/languages/src/lib.rs` — fall-through call to the new + `AssociationIndex` after the existing hardcoded match. +- `crates/languages/src/schema.rs` (new) — `language.toml` parser. +- `crates/languages/src/loader.rs` (new) — discovery + load. +- `crates/languages/src/bundled_parsers.rs` (new) — empty map + initially; entries added per-migration. +- `crates/languages/src/association_index.rs` (new) — lookup maps. +- `crates/languages/Cargo.toml` — add `tree-sitter` (for + WasmStore) and `toml` deps if not already present. +- `crates/languages/src/schema_test.rs` (new) — T1–T5. +- `crates/languages/src/loader_test.rs` (new) — T6–T9. +- `crates/languages/src/integration_test.rs` (new) — IT1–IT3. +- `app/src/settings_view/editor_languages_page.rs` (new) — the + Languages sub-page. + +V1 explicitly does NOT touch: +- The 32 hardcoded language arms. +- Any consumer of `Language`. +- `arborium` crate. + +## Out-of-scope follow-ups (each independently revertable) + +- Per-language migration PRs (one per language) moving from + hardcoded to `crates/languages/grammars//`. +- Sub-language injection mechanism (Vue, TSX, Markdown code + blocks). +- Hot-reload of user-local grammars. +- Package-manager / cloud distribution of community grammars. +- LSP integration via a `[lsp]` section in `language.toml`. + +## Open questions for maintainer review + +1. WASM grammars require a tree-sitter version that supports + `WasmStore`. Confirm the version Warp uses today (verify + `Cargo.lock`) supports it. +2. The user-local grammar directory: `~/.warp/grammars/` vs. + `$XDG_CONFIG_HOME/warp/grammars/`. Recommendation: XDG when set, + fall back to `~/.warp/`. +3. Should the Settings → Editor → Languages page allow disabling + individual loaded languages? Recommendation: yes, but a + follow-up PR; not V1. +4. The `bundled_parsers.rs` compile-time map is the only hand- + edited file for adding bundled grammars. Can we use an + `inventory`-style auto-registration pattern instead? (Would + eliminate the only remaining hand-edit but adds a build-time + crate dependency.)