M2: corpus + tokenizer byte offsets + outline parser & LSP symbols (#304, #306)#308
Conversation
Continues the remaining M2 work. #304 (conformance corpus): add styled-page and slot-component accept cases and document the inherent single-file scope. Reactive g: directives, endpoints, layouts, wasm, and assets need project context (a Go-typed state contract, Go handlers, sibling files) that single-file CheckSource cannot resolve, so they stay covered by package/build-level tests rather than the corpus; this is now stated in docs/language/conformance.md. #306 (ADR 0010 phase 1): the shared lang tokenizer now records each token's 0-based byte Offset, tracked alongside the existing rune scan. A test verifies the offsets are byte-accurate and consistent with source.OffsetOf. This is the substrate the planned recursive-descent parser consumes; phases 2-3 (the parser itself and the per-declaration cutover) remain open on #306. Tests: go test ./internal/... ./cmd/... . pass; gofmt/vet clean; node 39/0.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 270b09b150
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| func (scanner *scanner) advance() rune { | ||
| ch := scanner.source[scanner.index] | ||
| scanner.index++ | ||
| scanner.byteIndex += utf8.RuneLen(ch) |
There was a problem hiding this comment.
Preserve byte widths for malformed UTF-8
When a .gwdk file contains malformed UTF-8, Lex first converts the original string to []rune, so each invalid byte becomes utf8.RuneError; utf8.RuneLen(utf8.RuneError) is 3, even though the original malformed byte occupied 1 byte. That makes every subsequent Token.Offset drift from the original source buffer (for example, after one bad byte the next newline/metadata offset is two bytes too large), which defeats the new exact-span contract and can make the planned parser/diagnostic edits point at the wrong lexeme or past valid slice bounds. Track widths while ranging over the original string, or otherwise preserve the decoder-reported byte size instead of recomputing it from the rune.
Useful? React with 👍 / 👎.
Codex review: deriving Token.Offset by summing utf8.RuneLen drifts on malformed UTF-8. []rune turns each invalid byte into a 3-byte U+FFFD, so RuneLen reports 3 for a byte that occupied 1, pushing every later token's offset past its true position and breaking the exact-span contract. Derive byte offsets from ranging the original string (which reports true byte positions) into a per-rune byteOffsets table the scanner indexes, instead of accumulating RuneLen. Add a malformed-UTF-8 test that asserts offsets stay anchored to the byte buffer. Tests: go test ./internal/... ./cmd/... . pass; gofmt/vet clean.
…ase 2) Builds the first real consumer of the ADR 0010 parser direction on top of the phase-1 tokenizer, so the byte-offset work is no longer dormant substrate. - internal/lang/outline.go: a recursive-descent pass over the shared tokenizer that parses the top-level declaration structure (package, metadata, imports, uses, blocks, endpoints, component/page) into a flat outline with byte-offset spans. It recovers from unrecognized lines by skipping to the next line, so a malformed line never hides the rest of the outline, and block ranges span to the matching close brace counted over tokens (string literals are single tokens, so braces inside strings never miscount). - internal/lsp: a textDocument/documentSymbol provider consuming the outline, with the documentSymbolProvider capability, mapping outline kinds to LSP SymbolKinds. Tests cover outline parsing, error recovery, brace-aware block ranges, offset spans, and the LSP handler end to end. Tests: go test ./internal/... ./cmd/... . pass; gofmt/vet clean; node 39/0.
Continues the remaining open M2 work after closing the eight finished issues (#294–#299, #302, #305).
#304 — Conformance corpus (expanded + honestly scoped)
Added
styled_pageandslot_componentaccept cases. Documented a real boundary: full per-construct accept coverage is infeasible single-file — componentstateis a Go-typed contract, so reactiveg:directives can't check clean without the Go types;act/apineed handlers;layout/wasm/asset/cssneed sibling files or config.conformance.mdnow scopes the corpus to what single-file checking can verify.#306 phase 1 — shared tokenizer byte offsets
The
internal/langtokenizer records each token's 0-based byteOffset. Codex caught a real bug here: deriving offsets by summingutf8.RuneLendrifts on malformed UTF-8, since[]runeturns each bad byte into a 3-byte U+FFFD. Fixed by deriving offsets from ranging the original string (true byte positions) into a per-rune table; added a malformed-UTF-8 regression test.#306 phase 2 — recursive-descent outline parser + LSP document symbols
internal/lang/outline.gois the first real consumer of the ADR 0010 direction: a recursive-descent pass over the tokenizer that parses the top-level declaration structure into an outline with byte-offset spans. It recovers from unrecognized lines (a malformed line doesn't hide the rest — the headline #306 capability) and spans block ranges to the matching close brace counted over tokens (strings are single tokens, so interpolation/CSS/JS braces never miscount). Wired as atextDocument/documentSymbolprovider so the byte-offset work is not dormant substrate — it powers a live editor outline.Phase 3 (the full recursive-descent parser producing
gwdkast.Fileand the per-declaration cutover that retires the line-oriented parser) remains open on #306.Verification
go test ./internal/... ./cmd/... .— passgofmt -l+go vet— cleannode --test editors/vscode/*.test.js— 39/0🤖 Generated with Claude Code