M2: corpus + tokenizer byte offsets + outline parser & LSP symbols (#304, #306) by cssbruno · Pull Request #308 · cssbruno/GoWDK

cssbruno · 2026-06-11T21:43:47Z

Continues the remaining open M2 work after closing the eight finished issues (#294–#299, #302, #305).

#304 — Conformance corpus (expanded + honestly scoped)

Added styled_page and slot_component accept cases. Documented a real boundary: full per-construct accept coverage is infeasible single-file — component state is a Go-typed contract, so reactive g: directives can't check clean without the Go types; act/api need handlers; layout/wasm/asset/css need sibling files or config. conformance.md now scopes the corpus to what single-file checking can verify.

#306 phase 1 — shared tokenizer byte offsets

The internal/lang tokenizer records each token's 0-based byte Offset. Codex caught a real bug here: deriving offsets by summing utf8.RuneLen drifts on malformed UTF-8, since []rune turns each bad byte into a 3-byte U+FFFD. Fixed by deriving offsets from ranging the original string (true byte positions) into a per-rune table; added a malformed-UTF-8 regression test.

#306 phase 2 — recursive-descent outline parser + LSP document symbols

internal/lang/outline.go is the first real consumer of the ADR 0010 direction: a recursive-descent pass over the tokenizer that parses the top-level declaration structure into an outline with byte-offset spans. It recovers from unrecognized lines (a malformed line doesn't hide the rest — the headline #306 capability) and spans block ranges to the matching close brace counted over tokens (strings are single tokens, so interpolation/CSS/JS braces never miscount). Wired as a textDocument/documentSymbol provider so the byte-offset work is not dormant substrate — it powers a live editor outline.

Phase 3 (the full recursive-descent parser producing gwdkast.File and the per-declaration cutover that retires the line-oriented parser) remains open on #306.

Verification

go test ./internal/... ./cmd/... . — pass
gofmt -l + go vet — clean
node --test editors/vscode/*.test.js — 39/0

🤖 Generated with Claude Code

Continues the remaining M2 work. #304 (conformance corpus): add styled-page and slot-component accept cases and document the inherent single-file scope. Reactive g: directives, endpoints, layouts, wasm, and assets need project context (a Go-typed state contract, Go handlers, sibling files) that single-file CheckSource cannot resolve, so they stay covered by package/build-level tests rather than the corpus; this is now stated in docs/language/conformance.md. #306 (ADR 0010 phase 1): the shared lang tokenizer now records each token's 0-based byte Offset, tracked alongside the existing rune scan. A test verifies the offsets are byte-accurate and consistent with source.OffsetOf. This is the substrate the planned recursive-descent parser consumes; phases 2-3 (the parser itself and the per-declaration cutover) remain open on #306. Tests: go test ./internal/... ./cmd/... . pass; gofmt/vet clean; node 39/0.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 270b09b150

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T21:49:31Z

 func (scanner *scanner) advance() rune {
 	ch := scanner.source[scanner.index]
 	scanner.index++
+	scanner.byteIndex += utf8.RuneLen(ch)


Preserve byte widths for malformed UTF-8

When a .gwdk file contains malformed UTF-8, Lex first converts the original string to []rune, so each invalid byte becomes utf8.RuneError; utf8.RuneLen(utf8.RuneError) is 3, even though the original malformed byte occupied 1 byte. That makes every subsequent Token.Offset drift from the original source buffer (for example, after one bad byte the next newline/metadata offset is two bytes too large), which defeats the new exact-span contract and can make the planned parser/diagnostic edits point at the wrong lexeme or past valid slice bounds. Track widths while ranging over the original string, or otherwise preserve the decoder-reported byte size instead of recomputing it from the rune.

Useful? React with 👍 / 👎.

Codex review: deriving Token.Offset by summing utf8.RuneLen drifts on malformed UTF-8. []rune turns each invalid byte into a 3-byte U+FFFD, so RuneLen reports 3 for a byte that occupied 1, pushing every later token's offset past its true position and breaking the exact-span contract. Derive byte offsets from ranging the original string (which reports true byte positions) into a per-rune byteOffsets table the scanner indexes, instead of accumulating RuneLen. Add a malformed-UTF-8 test that asserts offsets stay anchored to the byte buffer. Tests: go test ./internal/... ./cmd/... . pass; gofmt/vet clean.

…ase 2) Builds the first real consumer of the ADR 0010 parser direction on top of the phase-1 tokenizer, so the byte-offset work is no longer dormant substrate. - internal/lang/outline.go: a recursive-descent pass over the shared tokenizer that parses the top-level declaration structure (package, metadata, imports, uses, blocks, endpoints, component/page) into a flat outline with byte-offset spans. It recovers from unrecognized lines by skipping to the next line, so a malformed line never hides the rest of the outline, and block ranges span to the matching close brace counted over tokens (string literals are single tokens, so braces inside strings never miscount). - internal/lsp: a textDocument/documentSymbol provider consuming the outline, with the documentSymbolProvider capability, mapping outline kinds to LSP SymbolKinds. Tests cover outline parsing, error recovery, brace-aware block ranges, offset spans, and the LSP handler end to end. Tests: go test ./internal/... ./cmd/... . pass; gofmt/vet clean; node 39/0.

This was referenced Jun 11, 2026

[Compiler] Implement shared tokenizer + recursive-descent parser (ADR 0010) #306

Open

[Language] Expand conformance corpus to full construct coverage #304

Open

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

cssbruno added 2 commits June 11, 2026 18:59

cssbruno changed the title ~~M2 continued: corpus coverage + tokenizer byte offsets (#304, #306 phase 1)~~ M2: corpus + tokenizer byte offsets + outline parser & LSP symbols (#304, #306) Jun 11, 2026

cssbruno merged commit 9c1d4f3 into main Jun 12, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M2: corpus + tokenizer byte offsets + outline parser & LSP symbols (#304, #306)#308

M2: corpus + tokenizer byte offsets + outline parser & LSP symbols (#304, #306)#308
cssbruno merged 3 commits into
mainfrom
codex/m2-corpus-tokenizer

cssbruno commented Jun 11, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cssbruno commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

#304 — Conformance corpus (expanded + honestly scoped)

#306 phase 1 — shared tokenizer byte offsets

#306 phase 2 — recursive-descent outline parser + LSP document symbols

Verification

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cssbruno commented Jun 11, 2026 •

edited

Loading