Skip to content

M2: corpus + tokenizer byte offsets + outline parser & LSP symbols (#304, #306)#308

Merged
cssbruno merged 3 commits into
mainfrom
codex/m2-corpus-tokenizer
Jun 12, 2026
Merged

M2: corpus + tokenizer byte offsets + outline parser & LSP symbols (#304, #306)#308
cssbruno merged 3 commits into
mainfrom
codex/m2-corpus-tokenizer

Conversation

@cssbruno

@cssbruno cssbruno commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Continues the remaining open M2 work after closing the eight finished issues (#294#299, #302, #305).

#304 — Conformance corpus (expanded + honestly scoped)

Added styled_page and slot_component accept cases. Documented a real boundary: full per-construct accept coverage is infeasible single-file — component state is a Go-typed contract, so reactive g: directives can't check clean without the Go types; act/api need handlers; layout/wasm/asset/css need sibling files or config. conformance.md now scopes the corpus to what single-file checking can verify.

#306 phase 1 — shared tokenizer byte offsets

The internal/lang tokenizer records each token's 0-based byte Offset. Codex caught a real bug here: deriving offsets by summing utf8.RuneLen drifts on malformed UTF-8, since []rune turns each bad byte into a 3-byte U+FFFD. Fixed by deriving offsets from ranging the original string (true byte positions) into a per-rune table; added a malformed-UTF-8 regression test.

#306 phase 2 — recursive-descent outline parser + LSP document symbols

internal/lang/outline.go is the first real consumer of the ADR 0010 direction: a recursive-descent pass over the tokenizer that parses the top-level declaration structure into an outline with byte-offset spans. It recovers from unrecognized lines (a malformed line doesn't hide the rest — the headline #306 capability) and spans block ranges to the matching close brace counted over tokens (strings are single tokens, so interpolation/CSS/JS braces never miscount). Wired as a textDocument/documentSymbol provider so the byte-offset work is not dormant substrate — it powers a live editor outline.

Phase 3 (the full recursive-descent parser producing gwdkast.File and the per-declaration cutover that retires the line-oriented parser) remains open on #306.

Verification

  • go test ./internal/... ./cmd/... . — pass
  • gofmt -l + go vet — clean
  • node --test editors/vscode/*.test.js — 39/0

🤖 Generated with Claude Code

Continues the remaining M2 work.

#304 (conformance corpus): add styled-page and slot-component accept cases and
document the inherent single-file scope. Reactive g: directives, endpoints,
layouts, wasm, and assets need project context (a Go-typed state contract, Go
handlers, sibling files) that single-file CheckSource cannot resolve, so they
stay covered by package/build-level tests rather than the corpus; this is now
stated in docs/language/conformance.md.

#306 (ADR 0010 phase 1): the shared lang tokenizer now records each token's
0-based byte Offset, tracked alongside the existing rune scan. A test verifies
the offsets are byte-accurate and consistent with source.OffsetOf. This is the
substrate the planned recursive-descent parser consumes; phases 2-3 (the parser
itself and the per-declaration cutover) remain open on #306.

Tests: go test ./internal/... ./cmd/... . pass; gofmt/vet clean; node 39/0.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 270b09b150

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/lang/lexer.go Outdated
func (scanner *scanner) advance() rune {
ch := scanner.source[scanner.index]
scanner.index++
scanner.byteIndex += utf8.RuneLen(ch)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve byte widths for malformed UTF-8

When a .gwdk file contains malformed UTF-8, Lex first converts the original string to []rune, so each invalid byte becomes utf8.RuneError; utf8.RuneLen(utf8.RuneError) is 3, even though the original malformed byte occupied 1 byte. That makes every subsequent Token.Offset drift from the original source buffer (for example, after one bad byte the next newline/metadata offset is two bytes too large), which defeats the new exact-span contract and can make the planned parser/diagnostic edits point at the wrong lexeme or past valid slice bounds. Track widths while ranging over the original string, or otherwise preserve the decoder-reported byte size instead of recomputing it from the rune.

Useful? React with 👍 / 👎.

cssbruno added 2 commits June 11, 2026 18:59
Codex review: deriving Token.Offset by summing utf8.RuneLen drifts on malformed
UTF-8. []rune turns each invalid byte into a 3-byte U+FFFD, so RuneLen reports 3
for a byte that occupied 1, pushing every later token's offset past its true
position and breaking the exact-span contract.

Derive byte offsets from ranging the original string (which reports true byte
positions) into a per-rune byteOffsets table the scanner indexes, instead of
accumulating RuneLen. Add a malformed-UTF-8 test that asserts offsets stay
anchored to the byte buffer.

Tests: go test ./internal/... ./cmd/... . pass; gofmt/vet clean.
…ase 2)

Builds the first real consumer of the ADR 0010 parser direction on top of the
phase-1 tokenizer, so the byte-offset work is no longer dormant substrate.

- internal/lang/outline.go: a recursive-descent pass over the shared tokenizer
  that parses the top-level declaration structure (package, metadata, imports,
  uses, blocks, endpoints, component/page) into a flat outline with byte-offset
  spans. It recovers from unrecognized lines by skipping to the next line, so a
  malformed line never hides the rest of the outline, and block ranges span to
  the matching close brace counted over tokens (string literals are single
  tokens, so braces inside strings never miscount).
- internal/lsp: a textDocument/documentSymbol provider consuming the outline,
  with the documentSymbolProvider capability, mapping outline kinds to LSP
  SymbolKinds.

Tests cover outline parsing, error recovery, brace-aware block ranges, offset
spans, and the LSP handler end to end.

Tests: go test ./internal/... ./cmd/... . pass; gofmt/vet clean; node 39/0.
@cssbruno cssbruno changed the title M2 continued: corpus coverage + tokenizer byte offsets (#304, #306 phase 1) M2: corpus + tokenizer byte offsets + outline parser & LSP symbols (#304, #306) Jun 11, 2026
@cssbruno cssbruno merged commit 9c1d4f3 into main Jun 12, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant