Bound prefer_inner_replay_corrections by depth_diff to substitute siblings only by stefanobaghino · Pull Request #682 · trishume/syntect

stefanobaghino · 2026-04-29T13:16:49Z

Review just this PR's changes: 631-java-cross-line-bp-fall-through...631-java-cross-line-allexhaust-pops

Stacked on #681.

The previous gate in
prefer_inner_replay_corrections
(commit
0a2139a)
skipped substitution iff inner.stack_depth > outer.stack_depth.
That predicate collapsed two structurally distinct cases:

Sibling refinement (substitute needed): inner is one level
deeper than outer on the same line — e.g. cluster 1's
@Anno\n.\nAnno\n(par=1)\nenum E {} with
outer=declarations(3), inner=annotation-identifier(4). Inner's
qualified-identifier alt brings meta.path.java that outer's
locally-computed parse drops.
Child of resolved alt (substitute must skip): inner is much
deeper, nested inside contexts outer's resolved alt pushed —
e.g. multigen16's outer=class-members(4), inner=object-type(11). Inner's reparse adds atoms outer's alt
already provides (the
deeper_inner_bp_correction_does_not_double_outer_meta_scope
guard).

This PR tightens the gate to depth_diff in {0, 1}, separating
the two by the smallest viable structural signal. Java syntest
baseline drops 119 → 117. The cluster-1 ignored repro becomes a
passing regression test
(cross_line_all_exhaust_with_pop_count_emits_popped_meta_scope_pops).
The doubling guard stays green.

Cluster 2's multi-line qualified field type (depth_diff=5) and
cluster 3's pop: 2 miscount remain unresolved; per investigation
captured in scratch on the stacked branch, neither has a
local-signal discriminator distinguishing it from the doubling
cascade. Those are deferred to a separate pass.

push_meta_ops emitted the compound Pop before the Restore when the leaving context carried `clear_scopes:` and `set_pop_count > 1`. The intermediate popped frame's meta_content_scope is split across the live scope stack and clear_stack; Pop — counting the full pre-Clear total — then ate atoms from frames below the popped range. Observed on Batch File `cmd-set-quoted-value-inner-end` (`clear_scopes: 1`) firing `pop: 2, set: ignored-tail-outer`, which dropped `meta.command.set.dosbatch` from the trailing content of every quoted `set "var"=...` line. Emit Restore before Pop when `set_pop_count > 1`; plain `set:` keeps the existing Pop-then-Restore order (gated by the Lisp `defun` test `v2_set_to_target_with_clear_scopes_clears_parent_meta_content_scope`). syntax_test_batch_file.bat: 74 → 0 on both backends; no other baseline entry changes. New regression `pop_n_set_with_cur_clear_scopes_restores_before_popping_deeper_frames`. Refs: trishume#631

Sublime Text applies `captures: N:` to the overlap between group N's span and the rule's consumed match range. syntect dropped lookaround- internal captures in `parse_captures` and would have emitted Pop past match_end in `build_capture_ops` had they reached it. Keep every non-negative `captures:` key at load time; clip `(cap_start, cap_end)` to `regions.pos(0)` at apply time. Removes the now-unused `get_consuming_capture_indexes` walker and its tests. Baseline (both backends): clears ASP/syntax_test_asp.asp (53), C#/tests/syntax_test_Generics.cs (3), Rails/tests/syntax_test_rails.html.erb (23). Refs: trishume#631

ST drops the popped context's `meta_scope` and `meta_content_scope` from the trigger match's text for `pop: N + embed:`, unlike `pop: N + set:` which preserves both. Rules in the wild re-add the meta_scope atom explicitly in their match `scope:` so it still appears exactly once on the trigger — HTML (JSP).sublime-syntax's `tag-jsp-{declaration,expression,scriptlet}-attributes` all do. syntect's Embed → synthetic Set routing in `push_meta_ops` inherited plain-Set semantics, so cur.meta_scope stayed on the stack and the match's explicit scope duplicated it on top. Fix: when `pop_count > 0`, emit initial-phase Pops for cur's mcs and ms, then pass a scope-stripped clone of cur through to the recursive Set call so its non-initial `num_to_pop` doesn't double-account atoms that are already off the stack. Probe and ordering invariant in `v2_pop_embed_suppresses_cur_meta_scope_on_match`. Net syntest: Java/jsp 44 → 39 on both baselines (5 `- meta.tag.jsp meta.tag.jsp` assertions cleared). The other 39 are three unrelated root causes, not addressed here. Refs: trishume#631

A cross-line `fail` replay commits a Push(meta_scope) to `flushed_ops` for a speculative context; a later same-line `fail` for a branch_point created *during* the replay then truncates the owning context out of `self.stack` without emitting a balancing Pop (the Push is in `flushed_ops`, beyond `ops.truncate`'s reach). `exec_escape` pops based on the truncated stack, leaving an orphan atom at the top. Track a `shadow: ScopeStack` mirror of the consumer view, synced at `parse_line` boundaries the same way `syntest` applies `replayed` + `ops`. `exec_escape` now emits a corrective Pop for any atoms exceeding the sum of `self.stack`'s `meta_scope` / gated `meta_content_scope` contributions. Drops `syntax_test_latex.tex: 76` from both `testdata/known_syntest_failures{,_fancy}.txt`.

`IncludeWithPrototype` in `MatchIter` pushed the included target on top of the prototype, and `MatchIter::next` reads the stack top (`ctx_stack[len-1]`) — so the target's patterns were iterated first and the external prototype's second. The parser's tie-break on match_start is strict `<`, so whichever rule is enumerated first wins a same-position match. ST's `apply_prototype` semantics and `ParseState::find_best_match`'s own `context.prototype` chaining (`chain(cur_prototype).chain(cur_context)`) both put prototype patterns ahead of the target. Swap the two pushes so the prototype lands on top of the stack and is iterated first. Concretely: in HAML `tag-attributes-content`, `ruby-code` does `include: scope:source.ruby.rails.embedded.haml apply_prototype: true`. Ruby-for-HAML's prototype injects HAML's `pipe-continuations` (match `|\s*$`). Before this fix, Ruby's bitwise-or rule (`[~|^]`) at the same position was iterated first and won the tie, so `|` at EOL got `keyword.operator.bitwise.ruby` instead of `punctuation.separator.continuation.haml`; the attribute braces popped at the newline, and every continuation assertion below cascaded. After the swap, the prototype's pipe-continuation wins the tie. Refs: trishume#631

Strengthens the existing `apply_prototype_includes_external_prototype` from build-only to parse-and-assert. Adds precedence, opt-out, and HAML-Rails end-to-end guards alongside it in `src/parsing/syntax_set.rs`. Refs: trishume#631

All 65 assertions in `syntax_test_rails.haml` pass after the apply_prototype ordering fix. Delta applies to both baselines.

ST's `text_point(row, col)` overflows past-EOL columns into the next row, so its syntax-test framework evaluates past-EOL assertions against the corresponding column on the next line. syntect's harness was instead testing against the consumed `\n`'s scope — silent divergence whenever the `\n` carried parent meta_scopes that the EOL pop chain dropped. Reorder the loop to parse-before-assert; thread the first post-target line's scopes into `process_assertions` (`examples/syntest.rs`); fall back to the previous behaviour when next-line scopes aren't available (EOF, replay path). Closes 17 `syntax_test_git_config` and 1 `syntax_test_clojure` stale baselines. Refs: trishume#631

Two cross-line branches failing on the same parse_line grew `flushed_ops` by append, so `ParseLineOutput::replayed` doubled and consumers that pair `replayed[i]` with the i-th pending line slid ops from one buffered line onto another's text. Observed as the byte-77 panic at `syntax_test_java.java` line 624. Track `flushed_ops_start` alongside `flushed_ops` and merge subsequent fails against the prior snapshot's range. See `ParseState::merge_flushed` docs for the composition rule. `known_syntest_failures{,_fancy}` absorb the unmasking: Python / TypeScript / Bash / Zsh files previously panicking now report their real path-1 counts. Java stays at `1` — next panic site is a pre-existing stale-`line_number` on branches created during replay, tracked as follow-up. Refs: trishume#631

Branches created while `handle_fail` re-parses a buffered past line snapshotted `self.line_number` / `self.pending_lines.len()`, which still reflect the *outer* `parse_line`'s current line. A later fail on the outer line would then see `bp.line_number == cur_line`, classify the branch as same-line, and apply the branch's replay-line-relative `match_start` to a shorter outer line — shipped as `byte index 20 out of bounds of " foo = BAR,\n"` on `syntax_test_java.java:10263` inside `@MultiLineAnnotation(...)`, and the matching byte-2 panic in `syntax_test_markdown.md`'s multi-line math blocks. Introduce a `replay_ctx: Option<ReplayCtx>` set around each inner `parse_line_inner*` call in both replay loops. Branch creation and `handle_fail`'s `cur_line` read through it, so branches born in the re-parse of line `L+i` record `line_number = L+i` and `pending_lines_snapshot_len = <slot for L+i>`. Baselines absorb the unmasking: TypeScript drops 230 to 12 (cascading replay-branch misclassifications fixed), Markdown moves 1 to 897 (the `1` was the byte-2 panic artefact; real count surfaces). `syntax_test_java.java` stays at `1`: a distinct pre-existing `NoClearedScopesToRestore` surfaces further into the same file, tracked as a follow-up. Refs: trishume#631

A `branch_point` born inside `handle_fail`'s cross-line replay recorded only the inner re-parse's local `res` Vec as its `prefix_ops`. When that nested branch later failed cross-line, its own replay reconstructed the line from an empty prefix and the captures emitted before the *outer* branch trigger vanished. Shipped as `[foo]: /url` losing its `meta.link.reference.def.markdown` and capture scopes whenever the outer `link-def-title-continuation` branch's `immediately-pop2` alt-1 spawned a nested `link-def-attr-continuation` whose own fail then replayed line 3 without the original LRD opener captures. Compose the first-line prefix (outer `prefix_ops` + new-alt meta/pat/capture/meta_content) up front in both cross-line replay paths, surface it via `ParseState::replay_prefix_ops`, and prepend it to inner branch creations' `prefix_ops`. Baselines: Markdown 897 → 565, TypeScript 12 → 0 (file disappears — `syntax_test_typescript.ts` exercised the same nested-replay shape). Refs: trishume#631

`parse_line` captured the buffered shadow snapshot BEFORE the line ran, and the syntest consumer captured `stack_before` similarly. A replay applied during that line's parse may have corrected ops for prior buffered lines, leaving the captured snapshot reflecting the uncorrected baseline. A LATER replay covering the same line then resets to that stale snapshot, re-applies the corrected ops on top, and resurrects any meta_scope the prior replay had unwound. Manifested as `meta.link.reference.def.markdown` leaking past back-to-back Markdown link reference definitions and polluting all subsequent paragraphs, code blocks, blockquotes, autolinks, footnotes, etc. for the rest of the file (~408 chars / 88 assertions in `syntax_test_markdown.md`). After applying replays in `parse_line`, overwrite each buffered `pending_line_start_shadows[start_idx + i + 1]` with the post-i shadow, and use the post-replay shadow as the snapshot for the current line being pushed. Mirror the same correction in `syntest`'s consumer loop on `parsed_line_buffer[..].stack_before`. Baselines: - Markdown 565 → 158 (the LRD-leak family) - Java 1 (panic) → 18953 (real failures unmasked — the `NoClearedScopesToRestore` panic that the same drift was triggering is gone) Refs: trishume#631

A `pop: N + branch_point` snapshots `stack_depth` pre-pop; the synthetic Set's post-Set retain (`bp.stack_depth <= final_len`) and `handle_fail`'s validity check (`stack.len() < bp.stack_depth`) both ignored that `pop_count`, dropping the freshly-created bp at creation. Same-line re-emit also missed the popped contexts' meta_scope clearance Pop — route it through `push_meta_ops` like the original push. Symptom: `meta.annotation.identifier.java meta.path.java` leaking past nested-annotation extends paths in `syntax_test_java.java`. Drops Java baseline 18953 -> 9956.

Mirrors the trishume#660 same-line fix into the cross-line branch — the bespoke re-emit of the new alternative's meta_scope/meta_content_scope was missing the popped contexts' Pop, leaking the popped meta_scope (annotation-qualified-identifier's meta_scope in Java) plus the surrounding declaration's meta_scope when an annotation crosses a line into a class/enum/interface declaration.

…ctions When an outer cross-line `fail`'s replay re-parses buffered lines, an inner cross-line `fail` firing during the loop writes its correction into `self.flushed_ops`. Previously, the outer's locally-computed `replayed_ops[i]` overwrote that correction via `merge_flushed`, freezing a stale interpretation for indices the inner had already corrected. Fixes the leak in `src/parsing/parser.rs::handle_fail` for both the alt-N and exhaustion cross-line paths. Repro: Java `@A.B\n(par=1)\nenum E {}\n` — the outer `declarations` fail's line-1 reparse froze the dotted annotation as `path` alt before the inner `annotation-qualified-identifier` fail's `name`-alt resolution landed. Drops Java syntest baseline 9935 → 9774; no regressions in other languages or in `Markdown` (still 158).

When a same-line branch_point exhausts at a zero-width lookahead, rewind the cursor to the BP's original position and skip the same-name Branch pattern on retry — letting the parent context's next rule fire instead of advancing past the lookahead, which let stale keyword rules match inside identifiers (`package` in `$package`, `class` in `Foo.class;`). Drops Java syntest 9774 → 1987 (-7787, -80%); jsp 39 → 0; Zsh 604 → 410. Markdown unchanged at 158. No regressions elsewhere. See parser.rs::handle_fail same-line exhaustion handler and the new `skipped_branches` field; new test `exhausted_branch_point_falls_through_to_parent_next_rule`.

`push_meta_ops`'s non-initial phase emitted the deep-context meta_scope/mcs Pops before restoring `cur_context.clear_scopes`. When the cleared atom belonged to one of the deeper contexts being popped, the Pops landed on the wrong (still-visible) scope — observed on Java's `case DayType when -> "incomplete"`, where `case-label-expression`'s `clear_scopes: 1` hid `case-label`'s `meta.case.java` and `case-label-end`'s `pop: 2` then popped the surrounding switch block off the consumer's stack. Move the cur_context Restore to before the depth loop so the previously-cleared atom is visible again when the deeper-context Pop lands on it. Drops Java syntest 1987 → 949 (-1038, additional -50%); fixes C#'s `syntax_test_GeneralStructure.cs` (was 2 → 0) and Haskell -1. Markdown unchanged at 158, no other regressions. See `parser.rs::push_meta_ops` Pop arm and the new test `pop_n_restores_clear_before_unwinding_deeper_meta_scopes`.

The YAML loader checked `set:`, `branch:`, and `embed:` after a `pop:` key but never `push:`. Combined `pop: N + push: X` rules degraded to a plain `Pop(N)` and silently dropped the push, leaving the parser on the outer context instead of the intended target. Affected rules in vendored syntaxes: Java's `pop: 2 + push: annotation-parameters-body` (lambda3 line 10069 and many others) and `pop: 1 + push: case-label-expression`; Python's `pop: 2 + push: function-parameter-list-body` and `type-parameter-list-body`. Java syntest 641 → 245 (-396); Python 66 → 45 (-21). Other language baselines unchanged.

The Set initial-phase Pop at parser.rs:1992 unconditionally popped `cur_context.meta_content_scope.len()` even when cur_context's mcs was never pushed because the context immediately below has `embed_scope_replaces=true`. This dropped the topmost wrapper-pushed embed_scope token. Mirrors the skip already in the Pop branch at parser.rs:1912. Markdown 158 -> 31; Python 45 -> 32 (free benefit).

Plain `set:` (no `pop_count`) into a target with `clear_scopes` emitted that Clear in `push_meta_ops`'s initial phase even when the leaving context carried its own `meta_scope` / `meta_content_scope`. Cur's ms sits on top of the visible stack at that point; Clear hid it instead of the parent atom the optimization was meant to strip. The non-initial Pop then ate atoms below cur's hidden ms, and the trailing Restore resurrected cur's ms — leaving cur's meta_scope where the parent's atom used to be. Bash repro `: ~/`: `~` set: `tilde-modifier` (clear+ms); `''` zero-width set: `tilde-modifier-username` (clear+mcs); `/` lookahead pops. ST scopes `/` as `meta.string.glob.shell string.unquoted.shell`; syntect emitted `meta.interpolation.tilde.shell string.unquoted.shell`. Fix: when cur has `meta_scope` or `meta_content_scope`, defer the single-context-set target Clear to the non-initial phase, after Pop+Restore (so Pop finds cur's ms visible and Restore brings the parent atoms back) and before pushing target's ms/mcs. The cur-empty case (Lisp `(defun fn (...)`, pinned by `v2_set_to_target_with_clear_scopes_clears_parent_meta_content_scope`) is unchanged. Net syntest: bash 249 → 30, zsh 410 → 25, java 245 → 221 on both regex backends; no other baseline lines change. New regression `cur_meta_scope_set_to_target_with_clear_scopes` mirrors the bash shape. Refs: trishume#631

Multi-context `set:` whose target body has both `clear_scopes: N` and a non-empty `meta_scope`, fired from a cur with no ms/mcs/clear, needs an extra atom dropped on the trigger token beyond Clear's reach. ST drops `N + 1` atoms on the trigger and `N` on the body content, anchoring the extra drop on the target's `meta_scope`. `push_meta_ops` previously kept both atoms on the trigger, leaking nested `meta.function.php` / `meta.function.return-type.php` into the `:` of PHP `function bye(): never {`. The fix emits a combined `Clear(N + 1)` in the initial phase and a paired `Restore` in the non-initial phase, leaving the body content's existing per-context Clear+Push to land it at the same place as before. Gated on the clear-bearing target carrying a non-empty `meta_scope` so syntaxes whose target has only `meta_content_scope` are unaffected — Zsh's `zsh-redirection-glob-range-end` (clear+mcs, no ms) on the `<` redirection trigger otherwise loses `source.shell.zsh` and `meta.function-call.arguments.shell`. PHP 1 -> 0.

push_meta_ops's `MatchOperation::Set` arm with `set_pop_count > 1` lumped target.ms + cur.ms + every popped deeper frame's mcs+ms into a single Pop. Per-frame clear_scopes were never restored — their cleared atoms stayed in clear_stack out of reach, and the new target's clear_scopes then bit one atom too deep. Observed on Python `r'''(?ix:some text(?-i:hello))(?iLmsux)(?a)foo'''`: the `(?ix:` rule's `pop: 3 + set:[group-body-extended, maybe-unexpected-quantifiers]` left `group-body-extended_outer`'s cleared `meta.mode.extended.regexp` in clear_stack; `group-body-extended_target`'s `clear_scopes: 1` then cleared `source.regexp.python` (the embed wrapper's mcs) instead of `mode_outer`. ST keeps `source.regexp.python` visible from col 22 through col 47+; syntect previously dropped it from col 27 onward. Split the lumped Pop into a head Pop (target.ms + cur.ms) and a per-depth Pop+Restore loop mirroring `MatchOperation::Pop` arm at parser.rs:1954-1971. New regression tests `pop_n_set_restores_deeper_frame_clear_scopes` (positive) and `pop_n_set_without_deeper_clear_scopes_unaffected` (negative gate against regressing Java's `pop:2 + push:annotation-parameters-body` shape). Refs: trishume#631

Resolved by per-depth clear_scopes Restore on pop:N + set:. Refs: trishume#631

`yaml_load`'s `parse_embed_op` was setting `embed_scope_replaces=true` on the wrapper unconditionally. That flag tells the per-target loop in `parser.rs` to suppress the next context's `meta_content_scope` push, to avoid duplicating the embedded syntax's top-level scope (auto- inserted into `main`'s mcs at `yaml_load.rs:706-713`) with the wrapper's last `embed_scope` atom. That dedup is only needed when the embed enters via `main`. Fragment embeds (e.g. `embed: scope:source.toml.embedded.python#toml`) bypass `main`, so the fragment context's mcs is independent of the syntax's top-level scope. Suppressing it strips a real grammar atom (TOML's `meta.mapping.toml`) and the next `clear_scopes:` then bites the wrapper instead of the intended grammar atom — leaking the wrapper out of every nested scope inside the embed. Mark the wrapper as `embed_scope_replaces=true` only when the embed target has no `#fragment`. Two regression tests: - `fragment_embed_preserves_target_meta_content_scope` (positive) - `non_fragment_embed_still_suppresses_main_mcs` (negative gate) The b31b727 test `embed_scope_replaces_preserves_wrapper_mcs_across_inner_set` is unaffected — Markdown's bash code-fence embed has no fragment. Python 32 -> 0 on both regex backends; no other baseline moves.

When a child syntax has multiple parents in `extends:` and the parents disagree on a shared context or variable, a parent's directly-defined entry now outranks another parent's inherited entry. Same-provenance ties still resolve last-wins. Fixes the indented zsh shebang in Markdown fenced blocks: `Zsh (for Markdown)` extends `[Bash (for Markdown), Zsh]`. Bash (for Markdown) owns a lenient `main` (`^(?=\s*#!)`); Zsh inherits Bash's strict column-0 main. The previous last-wins merge let Zsh's inherited main override, so the indented ` #!/usr/bin/env zsh` fell into the regular comments rule.

`get_line_assertion_details` recognised any line where the testtoken appeared mid-text and where valid assertion markers followed. ST's syntax-test format only allows assertions on dedicated comment-only lines, so source code preceding the testtoken means the markers are coincidental. The harness was processing such lines as assertions anyway, producing spurious failures and pinning `test_against_line_number` away from the source line so the *next* genuine assertion tested against stale scopes. Fix: early-return `None` when source code precedes the testtoken or non-whitespace follows the closing testtoken_end. The two bash repros are `: ${#^pattern}` (the `#` is the parameter-length operator) and `[ <<doc ] # <- ]` (a trailing comment whose body starts with `<-`). Doing this also exposed a latent bug in `only_whitespace_after_token_end`: `after_token_end` was the substring *from* the end-token, so the end-token glyphs themselves always counted as non-whitespace, and `/* ^ scope */` lines were silently classified as non-pure. Under the old gate this was harmless (the flag only fed the `parse_test_lines` path), but the early-return turned every C-style block-comment assertion into a non-assertion source line. Skip the end-token before checking the trailing content. Three new harness unit tests cover the corrected predicate and both shell repros. Two existing tests already exercise the pure-assertion path; their `is_pure_assertion_line` field assertions are now invariant-true at the constructor, but kept as documentation. Net syntest deltas (both regex backends): - Bash 30 -> 4 (residual: backtick `for...done` interaction) - Zsh 25 -> 10 (residual: zsh glob-range scoping) - Haskell 49 -> 43 Stacks on trishume#673.

The per-line search cache stored full-line `regex.search` results keyed by MatchPattern pointer, then reused them on every later search regardless of `search_end`. Inside an embed where `search_end` is clipped to the escape position, that reuse can flip rule outcomes whose lookaheads sit exactly at the boundary — the cached "no match" was computed against the escape glyph, but a fresh truncated search would see end-of-input there. Concretely: in `` `for i in $(seq 100); do echo $i; done` `` the `done{{cmd_break}}` rule (`done(?!cmd_char)`) was searched at the outer level with full-line text, where the lookahead saw the closing backtick (itself a cmd_char) and failed. That `None` was cached. Inside the backtick embed, with search_end clipped to the close, the cache short-circuited the lookup before the regex could re-run with end-of-input semantics, so `done` fell through to `cmd-name-body` and got `variable.function.shell` instead of `keyword.control.loop.end.shell`. Skip the cache lookup whenever `search_end < line.len()`. Insertion was already gated on `search_end == line.len()`, so the cache stays populated by full-line answers; truncated searches just re-run.

In a multi-context `set:` whose non-topmost target declares `clear_scopes: N` plus a `meta_content_scope`-only body (empty `meta_scope`), Sublime applies the Clear to atoms that earlier targets pushed via their `meta_scope` and the strip is visible to the trigger match's own scope/captures. Syntect was deferring the Clear to the non-initial phase, so the trigger token leaked the cleared atom even though body content saw it removed. Surfaces on Zsh glob-range openings inside `[ <1-2> ]` etc.: the `zsh-redirection-glob-range-begin` `set:` lists `string-path-pattern-body` (meta_scope `meta.string.glob.shell string.unquoted.shell`) before `zsh-redirection-glob-range-end` (`clear_scopes: 1` + `meta_content_scope: meta.range.shell.zsh`), and the `<` carries a capture scope asserted with `- string`. Drops the residual 10-char Zsh syntest failure on both backends.

Two-part guard against `branch_point` exhaustion collapsing a parent `meta_scope` one line boundary too early on empty lines: 1. In `parse_next_token`, a non-consuming `Branch` match that lands at or past the replay line's end is skipped when inside `replay_ctx`. Without this, the outer fail-replay would chain another `branch_point` at end-of-replay-line whose own cross-line exhaustion later attaches pops to the wrong line. 2. In `handle_fail`'s same-line path, when the rewind position is 0 of a purely empty line (length ≤ 1, just `\n`), advance the cursor to `line.len()`. The next-iteration `match: ''` of an `immediately-pop`- style alt then emits its scope pops past-EOL, which `ScopeRegionIterator` wraps onto the next line's baseline. Together they make Markdown's non-terminated link reference definition keep `meta.link.reference.def.markdown` on the empty line between `blah` and the closing `text` paragraph, matching ST. Baseline: Markdown 1 → 0 (the `syntax_test_markdown.md` line drops from `known_syntest_failures{,_fancy}.txt`). No other rows change.

The harness's `SYNTAX_TEST_HEADER_PATTERN` restricted `testtoken_end` to punctuation glyphs (`*/`, `-->`, …), assuming alphabetic tails like `dmd`, `clojure`, or `dotnet run` were shebang-style instructions to ignore. ST disagrees: those tails *are* the closing testtoken, and ST clips each assertion line's selector at the first substring match. The D shebang test's ` #! <- keyword.operator.logical.d dmd` and the Clojure shebang's `<- comment.line.shebang.clojure …` both relied on that clipping; under the old regex `dmd` / `clojure` leaked into the selector and the assertions failed against scopes the parser had correct. Two-part fix in `examples/syntest.rs`: - Broaden the `testtoken_end` capture to the entire whitespace-stripped trailing tail (`\S(?:.*\S)?`), so multi-word tails like `dotnet run` also round-trip cleanly. - Drop the `only_whitespace_after_token_end` gate. The Clojure case has `clojure` inside `comment.line.shebang.clojure`, so clipping succeeds but content follows the closing token; ST still treats the line as a pure assertion (with the clipped selector) rather than as source code, and so should we. The before-`testtoken_start` whitespace check alone is enough to reject the bash `: ${#^pat}` and `[ <<doc ] # <- ]` repros that motivated the gate. Baseline drops both `syntax_test_shebang.d` and `syntax_test_shebang.clj` rows from `testdata/known_syntest_failures{,_fancy}.txt`. No other rows change. Stacked on trishume#677.

Pre-fix `recursively_mark_no_prototype` followed every `Push` / `Set` / `Branch` / `Embed` AND every nested `include` from the prototype's include chain unconditionally, marking every reachable context as "don't include the prototype". For Haskell that meant marking `function-name`, `variable-name`, and `variable-name-end` because of the chain prototype → preprocessor-pragmas → push: preprocessor-pragma-body → embed: preprocessor-pragma-signature-value → include: functions → branch: variable-name, function-name → push: variable-name-end With the prototype's `line-comments` rule no longer applied inside `variable-name-end`, the `(?=\S)` pop:2 rule fired on every `--` of the assertion-comment lines that sit between an infix operator declaration and its `:: a -> Bool` continuation. That popped the branch alternative off the stack mid-air, orphaned the `functions` branch_point, and prevented the `(?=::)` `fail: functions` rule from ever installing `meta.function.identifier.haskell` via cross-line replay. ST verified via `scope_at_test`: every position the harness flagged as wrong is `source.haskell meta.function.identifier.haskell …` in ST. The fix tracks a `via_push` flag through the recursion: includes are followed only while still in the prototype's include chain (`via_push: false`); once we've crossed a Push/Set/Branch/Embed we keep following further match-op targets but stop following the body's own `include:`s. That preserves the YAML and Lua cases (where prototype-pushed bodies chain via `set:` to other prototype-pushed bodies that DO need the no_prototype mark to break the loop — `property → property-body`, `line-doc-comment-body → maybe-line-doc- comment → line-doc-comment-body`) while keeping prototype attached to general code-parsing contexts that are merely included from a body for its local rule access. Baseline: `syntax_test_haskell.hs` 43 → 1 (just the orthogonal `variable.other..haskell` double-dot selector failure remains, fixed in the next commit). `syntax_test_java.java` 221 → 212 incidentally — same underlying mechanism unmasked nine column-failures that the over-marking had been hiding. Stacked on trishume#678.

`Scope::new("variable.other..haskell")` (double dot from a typo or a test author writing `variable.other..haskell` to bypass ST's symbol- test heuristics) used to pack `""` as a real atom, producing a 4-atom scope `[variable, other, "", haskell]` that no longer prefix-matched the 3-atom `variable.other.haskell` it was meant to equal. ST's selector engine collapses runs of dots — `score_selector( 'variable.other..haskell', 'source.haskell variable.other.haskell')` returns 48, the same as the single-dot form. Mirror that by filtering empty segments in `ScopeRepository::build`. Symmetric: applies to both selector parsing in syntest assertions and to scope construction where a syntax accidentally has `scope: foo..bar`. Surfaces as the last `syntax_test_haskell.hs` failure (`syntax_test_haskell.hs:2348` line `:: a -> Bool`, `-- ^ variable.other..haskell` against scope `source.haskell variable.other.haskell`). Baseline: `syntax_test_haskell.hs` drops out of both `testdata/known_syntest_failures{,_fancy}.txt`. Java incidentally went from 221 to 212 with the prior commit's prototype-attachment fix; the new line is recorded here.

Submodule moves from `1ba99a47` (`v4201-119-g1ba99a47`) to the shipped `v4202` tag (`91ad8085`, "[D, Makefile, Rust] Standardize build output scopes"). v4202 is the most recent stable release tag before the C# v2 migration `8621831d` and the regex embed grammar `c735169b`; pinning here keeps `regex_string` on the legacy `embed: scope:source.regexp; embed_scope: meta.string.cs meta.regexp.cs` form, sidestepping the wrapper-mcs divergence between syntect and ST DEV's renderer that produced the `syntax_test_C#11.cs: 35` baseline entry. Compared with v4200 and v4204/v4205: - v4200 requires regenerating `testdata/test4.html` against the older `Cargo.sublime-syntax` (pre-`91ad8085` scope rename); v4202 matches the existing fixture as-is. - v4204/v4205 reintroduce the C#11 row (35) plus a `parser.rs::can_parse_preprocessor_rules` divergence from the C directive-scope refactor `44871676`. Baseline movement: `make syntest` and `make syntest-fancy` both end clean ("No new failures!"). C#11 row drops (-35); Java row at 212 unchanged. Net -35 failures. Companion fixes for v4202's older fixtures: - `parsing::syntax_set::tests::can_load`: Rails `main`'s `context_iter` count drops from 185 to 184 (one context added upstream post-v4202). - `parser.rs::push_meta_ops`: keep the auto-injected top-level scope across v2 set's cur.mcs Pop. The initial-phase Pop was popping `cur_context.meta_content_scope.len()` atoms at `match_start` so the matched text wouldn't see cur's `meta_content_scope`. That overcounts when cur is `main`: `add_initial_contexts` injects the syntax's top-level scope at `main.meta_content_scope[0]`, which ST keeps on the visible stack across the trigger (verified against ST 4200 stable on TOML's `[section]` rule, where the `[` trigger emits `source.toml` alongside `meta.section.toml`). Without this, the v4202-era `Rust/tests/syntax_test_frontmatter.{rs,md}` would fail at the `[section]` trigger position — the upstream fix `20212766` for the same divergence is post-v4202 and not in scope. Regression coverage in `v2_set_does_not_apply_parent_meta_content_scope_to_matched_text` still pins user-declared cur.mcs as popped.

Cross-line all-exhaustion in `handle_fail` advanced one char past the branch_point's lookahead, leaving the rest of the matched identifier to be reparsed without the branch_point in scope. The same-line arm already does the rewind+skipped_branches dance from f3e497a; extend it to the cross-line arm so the parent context's NEXT rule fires at the BP's match position. Drops Java syntest 212 → 119 (-93 char-assertions). The three unique-line wins are `package apple dot` line 572, and the ` variable` after `import no.terminator` / `import static no.terminator` on lines 656 and 671 — top-level-`java` cases where `declarations` exhausts and ST falls through to `else-expressions → expressions → constant-expressions → variables`. Drops `outer_cross_line_replay_prefers_inner_correction`. The test was added in trishume#663 to guard the inner-correction-preference machinery under the path "outer `declarations` 0 → 1, inner `annotation-qualified-identifier` 0 → 1". Intervening parser fixes between trishume#663's baseline (9774) and current HEAD (212) shifted control flow so that the test's 3-line input now hits the cross-line all-exhaust path instead, with the outer cycling all 5 alts. The test's coverage of `prefer_inner_replay_corrections` was already lost before this change; deleting it reflects that. The current Java baseline failures still exercise the alt-N path through other inputs.

stefanobaghino · 2026-04-29T18:35:35Z

Cluster-2 / multigen16 investigation log

A separate investigation pass on a stacked branch tried several architectural directions to also fix cluster 2 (and unblock the meta.field.type.java regression at syntax_test_java.java:3436). All blocked by the cascade nature of the duplication. Recording the attempts so the next investigator doesn't repeat them.

What the probe captured

Parsing syntax_test_java.java lines 2730–3460 with full tracing on, the cascade at parser line 702 (file 3432) fires five prefer_inner_replay_corrections calls. Three are skip cases under the depth gate; two are substitute cases:

outer	inner	depth_diff	decision	outer.creator	inner.creator
`object-type(9)`	`qualified-object-type(10)`	1	substitute	`Some(4)`	`Some(4)`
`object-type(9)`	`qualified-object-type(10)`	1	substitute	`Some(4)`	`Some(4)`
`class-members(4)`	`object-type(9)`	5	skip — cluster-2 seat	`None`	`Some(4)`
`object-type(11)`	`class-members(4)`	-7	skip	`Some(4)`	`None`
`class-members(4)`	`object-type(11)`	7	skip — doubling seat	`None`	`Some(4)`

Cluster 2 (substitute wanted) and the multigen16 doubling (skip wanted) have identical lineage signatures — only inner.stack_depth differs (9 vs 11). No purely-local signal at the gate distinguishes them.

Directions tried, all reverted

Content discriminator — skip iff inner_count(s) > outer_count(s) > 0 for any scope s. Failed: the doubling at line 3436 is a cascade of earlier substitutions whose individual contents don't exhibit the duplication signature locally. Java syntest 117 → 424.
BP lineage — creator_bp_depth (stack_depth of branch_points.first() at creation). Both cluster 2 and the doubling case have inner.creator_bp_depth = Some(4) (the outer class-members(4) was active at inner's creation). 117 → 393 if used as override.
BP lineage — created_during_replay (replay_ctx.is_some() at creation). Both cases have it true. 117 → 423 if used as override.
Sibling-by-creator-match — substitute when outer.creator_bp_depth == inner.creator_bp_depth. No-op on the Java region: cluster 2 and doubling both have asymmetric creator depths (outer.creator=None, inner.creator=Some(4)), so the override never fires. 117 stays at 117.
merge_flushed semantics — keep prior when new is deeper. Idea: don't let late writes overwrite earlier-resolved refinements. Broke cross_line_pop_n_branch_point_alt_fail_unwinds_meta_scope (`@A.B\nclass E {}` annotation cleanup) — that test's correctness relies on a deeper alt's later write replacing the prior. The current replace_all / keep_prefix_replace_suffix rules are load-bearing for cross-line cases beyond cluster 2.
Output coalescing — per-line relative-stack Push(X) skip. Lifted the depth gate and scrubbed duplicates with a per-line tracker that skipped Push(X) when X was at top of the relative stack. Doesn't help: the bad Push fires after an intervening Clear(TopN(1)) / Restore sequence resets the per-line tracker, while the running stack underneath retains the original push (re-asserted by Restore). Pattern:
```
Push(type), Push(modifier), Pop(2)
Restore                                    ; running stack: [..., type, modifier]
Push(identifier), Clear(TopN(1))           ; running stack: [..., type, modifier]
Push(type)                                 ; bad duplicate — under modifier
```
Detecting the duplicate requires fully simulating Clear's snapshot and Restore — i.e. reimplementing the parser's stack machinery in the dedup pass. 117 + coalescing → still 424.

Findings summary

The doubling at syntax_test_java.java:3436 is a pre-existing cascade expressed via Restore re-pushing scopes that outer's resolved alt also pushes. Locally-correct gate decisions accumulate duplicate atoms across multiple merges.
Local signals (lineage, content, simple coalesce) can't see the cascade.
merge_flushed's last-write-wins semantics are load-bearing — naive inversions break unrelated cross-line cases.

Remaining viable directions (deferred)

Full stack simulation in coalescing — track Clear/Restore snapshots, dedup Push(X) when X is on the running stack. Invasive (reimplements the parser's stack semantics in the dedup pass).
Stack-relative inner corrections — change the parser to express inner's corrections as a delta-form so substitution doesn't reset/conflict with outer's stack ancestry. Architectural.
Syntax-definition-level rework — change Java.sublime-syntax to avoid the Clear/Restore interaction with the qualified-object-type's meta_scope. Out of syntect's scope.

Diagnostic infrastructure

Added but not included in this PR (kept on a stacked branch for future pickup, since it doesn't drive syntest down):

A tracing instrumentation pass with hierarchical span dumps via tracing-tree, covering BP creation/removal/retry, merge_flushed decisions with prior/new BP info, gate decisions, replay-iteration spans, op-content traces. Reproduces the cascade tree at a glance.
_probe_cluster2_and_doubling — an ignored test that parses lines 2730–3460 of syntax_test_java.java and dumps the trace; the table above came from this probe.
creator_bp_depth and created_during_replay fields on BranchPoint / BpInfo.

If picking this back up, that infrastructure is the right starting point.

…lings only The previous gate skipped substitution iff `inner.stack_depth > outer.stack_depth`, which collapsed two structurally distinct cases — sibling refinement (substitute needed, e.g. `outer=declarations(3)`, `inner=annotation-identifier(4)` on the cluster-1 input `@Anno\n.\nAnno\n(par=1)\nenum E {}`) and child-of-resolved-alt nesting (substitute must skip, the multigen16 doubling guarded by `deeper_inner_bp_correction_does_not_double_outer_meta_scope`). Tighten to `depth_diff in {0, 1}`. Java syntest 119 → 117. Adds `cross_line_all_exhaust_with_pop_count_emits_popped_meta_scope_pops` as a passing regression test for the cluster-1 input. Doubling guard stays green.

stefanobaghino added 30 commits April 26, 2026 08:24

Cover apply_prototype external-prototype precedence

69cd4b3

Strengthens the existing `apply_prototype_includes_external_prototype` from build-only to parse-and-assert. Adds precedence, opt-out, and HAML-Rails end-to-end guards alongside it in `src/parsing/syntax_set.rs`. Refs: trishume#631

Drop HAML-Rails from syntest baselines

347ea76

All 65 assertions in `syntax_test_rails.haml` pass after the apply_prototype ordering fix. Delta applies to both baselines.

Skip inner replay correction when nested deeper than outer BP

0a2139a

Drop python_strings.py from syntest baselines

61a796e

Resolved by per-depth clear_scopes Restore on pop:N + set:. Refs: trishume#631

Drop python.py from syntest baselines

d4b6a5b

stefanobaghino added 6 commits April 28, 2026 18:22

stefanobaghino force-pushed the 631-java-cross-line-allexhaust-pops branch from d7fde05 to 1a893f2 Compare April 29, 2026 18:32

stefanobaghino changed the title ~~Add ignored repros for syntax_test_java.java residual failures~~ Bound prefer_inner_replay_corrections by depth_diff to substitute siblings only Apr 29, 2026

stefanobaghino mentioned this pull request Apr 29, 2026

Track remaining syntest failures after #630 #631

Open

stefanobaghino force-pushed the 631-java-cross-line-allexhaust-pops branch from 1a893f2 to ae78419 Compare April 29, 2026 18:39

This was referenced Apr 29, 2026

Pop deeper popped meta_scope before pop:N+set: trigger token #683

Draft

Allow immediately-pop tail-extension from deeper inner BP correction #686

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bound prefer_inner_replay_corrections by depth_diff to substitute siblings only#682

Bound prefer_inner_replay_corrections by depth_diff to substitute siblings only#682
stefanobaghino wants to merge 37 commits into
trishume:masterfrom
stefanobaghino:631-java-cross-line-allexhaust-pops

stefanobaghino commented Apr 29, 2026 •

edited

Loading

Uh oh!

stefanobaghino commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stefanobaghino commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefanobaghino commented Apr 29, 2026

Cluster-2 / multigen16 investigation log

What the probe captured

Directions tried, all reverted

Findings summary

Remaining viable directions (deferred)

Diagnostic infrastructure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stefanobaghino commented Apr 29, 2026 •

edited

Loading