fix: \G regex anchor — respect pos() in all match contexts#469
Merged
Conversation
Investigation of ./jcpan -t Perl::Tidy (v20260204) identified 5 blockers: 1. DESTROY singleton (Critical): Formatter and Tokenizer use closure counters decremented in DESTROY. Since PerlOnJava does not call DESTROY, the 2nd+ perltidy() call per process dies. Affects 36/44 test files (~555 subtests). Fix: 2-line overlay in Perl/Tidy.pm. 2. Option parsing (Moderate): perltidyrc string ref options (-dac, -bl) not applied. Possibly Getopt::Long negatable boolean handling. 3. Wide char alignment (Low): Unicode display width miscalculation. 4. EOL handling (Low): t/test-eol.t produces no output. 5. DEBUG output (Low): debugfile scalar ref returns undef. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Two fixes for the \G assertion in the regex engine: 1. \G when pos() is undef: Previously, \G was only checked when pos() was defined (isPosDefined). When pos() was undef, the \G anchor check was skipped entirely, allowing the regex to scan forward and match at any position. Now \G always anchors at startPos (which defaults to 0 when pos() is undef), matching Perl behavior. 2. \G in non-/g matches: Previously, pos() was only looked up for /g matches. \G in non-/g matches (e.g. $str =~ /\Gfoo/) always started from position 0 regardless of pos(). Now pos() is looked up whenever \G is present in the pattern, matching Perl behavior where \G anchors at pos() even without /g. These fixes are critical for Perl::Tidy compatibility: - Fix 1 corrects option parsing (parse_args uses \G/gc tokenizer) - Fix 2 corrects the tokenizer signature detection (\G in non-/g) Perl::Tidy test results improve from 5/44 to 7/44 passing files. All remaining failures are DESTROY-related (singleton counter). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Updated plan doc to reflect the two \G regex fixes applied: 1. \G when pos() is undef - anchors at 0 instead of scanning forward 2. \G in non-/g matches - now respects pos() like Perl does Documented remaining DESTROY blocker as the sole remaining issue blocking 33+ test files (snippet tests, wide char, EOL). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The -T heuristic was too strict for UTF-8 files. It treated all bytes >= 128 as non-text, requiring > 70% printable ASCII to classify as text. Files with significant UTF-8 content (e.g. Cyrillic, Polish, German umlauts) were misclassified as binary, causing Perl::Tidy to skip them with "Non-text (override with -f)". Now matches Perl pp_fttext heuristic: - Valid UTF-8 multi-byte sequences are treated as text (not odd) - Invalid high-bit bytes are counted as odd - Binary if odd * 3 > length (same 1/3 threshold as Perl) Fixes testwide-passthrough.t and testwide-tidy.t file-to-file tests. All 48 subtests that run now pass (0 failures). Remaining 37/44 test file failures are purely DESTROY-related. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two fixes for the
\Gassertion in the regex engine, discovered while investigating Perl::Tidy test failures:\G when pos() is undef: Previously, the
\Ganchor check was skipped whenpos()was undefined, allowing\G(\s+)to scan forward and match at any position. Now\Ganchors at position 0 when pos is undef, matching Perl behavior.\G in non-/g matches: Previously,
pos()was only consulted for/gmatches.\Gin non-/g matches (e.g.$str =~ /\Gfoo/) always started from position 0 regardless ofpos(). Nowpos()is consulted whenever\Gis present in the pattern.Impact on Perl::Tidy
parse_argsuses\G/gcpatterns to tokenize option strings. Options like-dacwere silently dropped.$input_line =~ /\G\s*\(/(non-/g) was always matching at pos 0 instead of the current position.Perl::Tidy test results: 5/44 → 7/44 passing files. New passes: atee.t, filter_example.t, test_DEBUG.t.
Remaining 37 failures are all DESTROY-related (singleton counter not decremented).
Test plan
makepasses (all unit tests including new \G regression tests)regex_g_pos.t:Generated with Devin