diff --git a/dev/modules/perl_tidy.md b/dev/modules/perl_tidy.md new file mode 100644 index 000000000..c1bc211f0 --- /dev/null +++ b/dev/modules/perl_tidy.md @@ -0,0 +1,175 @@ +# Perl::Tidy Support Plan + +## Goal + +Make `./jcpan -t Perl::Tidy` run without errors on PerlOnJava. + +## Current Status + +**Version:** Perl-Tidy-20260204 (SHANCOCK/Perl-Tidy-20260204.tar.gz) +**Install:** Succeeds — 16 files installed to `~/.perlonjava/lib/` +**Tests:** 7/44 files pass, 37/44 fail + +### Test Results Summary (after \G fixes) + +| Category | Files | Result | +|----------|-------|--------| +| Passing | t/atee.t, t/filter_example.t, t/test.t, t/test_DEBUG.t, t/testsa.t, t/testss.t, t/zero.t | 7 OK | +| Snippet tests (DESTROY) | t/snippets1.t–t/snippets33.t (33 files) | 33 FAIL | +| Wide char tests (DESTROY) | t/testwide.t (2/3 pass), t/testwide-passthrough.t (2/6), t/testwide-tidy.t (2/6) | 3 FAIL | +| EOL tests (DESTROY) | t/test-eol.t (1/4 pass) | 1 FAIL | + +### Progress Tracking + +| Date | Milestone | Tests Passing | +|------|-----------|---------------| +| 2025-04-08 | Initial investigation | 5/44 | +| 2025-04-09 | \G regex fixes (pos undef + non-/g) | 7/44 | + +## Fixes Applied + +### Fix 1: \G Regex Anchor — pos() undef case (2025-04-09) + +**File:** `src/main/java/org/perlonjava/runtime/regex/RuntimeRegex.java` (line 651) + +**Problem:** When `pos()` was undef, the `\G` anchor check was skipped entirely +(`if (regex.useGAssertion && isPosDefined && matcher.start() != startPos)`). +This allowed `\G(\s+)` to scan forward and match whitespace anywhere in the +string, even though `\G` should anchor at position 0 when pos is undef. + +**Impact:** Perl::Tidy's `parse_args` function uses `\G/gc` patterns to +tokenize option strings. With broken `\G`, options like `-dac` were silently +dropped, causing t/atee.t to fail. + +**Fix:** Removed `isPosDefined` from the condition. When pos is undef, +`startPos` defaults to 0, so `\G` correctly anchors at 0. + +### Fix 2: \G in Non-/g Matches (2025-04-09) + +**File:** `src/main/java/org/perlonjava/runtime/regex/RuntimeRegex.java` (line 607) + +**Problem:** `pos()` was only looked up for `/g` matches. In Perl, `\G` +should anchor at `pos()` even in non-`/g` matches (e.g. `$str =~ /\Gfoo/`). +PerlOnJava was ignoring pos entirely for non-/g matches containing `\G`. + +**Impact:** Perl::Tidy's tokenizer uses `\G` in non-/g matches for signature +detection (line 10060: `$input_line =~ /\G\s*\(/`). Without this fix, +subroutine signatures like `sub foo($bar, %opts)` were misidentified as +prototypes, causing t/filter_example.t to fail. + +**Fix:** Changed the pos() lookup condition from `isGlobalMatch()` to +`isGlobalMatch() || useGAssertion`, so pos is looked up whenever `\G` is +present in the pattern. + +## Remaining Blocker: Missing DESTROY (33+ test files) + +**Symptom:** +``` +Attempt to create more than 1 object in Perl::Tidy::Formatter, which is not a true class yet + at .../Perl/Tidy/Formatter.pm line 1108. +``` + +This error kills the 2nd (and all subsequent) calls to `perltidy()` within a +single process. Since each snippet test file calls `perltidy()` 8–20 times in +a loop, only the first test passes per file. + +**Root cause:** `Perl::Tidy::Formatter` and `Perl::Tidy::Tokenizer` use +closure-scoped instance counters that are incremented in `new()` and +decremented in `DESTROY()`. PerlOnJava does not call `DESTROY`, so the counter +never resets to 0. + +**Formatter.pm singleton pattern:** +```perl +{ ## begin closure to count instances + my $_count = 0; + sub _increment_count { return ++$_count } + sub _decrement_count { return --$_count } +} + +sub DESTROY { + my $self = shift; + _decrement_count(); + return; +} + +sub new { + ... + if ( _increment_count() > 1 ) { + confess "Attempt to create more than 1 object..."; + } + ... +} +``` + +`Perl::Tidy::Tokenizer` has an identical pattern (lines 271–284, guard at +line 676). + +**Other DESTROY methods in Perl::Tidy:** 10 other classes have empty DESTROY +methods (only to prevent AUTOLOAD dispatch) — these are safe with missing +DESTROY. Only `Formatter` and `Tokenizer` have functional DESTROY code. + +**Impact:** ~555 subtests across 33+ test files never run. + +**Fix (Bundled overlay — Perl/Tidy.pm):** + +Patch `Perl::Tidy.pm`'s `perltidy()` function to explicitly call +`_decrement_count()` on Formatter and Tokenizer before returning. This +compensates for the missing DESTROY call with a 2-line surgical change. + +In `Perl/Tidy.pm`, add before the final return in `perltidy()` (~line 1395): +```perl +# PerlOnJava: DESTROY not called on JVM — manually reset singleton counters +Perl::Tidy::Formatter::_decrement_count(); +Perl::Tidy::Tokenizer::_decrement_count(); +``` + +**Effort:** Low — 2 lines added to one file. + +## Implementation Plan + +### Phase 1: Fix DESTROY Singleton (unblocks ~555 subtests) + +1. **Create bundled overlay** of `Perl/Tidy.pm` + - Copy upstream `Perl/Tidy.pm` (v20260204) to `src/main/perl/lib/Perl/Tidy.pm` + - Add `_decrement_count()` calls before `perltidy()`'s return points + - Mark changes with `# PerlOnJava:` comments + - Store diff in `dev/patches/cpan/Perl-Tidy-20260204/` + +2. **Verify:** Re-run `./jcpan -t Perl::Tidy` — expect snippet tests to + progress past first test case in each file. + +3. **Run `make`** — ensure no regressions in PerlOnJava's own tests. + +### Phase 2: Wide Character Alignment (nice to have) + +1. **Investigate** string width computation for Unicode characters +2. May require changes to PerlOnJava's `length()` or Perl::Tidy's alignment code +3. **Verify:** t/testwide.t, t/testwide-passthrough.t, t/testwide-tidy.t + +## Expected Results After Phase 1 + +With the DESTROY fix alone, the test results should improve dramatically: + +| Before | After (estimated) | +|--------|-------------------| +| 7/44 files pass | ~38/44 files pass | +| 4/53 subtests fail | TBD (most snippet tests should fully pass) | +| Result: FAIL | Closer to PASS | + +## Dependency on Other Work + +- **DESTROY implementation** (`dev/design/destroy_weaken_plan.md`): Would fix + this and all other DESTROY-dependent CPAN modules generically. However, it's + a large project. The targeted Perl::Tidy.pm overlay is the pragmatic + short-term fix. +- **Perl::Critic** (`dev/modules/perl_critic.md`): Already installed + (99.9% pass rate). Its `RequireTidyCode` policy fails because Perl::Tidy's + `Formatter::initialize_self_vars` exceeds the JVM 255-argument method + signature limit. That issue is separate from the test suite failures + documented here. + +## Related Documents + +- [cpan_patch_plan.md](cpan_patch_plan.md) — CPAN patching strategy (Option A: Bundled Overlays) +- [perl_critic.md](perl_critic.md) — Perl::Critic support (uses Perl::Tidy optionally) +- `dev/design/destroy_weaken_plan.md` — DESTROY/weaken implementation design diff --git a/src/main/java/org/perlonjava/core/Configuration.java b/src/main/java/org/perlonjava/core/Configuration.java index 3e56195dd..ad63a9f40 100644 --- a/src/main/java/org/perlonjava/core/Configuration.java +++ b/src/main/java/org/perlonjava/core/Configuration.java @@ -33,7 +33,7 @@ public final class Configuration { * Automatically populated by Gradle/Maven during build. * DO NOT EDIT MANUALLY - this value is replaced at build time. */ - public static final String gitCommitId = "6858b39e6"; + public static final String gitCommitId = "35c9ee0d5"; /** * Git commit date of the build (ISO format: YYYY-MM-DD). @@ -48,7 +48,7 @@ public final class Configuration { * Parsed by App::perlbrew and other tools via: perl -V | grep "Compiled at" * DO NOT EDIT MANUALLY - this value is replaced at build time. */ - public static final String buildTimestamp = "Apr 9 2026 10:09:12"; + public static final String buildTimestamp = "Apr 9 2026 12:17:38"; // Prevent instantiation private Configuration() { diff --git a/src/main/java/org/perlonjava/runtime/operators/FileTestOperator.java b/src/main/java/org/perlonjava/runtime/operators/FileTestOperator.java index 03dacada5..82163d26a 100644 --- a/src/main/java/org/perlonjava/runtime/operators/FileTestOperator.java +++ b/src/main/java/org/perlonjava/runtime/operators/FileTestOperator.java @@ -777,23 +777,67 @@ private static RuntimeScalar isTextOrBinaryFromHandle(CustomFileChannel cfc, boo /** * Common heuristic for text/binary detection. + * Matches Perl's pp_fttext heuristic from pp_sys.c: + * - "odd" chars = null bytes, invalid high-bit bytes, and control chars + * (0-31) except \n, \r, \t, \b (8), \f (12), and ESC (27) + * - Valid UTF-8 multi-byte sequences are treated as text (not odd) + * - File is binary if odd * 3 > length (more than 1/3 odd characters) */ private static RuntimeScalar analyzeTextBinary(byte[] buffer, int length, boolean checkForText) { - int textChars = 0; - int totalChars = 0; + int odd = 0; for (int i = 0; i < length; i++) { - if (buffer[i] == 0) { - return checkForText ? scalarFalse : scalarTrue; // Binary file - } - if ((buffer[i] >= 32 && buffer[i] <= 126) || buffer[i] == '\n' || buffer[i] == '\r' || buffer[i] == '\t') { - textChars++; + int b = buffer[i] & 0xFF; // treat as unsigned + if (b == 0) { + // Null byte — immediately binary + return checkForText ? scalarFalse : scalarTrue; + } else if (b >= 128) { + // Check if this starts a valid UTF-8 multi-byte sequence + int seqLen = utf8SequenceLength(buffer, i, length); + if (seqLen > 1) { + // Valid UTF-8 sequence — skip remaining bytes, not odd + i += seqLen - 1; + } else { + // Invalid high-bit byte — odd + odd++; + } + } else if (b < 32 + && b != '\n' && b != '\r' && b != '\t' + && b != '\b' && b != '\f' && b != 27) { + // Control characters (except common whitespace and ESC) are "odd" + odd++; } - totalChars++; } - double textRatio = (double) textChars / totalChars; - return getScalarBoolean(checkForText ? textRatio > 0.7 : textRatio <= 0.7); + // Perl: odd * 3 > len means binary + boolean isBinary = odd * 3 > length; + return getScalarBoolean(checkForText ? !isBinary : isBinary); + } + + /** + * Determines the length of a valid UTF-8 sequence starting at the given position. + * Returns the sequence length (2-4) if valid, or 1 if invalid. + */ + private static int utf8SequenceLength(byte[] buffer, int pos, int length) { + int b = buffer[pos] & 0xFF; + int seqLen; + + if ((b & 0xE0) == 0xC0) { + seqLen = 2; // 110xxxxx — 2-byte sequence + } else if ((b & 0xF0) == 0xE0) { + seqLen = 3; // 1110xxxx — 3-byte sequence + } else if ((b & 0xF8) == 0xF0) { + seqLen = 4; // 11110xxx — 4-byte sequence + } else { + return 1; // Not a valid UTF-8 start byte + } + + // Verify continuation bytes (10xxxxxx) + if (pos + seqLen > length) return 1; // Not enough bytes + for (int j = 1; j < seqLen; j++) { + if ((buffer[pos + j] & 0xC0) != 0x80) return 1; // Invalid continuation + } + return seqLen; } /** diff --git a/src/main/java/org/perlonjava/runtime/regex/RuntimeRegex.java b/src/main/java/org/perlonjava/runtime/regex/RuntimeRegex.java index e4fc5171e..6a1800b28 100644 --- a/src/main/java/org/perlonjava/runtime/regex/RuntimeRegex.java +++ b/src/main/java/org/perlonjava/runtime/regex/RuntimeRegex.java @@ -598,12 +598,13 @@ private static RuntimeBase matchRegexDirect(RuntimeScalar quotedRegex, RuntimeSc // hexPrinter(inputStr); - // Only look up pos() for /g matches - non-/g matches always start from 0 + // Look up pos() for /g matches and for non-/g matches that use \G. + // In Perl, \G anchors at pos() even in non-/g matches (e.g. $str =~ /\Gfoo/). RuntimeScalar posScalar = null; boolean isPosDefined = false; int startPos = 0; - if (regex.regexFlags.isGlobalMatch()) { + if (regex.regexFlags.isGlobalMatch() || regex.useGAssertion) { // Use RuntimePosLvalue to get the current position posScalar = RuntimePosLvalue.pos(string); isPosDefined = posScalar.getDefinedBoolean(); @@ -611,7 +612,7 @@ private static RuntimeBase matchRegexDirect(RuntimeScalar quotedRegex, RuntimeSc // Check if previous call had zero-length match at this position (for SCALAR context) // This prevents infinite loops in: while ($str =~ /pat/g) - if (ctx == RuntimeContextType.SCALAR) { + if (regex.regexFlags.isGlobalMatch() && ctx == RuntimeContextType.SCALAR) { String patternKey = regex.patternString; if (RuntimePosLvalue.hadZeroLengthMatchAt(string, startPos, patternKey)) { // Previous match was zero-length at this position - fail to break loop @@ -647,7 +648,8 @@ private static RuntimeBase matchRegexDirect(RuntimeScalar quotedRegex, RuntimeSc try { while (matcher.find()) { // If \G is used, ensure the match starts at the expected position - if (regex.useGAssertion && isPosDefined && matcher.start() != startPos) { + // When pos() is undefined, \G anchors at position 0 (startPos defaults to 0) + if (regex.useGAssertion && matcher.start() != startPos) { break; } diff --git a/src/test/resources/unit/regex/regex_g_pos.t b/src/test/resources/unit/regex/regex_g_pos.t index 9d637f5e4..6550d4f82 100644 --- a/src/test/resources/unit/regex/regex_g_pos.t +++ b/src/test/resources/unit/regex/regex_g_pos.t @@ -79,6 +79,82 @@ $pattern = qr/\G(\d{3})/; # Use a capture group # Non-global match should not use \G $string =~ /$pattern/; ok(!($1 ne '123'), 'Non-global match does not use \\G, matched \'123\''); +################### +# \G anchoring when pos() is undefined +# \G should anchor at position 0 when pos is undef, not scan forward + +# \G(\s+) should NOT match "-dac -tac" at pos 0 (no space at pos 0) +my $cfg = "-dac -tac"; +if ($cfg =~ /\G(\s+)/gc) { + ok(0, '\\G(\\s+) should not match when no space at pos 0'); +} else { + ok(1, '\\G(\\s+) correctly fails when pos is undef and no space at pos 0'); +} + +# \G([a-z]+) should NOT match "-dac -tac" at pos 0 (dash at pos 0) +pos($cfg) = undef; +if ($cfg =~ /\G([a-z]+)/gc) { + ok(0, '\\G([a-z]+) should not match when no letter at pos 0'); +} else { + ok(1, '\\G([a-z]+) correctly fails when pos is undef and no letter at pos 0'); +} + +# \G(-) SHOULD match "-dac -tac" at pos 0 (dash at pos 0) +pos($cfg) = undef; +if ($cfg =~ /\G(-)/gc) { + ok($1 eq '-' && pos($cfg) == 1, '\\G(-) correctly matches dash at pos 0'); +} else { + ok(0, '\\G(-) should match dash at pos 0'); +} + +# Simulate parse_args pattern: multiple \G/gc alternations on same string +pos($cfg) = undef; +my @tokens; +my $part = ""; +while (1) { + if ($cfg =~ /\G([\"\'])/gc) { + # quote + } + elsif ($cfg =~ /\G(\s+)/gc) { + push @tokens, $part if length($part); + $part = ""; + } + elsif ($cfg =~ /\G(.)/gc) { + $part .= $1; + } + else { + push @tokens, $part if length($part); + last; + } +} +ok(scalar(@tokens) == 2 && $tokens[0] eq '-dac' && $tokens[1] eq '-tac', + '\\G/gc tokenizer correctly splits "-dac -tac" into two tokens'); + +################### +# \G in non-/g matches should still anchor at pos() +# This is used by Perl::Tidy's tokenizer for signature detection + +my $line = "sub foo(\$bar) { }"; +pos($line) = 7; # after "sub foo" +ok($line =~ /\G\s*\(/, '\\G in non-/g match anchors at pos() - matches ( at pos 7'); + +pos($line) = 7; +ok(!($line =~ /\Gx/), '\\G in non-/g match anchors at pos() - fails when char does not match'); + +# \G in non-/g should not change pos() +pos($line) = 7; +$line =~ /\G\s*\(/; +ok(pos($line) == 7, '\\G non-/g match does not change pos()'); + +# \G with capture in non-/g match +my $data = "hello world"; +pos($data) = 6; +if ($data =~ /\G(\w+)/) { + ok($1 eq 'world', '\\G non-/g match with capture works at pos 6'); +} else { + ok(0, '\\G non-/g match with capture should match at pos 6'); +} + ################### # End of Perl `pos` and `\G` Tests