diff --git a/dev/modules/list_moreutils.md b/dev/modules/list_moreutils.md new file mode 100644 index 000000000..6f377e61c --- /dev/null +++ b/dev/modules/list_moreutils.md @@ -0,0 +1,238 @@ +# List::MoreUtils — Fix Plan + +Target: make every subtest in `./jcpan -t List::MoreUtils` (v0.430) pass. + +Initial run (master): `Files=61, Tests=4492` — 7 failing subtests across 8 test files. + +``` +t/pureperl/binsert.t Failed test 19 (is_dying) +t/pureperl/bremove.t Failed test 415 (is_dying) +t/pureperl/indexes.t Failed test 18 (is_dying) +t/pureperl/mesh.t Failed test 7 (is_dying) +t/pureperl/zip6.t Failed test 5 (is_dying) +t/pureperl/mode.t Failed tests 2, 4 (lorem mode) +t/pureperl/minmaxstr.t aborts at line 14: Undefined subroutine &POSIX::setlocale +t/pureperl/part.t aborts at line 84: Global symbol "@long_list" requires explicit package name +``` + +## Root causes + +All failures are PerlOnJava bugs (the test code is correct and the module's XS/PP code is pristine). There are four distinct root causes, each mapping to one or more failing tests. + +### RC1 — Strict-refs not enforced on numeric-valued scalars + +`RuntimeScalar.arrayDeref()` in `src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeScalar.java` returns an empty `RuntimeArray` silently when `type == INTEGER` or `DOUBLE`. That branch was added to keep `1->[0]` quiet for literal constants, but it is also hit for plain scalar variables that happen to hold a number. + +```perl +use strict; +my $x = 1; +my @a = @$x; # perl: dies "Can't use string (\"1\") as an ARRAY ref" + # jperl: silently returns () +my $m = $#$x; # perl: dies with same message + # jperl: returns -1 +``` + +The same applies symmetrically to `hashDeref()` — it throws `Not a HASH reference` instead of the strict-refs error. + +Literal constants like `1->[0]` are handled by `RuntimeScalarReadOnly` which overrides `arrayDeref()` / `hashDeref()`, so fixing the base-class method only affects non-readonly values. + +Affected tests (all call something like `&foo(42, ...)` with the `&` prefix disabling the prototype so the number reaches `@$_` / `$#$_` / `$_->()` inside the PP implementation): + +- `binsert.t` test 19 — `&binsert(42, @even)` → `lower_bound` → `$_[0]->()` (numeric used as CODE ref) +- `bremove.t` test 415 — `&bremove(42, ...)` → same +- `indexes.t` test 18 — `&indexes(42, 4711)` → `$test->()` with numeric +- `mesh.t` test 7 — `&mesh(1, 2)` → `$#$_` +- `zip6.t` test 5 — `&zip6(1, 2)` → `$#$_` + +Parallel defect in `scalarDeref()` / code-ref deref: `my $x = 42; $x->();` currently throws `Not a CODE reference` (`my $x = "42"` → `Undefined subroutine &main::42`) where Perl says: + +``` +Can't use string ("42") as a subroutine ref while "strict refs" in use +``` + +We should align message and dying behaviour so the `is_dying()` assertions accept the result and so errors from numeric scalars also die. Five of the seven failing subtests collapse into this one fix once the numeric path dies correctly (even with a slightly different message, `is_dying` only checks for any exception). + +### RC2 — `my @var = ... for LIST;` not visible in enclosing block + +`part.t` line 83: + +```perl +my @long_list = int rand(1000) for 0 .. 1E7; +my @part = part { $_ == 1E7 and die "Too much!"; ($_ % 10) * 2 } @long_list; +``` + +In real Perl, a `my` declared inside a statement-modifier loop is visible for the rest of the enclosing block (the `for` modifier re-executes the whole statement, but the `my` declaration still belongs to the surrounding scope). PerlOnJava's parser scopes the `my` only to the modifier's statement, so the second line sees an undeclared `@long_list` and `use strict` bails out at compile time. + +### RC3 — `POSIX::setlocale` / `POSIX::localeconv` are declared but not implemented + +`src/main/perl/lib/POSIX.pm` exports `setlocale`, `localeconv`, `LC_ALL`, `LC_COLLATE`, … but there is no implementation (neither a Perl `sub` in POSIX.pm nor a Java binding in `org/perlonjava/perlmodule/Posix.java`). Any call raises `Undefined subroutine &POSIX::setlocale`. + +Affected test: `minmaxstr.t` (aborts at line 14). + +### RC4 — `split /(?:\b|\s)/` leaves whitespace inside fields + +```perl +split /(?:\b|\s)/, "Lorem ipsum,"; +# perl: ("Lorem", "", "ipsum", ",") +# jperl: ("Lorem", " ", "ipsum", ",") +``` + +When the split pattern is an alternation of a zero-width assertion (`\b`) and a consuming class (`\s`), jperl's split consumes the zero-width match, advances one position, then re-matches the next `\b` past the whitespace, so the space ends up inside a field instead of being treated as a separator. Real perl advances past both the zero-width match and the following consuming match. + +Affected test: `mode.t` tests 2 and 4 (the Lorem sample builds a 720-ish word list via `split /(?:\b|\s)/`; expected mode is `(106, ',')`; we get `(11, …)`). + +## Implementation plan + +Work in branch `fix/list-moreutils`. Commit each root cause separately so a regression can be bisected. + +### Step 1 — RC1: strict refs on numeric scalar dereference + +Files: + +- `src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeScalar.java` + - `arrayDeref()`: change `INTEGER` / `DOUBLE` arms to throw `Can't use string ("") as an ARRAY ref while "strict refs" in use`. + - `hashDeref()`: change `INTEGER` / `DOUBLE` arms to throw `Can't use string ("") as a HASH ref while "strict refs" in use`. + - Code-ref path (`codeDeref` / CODE invocation on non-reference scalar): unify the error message to the strict-refs form for `INTEGER` / `DOUBLE` / `STRING` / `BYTE_STRING`. +- `RuntimeScalarReadOnly` already overrides `arrayDeref()` and `hashDeref()` to return empty / throw `Not a HASH reference`, so `1->[0]` and `1->{a}` keep their current silent behaviour. + +Verification: + +``` +./jperl -e 'use strict; my $x=1; my @a=@$x' # must die +./jperl -e 'use strict; my $x=1; my %h=%$x' # must die +./jperl -e 'use strict; my $x=1; $x->()' # must die +./jperl -e 'print 1->[0], 1->{a}' # must remain silent +./jperl -e 'my $x=1; my @a=@$x' # no strict, still works (non-strict path) +./jperl t/op/* ... # make test-unit +``` + +Expected to flip green: `binsert.t`, `bremove.t`, `indexes.t`, `mesh.t`, `zip6.t`. + +### Step 2 — RC3: POSIX::setlocale / POSIX::localeconv / LC_* constants + +Scope: minimal, enough to satisfy `minmaxstr.t` and any similar locale-using tests. We do not need a real locale; `setlocale(LC_COLLATE, "C")` must return a truthy string (`"C"`) and not raise. + +Plan: + +- Add stubs to `src/main/perl/lib/POSIX.pm`: + - `sub setlocale { my ($cat, $loc) = @_; defined $loc ? $loc : 'C' }` + - `sub localeconv { +{ decimal_point => '.', thousands_sep => ',' , grouping => '' } }` + - `sub LC_ALL () { 0 }` + - `sub LC_COLLATE () { 1 }` + - `sub LC_CTYPE () { 2 }` + - `sub LC_MONETARY () { 3 }` + - `sub LC_NUMERIC () { 4 }` + - `sub LC_TIME () { 5 }` + - `sub LC_MESSAGES () { 6 }` + - Values only matter as unique integers; they're opaque category handles to callers. +- Document in `docs/FEATURE_MATRIX.md` that locale support is stubbed (no real locale switching). + +Verification: + +``` +./jperl -MPOSIX=setlocale,LC_COLLATE -e 'print setlocale(LC_COLLATE, "C"), "\n"' +``` + +### Step 3 — RC2: `my` in statement-modifier loop + +The fix is in the parser: when a statement is followed by a `for` / `foreach` / `while` / `until` / `if` / `unless` modifier, any `my` inside the statement body still belongs to the enclosing block, not a new scope introduced by the modifier. + +Plan: + +1. Find the statement-modifier transformation site in `org.perlonjava.parser` (likely `StatementParser`). It rewrites `EXPR for LIST;` into a `for`-loop AST. +2. Ensure the `my` variable's `OperatorNode("my", ...)` keeps its original lexical scope (the enclosing block), not the new synthetic for-loop body. Concretely: do not wrap the `EXPR` in a new block; or if a block wrapper is introduced, mark `my` declarations as hoisted to the enclosing scope. +3. Diff against a minimal reproducer: + ```perl + use strict; + my @x = 1 for 1..3; + print scalar @x, "\n"; # must print 1, not error + ``` + +Verification: `./jperl` runs the reproducer, `part.t` gets past line 84, full `make test-unit` still passes. + +Risk: this touches a very hot parser path. Add a focused unit test under `src/test/resources/unit/` covering `my`/`our`/`local` with each modifier (`for`, `foreach`, `while`, `until`, `if`, `unless`). + +### Step 4 — RC4: split with `\b` alternation + +Location: `org.perlonjava.operators.RegexOperators.split` (or wherever `split` is dispatched). The regex engine itself is fine for `m//g`; the problem is split-specific handling of zero-width matches mixed with consuming matches. + +Plan: + +1. Write a minimal reproducer comparison harness: + ```perl + for my $s ("Lorem ipsum,", "a b c", "foo-bar") { + my @a = split /(?:\b|\s)/, $s; + print join("|", map "[$_]", @a), "\n"; + } + ``` + Capture expected output from system perl. +2. Identify how PerlOnJava's split loop advances `pos` after a match. The bug signature is: after a zero-width match at position P, the next search must begin at P+1 **without** re-matching a zero-width assertion at P+1 that overlaps the just-skipped character; if such a match happens, the consumed character between P and P+1 must still be treated as a separator (emit empty field) rather than ending up inside the next field. +3. The usual Perl semantic is: after a zero-width match at P, try again at P+1 with the constraint that a zero-width match at P+1 is allowed, *but* the text between P and the next match start is the next field's contents. In perl, when the next match at P+1 is also zero-width, the field between is empty. The bug is probably that PerlOnJava emits a field of length 1 (the skipped char) instead of length 0. +4. Fix the split loop's field-boundary computation. + +Verification: + +- Small harness matches system perl for a set of cases (including `split / /`, `split /\s/`, `split /\b/`, `split /(?:\b|\s)/`, `split //`, `split /(\s)/` with capture). +- `mode.t` passes. +- `make test-unit` still green. + +### Step 5 — Re-run the full distribution tests + +``` +./jcpan -t List::MoreUtils +``` + +All 4492 subtests must pass. Rerun `make` to ensure no unit-test regressions. + +## Progress tracking + +### Current status + +`./jcpan -t List::MoreUtils` goes from 7 failing subtests (8 test files) on master down to **1 failing subtest (`indexes.t` test 18) in 1 test file**, and that remaining failure is deferred to the parallel weaken branch (see RC6 below). + +### Completed phases + +- [x] **RC1** — numeric scalar strict-refs (2026-04-20) — commit `db94a5ae1` + - `RuntimeScalar.arrayDeref()` / `hashDeref()` now throw `Can't use string ("N") as an ARRAY|HASH ref while "strict refs" in use` for `INTEGER` / `DOUBLE`. + - `RuntimeScalarReadOnly` gains the same behavior for read-only scalars holding numeric values, while keeping `1->[0]` / `1->{a}` silent via new `arrayDerefGet` / `hashDerefGet` overrides. + - Fixes: `binsert.t`, `bremove.t`, `mesh.t`, `zip6.t` (4 tests). +- [x] **RC3** — POSIX stubs (2026-04-20) — commit `a161fa284` + - Adds `setlocale`, `localeconv`, `LC_ALL` / `LC_COLLATE` / `LC_CTYPE` / `LC_MONETARY` / `LC_NUMERIC` / `LC_TIME` / `LC_MESSAGES` as Perl stubs in `src/main/perl/lib/POSIX.pm`. + - Fixes: `minmaxstr.t`. +- [x] **RC2** — `my` hoist in statement-modifier loops (2026-04-20) — commit `3bfaffda3` + - `StatementResolver.parseStatement` for `for` / `foreach` / `while` / `until` now detects `my DECL = RHS for LIST` / `my DECL = RHS while COND` and emits a bare `my DECL;` in the enclosing scope, wrapping the loop body in a BlockNode for `while`/`until` so the inner `my` shadows the outer on each iteration (matching perl: the outer variable stays empty/undef). + - Fixes: compile-time `Global symbol @long_list …` error in `part.t`. +- [x] **RC4** — split with zero-width vs consuming alternation (2026-04-20) — commit `c9b8e05dd` + - `Operator.split` now probes each zero-width match via `Matcher.matches()` on progressively larger regions. When a consuming alternative also matches at the same offset, an extra empty field is emitted between the two separators and the consumed characters are skipped, matching perl's `REG_NOTEMPTY_ATSTART` retry loop. + - Fixes: `mode.t` tests 2 and 4. +- [x] **RC5** — undef-as-subscript warning (2026-04-20) — commit `96c4f92d5` + - `RuntimeArray.get` / `RuntimeHash.get` now emit `Use of uninitialized value in array|hash element` (category `uninitialized`) when called with an `UNDEF` index. This was exposed after RC2 unblocked the later leak-free tests in `part.t`. + - Fixes: `part.t` tests 12 and 13. + +### Deferred + +- **RC6 — `Scalar::Util::weaken` on a reference to a temporary** (`indexes.t` test 18). The test does + ```perl + my $ref = \(indexes(sub { 1 }, 123)); + Scalar::Util::weaken($ref); + is($ref, undef, "weakened away"); + ``` + In real perl the temporary returned by `indexes(...)` has a refcount of 1 held by `$ref`; weakening that ref drops the refcount to 0 and the temporary is freed, so `$ref` becomes undef. PerlOnJava's cooperative-refcount overlay (see `dev/architecture/weaken-destroy.md`) only tracks objects blessed into a class with `DESTROY`. For an unblessed numeric scalar like this one, weaken transitions it to `WEAKLY_TRACKED` but does not clear the weak ref at scope exit because we can't distinguish "last strong ref was this one" from "symbol table still holds a ref" without full refcounting. This is a known architectural limitation being addressed on a separate branch; this PR does not touch it. + +### Final summary + +- `binsert.t` ok +- `bremove.t` ok +- `mesh.t` ok +- `zip6.t` ok +- `minmaxstr.t` ok +- `mode.t` ok +- `part.t` ok +- `indexes.t` — 1 subtest still fails (RC6, deferred to weaken branch) + +Run `./jcpan -t List::MoreUtils` and observe: `Files=61, Tests=4533. Failed 1/61 test programs. 1/4533 subtests failed.` (was 8 / 7 before this branch). + +### Open questions +- None open for RC1–RC5. +- RC6 is tracked on the separate weaken branch. + diff --git a/src/main/java/org/perlonjava/core/Configuration.java b/src/main/java/org/perlonjava/core/Configuration.java index 408428e8e..65ac902be 100644 --- a/src/main/java/org/perlonjava/core/Configuration.java +++ b/src/main/java/org/perlonjava/core/Configuration.java @@ -33,7 +33,7 @@ public final class Configuration { * Automatically populated by Gradle/Maven during build. * DO NOT EDIT MANUALLY - this value is replaced at build time. */ - public static final String gitCommitId = "4b2bcf8dd"; + public static final String gitCommitId = "c9b8e05dd"; /** * Git commit date of the build (ISO format: YYYY-MM-DD). @@ -48,7 +48,7 @@ public final class Configuration { * Parsed by App::perlbrew and other tools via: perl -V | grep "Compiled at" * DO NOT EDIT MANUALLY - this value is replaced at build time. */ - public static final String buildTimestamp = "Apr 20 2026 16:15:32"; + public static final String buildTimestamp = "Apr 20 2026 17:15:38"; // Prevent instantiation private Configuration() { diff --git a/src/main/java/org/perlonjava/frontend/parser/StatementResolver.java b/src/main/java/org/perlonjava/frontend/parser/StatementResolver.java index 89eaa52fe..ee0dbfb4a 100644 --- a/src/main/java/org/perlonjava/frontend/parser/StatementResolver.java +++ b/src/main/java/org/perlonjava/frontend/parser/StatementResolver.java @@ -755,6 +755,15 @@ yield dieWarnNode(parser, "die", new ListNode(List.of( } } + // Hoist 'my' declarations from the loop-body expression, too. + // "my @long_list = EXPR for LIST" declares @long_list in the + // enclosing scope in perl. Each iteration still gets a fresh + // @long_list (the inner `my` shadows the outer one), so the + // outer value remains empty — we must not transform the body + // itself. Emitting a bare `my @long_list;` before the loop is + // enough to make the name visible in the enclosing scope. + Node bodyHoistedMyDecl = hoistMyFromAssignment(expression); + // Statement modifier for loop: EXPR for LIST // $_ is global, so needs array-of-alias and local wrapping Node varNode = scalarUnderscore(parser); @@ -769,14 +778,22 @@ yield dieWarnNode(parser, "die", new ListNode(List.of( new OperatorNode("local", varNode, parser.tokenIndex), forNode ), parser.tokenIndex); - if (hoistedMyDecl != null) { - yield new ListNode(List.of(hoistedMyDecl, result), parser.tokenIndex); + if (hoistedMyDecl != null || bodyHoistedMyDecl != null) { + java.util.List hoisted = new java.util.ArrayList<>(); + if (bodyHoistedMyDecl != null) hoisted.add(bodyHoistedMyDecl); + if (hoistedMyDecl != null) hoisted.add(hoistedMyDecl); + hoisted.add(result); + yield new ListNode(hoisted, parser.tokenIndex); } yield result; } Node result = new For1Node(null, false, varNode, modifierExpression, expression, null, parser.tokenIndex); - if (hoistedMyDecl != null) { - yield new ListNode(List.of(hoistedMyDecl, result), parser.tokenIndex); + if (hoistedMyDecl != null || bodyHoistedMyDecl != null) { + java.util.List hoisted = new java.util.ArrayList<>(); + if (bodyHoistedMyDecl != null) hoisted.add(bodyHoistedMyDecl); + if (hoistedMyDecl != null) hoisted.add(hoistedMyDecl); + hoisted.add(result); + yield new ListNode(hoisted, parser.tokenIndex); } yield result; } @@ -794,12 +811,28 @@ yield dieWarnNode(parser, "die", new ListNode(List.of( // Executes the loop at least once if (CompilerOptions.DEBUG_ENABLED) parser.ctx.logDebug("do-while " + expression); } - yield new For3Node(null, + // Hoist 'my' from the loop body, same as the for/foreach modifier. + // Skip for do-while: the `my` is inside an explicit BLOCK so its + // scope is already what perl expects. + // If we hoist, wrap the body in a BlockNode so the inner `my` + // shadows the outer hoisted one, matching perl's behavior where + // the outer variable stays untouched while each iteration + // creates a fresh instance. + Node bodyHoistedMyDecl = isDoWhile ? null : hoistMyFromAssignment(expression); + Node body = expression; + if (bodyHoistedMyDecl != null) { + body = new BlockNode(java.util.List.of(expression), parser.tokenIndex); + } + Node result = new For3Node(null, false, null, modifierExpression, - null, expression, null, + null, body, null, isDoWhile, false, parser.tokenIndex); + if (bodyHoistedMyDecl != null) { + yield new ListNode(java.util.List.of(bodyHoistedMyDecl, result), parser.tokenIndex); + } + yield result; } default -> { @@ -1004,6 +1037,27 @@ public static void parseStatementTerminator(Parser parser) { } } + /** + * If {@code expression} is an assignment whose left side is a `my` + * declaration (e.g. {@code my $x = EXPR} or {@code my @w = LIST}), return + * the bare `my` declaration node. Callers use this to hoist the `my` out + * of a statement-modifier body so the variable is visible in the enclosing + * scope, matching perl's semantics for + * {@code my @x = EXPR for LIST} / {@code my $x = EXPR while COND}. + * + * Returns null if no hoist is applicable. Does not mutate {@code expression} + * — the caller is responsible for rewriting it. + */ + private static Node hoistMyFromAssignment(Node expression) { + if (expression instanceof BinaryOperatorNode assignNode + && assignNode.operator.equals("=") + && assignNode.left instanceof OperatorNode myNode + && myNode.operator.equals("my")) { + return myNode; + } + return null; + } + /** * Handle statement modifiers (if/unless) with my/our/state declarations. * For "my $x = EXPR if COND", the variable must be declared even when condition is false. diff --git a/src/main/java/org/perlonjava/runtime/operators/Operator.java b/src/main/java/org/perlonjava/runtime/operators/Operator.java index 4d48a26d9..0cdb44801 100644 --- a/src/main/java/org/perlonjava/runtime/operators/Operator.java +++ b/src/main/java/org/perlonjava/runtime/operators/Operator.java @@ -146,19 +146,20 @@ public static RuntimeList split(RuntimeScalar quotedRegex, RuntimeList args, int try { while (matcher.find() && (limit <= 0 || splitCount < limit - 1)) { + int matchStart = matcher.start(); + int matchEnd = matcher.end(); + // Add the part before the match - // System.out.println("matcher lastend " + lastEnd + " start " + matcher.start() + " end " + matcher.end() + " length " + inputStr.length()); - if (lastEnd == 0 && matcher.end() == 0) { - // if (lastEnd == 0 && matchStr.isEmpty()) { + // System.out.println("matcher lastend " + lastEnd + " start " + matchStart + " end " + matchEnd + " length " + inputStr.length()); + if (lastEnd == 0 && matchEnd == 0) { // A zero-width match at the beginning of EXPR never produces an empty field - // System.out.println("matcher skip first"); - } else if (matcher.start() == matcher.end() && matcher.start() == lastEnd) { + } else if (matchStart == matchEnd && matchStart == lastEnd) { // Skip consecutive zero-width matches at the same position // This handles patterns like / */ that can match zero spaces continue; } else { - splitElements.add(new RuntimeScalar(inputStr.substring(lastEnd, matcher.start()))); + splitElements.add(new RuntimeScalar(inputStr.substring(lastEnd, matchStart))); } // Add captured groups if any (but skip code block captures) @@ -183,8 +184,40 @@ public static RuntimeList split(RuntimeScalar quotedRegex, RuntimeList args, int } } - lastEnd = matcher.end(); + lastEnd = matchEnd; splitCount++; + + // Perl's split re-runs the regex at matchEnd with + // REG_NOTEMPTY_ATSTART after a zero-width match, so a + // consuming alternative at the same position counts as an + // additional separator (producing an empty field between + // the two separators). Without this, e.g. + // `split /(?:\b|\s)/, "Lorem ipsum"` loses the empty field + // that should appear between the `\b` and `\s` matches at + // offset 5, and the space leaks into the next field when + // Java's matcher auto-advances. + // + // Java regex always tries alternation left-to-right, so a + // pattern like `(?:\b|\s)` returns the zero-width `\b` + // match even when `\s` could have consumed a character. + // To find a consuming alternative we use `matches()` on + // progressively larger regions starting at matchEnd: the + // shortest region the whole pattern consumes is the + // length of the consuming alternative. + if (matchStart == matchEnd + && matchEnd < inputStr.length() + && (limit <= 0 || splitCount < limit - 1)) { + int consumedEnd = findConsumingMatch(pattern, inputStr, matchEnd); + if (consumedEnd > matchEnd) { + // Emit the (empty) field between the two separators + splitElements.add(new RuntimeScalar("")); + lastEnd = consumedEnd; + splitCount++; + // Advance the primary matcher past the consumed + // region so its next find() doesn't re-match inside. + matcher.region(lastEnd, inputStr.length()); + } + } } } catch (RegexTimeoutException e) { WarnDie.warn(new RuntimeScalar(e.getMessage() + "\n"), RuntimeScalarCache.scalarEmptyString); @@ -250,6 +283,32 @@ public static RuntimeList split(RuntimeScalar quotedRegex, RuntimeList args, int return result; } + /** + * After a zero-width match at {@code pos}, return the end offset of the + * shortest non-zero-width match of {@code pattern} starting exactly at + * {@code pos}, or {@code pos} if no consuming match exists there. + * + * Java's Matcher always tries alternation left-to-right, so a pattern like + * {@code (?:\b|\s)} returns the zero-width {@code \b} branch even when + * {@code \s} could have consumed a character at the same position. We use + * {@link Matcher#matches()} on progressively larger regions starting at + * {@code pos}: the pattern matches the whole region only when some + * alternative consumes exactly that many characters. This approximates + * perl's {@code REG_NOTEMPTY_ATSTART} retry that split uses after each + * zero-width match. + */ + private static int findConsumingMatch(Pattern pattern, String input, int pos) { + int max = Math.min(input.length() - pos, 64); + Matcher probe = pattern.matcher(input); + for (int len = 1; len <= max; len++) { + probe.region(pos, pos + len); + if (probe.matches()) { + return pos + len; + } + } + return pos; + } + /** * Extracts a substring from a given RuntimeScalar based on the provided offset and length. * This method mimics Perl's substr function, handling negative offsets and lengths. diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeArray.java b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeArray.java index 9c0629370..4c8201ef7 100644 --- a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeArray.java +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeArray.java @@ -5,6 +5,8 @@ import java.util.List; import java.util.Stack; +import org.perlonjava.runtime.operators.WarnDie; + import static org.perlonjava.runtime.runtimetypes.RuntimeScalarCache.*; import static org.perlonjava.runtime.runtimetypes.RuntimeScalarType.TIED_SCALAR; @@ -586,6 +588,16 @@ public RuntimeScalar get(int index) { */ public RuntimeScalar get(RuntimeScalar value) { + // Perl warns "Use of uninitialized value in array element" whenever an + // undef value is used as an array subscript (both read and lvalue/ + // autoviv use this path). Matches perl's behavior under `use warnings`. + if (value != null && value.type == RuntimeScalarType.UNDEF) { + WarnDie.warnWithCategory( + new RuntimeScalar("Use of uninitialized value in array element"), + RuntimeScalarCache.scalarEmptyString, + "uninitialized"); + } + if (this.type == TIED_ARRAY) { int idx = value.getInt(); Integer outOfRangeOriginal = null; diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeHash.java b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeHash.java index df1085a3a..38cd69e34 100644 --- a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeHash.java +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeHash.java @@ -391,6 +391,15 @@ public RuntimeScalar get(String key) { * @return The value associated with the key, or a proxy for lazy autovivification if the key does not exist. */ public RuntimeScalar get(RuntimeScalar keyScalar) { + // Perl warns "Use of uninitialized value in hash element" whenever an + // undef value is used as a hash key (both read and lvalue/autoviv use + // this path). Matches perl's behavior under `use warnings`. + if (keyScalar != null && keyScalar.type == RuntimeScalarType.UNDEF) { + org.perlonjava.runtime.operators.WarnDie.warnWithCategory( + new RuntimeScalar("Use of uninitialized value in hash element"), + RuntimeScalarCache.scalarEmptyString, + "uninitialized"); + } return switch (this.type) { case PLAIN_HASH, AUTOVIVIFY_HASH -> { // Note: get() does not autovivify the hash, so we don't call AutovivificationHash.vivify() diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeScalar.java b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeScalar.java index 83e5ffbf7..25e46d5fc 100644 --- a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeScalar.java +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeScalar.java @@ -1398,12 +1398,12 @@ public RuntimeArray arrayDeref() { // Cases 0-11 are listed in order from RuntimeScalarType, and compile to fast tableswitch return switch (type) { case INTEGER -> // 0 - // For numeric constants (like 1->[0]), return an empty array - // This matches Perl's behavior where 1->[0] returns undef without error - new RuntimeArray(); + // Under strict refs, dereferencing a non-readonly numeric scalar as an ARRAY ref + // is a strict-refs violation. (Read-only constants like `1->[0]` take the + // RuntimeScalarReadOnly.arrayDerefGet() override instead and stay quiet.) + throw new PerlCompilerException("Can't use string (\"" + this + "\") as an ARRAY ref while \"strict refs\" in use"); case DOUBLE -> // 1 - // For numeric constants (like 1->[0]), return an empty array - new RuntimeArray(); + throw new PerlCompilerException("Can't use string (\"" + this + "\") as an ARRAY ref while \"strict refs\" in use"); case STRING -> // 2 throw new PerlCompilerException("Can't use string (\"" + this + "\") as an ARRAY ref while \"strict refs\" in use"); case BYTE_STRING -> // 3 @@ -1501,9 +1501,12 @@ public RuntimeHash hashDeref() { // Cases 0-11 are listed in order from RuntimeScalarType, and compile to fast tableswitch return switch (type) { case INTEGER -> // 0 - throw new PerlCompilerException("Not a HASH reference"); + // Under strict refs, dereferencing a non-readonly numeric scalar as a HASH ref + // is a strict-refs violation. (Read-only constants like `1->{a}` take the + // RuntimeScalarReadOnly.hashDerefGet() override instead.) + throw new PerlCompilerException("Can't use string (\"" + this + "\") as a HASH ref while \"strict refs\" in use"); case DOUBLE -> // 1 - throw new PerlCompilerException("Not a HASH reference"); + throw new PerlCompilerException("Can't use string (\"" + this + "\") as a HASH ref while \"strict refs\" in use"); case STRING -> // 2 // Strict refs violation: attempting to use a string as a hash ref throw new PerlCompilerException("Can't use string (\"" + this + "\") as a HASH ref while \"strict refs\" in use"); diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeScalarReadOnly.java b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeScalarReadOnly.java index 3a46554e6..ce9278c61 100644 --- a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeScalarReadOnly.java +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeScalarReadOnly.java @@ -1,5 +1,9 @@ package org.perlonjava.runtime.runtimetypes; +import static org.perlonjava.runtime.runtimetypes.RuntimeScalarType.BYTE_STRING; +import static org.perlonjava.runtime.runtimetypes.RuntimeScalarType.DOUBLE; +import static org.perlonjava.runtime.runtimetypes.RuntimeScalarType.INTEGER; +import static org.perlonjava.runtime.runtimetypes.RuntimeScalarType.STRING; import static org.perlonjava.runtime.runtimetypes.RuntimeScalarType.UNDEF; /** @@ -168,19 +172,57 @@ public RuntimeHash hashDeref() { if (this.type == UNDEF) { throw new PerlCompilerException("Can't use an undefined value as a HASH reference"); } + // Dereferencing a read-only scalar whose content is a non-reference value + // (e.g. `for my $x (1) { %$x }`) is a strict-refs violation in perl. + // Literal arrow dereferences like `1->{a}` bypass this via the + // hashDerefGet() override below. + if (this.type == INTEGER || this.type == DOUBLE + || this.type == STRING || this.type == BYTE_STRING) { + throw new PerlCompilerException("Can't use string (\"" + this + "\") as a HASH ref while \"strict refs\" in use"); + } throw new PerlCompilerException("Can't use value as a HASH reference"); } + /** + * `1->{a}` style arrow dereference on a read-only literal number stays silent + * to match perl's behavior for compile-time numeric literals. + */ + @Override + public RuntimeScalar hashDerefGet(RuntimeScalar index) { + if (this.type == INTEGER || this.type == DOUBLE) { + return new RuntimeScalar(); + } + return super.hashDerefGet(index); + } + @Override public RuntimeArray arrayDeref() { if (this.type == UNDEF) { throw new PerlCompilerException("Can't use an undefined value as an ARRAY reference"); } - // For non-reference values (like constants), return an empty array - // This matches Perl's behavior where 1->[0] returns undef without error + // Dereferencing a read-only scalar whose content is a non-reference value + // (e.g. `for my $x (1) { @$x }`) is a strict-refs violation in perl. + // Literal arrow dereferences like `1->[0]` bypass this via the + // arrayDerefGet() override below. + if (this.type == INTEGER || this.type == DOUBLE + || this.type == STRING || this.type == BYTE_STRING) { + throw new PerlCompilerException("Can't use string (\"" + this + "\") as an ARRAY ref while \"strict refs\" in use"); + } return new RuntimeArray(); } + /** + * `1->[0]` style arrow dereference on a read-only literal number stays silent + * to match perl's behavior for compile-time numeric literals. + */ + @Override + public RuntimeScalar arrayDerefGet(RuntimeScalar index) { + if (this.type == INTEGER || this.type == DOUBLE) { + return new RuntimeScalar(); + } + return super.arrayDerefGet(index); + } + @Override public RuntimeArray arrayDerefNonStrict(String packageName) { // Don't call vivify() for read-only scalars diff --git a/src/main/perl/lib/POSIX.pm b/src/main/perl/lib/POSIX.pm index bbaf9605c..9f855bb7b 100644 --- a/src/main/perl/lib/POSIX.pm +++ b/src/main/perl/lib/POSIX.pm @@ -274,6 +274,47 @@ sub getegid { POSIX::_getegid() } sub setuid { POSIX::_setuid(@_) } sub setgid { POSIX::_setgid(@_) } +# Locale support (stubbed — PerlOnJava does not switch C library locales, +# but many modules rely on these existing and being callable). +sub LC_ALL () { 0 } +sub LC_COLLATE () { 1 } +sub LC_CTYPE () { 2 } +sub LC_MONETARY () { 3 } +sub LC_NUMERIC () { 4 } +sub LC_TIME () { 5 } +sub LC_MESSAGES () { 6 } + +sub setlocale { + my ($category, $locale) = @_; + # Returning the requested locale (or the current/default one) is enough + # for callers that use setlocale() purely for its return value, e.g. + # `setlocale(LC_COLLATE, "C")`. + return defined $locale ? $locale : 'C'; +} + +sub localeconv { + return { + decimal_point => '.', + thousands_sep => '', + grouping => '', + int_curr_symbol => '', + currency_symbol => '', + mon_decimal_point => '', + mon_thousands_sep => '', + mon_grouping => '', + positive_sign => '', + negative_sign => '-', + int_frac_digits => -1, + frac_digits => -1, + p_cs_precedes => -1, + p_sep_by_space => -1, + n_cs_precedes => -1, + n_sep_by_space => -1, + p_sign_posn => -1, + n_sign_posn => -1, + }; +} + # User/Group functions sub getpwnam { POSIX::_getpwnam(@_) } sub getpwuid { POSIX::_getpwuid(@_) }