diff --git a/dev/modules/smoke_test_investigation.md b/dev/modules/smoke_test_investigation.md new file mode 100644 index 000000000..ee4fd00dc --- /dev/null +++ b/dev/modules/smoke_test_investigation.md @@ -0,0 +1,376 @@ +# CPAN Smoke Test — Failure Investigation Plan + +## Overview + +This document tracks test failures across all modules in the CPAN smoke test +(`dev/tools/cpan_smoke_test.pl`). The goal is to systematically investigate +failures, identify root causes in PerlOnJava, and prioritize fixes that unblock +the most modules. + +**Run the smoke test**: `perl dev/tools/cpan_smoke_test.pl` +**Quick regression check**: `perl dev/tools/cpan_smoke_test.pl --quick` + +## Module Status Summary + +| Category | Count | Description | +|----------|-------|-------------| +| known-good | 11 | All or nearly all tests pass | +| partial | 19 | Installs but has test failures | +| blocked | 6 | Cannot install or test due to missing deps | + +## Priority: Shared Root Causes + +These are PerlOnJava issues that affect multiple modules. Fixing them has +the highest impact. + +### P1: Clone::PP missing + +**Affects**: HTTP::Message, LWP::UserAgent, Plack (indirectly) + +Clone.pm exists in PerlOnJava but falls back to Clone::PP which doesn't exist. +Creating a pure-Perl Clone::PP that implements `clone()` via Storable's +`dclone` would unblock the entire HTTP stack. + +**Fix**: Create `src/main/perl/lib/Clone/PP.pm` with: +```perl +package Clone::PP; +use Storable 'dclone'; +sub clone { Storable::dclone($_[0]) } +1; +``` + +**Unblocks**: HTTP::Message → LWP::UserAgent → Plack → Devel::Cover (partially) + +### P2: MIME::Base64 $VERSION undef + +**Affects**: MIME::Base64, HTTP::Message (version check fails) + +The Java XS implementation in `MIMEBase64.java` doesn't set `$VERSION`. +Modules that require `MIME::Base64 >= 2.1` fail at version check. + +**Fix**: Set `$MIME::Base64::VERSION` in the Java backend or in a wrapper .pm. + +### P3: Encode::Alias — find_encoding bypassed Perl alias system (DONE) + +**Affects**: Encode::Locale, HTTP::Message, LWP::UserAgent + +Java `Encode.find_encoding()` completely bypassed the Perl `Encode::Alias` system, +only checking hardcoded `CHARSET_ALIASES` and `Charset.forName()`. + +**Fix**: Modified bundled `src/main/perl/lib/Encode.pm` to wrap `find_encoding` +with Perl-level alias resolution. Removed `"Encode"` from `preloadedModules` in +`GlobalContext.java` so XSLoader executes the .pm file. The .pm saves a reference +to the Java `find_encoding` before overriding it with a wrapper that tries Java +first, then falls back to `Encode::Alias::find_alias()`. Also added `encodings()` +method to `Encode.java`. + +### P4: File::Temp close / PerlIO::encoding stub + +**Affects**: IO::HTML, File::Temp-dependent modules + +File::Temp objects don't properly handle `close` via AUTOLOAD, and +PerlIO::encoding has no stub. + +**Fix**: Add PerlIO::encoding stub; review File::Temp close delegation. + +### P5: Exit code handling (DONE) + +**Affects**: Test::Needs (now 227/227 — PASS), potentially other test harness modules + +`WarnDie.exit()` called `runEndBlocks()` which reset `$?` to 0, ignoring END +block modifications. Also, `require` error messages didn't include +"Compilation failed" prefix. + +**Fix**: Set `$?` before END blocks, use `runEndBlocks(false)` to skip reset, +read `$?` back after END blocks for final exit code. Fixed in `WarnDie.java` +and `ModuleOperators.java`. Test::Needs now passes all 227 tests. + +### P6: Regex engine — `\|` quantifier error in alternations (DONE) + +**Affects**: Template Toolkit (was 0/247, now 170/2072 with full test suite) + +Two bugs fixed in `RegexPreprocessor.java`: +1. **Escaped pipe `\|` with quantifier**: Replaced brittle `sb` character inspection + with `lastWasQuantifiable` flag check. Removed duplicate `lastChar == '|'` blocks. +2. **Lookaheads/lookbehinds/atomic groups**: `(?=...)`, `(?!...)`, `(?<=...)`, `(?...)`, `(?:...)` were routed through `handleRegularParentheses` which only + appended `(` and parsed from `offset+1`, causing `?` to be treated as quantifier. + Fixed by appending full group opener and calling `handleRegex` with correct offset. + +### P7: HTML::Parser/HTML::Entities — Java XS backend (DONE — Phase 1) + +**Affects**: HTML::Parser, HTTP::Message, Devel::Cover, any module using HTML::Entities + +HTML::Parser is an XS module. HTML::Entities requires `HTML::Parser` to provide +`decode_entities` and `_decode_entities` as XS-accelerated functions. + +**Fix**: Created `HTMLParser.java` — a single Java XS file matching the original +`Parser.xs` layout with both `PACKAGE = HTML::Parser` and `PACKAGE = HTML::Entities`. + +Phase 1 provides: +- **HTML::Entities** (fully functional): `decode_entities`, `_decode_entities` (with + numeric decimal/hex, named entities, surrogate pairs, prefix expansion), + `UNICODE_SUPPORT`, `_probably_utf8_chunk` +- **HTML::Parser** (basic): `_alloc_pstate`, `parse`/`eof` with basic event-driven + parsing, 13 boolean accessors, `handler` registration, tag list methods + +Cross-package registration uses direct `GlobalVariable.getGlobalCodeRef()` calls +since `registerMethod` prefixes with the module name. + +**Results**: HTTP::Message → PASS, Devel::Cover → PASS, HTML::Parser 190/415 + +**Phase 2** (future): Full hparser.c port (~1900 lines), argspec compilation, +array-ref accumulator handlers (needed for TokeParser/PullParser). + +### P8: IO::Compress::Gzip — Java backend (NOT STARTED) + +**Affects**: IO::Compress chain, modules needing gzip support + +Needs `Compress::Raw::Zlib` (C library). Could implement via `java.util.zip`. + +## Per-Module Investigation + +### Partial Modules (install OK, tests partially fail) + +#### Try::Tiny — 91/94 + +**Known issue**: 3 tests fail due to fork (not available in PerlOnJava). +**Action**: None needed — expected failure. Could mark as known-good with a +note. + +#### Test::Warn — partial + +**Known issue**: Dep test issues with Sub::Uplevel. +**Investigation**: Run `./jcpan -t Test::Warn` and capture failing subtests. +Check if Sub::Uplevel's stack manipulation works. + +#### Path::Tiny — 1489/1542 + +**Known issue**: 53 failing subtests. +**Investigation**: Run tests individually to categorize failures: +- File permission tests (chmod/chown)? +- Symlink tests? +- Encoding-related? +- Temp file handling? + +#### Parse::RecDescent — untested via jcpan + +**Known issue**: Heavy use of `eval STRING` and complex regex. +**Investigation**: Run `./jcpan -t Parse::RecDescent` and record results. + +#### Spreadsheet::WriteExcel — untested via jcpan + +**Investigation**: Run `./jcpan -t Spreadsheet::WriteExcel` and record results. +Deps (OLE::Storage_Lite, Parse::RecDescent) should already be installable. + +#### Image::ExifTool — 590/600 (98%) + +**Known issue**: 10 failing tests across Writer.t, XMP.t, Geotag.t, PDF.t, etc. +See `.cognition/skills/debug-exiftool/SKILL.md` for detailed analysis. +**Action**: Low priority — already very high pass rate. + +#### MIME::Base64 — partial + +**Known issues**: $VERSION undef, missing url-safe variants. +**Fix**: See P2 above. Also implement `encode_base64url`/`decode_base64url`. + +#### URI — 896/947 + +**Known issues**: UTF-8 encoding differences. +**Investigation**: Categorize failures by URI subtype (URI::http, URI::data, +URI::_query, etc.) and encoding edge cases. + +#### IO::HTML — 32/52 + +**Known issues**: File::Temp close, encoding name differences. +**Fix**: See P4 above. Also check encoding name normalization +(e.g., "utf-8-strict" vs "utf-8"). + +#### LWP::MediaTypes — 41/47 + +**Known issues**: MIME type differences. +**Investigation**: Compare MIME type database — may be version/platform +differences in the type mappings. + +#### Test::Needs — 227/227 (PASS) + +**Fixed**: Exit code handling (P5). All tests pass now. + +#### Test::Warnings — 86/88 + +**Known issues**: 2 failing subtests. +**Investigation**: Identify which 2 tests fail and whether they relate to +warning propagation or $SIG{__WARN__} handling. + +#### Encode::Locale — partial + +**Known issues**: Cannot determine system locale encoding. +**Fix**: See P3 above. + +#### Log::Log4perl — partial + +**Known issues**: Mostly works. See `dev/modules/log4perl-compatibility.md`. +**Investigation**: Run full test suite and record specific failures. + +#### JSON — CONFIG_FAIL + +**Root cause**: Bundled `src/main/perl/lib/JSON.pm` was missing `$VERSION`. +CPAN's JSON-4.11 `Makefile.PL` does `use JSON` and checks `$JSON::VERSION`, +which returned undef causing a fatal "version check failed" error. +**Fix**: Added `our $VERSION = '4.11'` to bundled JSON.pm. (DONE) + +#### Type::Tiny — untested via jcpan + +**Expected behavior**: Pure Perl, should work with Moo. +**Investigation**: Run `./jcpan -t Type::Tiny`. Many deps — check whether +the dependency chain installs cleanly. + +#### List::MoreUtils — untested via jcpan + +**Expected behavior**: Should fall back to PP implementation. +**Investigation**: Run `./jcpan -t List::MoreUtils`. Check that +List::MoreUtils::XS is skipped gracefully. + +#### Template (Template Toolkit) — 170/2072 + +**Previously 0/247** (regex bug blocked all tests). After P6 regex fix, tests +can now run. The full test suite has 2072 subtests, of which 170 pass. +Remaining failures need investigation — likely a mix of regex edge cases, +missing features, and test infrastructure issues. + +#### Mojolicious — untested via jcpan + +**Expected behavior**: Zero non-core deps, but uses IO::Socket::SSL, +subprocesses, event loops. +**Investigation**: Run `./jcpan -t Mojolicious`. Expect partial results — +socket/async features may not work. + +### Blocked Modules (cannot install/test) + +#### HTTP::Message — PASS (unblocked) + +**Previously blocked** on Clone::PP and HTML::Entities. +**Fixed by**: P1 (Clone::PP) and P7 (HTML::Parser Java XS backend). +Now passes all tests (243s runtime). + +#### Devel::Cover — PASS (unblocked) + +**Previously blocked** on HTML::Entities dep chain. +**Fixed by**: P7 (HTML::Parser Java XS backend). Now passes 1/1 tests. + +#### HTML::Parser — 190/415 (partially working) + +**Previously blocked**: No Java XS backend. +**Fixed by**: P7 (HTMLParser.java Phase 1). Basic parsing works. +190/415 tests pass. Remaining failures likely need Phase 2 (full hparser.c +port with argspec compilation and accumulator handlers for TokeParser). + +#### IO::Compress::Gzip — XS required + +**Blocker**: Needs Compress::Raw::Zlib (C library). +**Action**: Could potentially implement via Java's `java.util.zip`. + +#### Moose — XS required (skipped from smoke tests) + +**Blocker**: Needs B module subroutine name introspection, Class::MOP XS. +**Action**: Long-term — Moo is the recommended alternative. + +#### Plack — blocked on dep chain (skipped from smoke tests) + +**Blocker**: Needs HTTP::Message (now fixed), and other HTTP modules. +HTTP::Message is now unblocked. May be worth re-enabling in smoke tests. + +#### LWP::UserAgent — may be unblocked + +HTTP::Message now passes. LWP::UserAgent may work now. +**Action**: Re-test with `./jcpan -t LWP::UserAgent`. + +#### DBIx::Class — blocked on DBI + +**Blocker**: DBI is XS-only. +**Action**: Would need a Java DBI backend. Long-term goal. + +#### DBI — XS required + +**Blocker**: Pure C XS implementation. +**Action**: Could implement a Java backend using JDBC. Significant effort. + +## Fix Priority Order + +Based on impact (modules unblocked) and effort: + +1. **P1: Clone::PP** — trivial fix, unblocks HTTP::Message → LWP → Plack **(DONE)** +2. **P2: MIME::Base64 $VERSION** — trivial fix, unblocks version checks **(DONE)** +3. **P3: Encode::Alias** — Perl-level alias resolution for find_encoding **(DONE)** +4. **P4: PerlIO::encoding stub** — small fix, helps IO::HTML **(DONE)** +5. **P5: Exit code handling** — END blocks now see correct $? **(DONE)** +6. **P6: Regex `\|` + lookaheads** — two bugs in RegexPreprocessor **(DONE)** +7. **P7: HTML::Parser Java XS** — Phase 1: entities + basic parser **(DONE)** +8. **P8: IO::Compress::Gzip Java backend** — large effort, unblocks Compress chain + +## How to Update This Document + +After running the smoke test, update the per-module sections with: +1. Current pass/fail counts +2. New root causes discovered +3. Move modules between categories as fixes land + +```bash +# Run full smoke test and save results +perl dev/tools/cpan_smoke_test.pl --output smoke_results.log + +# Compare with previous run +perl dev/tools/cpan_smoke_test.pl --compare cpan_smoke_PREVIOUS.dat +``` + +## Progress Tracking + +### Current Status: P1–P7 all completed + +### Completed +- [x] Module registry created with 39 modules (2026-03-31) +- [x] Top-20 CPAN modules identified and added +- [x] Shared root causes documented +- [x] P1: Clone::PP created (`src/main/perl/lib/Clone/PP.pm`) — uses Storable::dclone +- [x] P2: MIME::Base64 $VERSION set in both `.pm` (`3.16`) and `.java` backend +- [x] P4: PerlIO::encoding stub created (`src/main/perl/lib/PerlIO/encoding.pm`) +- [x] JSON $VERSION added to bundled `src/main/perl/lib/JSON.pm` +- [x] Template::Stash::XS shim created (falls back to pure-Perl Template::Stash) +- [x] P6: Regex preprocessor — lookaheads and escaped pipes fix (2026-03-31) +- [x] P3: Encode::Alias — find_encoding wrapper + XSLoader deferred load (2026-03-31) +- [x] P5: exit() lets END blocks modify $? + require shows full error (2026-03-31) +- [x] P7: HTML::Parser/HTML::Entities Java XS backend Phase 1 (2026-03-31) + - Created `HTMLParser.java` with entity decoding + basic HTML parser + - Cross-package registration for HTML::Entities namespace + - Unblocked HTTP::Message (PASS), Devel::Cover (PASS) + - HTML::Parser 190/415 tests passing + +### Smoke Test Results (2026-03-31, post P1–P7) + +| Module | Status | Tests | +|--------|--------|-------| +| Test::Deep | FAIL | 1266/1268 | +| Try::Tiny | FAIL | 91/94 | +| Test::Fatal | PASS | 19/19 | +| MIME::Base32 | PASS | 31/31 | +| HTML::Tagset | PASS | 33/33 | +| Path::Tiny | FAIL | 1488/1542 | +| Spreadsheet::WriteExcel | FAIL | 1124/1189 | +| Image::ExifTool | PASS | 600/600 | +| Spreadsheet::ParseExcel | PASS | 1612/1612 | +| IO::Stringy | PASS | 127/127 | +| Moo | FAIL | 809/840 | +| URI | FAIL | 844/947 | +| Test::Needs | PASS | 227/227 | +| Test::Warnings | FAIL | 86/88 | +| Log::Log4perl | FAIL | 715/719 | +| JSON | FAIL | 23683/24886 | +| Devel::Cover | PASS | 1/1 | +| HTTP::Message | PASS | (all) | +| HTML::Parser | FAIL | 190/415 | + +### Next Steps +1. Investigate remaining HTML::Parser failures (Phase 2: argspec, accumulator handlers) +2. Fix P8 (IO::Compress::Gzip) — Java backend using java.util.zip +3. Investigate Template Toolkit remaining failures (170/2072) +4. Look at near-PASS modules: Test::Deep (1266/1268), Moo (809/840), Log::Log4perl (715/719) diff --git a/dev/tools/cpan_smoke_test.pl b/dev/tools/cpan_smoke_test.pl index 43d20dc0a..588871bae 100755 --- a/dev/tools/cpan_smoke_test.pl +++ b/dev/tools/cpan_smoke_test.pl @@ -53,6 +53,9 @@ ['Test::Warn', 'partial', 'pure-perl', undef, 'Dep test issues with Sub::Uplevel'], ['Path::Tiny', 'partial', 'pure-perl', undef, '1489/1542 subtests'], ['namespace::clean', 'known-good', 'pure-perl', undef, 'Namespace cleanup'], + ['Parse::RecDescent', 'partial', 'pure-perl', undef, 'Recursive descent parser; depends on Text::Balanced'], + ['Spreadsheet::WriteExcel', 'partial', 'pure-perl', undef, 'Write .xls files; unlocks ParseExcel t/46_save_parser.t'], + ['Image::ExifTool', 'partial', 'pure-perl', undef, '590/600 subtests (98%) via dedicated runner'], # ── Known-good: Java XS implementations ── ['DateTime', 'known-good', 'java-xs', undef, 'Date/time (Java java.time backend)'], @@ -71,13 +74,22 @@ ['Test::Warnings', 'partial', 'pure-perl', undef, '86/88 subtests'], ['Encode::Locale', 'partial', 'pure-perl', undef, 'Unknown encoding: locale'], ['Log::Log4perl', 'partial', 'pure-perl', undef, 'Mostly works'], + ['JSON', 'partial', 'pure-perl', undef, 'Wrapper; uses JSON::PP fallback (top-20 CPAN)'], + ['Type::Tiny', 'partial', 'pure-perl', undef, 'Type constraints for Moo/Moose (top-20 CPAN)'], + ['List::MoreUtils', 'partial', 'xs-with-pp-fallback', undef, 'Extra list utils; XS in separate dist (top-20 CPAN)'], + ['Template', 'partial', 'xs-with-pp-fallback', undef, 'Template Toolkit; Stash::XS optional (top-20 CPAN)'], + # ['Mojolicious', 'partial', 'pure-perl', undef, 'Web framework, zero non-core deps (top-20 CPAN)'], # ── Blocked: need fixes before they can work ── ['Devel::Cover', 'blocked', 'xs-required', undef, 'Blocked on HTML::Entities dep chain'], ['HTTP::Message', 'blocked', 'pure-perl', undef, 'Blocked on Clone::PP missing'], - ['HTML::Parser', 'blocked', 'xs-required', undef, 'XS module, no Java backend'], + ['HTML::Parser', 'blocked', 'xs-required', undef, 'XS module, has Java backend Phase 1'], ['IO::Compress::Gzip', 'blocked', 'xs-required', undef, 'Needs Compress::Raw::Zlib'], - ['Moose', 'blocked', 'xs-required', undef, 'Needs B module subroutine names'], + # ['Moose', 'blocked', 'xs-required', undef, 'Needs B module subroutine names'], + # ['Plack', 'blocked', 'pure-perl', undef, 'PSGI toolkit; blocked on dep chain (top-20 CPAN)'], + ['LWP::UserAgent', 'blocked', 'pure-perl', undef, 'HTTP client; blocked on HTTP::Message (top-20 CPAN)'], + ['DBIx::Class', 'blocked', 'pure-perl', undef, 'ORM; blocked on DBI (XS) (top-20 CPAN)'], + ['DBI', 'blocked', 'xs-required', undef, 'DB interface; needs Java backend (top-20 CPAN)'], # ── XS with PP fallback, need env vars ── ['Params::Util', 'partial', 'xs-with-pp-fallback', { PERL_PARAMS_UTIL_PP => 1 }, 'Needs PP env var'], diff --git a/src/main/java/org/perlonjava/core/Configuration.java b/src/main/java/org/perlonjava/core/Configuration.java index 411aba70b..568edf892 100644 --- a/src/main/java/org/perlonjava/core/Configuration.java +++ b/src/main/java/org/perlonjava/core/Configuration.java @@ -33,7 +33,7 @@ public final class Configuration { * Automatically populated by Gradle/Maven during build. * DO NOT EDIT MANUALLY - this value is replaced at build time. */ - public static final String gitCommitId = "9d7f75afe"; + public static final String gitCommitId = "995455306"; /** * Git commit date of the build (ISO format: YYYY-MM-DD). diff --git a/src/main/java/org/perlonjava/runtime/operators/ModuleOperators.java b/src/main/java/org/perlonjava/runtime/operators/ModuleOperators.java index 0d567b1de..0382f2cf5 100644 --- a/src/main/java/org/perlonjava/runtime/operators/ModuleOperators.java +++ b/src/main/java/org/perlonjava/runtime/operators/ModuleOperators.java @@ -797,9 +797,16 @@ public static RuntimeScalar require(RuntimeScalar runtimeScalar) { // Don't set %INC for file not found errors throw new PerlCompilerException(message); } else { - message = "Compilation failed in require: " + err; + // Build Perl 5-compatible error: original error + "Compilation failed in require" + String fullErr = err; + if (!fullErr.endsWith("\n")) { + fullErr += "\n"; + } + message = fullErr + "Compilation failed in require"; // Set %INC as undef to mark compilation failure incHash.put(fileName, new RuntimeScalar()); + // Update $@ so eval{} sees the full message (catchEval preserves $@ for PerlCompilerException) + getGlobalVariable("main::@").set(message); throw new PerlCompilerException(message); } } diff --git a/src/main/java/org/perlonjava/runtime/operators/WarnDie.java b/src/main/java/org/perlonjava/runtime/operators/WarnDie.java index 71ac16970..d59e1cf34 100644 --- a/src/main/java/org/perlonjava/runtime/operators/WarnDie.java +++ b/src/main/java/org/perlonjava/runtime/operators/WarnDie.java @@ -400,8 +400,13 @@ private static RuntimeBase dieEmptyMessage(RuntimeScalar oldErr, String fileName */ public static RuntimeScalar exit(RuntimeScalar runtimeScalar) { int exitCode = runtimeScalar.getInt(); + // Set $? to the exit code before running END blocks (Perl 5 semantics). + // From perlvar: "Inside an END subroutine $? contains the value that + // is going to be given to exit(). You can modify $? in an END + // subroutine to change the exit status of your program." + getGlobalVariable("main::?").set(exitCode); try { - runEndBlocks(); + runEndBlocks(false); // Don't reset $? - we just set it to the exit code } catch (Throwable t) { RuntimeIO.closeAllHandles(); String errorMessage = ErrorMessageUtil.stringifyException(t); @@ -409,7 +414,9 @@ public static RuntimeScalar exit(RuntimeScalar runtimeScalar) { throw new PerlExitException(1); } RuntimeIO.closeAllHandles(); - throw new PerlExitException(exitCode); + // Use $? as the final exit code - END blocks may have modified it + int finalExitCode = getGlobalVariable("main::?").getInt(); + throw new PerlExitException(finalExitCode); } /** diff --git a/src/main/java/org/perlonjava/runtime/perlmodule/Encode.java b/src/main/java/org/perlonjava/runtime/perlmodule/Encode.java index bf66f05f6..8b97a23f1 100644 --- a/src/main/java/org/perlonjava/runtime/perlmodule/Encode.java +++ b/src/main/java/org/perlonjava/runtime/perlmodule/Encode.java @@ -84,7 +84,7 @@ public class Encode extends PerlModuleBase { } public Encode() { - super("Encode", true); + super("Encode", false); // Don't set %INC - let Encode.pm run via XSLoader } public static void initialize() { @@ -335,6 +335,8 @@ public static RuntimeList is_utf8(RuntimeArray args, int ctx) { * find_encoding($encoding) * Returns a blessed Encode::Encoding object for the given encoding name. * The object supports ->encode($string) and ->decode($octets) methods. + * Note: This is the Java fast path for known charsets. The Perl wrapper + * in Encode.pm adds Encode::Alias fallback for custom aliases like "locale". */ public static RuntimeList find_encoding(RuntimeArray args, int ctx) { if (args.isEmpty()) { diff --git a/src/main/java/org/perlonjava/runtime/perlmodule/HTMLParser.java b/src/main/java/org/perlonjava/runtime/perlmodule/HTMLParser.java new file mode 100644 index 000000000..98d75d199 --- /dev/null +++ b/src/main/java/org/perlonjava/runtime/perlmodule/HTMLParser.java @@ -0,0 +1,791 @@ +package org.perlonjava.runtime.perlmodule; + +import org.perlonjava.runtime.runtimetypes.*; + +import org.perlonjava.runtime.mro.InheritanceResolver; +import static org.perlonjava.runtime.runtimetypes.RuntimeScalarCache.*; + +/** + * Java XS implementation of HTML::Parser and HTML::Entities. + *

+ * Mirrors the original Parser.xs which contains both PACKAGE = HTML::Parser + * and PACKAGE = HTML::Entities in a single file, loaded via + * XSLoader::load('HTML::Parser'). + *

+ * Phase 1: Full HTML::Entities support (decode_entities, _decode_entities) + * plus HTML::Parser stubs for construction and configuration. + */ +public class HTMLParser extends PerlModuleBase { + + public static final String XS_VERSION = "3.83"; + + public HTMLParser() { + super("HTML::Parser", false); + } + + public static void initialize() { + HTMLParser module = new HTMLParser(); + + try { + // ============================================================ + // PACKAGE = HTML::Parser + // ============================================================ + module.registerMethod("_alloc_pstate", null); + module.registerMethod("parse", null); + module.registerMethod("eof", "parserEof", null); + + // 13 boolean attribute accessors (aliased in XS via strict_comment) + module.registerMethod("strict_comment", null); + module.registerMethod("strict_names", null); + module.registerMethod("xml_mode", null); + module.registerMethod("unbroken_text", null); + module.registerMethod("marked_sections", null); + module.registerMethod("attr_encoded", null); + module.registerMethod("case_sensitive", null); + module.registerMethod("strict_end", null); + module.registerMethod("closing_plaintext", null); + module.registerMethod("utf8_mode", null); + module.registerMethod("empty_element_tags", null); + module.registerMethod("xml_pic", null); + module.registerMethod("backquote", null); + + module.registerMethod("boolean_attribute_value", null); + module.registerMethod("handler", null); + module.registerMethod("report_tags", "tagListAccessor", null); + module.registerMethod("ignore_tags", "tagListAccessor", null); + module.registerMethod("ignore_elements", "tagListAccessor", null); + + } catch (NoSuchMethodException e) { + System.err.println("Warning: Missing HTMLParser method: " + e.getMessage()); + } + + // ============================================================ + // PACKAGE = HTML::Entities + // Cross-package registration: these functions go into the + // HTML::Entities:: namespace but are loaded by HTML::Parser's XS. + // ============================================================ + try { + java.lang.invoke.MethodHandle mh; + RuntimeCode code; + + mh = RuntimeCode.lookup.findStatic(HTMLParser.class, "decode_entities", RuntimeCode.methodType); + code = new RuntimeCode(mh, null, null); + code.isStatic = true; + GlobalVariable.getGlobalCodeRef("HTML::Entities::decode_entities").set(new RuntimeScalar(code)); + + mh = RuntimeCode.lookup.findStatic(HTMLParser.class, "_decode_entities", RuntimeCode.methodType); + code = new RuntimeCode(mh, null, null); + code.isStatic = true; + GlobalVariable.getGlobalCodeRef("HTML::Entities::_decode_entities").set(new RuntimeScalar(code)); + + mh = RuntimeCode.lookup.findStatic(HTMLParser.class, "UNICODE_SUPPORT", RuntimeCode.methodType); + code = new RuntimeCode(mh, null, null); + code.isStatic = true; + GlobalVariable.getGlobalCodeRef("HTML::Entities::UNICODE_SUPPORT").set(new RuntimeScalar(code)); + + mh = RuntimeCode.lookup.findStatic(HTMLParser.class, "_probably_utf8_chunk", RuntimeCode.methodType); + code = new RuntimeCode(mh, null, null); + code.isStatic = true; + GlobalVariable.getGlobalCodeRef("HTML::Entities::_probably_utf8_chunk").set(new RuntimeScalar(code)); + + } catch (NoSuchMethodException | IllegalAccessException e) { + System.err.println("Warning: Missing HTMLEntities method: " + e.getMessage()); + } + } + + // ================================================================ + // HTML::Parser methods + // ================================================================ + + /** + * _alloc_pstate($self) + * Allocates parser state and stores it in $self->{_hparser_xs_state}. + * We use a RuntimeHash to hold the parser configuration. + */ + public static RuntimeList _alloc_pstate(RuntimeArray args, int ctx) { + RuntimeScalar self = args.get(0); + RuntimeHash selfHash = self.hashDeref(); + + // Create parser state as a hash holding boolean flags, handlers, etc. + RuntimeHash pstate = new RuntimeHash(); + + // Initialize boolean flags to false + String[] boolFlags = { + "strict_comment", "strict_names", "xml_mode", "unbroken_text", + "marked_sections", "attr_encoded", "case_sensitive", "strict_end", + "closing_plaintext", "utf8_mode", "empty_element_tags", "xml_pic", + "backquote" + }; + for (String flag : boolFlags) { + pstate.put(flag, scalarFalse); + } + + // Initialize handler slots + String[] events = { + "declaration", "comment", "start", "end", "text", + "process", "start_document", "end_document", "default" + }; + RuntimeHash handlers = new RuntimeHash(); + for (String event : events) { + handlers.put(event + "_cb", scalarUndef); + handlers.put(event + "_argspec", scalarUndef); + } + pstate.put("_handlers", handlers.createReference()); + + // State tracking + pstate.put("_parsing", scalarFalse); + pstate.put("_eof", scalarFalse); + pstate.put("_buf", new RuntimeScalar("")); + pstate.put("_bool_attr_val", scalarUndef); + + // Store in self + selfHash.put("_hparser_xs_state", pstate.createReference()); + + return new RuntimeList(); + } + + /** + * parse($self, $chunk) + * Feeds HTML to the parser. Phase 1: basic event-driven parsing. + */ + public static RuntimeList parse(RuntimeArray args, int ctx) { + RuntimeScalar self = args.get(0); + RuntimeHash selfHash = self.hashDeref(); + RuntimeHash pstate = getPstate(selfHash); + + if (pstate.get("_parsing").getBoolean()) { + throw new RuntimeException("Parse loop not allowed"); + } + + pstate.put("_parsing", scalarTrue); + + try { + if (args.size() > 1) { + RuntimeScalar chunk = args.get(1); + if (chunk.getDefinedBoolean()) { + String html = pstate.get("_buf").toString() + chunk.toString(); + pstate.put("_buf", new RuntimeScalar("")); + parseHtml(selfHash, pstate, html); + } + } + } finally { + pstate.put("_parsing", scalarFalse); + } + + if (pstate.get("_eof").getBoolean()) { + pstate.put("_eof", scalarFalse); + return scalarUndef.getList(); + } + return self.getList(); + } + + /** + * eof($self) + * Signals end-of-document, flushes buffered text. + */ + public static RuntimeList parserEof(RuntimeArray args, int ctx) { + RuntimeScalar self = args.get(0); + RuntimeHash selfHash = self.hashDeref(); + RuntimeHash pstate = getPstate(selfHash); + + if (pstate.get("_parsing").getBoolean()) { + pstate.put("_eof", scalarTrue); + } else { + pstate.put("_parsing", scalarTrue); + try { + // Flush any remaining buffered text + String remaining = pstate.get("_buf").toString(); + if (!remaining.isEmpty()) { + pstate.put("_buf", new RuntimeScalar("")); + parseHtml(selfHash, pstate, remaining); + } + // Fire end_document event + fireEvent(selfHash, pstate, "end_document"); + } finally { + pstate.put("_parsing", scalarFalse); + } + } + return self.getList(); + } + + // 13 boolean attribute getter/setters - each delegates to booleanAccessorHelper + public static RuntimeList strict_comment(RuntimeArray args, int ctx) { return booleanAccessorHelper(args, "strict_comment"); } + public static RuntimeList strict_names(RuntimeArray args, int ctx) { return booleanAccessorHelper(args, "strict_names"); } + public static RuntimeList xml_mode(RuntimeArray args, int ctx) { return booleanAccessorHelper(args, "xml_mode"); } + public static RuntimeList unbroken_text(RuntimeArray args, int ctx) { return booleanAccessorHelper(args, "unbroken_text"); } + public static RuntimeList marked_sections(RuntimeArray args, int ctx) { return booleanAccessorHelper(args, "marked_sections"); } + public static RuntimeList attr_encoded(RuntimeArray args, int ctx) { return booleanAccessorHelper(args, "attr_encoded"); } + public static RuntimeList case_sensitive(RuntimeArray args, int ctx) { return booleanAccessorHelper(args, "case_sensitive"); } + public static RuntimeList strict_end(RuntimeArray args, int ctx) { return booleanAccessorHelper(args, "strict_end"); } + public static RuntimeList closing_plaintext(RuntimeArray args, int ctx) { return booleanAccessorHelper(args, "closing_plaintext"); } + public static RuntimeList utf8_mode(RuntimeArray args, int ctx) { return booleanAccessorHelper(args, "utf8_mode"); } + public static RuntimeList empty_element_tags(RuntimeArray args, int ctx) { return booleanAccessorHelper(args, "empty_element_tags"); } + public static RuntimeList xml_pic(RuntimeArray args, int ctx) { return booleanAccessorHelper(args, "xml_pic"); } + public static RuntimeList backquote(RuntimeArray args, int ctx) { return booleanAccessorHelper(args, "backquote"); } + + private static RuntimeList booleanAccessorHelper(RuntimeArray args, String attrName) { + RuntimeScalar self = args.get(0); + RuntimeHash selfHash = self.hashDeref(); + RuntimeHash pstate = getPstate(selfHash); + + RuntimeScalar old = pstate.get(attrName); + RuntimeScalar retval = (old != null && old.getBoolean()) ? scalarTrue : scalarFalse; + if (args.size() > 1) { + pstate.put(attrName, args.get(1).getBoolean() ? scalarTrue : scalarFalse); + } + return retval.getList(); + } + + /** + * boolean_attribute_value($pstate, [$new_value]) + */ + public static RuntimeList boolean_attribute_value(RuntimeArray args, int ctx) { + RuntimeScalar self = args.get(0); + RuntimeHash selfHash = self.hashDeref(); + RuntimeHash pstate = getPstate(selfHash); + + RuntimeScalar old = pstate.get("_bool_attr_val"); + if (args.size() > 1) { + pstate.put("_bool_attr_val", args.get(1)); + } + return old.getList(); + } + + /** + * handler($pstate, $eventname, [$callback, $argspec]) + */ + public static RuntimeList handler(RuntimeArray args, int ctx) { + RuntimeScalar self = args.get(0); + RuntimeHash selfHash = self.hashDeref(); + RuntimeHash pstate = getPstate(selfHash); + + if (args.size() < 2) { + throw new RuntimeException("Usage: $p->handler(event => cb, argspec)"); + } + + String eventName = args.get(1).toString(); + RuntimeHash handlers = pstate.get("_handlers").hashDeref(); + + // Return old handler + RuntimeScalar oldCb = handlers.get(eventName + "_cb"); + + // Update handler if new callback provided + if (args.size() > 2) { + RuntimeScalar newCb = args.get(2); + handlers.put(eventName + "_cb", newCb); + } + if (args.size() > 3) { + RuntimeScalar argspec = args.get(3); + handlers.put(eventName + "_argspec", argspec); + } + + return (oldCb != null) ? oldCb.getList() : scalarUndef.getList(); + } + + /** + * Tag list accessor (report_tags, ignore_tags, ignore_elements). + */ + public static RuntimeList tagListAccessor(RuntimeArray args, int ctx) { + // Phase 1 stub - tag filtering not yet implemented + return new RuntimeList(); + } + + // ================================================================ + // HTML::Entities methods (PACKAGE = HTML::Entities in Parser.xs) + // ================================================================ + + /** + * decode_entities(...) + *

+ * In void context: decodes entities in-place in the arguments. + * In scalar context with multiple args: only processes first argument, returns copy. + * In list context: returns decoded copies of all arguments. + */ + public static RuntimeList decode_entities(RuntimeArray args, int ctx) { + RuntimeHash entity2char = GlobalVariable.getGlobalHash("HTML::Entities::entity2char"); + + int items = args.size(); + if (ctx == RuntimeContextType.SCALAR && items > 1) { + items = 1; + } + + if (ctx == RuntimeContextType.VOID) { + // Void context: modify in-place + for (int i = 0; i < items; i++) { + RuntimeScalar sv = args.get(i); + String decoded = decodeEntitiesString(sv.toString(), entity2char, false); + sv.set(decoded); + } + return new RuntimeList(); + } else { + // Scalar/list context: return decoded copies + RuntimeList result = new RuntimeList(); + for (int i = 0; i < items; i++) { + String decoded = decodeEntitiesString(args.get(i).toString(), entity2char, false); + result.add(new RuntimeScalar(decoded)); + } + return result; + } + } + + /** + * _decode_entities($string, \%entity2char, $expand_prefix) + * In-place decode with explicit entity hash and optional prefix expansion. + */ + public static RuntimeList _decode_entities(RuntimeArray args, int ctx) { + if (args.size() < 2) { + throw new RuntimeException("Usage: _decode_entities(string, entity2char_hash, [expand_prefix])"); + } + + RuntimeScalar stringSv = args.get(0); + RuntimeScalar entitiesSv = args.get(1); + boolean expandPrefix = args.size() > 2 && args.get(2).getBoolean(); + + RuntimeHash entityHash = null; + if (entitiesSv.getDefinedBoolean()) { + if (RuntimeScalarType.isReference(entitiesSv)) { + try { + entityHash = entitiesSv.hashDeref(); + } catch (Exception e) { + // Not a hash reference + } + } + if (entityHash == null) { + throw new RuntimeException("2nd argument must be hash reference"); + } + } + + String decoded = decodeEntitiesString(stringSv.toString(), entityHash, expandPrefix); + stringSv.set(decoded); + + return new RuntimeList(); + } + + /** + * UNICODE_SUPPORT() - always returns 1 + */ + public static RuntimeList UNICODE_SUPPORT(RuntimeArray args, int ctx) { + return new RuntimeScalar(1).getList(); + } + + /** + * _probably_utf8_chunk($string) - checks if string looks like valid UTF-8 + */ + public static RuntimeList _probably_utf8_chunk(RuntimeArray args, int ctx) { + if (args.isEmpty()) { + return scalarFalse.getList(); + } + String s = args.get(0).toString(); + for (int i = 0; i < s.length(); i++) { + if (s.charAt(i) > 0x7F) { + return scalarTrue.getList(); + } + } + return scalarFalse.getList(); + } + + // ================================================================ + // Internal helpers + // ================================================================ + + /** + * Retrieve the parser state hash from $self->{_hparser_xs_state}. + */ + private static RuntimeHash getPstate(RuntimeHash selfHash) { + RuntimeScalar ref = selfHash.get("_hparser_xs_state"); + if (ref == null || !ref.getDefinedBoolean()) { + throw new RuntimeException("HTML::Parser not initialized (missing _hparser_xs_state)"); + } + return ref.hashDeref(); + } + + /** + * Fire a parser event by calling the registered handler. + */ + private static void fireEvent(RuntimeHash selfHash, RuntimeHash pstate, String eventName, RuntimeScalar... eventArgs) { + RuntimeHash handlers = pstate.get("_handlers").hashDeref(); + RuntimeScalar cb = handlers.get(eventName + "_cb"); + + if (cb == null || !cb.getDefinedBoolean()) { + return; + } + + RuntimeArray callArgs = new RuntimeArray(); + + // Parse argspec to determine what arguments to pass + RuntimeScalar argspecSv = handlers.get(eventName + "_argspec"); + String argspec = (argspecSv != null && argspecSv.getDefinedBoolean()) ? + argspecSv.toString() : ""; + + if (cb.type == RuntimeScalarType.STRING) { + // Method name - call as $self->method(...) + String methodName = cb.toString(); + RuntimeScalar selfRef = selfHash.createReference(); + RuntimeArray.push(callArgs, selfRef); + for (RuntimeScalar arg : eventArgs) { + RuntimeArray.push(callArgs, arg); + } + // Look up method in the object's class hierarchy + int blessId = RuntimeScalarType.blessedId(selfRef); + String className = (blessId != 0) ? + NameNormalizer.getBlessStr(blessId) : "HTML::Parser"; + RuntimeScalar method = InheritanceResolver.findMethodInHierarchy( + methodName, className, null, 0); + if (method != null) { + RuntimeCode.apply(method, callArgs, RuntimeContextType.VOID); + } + } else if (cb.type == RuntimeScalarType.REFERENCE || cb.type == RuntimeScalarType.CODE) { + // Code reference - call directly + for (RuntimeScalar arg : eventArgs) { + RuntimeArray.push(callArgs, arg); + } + RuntimeCode.apply(cb, callArgs, RuntimeContextType.VOID); + } + } + + /** + * Basic HTML parser - fires text, start, end events. + * This is a simplified version; Phase 2 will port the full hparser.c logic. + */ + private static void parseHtml(RuntimeHash selfHash, RuntimeHash pstate, String html) { + int len = html.length(); + int i = 0; + int textStart = 0; + + while (i < len) { + if (html.charAt(i) == '<') { + // Flush pending text + if (i > textStart) { + fireEvent(selfHash, pstate, "text", + new RuntimeScalar(html.substring(textStart, i))); + } + + int tagStart = i; + i++; // skip '<' + + if (i < len && html.charAt(i) == '/') { + // End tag + i++; + int nameStart = i; + while (i < len && html.charAt(i) != '>' && !Character.isWhitespace(html.charAt(i))) { + i++; + } + String tagName = html.substring(nameStart, i).toLowerCase(); + while (i < len && html.charAt(i) != '>') i++; + if (i < len) i++; // skip '>' + + fireEvent(selfHash, pstate, "end", + new RuntimeScalar(tagName), + new RuntimeScalar(html.substring(tagStart, i))); + textStart = i; + } else if (i < len && html.charAt(i) == '!') { + // Comment or declaration + i++; + if (i + 1 < len && html.charAt(i) == '-' && html.charAt(i + 1) == '-') { + // Comment + i += 2; + int commentStart = i; + int endIdx = html.indexOf("-->", i); + if (endIdx >= 0) { + String comment = html.substring(commentStart, endIdx); + i = endIdx + 3; + fireEvent(selfHash, pstate, "comment", + new RuntimeScalar(comment)); + } else { + // Unterminated comment - buffer it + pstate.put("_buf", new RuntimeScalar(html.substring(tagStart))); + return; + } + } else { + // Declaration + int endIdx = html.indexOf('>', i); + if (endIdx >= 0) { + String decl = html.substring(tagStart, endIdx + 1); + i = endIdx + 1; + fireEvent(selfHash, pstate, "declaration", + new RuntimeScalar(decl)); + } else { + pstate.put("_buf", new RuntimeScalar(html.substring(tagStart))); + return; + } + } + textStart = i; + } else if (i < len && html.charAt(i) == '?') { + // Processing instruction + int endIdx = html.indexOf("?>", i); + if (endIdx >= 0) { + String pi = html.substring(tagStart, endIdx + 2); + i = endIdx + 2; + fireEvent(selfHash, pstate, "process", + new RuntimeScalar(pi)); + } else { + pstate.put("_buf", new RuntimeScalar(html.substring(tagStart))); + return; + } + textStart = i; + } else { + // Start tag + int nameStart = i; + while (i < len && html.charAt(i) != '>' && html.charAt(i) != '/' + && !Character.isWhitespace(html.charAt(i))) { + i++; + } + String tagName = html.substring(nameStart, i).toLowerCase(); + + // Parse attributes + RuntimeHash attrs = new RuntimeHash(); + RuntimeArray attrSeq = new RuntimeArray(); + + while (i < len && html.charAt(i) != '>' && html.charAt(i) != '/') { + // Skip whitespace + while (i < len && Character.isWhitespace(html.charAt(i))) i++; + if (i >= len || html.charAt(i) == '>' || html.charAt(i) == '/') break; + + // Attribute name + int attrNameStart = i; + while (i < len && html.charAt(i) != '=' && html.charAt(i) != '>' + && html.charAt(i) != '/' && !Character.isWhitespace(html.charAt(i))) { + i++; + } + String attrName = html.substring(attrNameStart, i).toLowerCase(); + RuntimeArray.push(attrSeq, new RuntimeScalar(attrName)); + + // Skip whitespace + while (i < len && Character.isWhitespace(html.charAt(i))) i++; + + String attrValue = attrName; // boolean attribute default + if (i < len && html.charAt(i) == '=') { + i++; // skip '=' + while (i < len && Character.isWhitespace(html.charAt(i))) i++; + + if (i < len) { + if (html.charAt(i) == '"' || html.charAt(i) == '\'') { + char quote = html.charAt(i); + i++; + int valStart = i; + while (i < len && html.charAt(i) != quote) i++; + attrValue = html.substring(valStart, i); + if (i < len) i++; // skip closing quote + } else { + int valStart = i; + while (i < len && html.charAt(i) != '>' + && !Character.isWhitespace(html.charAt(i))) { + i++; + } + attrValue = html.substring(valStart, i); + } + } + } + attrs.put(attrName, new RuntimeScalar(attrValue)); + } + + // Handle self-closing /> + boolean selfClosing = false; + if (i < len && html.charAt(i) == '/') { + selfClosing = true; + i++; + } + if (i < len && html.charAt(i) == '>') i++; + + String origText = html.substring(tagStart, i); + + fireEvent(selfHash, pstate, "start", + new RuntimeScalar(tagName), + attrs.createReference(), + attrSeq.createReference(), + new RuntimeScalar(origText)); + + if (selfClosing) { + fireEvent(selfHash, pstate, "end", + new RuntimeScalar(tagName), + new RuntimeScalar("")); + } + textStart = i; + } + } else { + i++; + } + } + + // Flush remaining text + if (textStart < len) { + fireEvent(selfHash, pstate, "text", + new RuntimeScalar(html.substring(textStart))); + } + } + + // ================================================================ + // Entity decoding core (ported from util.c decode_entities) + // ================================================================ + + /** + * Core entity decoding algorithm, ported from util.c decode_entities(). + * Handles numeric entities (decimal/hex), named entities, surrogate pairs, + * and prefix expansion mode. + */ + private static String decodeEntitiesString(String input, RuntimeHash entity2char, boolean expandPrefix) { + if (input == null || input.isEmpty()) { + return input; + } + + StringBuilder result = new StringBuilder(input.length()); + int len = input.length(); + int i = 0; + int highSurrogate = 0; + + while (i < len) { + char ch = input.charAt(i); + if (ch != '&') { + result.append(ch); + i++; + continue; + } + + int entStart = i; + i++; // skip '&' + + if (i >= len) { + result.append('&'); + continue; + } + + String replacement = null; + + if (input.charAt(i) == '#') { + // Numeric entity + i++; // skip '#' + long num = 0; + boolean ok = false; + + if (i < len && (input.charAt(i) == 'x' || input.charAt(i) == 'X')) { + // Hex: &#xHH; + i++; + while (i < len) { + int digit = hexDigit(input.charAt(i)); + if (digit < 0) break; + num = (num << 4) | digit; + if (num > 0x10FFFF) { ok = false; break; } + i++; + ok = true; + } + } else { + // Decimal: &#NNN; + while (i < len && input.charAt(i) >= '0' && input.charAt(i) <= '9') { + num = num * 10 + (input.charAt(i) - '0'); + if (num > 0x10FFFF) { ok = false; break; } + i++; + ok = true; + } + } + + if (num > 0 && ok) { + // Handle surrogates (matching C code logic) + if ((num & 0xFFFFFC00L) == 0xDC00L) { + // Low surrogate + if (highSurrogate != 0) { + // Combine with previous high surrogate + num = ((long) (highSurrogate - 0xD800) << 10) + + (num - 0xDC00) + 0x10000; + highSurrogate = 0; + // Remove previously appended U+FFFD + if (result.length() > 0 && result.charAt(result.length() - 1) == '\uFFFD') { + result.setLength(result.length() - 1); + } + } else { + num = 0xFFFD; + } + } else if ((num & 0xFFFFFC00L) == 0xD800L) { + // High surrogate + highSurrogate = (int) num; + num = 0xFFFD; + } else { + highSurrogate = 0; + if (num == 0xFFFE || num == 0xFFFF) { + // Illegal + ok = false; + } else if ((num >= 0xFDD0 && num <= 0xFDEF) || + ((num & 0xFFFE) == 0xFFFE) || + num > 0x10FFFF) { + num = 0xFFFD; + } + } + + if (ok && num > 0) { + replacement = new String(Character.toChars((int) num)); + } + } + + if (replacement == null) { + highSurrogate = 0; + } + } else { + // Named entity + int nameStart = i; + while (i < len && isAlnum(input.charAt(i))) { + i++; + } + + if (i > nameStart && entity2char != null) { + String entityName = input.substring(nameStart, i); + + // Try exact match without semicolon + RuntimeScalar val = entity2char.get(entityName); + if (val != null && val.getDefinedBoolean()) { + replacement = val.toString(); + } + + // Try with semicolon appended (for entities keyed with trailing ";") + if (replacement == null && i < len && input.charAt(i) == ';') { + val = entity2char.get(entityName + ";"); + if (val != null && val.getDefinedBoolean()) { + replacement = val.toString(); + } + } + + // Prefix expansion mode + if (replacement == null && expandPrefix) { + for (int end = i - 1; end > nameStart; end--) { + String prefix = input.substring(nameStart, end); + val = entity2char.get(prefix); + if (val != null && val.getDefinedBoolean()) { + replacement = val.toString(); + i = end; + break; + } + } + } + } + highSurrogate = 0; + } + + if (replacement != null) { + // Consume trailing semicolon if present + if (i < len && input.charAt(i) == ';') { + i++; + } + result.append(replacement); + + if (i >= len || input.charAt(i) != '&') { + highSurrogate = 0; + } + } else { + // No match - output original text + result.append(input, entStart, i); + } + } + + return result.toString(); + } + + private static int hexDigit(char c) { + if (c >= '0' && c <= '9') return c - '0'; + if (c >= 'a' && c <= 'f') return c - 'a' + 10; + if (c >= 'A' && c <= 'F') return c - 'A' + 10; + return -1; + } + + private static boolean isAlnum(char c) { + return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || + (c >= '0' && c <= '9') || c == '_'; + } +} diff --git a/src/main/java/org/perlonjava/runtime/perlmodule/MIMEBase64.java b/src/main/java/org/perlonjava/runtime/perlmodule/MIMEBase64.java index 8e102f090..d218eb383 100644 --- a/src/main/java/org/perlonjava/runtime/perlmodule/MIMEBase64.java +++ b/src/main/java/org/perlonjava/runtime/perlmodule/MIMEBase64.java @@ -5,6 +5,7 @@ import org.perlonjava.runtime.runtimetypes.RuntimeList; import org.perlonjava.runtime.runtimetypes.RuntimeScalar; import org.perlonjava.runtime.util.Base64Util; +import org.perlonjava.runtime.runtimetypes.GlobalVariable; import java.nio.charset.StandardCharsets; @@ -18,6 +19,7 @@ public static void initialize() { MIMEBase64 base64 = new MIMEBase64(); base64.initializeExporter(); base64.defineExport("EXPORT", "encode_base64", "decode_base64"); + GlobalVariable.getGlobalVariable("MIME::Base64::VERSION").set(new RuntimeScalar("3.16")); try { base64.registerMethod("encode_base64", null); base64.registerMethod("decode_base64", null); diff --git a/src/main/java/org/perlonjava/runtime/regex/RegexPreprocessor.java b/src/main/java/org/perlonjava/runtime/regex/RegexPreprocessor.java index 737b8179c..1244badf4 100644 --- a/src/main/java/org/perlonjava/runtime/regex/RegexPreprocessor.java +++ b/src/main/java/org/perlonjava/runtime/regex/RegexPreprocessor.java @@ -741,6 +741,14 @@ static int handleRegex(String s, int offset, StringBuilder sb, RegexFlags regexF // Remove \G from the pattern string for Java compilation if (s.startsWith("\\G", offset)) { offset += 2; + // Also strip any quantifier that followed \G (e.g. \G? in (\G?[ac])?) + // Since \G is a zero-width assertion, \G? is effectively nothing + if (offset < length) { + char next = s.charAt(offset); + if (next == '?' || next == '*' || next == '+') { + offset++; + } + } } while (offset < length) { @@ -751,18 +759,18 @@ static int handleRegex(String s, int offset, StringBuilder sb, RegexFlags regexF case '*': case '+': case '?': - // Check if this is at the start or after certain characters - if (offset == 0 || sb.length() == 0) { - regexError(s, offset + 1, "Quantifier follows nothing"); - } - - // Check what the last character was - if (sb.length() > 0) { - char lastChar = sb.charAt(sb.length() - 1); - // Check if quantifier follows | - if (lastChar == '|') { - regexError(s, offset + 1, "Quantifier follows nothing"); + // Check if the previous item can be quantified + if (!lastWasQuantifiable) { + // Distinguish "Nested quantifiers" from "Quantifier follows nothing": + // If the last char in sb is a quantifier, this is a nested quantifier + if (sb.length() > 0) { + char lc = sb.charAt(sb.length() - 1); + boolean esc = sb.length() >= 2 && sb.charAt(sb.length() - 2) == '\\'; + if (!esc && (lc == '*' || lc == '+' || lc == '?' || lc == '}')) { + regexError(s, offset + 1, "Nested quantifiers"); + } } + regexError(s, offset + 1, "Quantifier follows nothing"); } // Check if this might be a possessive quantifier @@ -771,15 +779,6 @@ static int handleRegex(String s, int offset, StringBuilder sb, RegexFlags regexF // Check if this might be a non-greedy quantifier boolean isNonGreedy = offset + 1 < length && s.charAt(offset + 1) == '?'; - // Check what the last character was - if (sb.length() > 0) { - char lastChar = sb.charAt(sb.length() - 1); - // Check if quantifier follows | - if (lastChar == '|') { - regexError(s, offset + 1, "Quantifier follows nothing"); - } - } - // Check for nested quantifiers (but not possessive or non-greedy) if (!isPossessive && !isNonGreedy && sb.length() > 0) { char lastChar = sb.charAt(sb.length() - 1); @@ -1036,11 +1035,13 @@ private static int handleParentheses(String s, int offset, int length, StringBui } else if (c3 == '<' && c4 == '=') { // Positive lookbehind (?<=...) validateLookbehindLength(s, offset); - offset = handleRegularParentheses(s, offset, length, sb, regexFlags); + sb.append("(?<="); + offset = handleRegex(s, offset + 4, sb, regexFlags, true); } else if (c3 == '<' && c4 == '!') { // Negative lookbehind (? ... ) offset = handleNamedCapture(c3, s, offset, length, sb, regexFlags); @@ -1064,16 +1065,20 @@ private static int handleParentheses(String s, int offset, int length, StringBui return RegexPreprocessorHelper.handleFlagModifiers(s, offset, sb, regexFlags); } else if (c3 == ':') { // Handle non-capturing group (?:...) - offset = handleRegularParentheses(s, offset, length, sb, regexFlags); + sb.append("(?:"); + offset = handleRegex(s, offset + 3, sb, regexFlags, true); } else if (c3 == '=') { // Positive lookahead (?=...) - offset = handleRegularParentheses(s, offset, length, sb, regexFlags); + sb.append("(?="); + offset = handleRegex(s, offset + 3, sb, regexFlags, true); } else if (c3 == '!') { // Negative lookahead (?!...) - offset = handleRegularParentheses(s, offset, length, sb, regexFlags); + sb.append("(?!"); + offset = handleRegex(s, offset + 3, sb, regexFlags, true); } else if (c3 == '>') { // Atomic group (?>...) - non-backtracking group - offset = handleRegularParentheses(s, offset, length, sb, regexFlags); + sb.append("(?>"); + offset = handleRegex(s, offset + 3, sb, regexFlags, true); } else if (c3 == '|') { // Handle (?|...) branch reset groups offset = handleBranchReset(s, offset, length, sb, regexFlags); diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/GlobalContext.java b/src/main/java/org/perlonjava/runtime/runtimetypes/GlobalContext.java index b3a5659d9..52f97f42b 100644 --- a/src/main/java/org/perlonjava/runtime/runtimetypes/GlobalContext.java +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/GlobalContext.java @@ -248,11 +248,11 @@ public static void initializeGlobals(CompilerOptions compilerOptions) { // Deferred to XSLoader::load() for faster startup - only loaded when actually used: // UnicodeNormalize.initialize(); // Has XSLoader in Perl file // TimeHiRes.initialize(); // Has XSLoader in Perl file + // Encode.initialize(); // Has XSLoader in Perl file - deferred for Encode::Alias support UnicodeUCD.initialize(); // No XSLoader in Perl file - needed at startup TermReadLine.initialize(); // No Perl file - needed at startup TermReadKey.initialize(); // No Perl file - needed at startup FileTemp.initialize(); // Perl uses eval require - keep for cleanup hooks - Encode.initialize(); // Common enough to keep // JavaSystem.initialize(); // Only for java:: integration PerlIO.initialize(); IOHandle.initialize(); // IO::Handle methods (_sync, _error, etc.) diff --git a/src/main/perl/lib/Clone/PP.pm b/src/main/perl/lib/Clone/PP.pm new file mode 100644 index 000000000..c65179205 --- /dev/null +++ b/src/main/perl/lib/Clone/PP.pm @@ -0,0 +1,77 @@ +package Clone::PP; + +use strict; +use warnings; + +our $VERSION = '1.09'; + +use Scalar::Util qw(reftype blessed refaddr); +use Exporter 'import'; +our @EXPORT_OK = qw(clone); + +sub clone { + return _clone_data($_[0], {}); +} + +sub _clone_data { + my ($data, $seen) = @_; + + return $data unless ref($data); + + # Handle circular references + my $addr = refaddr($data); + return $seen->{$addr} if exists $seen->{$addr}; + + my $rtype = reftype($data); + my $class = blessed($data); + + if ($rtype eq 'HASH') { + my $clone = {}; + $seen->{$addr} = $clone; + for my $key (keys %$data) { + $clone->{$key} = _clone_data($data->{$key}, $seen); + } + bless $clone, $class if defined $class; + return $clone; + } + + if ($rtype eq 'ARRAY') { + my $clone = []; + $seen->{$addr} = $clone; + for my $item (@$data) { + push @$clone, _clone_data($item, $seen); + } + bless $clone, $class if defined $class; + return $clone; + } + + if ($rtype eq 'SCALAR' || $rtype eq 'REF') { + my $clone = \(my $copy = $$data); + $seen->{$addr} = $clone; + bless $clone, $class if defined $class; + return $clone; + } + + # CODE, GLOB, IO, Regexp - return as-is (immutable or not deep-cloneable) + return $data; +} + +1; +__END__ + +=head1 NAME + +Clone::PP - Recursively copy Perl datatypes (pure Perl) + +=head1 SYNOPSIS + + use Clone::PP 'clone'; + my $copy = clone($data); + +=head1 DESCRIPTION + +Pure Perl deep clone implementation. Handles hashes, arrays, scalar refs, +and circular references. Code refs, globs, and regexps are returned as-is +(shared, not copied) since they are immutable or not safely cloneable. + +=cut diff --git a/src/main/perl/lib/Encode.pm b/src/main/perl/lib/Encode.pm index 491604a8f..730177b1f 100644 --- a/src/main/perl/lib/Encode.pm +++ b/src/main/perl/lib/Encode.pm @@ -17,4 +17,44 @@ our @EXPORT_OK = qw( use XSLoader; XSLoader::load('Encode', $VERSION); +# Save reference to the Java-backed find_encoding before overriding. +# The Java method handles known charsets (UTF-8, latin1, ascii, etc.) directly. +my $_java_find_encoding = \&find_encoding; + +# Override find_encoding to add Encode::Alias support. +# The Java backend only recognises hardcoded charset names. This wrapper +# consults Encode::Alias (loaded by modules like Encode::Locale) to resolve +# custom aliases (coderef, regex, string) before delegating to Java. +{ + no warnings 'redefine'; + my %_resolving; # per-name recursion guard + *find_encoding = sub { + my ($name, $skip_external) = @_; + return undef unless defined $name; + + # Fast path: try the Java charset lookup first + my $enc = eval { $_java_find_encoding->($name) }; + return $enc if defined $enc; + + # Guard against circular alias chains for the same name + return undef if $_resolving{$name}; + local $_resolving{$name} = 1; + + # Consult Encode::Alias if it has been loaded + if (defined &Encode::Alias::find_alias) { + my $resolved = eval { Encode::Alias::find_alias("Encode", $name) }; + return $resolved if defined $resolved; + } + + return undef; + }; +} + +sub resolve_alias { + my ($name) = @_; + my $enc = find_encoding($name); + return unless defined $enc; + return ref($enc) ? $enc->{Name} : $enc; +} + 1; diff --git a/src/main/perl/lib/ExtUtils/MakeMaker.pm b/src/main/perl/lib/ExtUtils/MakeMaker.pm index f7bc8dc7e..7b964f2b6 100644 --- a/src/main/perl/lib/ExtUtils/MakeMaker.pm +++ b/src/main/perl/lib/ExtUtils/MakeMaker.pm @@ -252,74 +252,71 @@ sub _install_pure_perl { if (!%pm) { print "Warning: No installable files found (no .pm, .pl, .dat, etc.).\n"; print "Expected structure: lib/Your/Module.pm\n\n"; - return PerlOnJava::MM::Installed->new($args); + my $mm = PerlOnJava::MM::Installed->new($args); + _create_mymeta($name, $version, $args); + _create_install_makefile($name, $version, $args, {}, {}, $mm); + return $mm; } - print "\nInstalling to: $INSTALL_BASE\n\n"; + print "\nWill install to: $INSTALL_BASE\n\n"; - # Install .pm files - my $installed = 0; + # List files to be installed (deferred to 'make' for proper CPAN.pm dep resolution) for my $src (sort keys %pm) { - my $dest = $pm{$src}; - my $dir = dirname($dest); - - if (!-d $dir) { - make_path($dir) or warn "Failed to create $dir: $!\n"; - } - - print " $src -> $dest\n"; - if (copy($src, $dest)) { - $installed++; - } else { - warn " Failed to copy: $!\n"; + print " $src -> $pm{$src}\n"; + } + + # Collect share directory files (don't copy yet) + my %share_files = _build_share_file_mapping($name, $args); + if (%share_files) { + print "\nShare files:\n"; + for my $src (sort keys %share_files) { + print " $src -> $share_files{$src}\n"; } + %pm = (%pm, %share_files); } - # Install scripts + # Collect script files + my %scripts; if ($args->{EXE_FILES} && @{$args->{EXE_FILES}}) { - print "\nInstalling scripts:\n"; my $bin_dir = File::Spec->catdir($INSTALL_BASE, '..', 'bin'); - make_path($bin_dir) unless -d $bin_dir; - for my $script (@{$args->{EXE_FILES}}) { - my $dest = File::Spec->catfile($bin_dir, basename($script)); - print " $script -> $dest\n"; - copy($script, $dest) or warn " Failed to copy: $!\n"; + $scripts{$script} = File::Spec->catfile($bin_dir, basename($script)); + } + print "\nScripts:\n"; + for my $src (sort keys %scripts) { + print " $src -> $scripts{$src}\n"; } } - # Install share directories (File::ShareDir::Install support) - $installed += _install_share_dirs($name, $args); - print "\n"; - print "=" x 60, "\n"; - print "Installation complete! ($installed files installed)\n"; - print "=" x 60, "\n\n"; # Create MM object first (needed by postamble) my $mm = PerlOnJava::MM::Installed->new($args); - # Create a stub Makefile to satisfy CPAN.pm's check - _create_stub_makefile($name, $version, $args, $mm); - # Create MYMETA.yml for CPAN.pm dependency resolution _create_mymeta($name, $version, $args); + # Create Makefile with install commands (actual install deferred to 'make') + _create_install_makefile($name, $version, $args, \%pm, \%scripts, $mm); + + my $total = scalar(keys %pm) + scalar(keys %scripts); + print "=" x 60, "\n"; + print "Configured! $total files will be installed when 'make' runs.\n"; + print "=" x 60, "\n\n"; + return $mm; } -sub _install_share_dirs { +sub _build_share_file_mapping { my ($name, $args) = @_; - my $installed = 0; + my %files; # Check if File::ShareDir::Install was used - return 0 unless @File::ShareDir::Install::DIRS; + return %files unless eval { @File::ShareDir::Install::DIRS }; # Convert module name to dist name (My::Module -> My-Module) (my $dist_name = $name) =~ s/::/-/g; - print "\nInstalling share directories:\n"; - for my $def (@File::ShareDir::Install::DIRS) { my $type = $def->{type} || 'dist'; next if $type =~ /^delete/; # Skip delete directives @@ -332,7 +329,7 @@ sub _install_share_dirs { @src_dirs = ($def->{dir}); } - # Handle directory specification (scan and copy all files) + # Scan files (don't copy yet - deferred to 'make') for my $src_dir (@src_dirs) { next unless -d $src_dir; @@ -354,23 +351,14 @@ sub _install_share_dirs { my $src = $File::Find::name; (my $rel = $src) =~ s{^\Q$src_dir\E/?}{}; - my $dest = File::Spec->catfile($dest_base, $rel); - my $dest_dir = dirname($dest); - make_path($dest_dir) unless -d $dest_dir; - - if (copy($src, $dest)) { - $installed++; - } else { - warn " Failed to copy $src: $!\n"; - } + $files{$src} = File::Spec->catfile($dest_base, $rel); }, no_chdir => 1, }, $src_dir); } } - print " Installed $installed share files\n" if $installed; - return $installed; + return %files; } sub _extract_version { @@ -393,16 +381,17 @@ sub _extract_version { return '0'; } -sub _create_stub_makefile { - my ($name, $version, $args, $mm) = @_; +sub _create_install_makefile { + my ($name, $version, $args, $pm, $scripts, $mm) = @_; - # Create a minimal Makefile that CPAN.pm can parse - # This allows CPAN.pm to proceed through its make/test/install workflow + # Create a Makefile that actually installs files when 'make' runs. + # This defers installation to after CPAN.pm has resolved and installed + # dependencies, enabling proper dependency resolution for any CPAN module. # Respect custom MAKEFILE name if provided my $makefile = $args->{MAKEFILE} || 'Makefile'; open my $fh, '>', $makefile or do { - warn "Note: Could not create stub Makefile: $!\n"; + warn "Note: Could not create Makefile: $!\n"; return; }; @@ -427,10 +416,40 @@ sub _create_stub_makefile { my $inst_lib = $args->{INST_LIB} || 'blib/lib'; my $installsitelib = $args->{INSTALLSITELIB} || $INSTALL_BASE; - # Minimal Makefile that works with CPAN.pm + # Build install commands for module/data/share files + my @install_cmds; + my %dirs_seen; + for my $src (sort keys %$pm) { + my $dest = $pm->{$src}; + my $dir = dirname($dest); + unless ($dirs_seen{$dir}++) { + push @install_cmds, _shell_mkdir($dir); + } + push @install_cmds, _shell_cp($src, $dest); + } + + # Build install commands for scripts + my @script_cmds; + if ($scripts && %$scripts) { + my %sdirs; + for my $src (sort keys %$scripts) { + my $dest = $scripts->{$src}; + my $dir = dirname($dest); + unless ($sdirs{$dir}++) { + push @script_cmds, _shell_mkdir($dir); + } + push @script_cmds, _shell_cp($src, $dest); + } + } + + my $install_cmds_str = join("\n", @install_cmds) || "\t\@true"; + my $script_cmds_str = join("\n", @script_cmds) || "\t\@true"; + my $file_count = scalar(keys %$pm) + scalar(keys %$scripts); + print $fh <<"MAKEFILE"; -# Stub Makefile for PerlOnJava -# This module was installed directly without 'make' +# Makefile generated by PerlOnJava MakeMaker +# Files are installed during 'make' (not during Makefile.PL) +# This enables CPAN.pm to resolve dependencies before installation NAME = $name DISTNAME = $distname @@ -442,9 +461,16 @@ INSTALLSITELIB = $installsitelib NOECHO = \@ RM_RF = rm -rf -# PerlOnJava installs modules directly - 'all' runs config for share directories -all: config -\t\@echo "PerlOnJava: Module already installed" +all: pm_to_blib config +\t\@echo "PerlOnJava: $name v$version installed ($file_count files)" + +# Copy module and data files to installation directory +pm_to_blib: +$install_cmds_str + +# Install executable scripts +install_scripts: +$script_cmds_str # Use double-colon for config to allow postamble to add rules config:: @@ -452,10 +478,8 @@ config:: test: \t$test_cmd -install: all -\t\@echo "PerlOnJava: Installing to \$(INSTALLSITELIB)" -\t\@mkdir -p \$(INSTALLSITELIB)/auto -\t-\@cp -r \$(INST_LIB)/auto/share \$(INSTALLSITELIB)/auto/ 2>/dev/null || true +install: all install_scripts +\t\@echo "PerlOnJava: $name installed to \$(INSTALLSITELIB)" clean: \t\@echo "PerlOnJava: Nothing to clean" @@ -464,7 +488,7 @@ realclean: clean distclean: clean -.PHONY: all config test install clean realclean distclean +.PHONY: all pm_to_blib config test install clean realclean distclean install_scripts MAKEFILE # Call MY::postamble if it exists (File::ShareDir::Install uses this) @@ -480,6 +504,21 @@ MAKEFILE close $fh; } +# Helper: generate a shell mkdir -p command for Makefile +sub _shell_mkdir { + my ($dir) = @_; + $dir =~ s/'/'\\''/g; # escape single quotes + return "\t\@mkdir -p '$dir'"; +} + +# Helper: generate a shell cp command for Makefile +sub _shell_cp { + my ($src, $dest) = @_; + $src =~ s/'/'\\''/g; + $dest =~ s/'/'\\''/g; + return "\t\@cp '$src' '$dest'"; +} + sub _create_mymeta { my ($name, $version, $args) = @_; diff --git a/src/main/perl/lib/IO/Socket/INET.pm b/src/main/perl/lib/IO/Socket/INET.pm index 45fd139d9..527d6a95b 100644 --- a/src/main/perl/lib/IO/Socket/INET.pm +++ b/src/main/perl/lib/IO/Socket/INET.pm @@ -16,7 +16,7 @@ use Errno; our @ISA = qw(IO::Socket); our $VERSION = "1.56"; -my $EINVAL = exists(&Errno::EINVAL) ? Errno::EINVAL() : 1; +my $EINVAL = eval { Errno::EINVAL() } || 1; IO::Socket::INET->register_domain( AF_INET ); diff --git a/src/main/perl/lib/JSON.pm b/src/main/perl/lib/JSON.pm index 4df7fa8d6..2baa3cf13 100644 --- a/src/main/perl/lib/JSON.pm +++ b/src/main/perl/lib/JSON.pm @@ -1,5 +1,7 @@ package JSON; +our $VERSION = '4.11'; + use Exporter "import"; use warnings; use strict; diff --git a/src/main/perl/lib/MIME/Base64.pm b/src/main/perl/lib/MIME/Base64.pm index 031566a90..c4956d0e4 100644 --- a/src/main/perl/lib/MIME/Base64.pm +++ b/src/main/perl/lib/MIME/Base64.pm @@ -11,6 +11,8 @@ package MIME::Base64; # The implementation is in: src/main/java/org/perlonjava/perlmodule/MimeBase64.java # +our $VERSION = '3.16'; + use XSLoader; XSLoader::load( 'MIME::Base64' ); diff --git a/src/main/perl/lib/PerlIO/encoding.pm b/src/main/perl/lib/PerlIO/encoding.pm new file mode 100644 index 000000000..4998fee3d --- /dev/null +++ b/src/main/perl/lib/PerlIO/encoding.pm @@ -0,0 +1,29 @@ +package PerlIO::encoding; + +use strict; +use warnings; + +our $VERSION = '0.30'; + +# Fallback flags for encoding error handling. +# Modules check this variable to decide how to handle encoding errors. +our $fallback = 0; # Encode::FB_QUIET equivalent + +# In standard Perl, PerlIO::encoding is an XS module that provides +# the :encoding() PerlIO layer. In PerlOnJava, IO layers are handled +# by the Java LayeredIOHandle implementation. This stub provides the +# package variables that other modules (IO::HTML, etc.) expect. + +1; +__END__ + +=head1 NAME + +PerlIO::encoding - encoding layer stub for PerlOnJava + +=head1 DESCRIPTION + +Stub module providing the C<$PerlIO::encoding::fallback> variable. +The actual encoding layer functionality is implemented in the Java backend. + +=cut diff --git a/src/main/perl/lib/Template/Stash/XS.pm b/src/main/perl/lib/Template/Stash/XS.pm new file mode 100644 index 000000000..e3a5ff5c6 --- /dev/null +++ b/src/main/perl/lib/Template/Stash/XS.pm @@ -0,0 +1,12 @@ +package Template::Stash::XS; + +# PerlOnJava: XS Stash is not available. Fall back to the pure Perl +# Template::Stash which provides identical functionality. + +use strict; +use warnings; +use Template::Stash; + +our @ISA = ('Template::Stash'); + +1;