Skip to content

Add XML::Parser Java XS implementation (JDK SAX backend)#457

Merged
fglock merged 14 commits into
masterfrom
feature/xml-parser
Apr 7, 2026
Merged

Add XML::Parser Java XS implementation (JDK SAX backend)#457
fglock merged 14 commits into
masterfrom
feature/xml-parser

Conversation

@fglock
Copy link
Copy Markdown
Owner

@fglock fglock commented Apr 7, 2026

Summary

  • Implement XML::Parser::Expat as a Java XS class (XMLParserExpat.java) using JDK's built-in SAX parser
  • Reuse CPAN's Parser.pm (pure Perl) — only replace the XS backend
  • Bundled Expat.pm shim in jar:PERL5LIB that delegates to Java via XSLoader
  • 41/47 test files pass (95% pass rate, 377/397 subtests)

Key features implemented

  • All 20 handler types (Start, End, Char, Comment, PI, CDATA, DTD, etc.)
  • Stream/string/file parsing with delimiter support
  • External entity resolution with ExternEnt/ExternEntFin handlers
  • Entity expansion tracking and originalString
  • ParseParamEnt with conditional SAX features
  • ProtocolEncoding support
  • Position tracking (line, column, byte offset, position_in_context)

Remaining (6 test files)

  • encoding.t: Custom encoding mapping (x-sjis-unicode) — Phase 4 planned in design doc
  • foreign_dtd.t: UseForeignDTD not implemented
  • parament.t / decl.t: Blocked by x-sjis-unicode in foo.dtd
  • checklib_*.t: Devel::CheckLib stubs, not XML-related

Test plan

  • make passes (all unit tests)
  • 41/47 XML::Parser test files pass
  • No regressions in existing tests

Replaces #455 (branch consolidation).

Generated with Devin

fglock and others added 14 commits April 7, 2026 17:04
Implements XML::Parser::Expat as a Java XS module using JDK's built-in
SAX parser instead of the native expat C library.

Key features:
- Full SAX-based parsing with Start/End/Char/PI/Comment handlers
- Namespace support using dualvar scalars (string=localname, int=ns_index)
  matching expat's gen_ns_name() dual PV/IV behavior
- XMLDecl, element/attlist declaration handlers
- Namespace prefix tracking (new_ns_prefixes, expand_ns_prefix, current_ns_prefixes)
- Error string mapping, ExpatVersion, security API stubs
- Byte position tracking via accumulated token lengths
- CPAN::Distribution helpers for XS module fallback installation

Test results: 24 of 47 XML::Parser tests pass, including all 15 namespace
tests. No unit test regressions.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Plan to implement XML::Parser::Expat as a Java XS class using the JDK
built-in javax.xml.parsers.SAXParser as the XML engine. No new Maven
dependencies required.

Covers: 47 test files, 55 XS functions, 20 handler types, 5 phases.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Major improvements to XML::Parser Java XS implementation:

- Fix UTF-8 double-encoding: use ISO_8859_1 for BYTE_STRING input
  to avoid re-encoding raw UTF-8 bytes (fixes utf8_handling.t,
  debug_multibyte.t - 32 tests)
- Fix specified vs defaulted attributes: use Attributes2.isSpecified()
  to separate and reorder attributes, matching expat convention
  (fixes defaulted.t - 4 tests)
- Fix error messages: format SAX errors as not well-formed (invalid
  token) with escaping hints, matching libexpat output format
  (fixes error_hint.t - 5 tests)
- Fix systemId resolution: un-resolve SAX-resolved absolute URIs back
  to relative paths by tracking parseBaseUri on InputSource
  (fixes decl.t tests 5/35 - 44 tests now pass)
- Fix string interpolation: support ${ref}{key} subscript access
  after braced variable expressions in double-quoted strings
  (fixes styles.t Objects style - 11 tests)
- Fix IO handle class detection: treat GLOB ref class as IO::Handle
  for input_record_separator calls (fixes stream.t partial)
- Fix MakeMaker BASEEXT scanning: recursively find .pm files in the
  module base directory for Style submodule installation
- Fix extern_ent_lexical_glob.t: handle file:/path compact URI form

XML::Parser test results: 35/47 files pass (74%), 365/385 subtests (95%)
Previously: 29/47 files, ~262/308 subtests

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Documents architecture, test status (35/47 pass, 95% subtests),
known limitations, and TODO items including self-closing tag
column recognition fix.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Fixes:
- Stream delimiter parsing: read line-by-line via readline() respecting
  $/ set by Expat.pm, enabling resumable delimited stream parsing
- Self-closing tag detection: scan inputBytes to detect <foo/> vs
  <foo></foo> for correct column tracking in both start and end handlers
- Entity expansion tracking: use startEntity/endEntity from LexicalHandler
  to set original_string to unexpanded entity ref (e.g. "&draft.day;")
- ExternEntFin handler: now called for both filehandle and string returns
  from ExternEnt handler
- Element index stack: maintain per-element index via push/pop so
  element_index returns same value in start and end handlers
- ProtocolEncoding: store and apply encoding from ParserCreate to
  InputSource, fixing ISO-8859-1 encoded documents
- PositionContext: implement position_in_context() returning surrounding
  lines and correct linepos for pointer insertion
- ParseParamEnt: conditionally enable external-parameter-entities and
  load-external-dtd SAX features based on ParseParamEnt option
- Entity resolver: preserve systemId on returned InputSource so SAX can
  resolve relative references within external DTDs
- Context pop order: pop Context array AFTER end handler callback,
  matching libexpat behavior for depth() consistency

Test results: 41/47 files pass (377/397 subtests, 95.0%)
Newly passing: astress.t, g_void.t, partial.t, stream.t,
  position_overflow.t, parament_internal.t

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Documents approach for handling expat-specific encoding names
(x-sjis-unicode -> Shift_JIS) that JDK SAX does not support natively.
Covers encoding.t, parament.t, and decl.t test improvements.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
xml_parser.md now points to dev/design/xml_parser_xs.md as the single
source of truth for progress tracking and TODOs. Updated stale status
from 'Not yet started' to '41/47 tests pass (95%)'.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Phase 4 encoding conversion:
- Map expat-specific encoding names (x-sjis-unicode, x-euc-jp-unicode)
  to JDK charsets (Shift_JIS, EUC-JP)
- Pre-parse encoding detection and byte re-encoding to UTF-8
- Applied in ParseString, ParseStream, ParseDone, resolveEntity, doParse

Tail call trampoline in RuntimeCode.apply():
- Handle goto &func returning TAILCALL control flow from static callers
- Needed for XML::Parser initial_ext_ent_handler which uses goto &func

Test results: 43/47 files pass (97.7% subtests), up from 41/47 (95%)
- encoding.t: 0 -> 43/43
- parament.t: 4/13 -> 13/13

Generated with Devin (https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Document root causes, exact line numbers, and suggested fixes for:
- decl.t: NOTATION type off-by-one bug (substring(8) should be 9)
  and missing XMLDecl for external entity text declarations
- foreign_dtd.t: UseForeignDTD not implemented, with 3 approaches
- checklib_findcc.t: stub inc/Devel/CheckLib.pm lacks source patterns
- checklib_tmpdir.t: same stub, missing tempfile/mktemp calls

Generated with Devin (https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Phase 5 final fixes:
- NOTATION type format: fix off-by-one (substring(8) -> substring(9))
- XMLDecl text declarations: fire handler from resolveEntity() for
  external entity text declarations (version=undef, original encoding)
- UseForeignDTD: synthesize ExternEnt handler call with undef sysid/pubid,
  inject DOCTYPE with synthetic system ID, resolve in resolveEntity()
- Error messages: map SAX 'was referenced, but not declared' to expat
  'undefined entity' format
- Devel::CheckLib: replaced stub with real upstream source (not tracked)

Generated with Devin (https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Add XML::Parser 2.56 PM files to bundled lib:
- XML/Parser.pm (upstream, unmodified)
- XML/Parser/Style/{Debug,Objects,Stream,Subs,Tree}.pm
- XML/Parser/LWPExternEnt.pl (optional LWP entity handler)

XML::Parser::Expat.pm (Java SAX-backed shim) was already bundled.
All dependencies (Carp, XSLoader, File::Spec, IO::Handle, etc.)
were already bundled. No new dependencies needed.

Update docs:
- README.md: add XML::Parser to module list
- changelog.md: add to v5.42.3 module list
- feature-matrix.md: add to non-core modules section

Generated with Devin (https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Add a JUnit 5 test runner (ModuleTestExecutionTest.java) for bundled
CPAN module tests. Tests live under src/test/resources/module/{Name}/
and are executed with chdir to the module directory so relative paths
resolve correctly.

- 45 XML::Parser tests stored in module/XML-Parser/{t/,samples/}
  (2 Devel::CheckLib C compiler tests excluded as irrelevant)
- Gradle task 'testModule' with @tag("module") filter
- Makefile target 'make test-bundled-modules'
- .gitignore exception for src/test/resources/module/*/t/

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…script

In Perl, explicit braces in ${var} terminate the variable name, so
subsequent [...] should be a character class (regex) or literal text
(string), not an array subscript. Only deref expressions like
${$ref}[0] should parse subscripts after braces.

Before: qr/${var}[$idx]/ → treated [$idx] as $var[$idx] (wrong)
After:  qr/${var}[$idx]/ → scalar $var + char class [$idx] (correct)

This also fixes "${arr}[0]" in strings, which was incorrectly
producing $arr[0] instead of $arr followed by literal "[0]".

Fixes unit/regex/array_element_strict_vs_nonstrict.t CI failure.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@fglock fglock merged commit 6565ef3 into master Apr 7, 2026
2 checks passed
@fglock fglock deleted the feature/xml-parser branch April 7, 2026 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant