Skip to content

Add XML::Parser implementation plan (Java XS via JDK SAX)#455

Closed
fglock wants to merge 7 commits into
masterfrom
feature/xml-parser-plan
Closed

Add XML::Parser implementation plan (Java XS via JDK SAX)#455
fglock wants to merge 7 commits into
masterfrom
feature/xml-parser-plan

Conversation

@fglock
Copy link
Copy Markdown
Owner

@fglock fglock commented Apr 7, 2026

Summary

  • Add dev/modules/xml_parser.md — comprehensive plan for implementing XML::Parser on PerlOnJava
  • Update dev/modules/README.md with cross-reference

Approach

Implement XML::Parser::Expat as a Java XS class (XMLParserExpat.java) using the JDK's built-in javax.xml.parsers.SAXParser as the XML parsing engine. No new Maven dependencies required.

Key design decisions

  • Reuse CPAN's Parser.pm (pure Perl) — only replace the XS backend
  • Bundled Expat.pm shim in jar:PERL5LIB that delegates to Java via XSLoader
  • JDK SAX maps cleanly to expat's 20 handler types (Start, End, Char, Comment, PI, CDATA, DTD declarations, etc.)
  • SAX Locator provides line/column; byte offsets approximated or stubbed

Scope

  • 47 test files in XML::Parser 2.56
  • 55 XS functions to implement
  • 5 implementation phases (infrastructure → core parsing → DTD → advanced → polish)
  • Unblocks XML::Simple's XML::Parser tests, XML::SAX::Expat, XML::Twig, XML::RSS, etc.

Test plan

  • This PR is documentation only — no code changes
  • Plan reviewed for completeness and feasibility

Generated with Devin

@fglock fglock force-pushed the feature/xml-parser-plan branch 2 times, most recently from dd298dd to e7f4bf0 Compare April 7, 2026 14:58
fglock and others added 7 commits April 7, 2026 17:04
Implements XML::Parser::Expat as a Java XS module using JDK's built-in
SAX parser instead of the native expat C library.

Key features:
- Full SAX-based parsing with Start/End/Char/PI/Comment handlers
- Namespace support using dualvar scalars (string=localname, int=ns_index)
  matching expat's gen_ns_name() dual PV/IV behavior
- XMLDecl, element/attlist declaration handlers
- Namespace prefix tracking (new_ns_prefixes, expand_ns_prefix, current_ns_prefixes)
- Error string mapping, ExpatVersion, security API stubs
- Byte position tracking via accumulated token lengths
- CPAN::Distribution helpers for XS module fallback installation

Test results: 24 of 47 XML::Parser tests pass, including all 15 namespace
tests. No unit test regressions.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Plan to implement XML::Parser::Expat as a Java XS class using the JDK
built-in javax.xml.parsers.SAXParser as the XML engine. No new Maven
dependencies required.

Covers: 47 test files, 55 XS functions, 20 handler types, 5 phases.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Major improvements to XML::Parser Java XS implementation:

- Fix UTF-8 double-encoding: use ISO_8859_1 for BYTE_STRING input
  to avoid re-encoding raw UTF-8 bytes (fixes utf8_handling.t,
  debug_multibyte.t - 32 tests)
- Fix specified vs defaulted attributes: use Attributes2.isSpecified()
  to separate and reorder attributes, matching expat convention
  (fixes defaulted.t - 4 tests)
- Fix error messages: format SAX errors as not well-formed (invalid
  token) with escaping hints, matching libexpat output format
  (fixes error_hint.t - 5 tests)
- Fix systemId resolution: un-resolve SAX-resolved absolute URIs back
  to relative paths by tracking parseBaseUri on InputSource
  (fixes decl.t tests 5/35 - 44 tests now pass)
- Fix string interpolation: support ${ref}{key} subscript access
  after braced variable expressions in double-quoted strings
  (fixes styles.t Objects style - 11 tests)
- Fix IO handle class detection: treat GLOB ref class as IO::Handle
  for input_record_separator calls (fixes stream.t partial)
- Fix MakeMaker BASEEXT scanning: recursively find .pm files in the
  module base directory for Style submodule installation
- Fix extern_ent_lexical_glob.t: handle file:/path compact URI form

XML::Parser test results: 35/47 files pass (74%), 365/385 subtests (95%)
Previously: 29/47 files, ~262/308 subtests

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Documents architecture, test status (35/47 pass, 95% subtests),
known limitations, and TODO items including self-closing tag
column recognition fix.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Fixes:
- Stream delimiter parsing: read line-by-line via readline() respecting
  $/ set by Expat.pm, enabling resumable delimited stream parsing
- Self-closing tag detection: scan inputBytes to detect <foo/> vs
  <foo></foo> for correct column tracking in both start and end handlers
- Entity expansion tracking: use startEntity/endEntity from LexicalHandler
  to set original_string to unexpanded entity ref (e.g. "&draft.day;")
- ExternEntFin handler: now called for both filehandle and string returns
  from ExternEnt handler
- Element index stack: maintain per-element index via push/pop so
  element_index returns same value in start and end handlers
- ProtocolEncoding: store and apply encoding from ParserCreate to
  InputSource, fixing ISO-8859-1 encoded documents
- PositionContext: implement position_in_context() returning surrounding
  lines and correct linepos for pointer insertion
- ParseParamEnt: conditionally enable external-parameter-entities and
  load-external-dtd SAX features based on ParseParamEnt option
- Entity resolver: preserve systemId on returned InputSource so SAX can
  resolve relative references within external DTDs
- Context pop order: pop Context array AFTER end handler callback,
  matching libexpat behavior for depth() consistency

Test results: 41/47 files pass (377/397 subtests, 95.0%)
Newly passing: astress.t, g_void.t, partial.t, stream.t,
  position_overflow.t, parament_internal.t

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Documents approach for handling expat-specific encoding names
(x-sjis-unicode -> Shift_JIS) that JDK SAX does not support natively.
Covers encoding.t, parament.t, and decl.t test improvements.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@fglock fglock force-pushed the feature/xml-parser-plan branch from e7f4bf0 to c90c7be Compare April 7, 2026 15:05
@fglock
Copy link
Copy Markdown
Owner Author

fglock commented Apr 7, 2026

Replacing with new PR from feature/xml-parser branch (consolidating branches).

@fglock fglock closed this Apr 7, 2026
@fglock fglock deleted the feature/xml-parser-plan branch April 7, 2026 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant