feat(XML::LibXML): bundled Java-backed implementation (t/10ns.t 137/137)#654
Merged
feat(XML::LibXML): bundled Java-backed implementation (t/10ns.t 137/137)#654
Conversation
Add several missing methods and fix several bugs:
New methods/aliases:
- getElementsByTagName/NS/LocalName on Document nodes (in addition to Element)
- getChildrenByTagName, getChildrenByLocalName, getChildrenByTagNameNS
- createPI alias for createProcessingInstruction
- actualEncoding alias for documentEncoding
- docToFH (toFH method) to write document to a Perl filehandle
- parse_string now accepts scalar refs (\$xml) and binary UTF-16/UTF-32 XML
Bug fixes:
- getElementsByLocalName on Document now includes document element itself
- parse_string now dereferences unblessed scalar refs (\$string)
- parse_string detects binary-encoded XML (UTF-16/UTF-32 with or without BOM)
and parses as byte stream for correct encoding detection
- serializeNode (toString) now removes encoding= attribute when setEncoding()
was called with no args (explicit encoding clear)
- serializeNode adds newline after XML declaration to match libxml2 output
- setEncoding/getEncoding now correctly store parsed document encoding
- Correct UDATA_ENCODING sentinel handling
RuntimeScalar.java:
- hashDerefRaw() added to bypass Perl %{} overload when accessing internal
node hash (prevents infinite recursion in XML::LibXML::Element %{} overload)
Result: t/03doc.t goes from ~5/193 to 175/193 pass; overall full-pass count
increases from ~22 to 37 test files (plus 7 partial).
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Implement push/incremental parsing and SAX event generation for XML::LibXML,
bringing the upstream test suite pass rate from 40.9% to 53.7% (1307/2435).
- Add `PushContext` inner class to buffer incremental XML chunks
- Implement `_start_push`, `_push`, `_end_push` Java methods
- Implement `_parse_xml_chunk` to parse well-balanced fragments (wraps in
synthetic root, creates DocumentFragment)
- Add Perl-level `init_push`, `push`, `parse_chunk`, `finish_push`,
`parse_xml_chunk`, `parse_balanced_chunk` methods to XML/LibXML.pm
- Add `_init_callbacks` / `_cleanup_callbacks` stubs for SAX compat
- Implement `_fire_sax_events` / `_fire_sax_element` in SAX.pm
- Walks DOM tree and fires SAX2 events to the handler chain
- Handles Text, CDATA (start_cdata/end_cdata), Comment, PI,
DocumentFragment, Element node types
- Uses `getData()` for node text (works on Comment, CDATA, Text)
- Builder.pm now receives events and builds a real DOM result
- `attributes()` in LIST context now returns individual Attr nodes (flat list)
instead of a NamedNodeMap object
- Add `appendText($text)` to Node: creates + appends a text node child
- Add `createRawElement` / `createRawElementNS` as aliases to
`createElement` / `createElementNS` on Document
- `setNamespace($uri, $prefix, $act)`: always add xmlns: declaration
attribute, even when `$act` is false
- `insertBefore` / `insertAfter`: auto-import nodes from foreign documents
(fixes WRONG_DOCUMENT_ERR when appending fragments from different parsers)
- Add `importNodeIfNeeded` helper for cross-document node moves
- `_parse_html_string` now self-closes void HTML elements (base, br, meta,
img, input, hr, link, etc.) before XML-parsing, enabling HTML parsing tests
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Major improvements to XML::LibXML namespace handling:
- When removing a namespace declaration (`setNamespaceDeclURI('xxx', undef)`),
now cascades to rename element and all child elements/attrs using that prefix
to have no namespace. Added `removePrefixFromSubtree` helper.
- `setNamespaceDeclPrefix` on a non-existent prefix now returns 1 (no-op)
instead of dying.
- `appendChild` now strips redundant namespace declarations from the child
node when the same prefix→URI is already declared on an ancestor.
Added `reconcileNamespaces` helper.
- `removeChild` now re-adds namespace declarations for prefixes used by
the detached node's attributes/prefix that were only declared in the
former parent context. Added `readdMissingNsDecls` helper.
- New helper used by both reconciliation paths and `getNamespaceDeclURI`.
- For non-empty prefixes, also detects implicit declarations from an
ancestor element's own namespace binding (createElementNS without explicit xmlns).
- Extended the namespace URI fallback to non-empty prefixes: when an element
uses a prefix but no ancestor declares it, `getAttribute('xmlns:prefix')`
returns the namespace URI (synthesised from the element's own NS binding).
- Both empty and non-empty prefix fallbacks skip synthesis when an ancestor
already declares the same prefix→URI.
- Handles empty/null ns argument to remove namespace and prefix from element
or attribute nodes via `renameNode(el, null, localName)`.
- `getElementById` on Document: tree walk checking xml:id and id attributes
- `setAttributeNodeNS` on Element: wrapper for DOM setAttributeNodeNS
- `getName` on Node: alias for nodeName
- `replaceChild` now adopts the new child if it belongs to a different
document (matches libxml2 behavior), fixing WRONG_DOCUMENT_ERR.
Test results: 135/137 passing in t/10ns.t (up from ~97/137).
Remaining 2 failures: libxml2-specific quirks (getElementById after
node detachment, setAttributeNodeNS declaring on root element).
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Two fixes in XMLLibXML.java:
1. setAttributeNodeNS: when setting an attribute whose namespace prefix is
not yet declared in the ancestor chain, libxml2 places the xmlns:
declaration on the document root element (not on the receiving element).
Emulate this by calling lookupNamespaceURI after setAttributeNodeNS; if
the prefix is still undeclared, add it to the document root.
2. getElementById: libxml2 maintains a persistent ID index so nodes are
still findable after they are detached from the tree. Emulate this via
a per-Document HashMap stored in setUserData("__xmlIdCache__").
Each successful live-tree lookup refreshes the cache; if the live walk
finds nothing, the cache is consulted (returning detached nodes).
Result: t/10ns.t now passes 137/137 (was 135/137).
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
… text merge Fixes for multiple XML::LibXML test improvements: - lookupNamespaceURI: check element's own namespace prefix (createElementNS elements without explicit xmlns: declaration) - getNamespaces: include element's own namespace if not already declared - replaceNode: now returns the replaced (old) node, matching libxml2 behavior - replaceNode: handle attribute nodes via ownerElement (parentNode is null for Attr) - unbindNode: handle attribute nodes via ownerElement (removeAttributeNS/removeAttribute) - nodeAddSibling: handle attribute nodes via ownerElement with auto xmlns: declaration; merge text nodes when addSibling called on detached text nodes - attrToString: preserve entity references in attribute children (&foo; notation) - serializeNode(Attr): preserve entity reference children - _parse_html_file: detect and handle UTF-16 BOM (LE/BE), strip BOM bytes before parsing - quotemeta (StringOperators): use code-point iteration for supplementary Unicode chars Test improvements: - t/04node.t: 195/195 (was ~179/195) - t/05text.t: 59/59 (was 57/59) - t/09xpath.t: 54/54 (was 47 then crash) - t/10ns.t: 137/137 (unchanged) - t/16docnodes.t: 11/11 (unchanged) - Overall: 30/77 test programs pass (was 19/77) Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a bundled Java-backed
XML::LibXMLimplementation to PerlOnJava, enabling XML processing without requiring native libxml2.What's included
XMLLibXML.java— XS-replacement Java implementation of ~160 XML::LibXML methodsXML/LibXML.pmand friends — Pure-Perl glue layer (XPathContext, NodeList, SAX, etc.)XML-LibXML.ymlconfiguresjcpanto skip XS compilation and load the bundled module insteadTest results
t/10ns.t(namespace handling): 137/137 passingKey features implemented:
javax.xml.xpathsetAttributeNodeNSplaces undeclared namespace on document root (libxml2 quirk)getElementByIdwith persistent ID cache so detached nodes remain findableNotable libxml2 quirks emulated
setAttributeNodeNS: when an attribute's prefix is not in scope, thexmlns:declaration is placed on the document root element (not the element receiving the attribute).getElementById: maintains a persistent per-Document ID cache so nodes still resolve after being detached from the tree.Test plan
makepasses (all unit tests green)./jperl /path/to/XML-LibXML-2.0210/t/10ns.t→ 137/137 passingGenerated with Devin