WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes #54
WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes #54joshday wants to merge 33 commits into
Conversation
|
Hey @joshday . I've only had a very superficial look so far but it looks great. Thanks! In terms of impact on XLSX.jl, I think it looks significant. It isn't just More of a challenge will be the removal of These obviously aren't insuperable, but will likely need a bit of time while I get to grips with Thanks, Tim |
Hi @joshday, I've been a bit distracted recently by transferring XLSX.jl to JuliaData and subsequently making a v0.11 release, but my attention will be back on this again after the Easter break. I have to say I'd welcome any PR you could make on XLSX.jl to help facilitate this upgrade. Thanks! |
Drops the underscore prefixes from internal names (module is unexported, the clutter was only needed back when these names leaked into XML.jl). Replaces the name-byte predicate with a 256-entry const lookup table. Also fixes a 1-based indexing off-by-one in read_doctype_body: the '<!--' detection guarded with `pos >= 2` while reading `codeunit(data, pos - 2)`, which is codeunit 0 when pos == 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tag, value, keys, and attributes on LazyNode now return
SubString{String} views into the source rather than allocating
fresh Strings, so traversing a large document lazily does not
duplicate its text data.
Introduces a small _as_substring helper to promote the String that
`unescape` can return into a SubString so Attributes stays homogeneous.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_write_xml now inspects children before reformatting: if any Text
child has non-whitespace content (or any CData child exists), the
element is treated as mixed content and its whitespace is preserved
verbatim. Otherwise the writer drops the whitespace-only Text nodes
the parser emits for round-tripping source formatting and generates
fresh indentation. Same filter is applied at the Document level.
Also adds an unescape(::SubString{String}) specialization that
returns the input unchanged when it contains no '&', avoiding an
allocation on the lazy scanning path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The medium-file workloads show a ~10–25% regression vs the numbers captured at 4a728ee ("Revamp benchmarks"). v0.4-vs-v0.3.8 remains a 70–80% improvement, so this is a post-release follow-up, not a release blocker. Suspected culprit is the eager Pair{S,S}[] alloc per TOKEN_OPEN_TAG introduced in 2f71f9a — see follow-up issue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hi @joshday — the v0.4 rewrite looks good; the tokenizer architecture reads faster on eager mode. I'm currently evaluating the v0.4 upgrade on FastKML.jl following your comment on my PR #58. The eager-path improvements are substantial on real-world KML: ×2–2.6 speedup and 37–69% memory reduction across four reference files (5k to 163k Placemarks), versus There is a trade-off I wanted to surface before #54 is merged: v0.4 removed the linear-traversal API on On real-world KML files with non-trivial structure, the
On those same files, the I've written up the decomposition (synthetic bench + FastKML real workloads + cost attribution + a SOTA-informed design space) as a separate design issue #61 rather than clutter this PR thread. Full data: Happy to refine the benchmark or prototype any direction if useful. |
|
Hi, @joshday. I've posted some results on #61 — a |
|
Hey all, sincere apologies for my lack of communication here. I need to step back from finishing this PR and maintaining XML in the foreseeable future. I wish I had wrapped up this PR before getting into a busy season of life and I'm more than happy to share my thoughts and general vision for v0.4, but I should no longer be in the critical path of development and/or decision making. I'm self-employed and (for better or worse) have been successful enough that I no longer have time to allocate for things that aren't (1) paying the bills or (2) spending time with my family. I'd love to get back to this someday! There's a sick twisted part of me that genuinely likes working on awful XML edge cases 😅. @mathieu17g @TimG1964 My recommendation would be to transfer XML.jl to JuliaData. This will need to be initiated by someone at JuliaHub, but I think they'll onboard with it. |
|
Thanks for the update and very sorry to hear you are "moving on". I wish you success in your business. Will you be facilitating the transfer to JuliaData? I really hope they are able to take XML.jl on and can find a worthy successor. Before you go, would you have time to review and possibly merge any of the pending PRs into a final v0.3.9. I fear it may be a while before a v0.4 can be finalized following the transfer of ownership. Thanks! |
|
+1 on @TimG1964's v0.3.9. The five open PRs that fit a 0.3.x patch are all CI-green on current Optionally, two small regression tests, verified against the PR branches and ready to fold in: # #56 — prev must cross a CDATA section (the prev call itself crashes on v0.3.8)
doc = parse("<r><a>x</a><![CDATA[hello]]><b>y</b></r>", LazyNode)
b = children(children(doc)[1])[3] # the <b> element
p = nothing
@test (p = XML.prev(b)) isa LazyNode # asserts the call does not throw
@test XML.nodetype(p) == XML.CData
@test XML.value(p) == "hello"
# #60 — escape on a SubString (MethodError on v0.3.8)
@test XML.escape(SubString("a&b<c>", 1)) == "a&b<c>" |
|
I don't have merge permissions here |
|
@quinnj has agreed to facilitate a transfer of XML.jl to JuliaData. I've suggested he keeps you, @joshday, as a maintainer and adds @mathieu17g as well. I've also asked if there are any other members of the JuliaData community who may be willing to provide additional maintenance support. Hopefully, this can happen quite quickly now. |
|
To do a transfer IIUC we need someone in both JuliaComputing and the destination org. I think @ViralBShah can transfer this repo to JuliaIO, then @quinnj could transfer from JuliaIO to JuliaData? |
|
Invited the folks from the discussion here to have access to the repo as well. |
Summary of Changes
I revived an old rewrite I had halfway finished with the help of Claude Code. It produced some good results!
src/XMLTokenizer.jlmodule for speedy tokenizationNode{T}now parameterized by the string storage type, enabling quick reads viaSubStringor StringViews.jlXML.mmap("file.xml", LazyNode)for memory-mapped parsing of very large filesxpath(node, path)with a practical subset of XPath 1.0Downstream
@TimG1964 you are likely the most impacted with these changes. The Downstream.yml action does indicate a failure in XLSX.jl tests related to
Rawno longer existing. I'd appreciate your review here! I'm happy to submit a PR for a fix in XLSX.jl so that its ready to go before this gets merged.Addressed Issues
Benchmarks: See
benchmarks/compare.jlHere
(SS)refers to usingSubString{String}as storage type.