Skip to content

WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes #54

Open
joshday wants to merge 33 commits into
JuliaData:mainfrom
joshday:main
Open

WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes #54
joshday wants to merge 33 commits into
JuliaData:mainfrom
joshday:main

Conversation

@joshday

@joshday joshday commented Mar 6, 2026

Copy link
Copy Markdown
Collaborator

Summary of Changes

I revived an old rewrite I had halfway finished with the help of Claude Code. It produced some good results!

  • Major rewrite of XML.jl's internals that addresses many open issues
  • Self-contained src/XMLTokenizer.jl module for speedy tokenization
  • Node{T} now parameterized by the string storage type, enabling quick reads via SubString or StringViews.jl
  • StringViews extension — XML.mmap("file.xml", LazyNode) for memory-mapped parsing of very large files
  • XPath support — xpath(node, path) with a practical subset of XPath 1.0
  • Greatly expanded test suite — 243 libxml2 test cases, pugixml and libexpat compatibility tests, W3C conformance tests

Downstream

@TimG1964 you are likely the most impacted with these changes. The Downstream.yml action does indicate a failure in XLSX.jl tests related to Raw no longer existing. I'd appreciate your review here! I'm happy to submit a PR for a fix in XLSX.jl so that its ready to go before this gets merged.

Addressed Issues

Benchmarks: See benchmarks/compare.jl

Here (SS) refers to using SubString{String} as storage type.

julia --project=. benchmarks/compare.jl
============================================================
  XML.jl Benchmark Comparison
  Current (dev) vs v0.3.8
============================================================

Running dev benchmarks... done
Setting up v0.3.8 worktree... done
Running v0.3.8 benchmarks... done

------------------------------------------------------------

  Parse (small)
          v0.3.8      0.114 ms
             dev     0.0335 ms  (70.6% faster)

  Parse (small, SS)
          v0.3.8           n/a
             dev     0.0285 ms

  Parse (medium)
          v0.3.8   634.7153 ms
             dev   161.0888 ms  (74.6% faster)

  Parse (medium, SS)
          v0.3.8           n/a
             dev   151.3025 ms

  Write (small)
          v0.3.8     0.0227 ms
             dev     0.0176 ms  (22.4% faster)

  Write (medium)
          v0.3.8   118.1504 ms
             dev     77.619 ms  (34.3% faster)

  Read file (medium)
          v0.3.8   645.5785 ms
             dev   170.8398 ms  (73.5% faster)

  Collect tags (small)
          v0.3.8     0.0005 ms
             dev     0.0006 ms  (10.3% slower)

  Collect tags (medium)
          v0.3.8    21.0988 ms
             dev    11.1532 ms  (47.1% faster)

============================================================

@TimG1964

TimG1964 commented Mar 8, 2026

Copy link
Copy Markdown
Collaborator

Hey @joshday . I've only had a very superficial look so far but it looks great. Thanks!

In terms of impact on XLSX.jl, I think it looks significant. It isn't just Raw. Since @nhz2 first suggested using Raw, I've known it was internal and therefore subject to change. On first inspection, I think the rework involved should be manageable.

More of a challenge will be the removal of prev and next, which are currently exported functions. I rely on these for fundamental elements of XLSX.jl like the sheetrow and tablerow iterators, and for reading and writing the XML files from/to the zip archive .xlsx file.

These obviously aren't insuperable, but will likely need a bit of time while I get to grips with xpath and tokenizer. Optimistic me thinks the new functionality will simplify the code of XLSX.jl, but I usually find things are considerably harder than I first imagine! I'll feedback more when I've had a bit more of a go at getting XLSX.jl working.

Thanks,

Tim

Comment thread ext/XMLStringViewsExt.jl Outdated
@TimG1964

TimG1964 commented Apr 2, 2026

Copy link
Copy Markdown
Collaborator

I'm happy to submit a PR for a fix in XLSX.jl so that its ready to go before this gets merged.

Hi @joshday, I've been a bit distracted recently by transferring XLSX.jl to JuliaData and subsequently making a v0.11 release, but my attention will be back on this again after the Easter break. I have to say I'd welcome any PR you could make on XLSX.jl to help facilitate this upgrade.

Thanks!

joshday and others added 6 commits April 2, 2026 16:49
Drops the underscore prefixes from internal names (module is unexported,
the clutter was only needed back when these names leaked into XML.jl).
Replaces the name-byte predicate with a 256-entry const lookup table.

Also fixes a 1-based indexing off-by-one in read_doctype_body: the
'<!--' detection guarded with `pos >= 2` while reading
`codeunit(data, pos - 2)`, which is codeunit 0 when pos == 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tag, value, keys, and attributes on LazyNode now return
SubString{String} views into the source rather than allocating
fresh Strings, so traversing a large document lazily does not
duplicate its text data.

Introduces a small _as_substring helper to promote the String that
`unescape` can return into a SubString so Attributes stays homogeneous.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_write_xml now inspects children before reformatting: if any Text
child has non-whitespace content (or any CData child exists), the
element is treated as mixed content and its whitespace is preserved
verbatim. Otherwise the writer drops the whitespace-only Text nodes
the parser emits for round-tripping source formatting and generates
fresh indentation. Same filter is applied at the Document level.

Also adds an unescape(::SubString{String}) specialization that
returns the input unchanged when it contains no '&', avoiding an
allocation on the lazy scanning path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The medium-file workloads show a ~10–25% regression vs the numbers
captured at 4a728ee ("Revamp benchmarks"). v0.4-vs-v0.3.8 remains
a 70–80% improvement, so this is a post-release follow-up, not a
release blocker. Suspected culprit is the eager Pair{S,S}[] alloc
per TOKEN_OPEN_TAG introduced in 2f71f9a — see follow-up issue.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mathieu17g

Copy link
Copy Markdown
Contributor

Hi @joshday — the v0.4 rewrite looks good; the tokenizer architecture reads faster on eager mode.

I'm currently evaluating the v0.4 upgrade on FastKML.jl following your comment on my PR #58. The eager-path improvements are substantial on real-world KML: ×2–2.6 speedup and 37–69% memory reduction across four reference files (5k to 163k Placemarks), versus v0.3.8 + #58 + #59 eager.

There is a trade-off I wanted to surface before #54 is merged: v0.4 removed the linear-traversal API on LazyNodenext / prev (and the next! / prev! zero-alloc variants from PR #59) — replaced by eachchildnode / children, which allocate per child. That leaves no API equivalent for the zero-alloc lazy walk class that PR #59 provided under v0.3.8.

On real-world KML files with non-trivial structure, the v0.4 lazy path regresses by ×1.4 to ×2.6 vs v0.3.8 + #58 + #59 lazy across the full 4-file reference set. Two concrete cases:

  • USGS WRS-2 tiles — 28k Polygon Placemarks (each a 5-vertex LinearRing) in a single flat top-level layer. Regresses ×2.6.
  • EPA Facility Registry — 163k Point Placemarks across 19k nested folders. Regresses ×2.3.

On those same files, the v0.3.8+PRs lazy path was actually faster than v0.4 eager too, so a strict migration loses the previously optimal path. Full per-file profile and methodology in the linked results doc below.

I've written up the decomposition (synthetic bench + FastKML real workloads + cost attribution + a SOTA-informed design space) as a separate design issue #61 rather than clutter this PR thread.

Full data: benchmark/results_eager_vs_lazy_3way_2026-05-11.md on the FastKML wip-xml-v0.4 branch.

Happy to refine the benchmark or prototype any direction if useful.

@mathieu17g

Copy link
Copy Markdown
Contributor

Hi, @joshday. I've posted some results on #61 — a Cursor streaming primitive plus an isbits Token that takes the iterate tuple allocation-free. The cursor is additive, but making Token isbits touches this PR's core Token, so flagging it here too: I'd value your read on whether that change fits v0.4.

@joshday

joshday commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

Hey all, sincere apologies for my lack of communication here. I need to step back from finishing this PR and maintaining XML in the foreseeable future. I wish I had wrapped up this PR before getting into a busy season of life and I'm more than happy to share my thoughts and general vision for v0.4, but I should no longer be in the critical path of development and/or decision making.

I'm self-employed and (for better or worse) have been successful enough that I no longer have time to allocate for things that aren't (1) paying the bills or (2) spending time with my family. I'd love to get back to this someday! There's a sick twisted part of me that genuinely likes working on awful XML edge cases 😅.

@mathieu17g @TimG1964 My recommendation would be to transfer XML.jl to JuliaData. This will need to be initiated by someone at JuliaHub, but I think they'll onboard with it.

@TimG1964

TimG1964 commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Thanks for the update and very sorry to hear you are "moving on". I wish you success in your business.

Will you be facilitating the transfer to JuliaData? I really hope they are able to take XML.jl on and can find a worthy successor.

Before you go, would you have time to review and possibly merge any of the pending PRs into a final v0.3.9. I fear it may be a while before a v0.4 can be finalized following the transfer of ownership.

Thanks!

@mathieu17g

Copy link
Copy Markdown
Contributor

+1 on @TimG1964's v0.3.9. The five open PRs that fit a 0.3.x patch are all CI-green on current main and mutually conflict-free. Suggested merge order, #64 last since it carries the version bump: #60#56#58#59#64, then register on the final merge commit.

Optionally, two small regression tests, verified against the PR branches and ready to fold in:

# #56 — prev must cross a CDATA section (the prev call itself crashes on v0.3.8)
doc = parse("<r><a>x</a><![CDATA[hello]]><b>y</b></r>", LazyNode)
b = children(children(doc)[1])[3]              # the <b> element
p = nothing
@test (p = XML.prev(b)) isa LazyNode           # asserts the call does not throw
@test XML.nodetype(p) == XML.CData
@test XML.value(p) == "hello"

# #60 — escape on a SubString (MethodError on v0.3.8)
@test XML.escape(SubString("a&b<c>", 1)) == "a&amp;b&lt;c&gt;"

@joshday

joshday commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

I don't have merge permissions here

@TimG1964

Copy link
Copy Markdown
Collaborator

My recommendation would be to transfer XML.jl to JuliaData. This will need to be initiated by someone at JuliaHub, but I think they'll onboard with it.

@quinnj has agreed to facilitate a transfer of XML.jl to JuliaData. I've suggested he keeps you, @joshday, as a maintainer and adds @mathieu17g as well. I've also asked if there are any other members of the JuliaData community who may be willing to provide additional maintenance support.

Hopefully, this can happen quite quickly now.

@nhz2

nhz2 commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

To do a transfer IIUC we need someone in both JuliaComputing and the destination org. I think @ViralBShah can transfer this repo to JuliaIO, then @quinnj could transfer from JuliaIO to JuliaData?

@ViralBShah

Copy link
Copy Markdown
Contributor

Invited the folks from the discussion here to have access to the repo as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

5 participants