Honor byte-order mark when parsing UTF-16 and UTF-8 input by SAY-5 · Pull Request #65 · JuliaData/XML.jl

SAY-5 · 2026-06-18T02:32:39Z

Fixes #62. UTF-16 (and UTF-8 BOM) documents crashed with a BoundsError because the byte tokenizer was handed the raw bytes without any encoding detection, so the interleaved zero bytes desynced the </> scanning and walked off the end of the buffer.

This adds a small normalization step at the single Vector{UInt8} entry point: a leading BOM (FF FE, FE FF, or EF BB BF) is honored, UTF-16 is transcoded to UTF-8, and a UTF-8 BOM is stripped, so the tokenizer always sees UTF-8. Input without a BOM is returned unchanged. Added a regression test covering UTF-16 LE/BE and UTF-8 BOM round-trips.

Signed-off-by: Sai Asish Y <say.apm35@gmail.com>

SAY-5 added 2 commits June 17, 2026 19:32

Honor byte-order mark when parsing UTF-16 and UTF-8 input

0c67195

Signed-off-by: Sai Asish Y <say.apm35@gmail.com>

test: select root element when parsing declaration-prefixed BOM input

3b091a0

Signed-off-by: Sai Asish Y <say.apm35@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Honor byte-order mark when parsing UTF-16 and UTF-8 input#65

Honor byte-order mark when parsing UTF-16 and UTF-8 input#65
SAY-5 wants to merge 2 commits into
JuliaData:mainfrom
SAY-5:fix-utf16-bom-parsing

SAY-5 commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SAY-5 commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant