Skip to content

Honor byte-order mark when parsing UTF-16 and UTF-8 input#65

Open
SAY-5 wants to merge 2 commits into
JuliaData:mainfrom
SAY-5:fix-utf16-bom-parsing
Open

Honor byte-order mark when parsing UTF-16 and UTF-8 input#65
SAY-5 wants to merge 2 commits into
JuliaData:mainfrom
SAY-5:fix-utf16-bom-parsing

Conversation

@SAY-5

@SAY-5 SAY-5 commented Jun 18, 2026

Copy link
Copy Markdown

Fixes #62. UTF-16 (and UTF-8 BOM) documents crashed with a BoundsError because the byte tokenizer was handed the raw bytes without any encoding detection, so the interleaved zero bytes desynced the </> scanning and walked off the end of the buffer.

This adds a small normalization step at the single Vector{UInt8} entry point: a leading BOM (FF FE, FE FF, or EF BB BF) is honored, UTF-16 is transcoded to UTF-8, and a UTF-8 BOM is stripped, so the tokenizer always sees UTF-8. Input without a BOM is returned unchanged. Added a regression test covering UTF-16 LE/BE and UTF-8 BOM round-trips.

SAY-5 added 2 commits June 17, 2026 19:32
Signed-off-by: Sai Asish Y <say.apm35@gmail.com>
Signed-off-by: Sai Asish Y <say.apm35@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UTF-16 XML input crashes parser with BoundsError (no encoding detection)

1 participant