feat(scanner): add HTML entity resolution in SimpleHTMLScanner by marevol · Pull Request #22 · codelibs/nekohtml

marevol · 2026-03-11T23:33:19Z

Summary

Implements HTML character entity resolution directly in SimpleHTMLScanner, decoding entities in both text content and attribute values during parsing.

Changes Made

Added resolveEntities(String text) and resolveEntities(String text, boolean inAttribute) methods to SimpleHTMLScanner
Added static ENTITY_PATTERN regex covering decimal (Ö), hex (Ö), and named (Ö) entity references with optional trailing semicolons
Added resolveCodePoint() helper that replaces invalid/noncharacter code points with U+FFFD per the HTML5 spec
Text content: entities are always resolved before dispatching characters() events
Attribute values: semicolon-less named entities followed by [A-Za-z0-9=] are intentionally left unresolved to avoid corrupting URLs (e.g. &not=, &copy=)
Updated all affected tests to expect correctly entity-resolved output

Testing

Existing test suite updated to reflect resolved entity output
Tests cover numeric decimal/hex entities, named entities, edge cases (null char, surrogates, out-of-Unicode-range, XML-illegal chars, Unicode noncharacters)

Breaking Changes

Text content and attribute values that previously contained literal &...; sequences will now be emitted as their decoded Unicode equivalents. Consumers relying on raw entity text in SAX events will need to re-evaluate, but this aligns with correct HTML parsing behavior.

Additional Notes

Uses HTMLEntities.get() for named entity lookup — no new dependencies introduced
Attribute context handling follows the HTML5 tokenizer attribute value state spec to avoid false positives on query strings and URL parameters

Implement resolveEntities() in SimpleHTMLScanner to decode HTML character references in text content and attribute values during parsing. - Handle numeric decimal (Ö), hex (Ö), and named (Ö) entities - Follow HTML5 attribute value state: skip semicolon-less named entities followed by [A-Za-z0-9=] to avoid corrupting URLs (e.g. &not=, &copy=) - Replace invalid code points (null, surrogates, out-of-range, noncharacters) with U+FFFD per HTML5 spec - Update test suite to reflect correct entity-resolved output Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marevol merged commit 7c319cf into master Mar 11, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scanner): add HTML entity resolution in SimpleHTMLScanner#22

feat(scanner): add HTML entity resolution in SimpleHTMLScanner#22
marevol merged 1 commit intomasterfrom
feat/html-entity-resolution

marevol commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marevol commented Mar 11, 2026

Summary

Changes Made

Testing

Breaking Changes

Additional Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant