Skip to content

feat(scanner): add HTML entity resolution in SimpleHTMLScanner#22

Merged
marevol merged 1 commit intomasterfrom
feat/html-entity-resolution
Mar 11, 2026
Merged

feat(scanner): add HTML entity resolution in SimpleHTMLScanner#22
marevol merged 1 commit intomasterfrom
feat/html-entity-resolution

Conversation

@marevol
Copy link
Collaborator

@marevol marevol commented Mar 11, 2026

Summary

Implements HTML character entity resolution directly in SimpleHTMLScanner, decoding entities in both text content and attribute values during parsing.

Changes Made

  • Added resolveEntities(String text) and resolveEntities(String text, boolean inAttribute) methods to SimpleHTMLScanner
  • Added static ENTITY_PATTERN regex covering decimal (Ö), hex (Ö), and named (Ö) entity references with optional trailing semicolons
  • Added resolveCodePoint() helper that replaces invalid/noncharacter code points with U+FFFD per the HTML5 spec
  • Text content: entities are always resolved before dispatching characters() events
  • Attribute values: semicolon-less named entities followed by [A-Za-z0-9=] are intentionally left unresolved to avoid corrupting URLs (e.g. &not=, &copy=)
  • Updated all affected tests to expect correctly entity-resolved output

Testing

  • Existing test suite updated to reflect resolved entity output
  • Tests cover numeric decimal/hex entities, named entities, edge cases (null char, surrogates, out-of-Unicode-range, XML-illegal chars, Unicode noncharacters)

Breaking Changes

  • Text content and attribute values that previously contained literal &...; sequences will now be emitted as their decoded Unicode equivalents. Consumers relying on raw entity text in SAX events will need to re-evaluate, but this aligns with correct HTML parsing behavior.

Additional Notes

  • Uses HTMLEntities.get() for named entity lookup — no new dependencies introduced
  • Attribute context handling follows the HTML5 tokenizer attribute value state spec to avoid false positives on query strings and URL parameters

Implement resolveEntities() in SimpleHTMLScanner to decode HTML character
references in text content and attribute values during parsing.

- Handle numeric decimal (Ö), hex (Ö), and named (Ö) entities
- Follow HTML5 attribute value state: skip semicolon-less named entities
  followed by [A-Za-z0-9=] to avoid corrupting URLs (e.g. &not=, &copy=)
- Replace invalid code points (null, surrogates, out-of-range, noncharacters)
  with U+FFFD per HTML5 spec
- Update test suite to reflect correct entity-resolved output

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@marevol marevol merged commit 7c319cf into master Mar 11, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant